Replicate

Salesforce BLIP. Vision-language model for image captioning and visual question answering. Given an image it writes a short natural-language caption, or answers a question about the image when one is supplied. A widely used baseline for automatic captioning.

replicateblipcaptioning

CLIP Interrogator

pharmapsychotic's CLIP Interrogator. Takes an image and produces a Stable-Diffusion-style text prompt by combining BLIP captioning with CLIP to rank likely subjects, artists, mediums and styles. Commonly used to reverse-engineer a prompt from an existing picture.

replicateclip-interrogatorcaptioning

Depth Anything v2

Monocular depth-estimation model trained on 595k labeled and 62M unlabeled images. Strong zero-shot generalization in indoor and outdoor scenes.

FLUX 1.1 Pro

Black Forest Labs' flagship text-to-image model. Faster generation than FLUX.1 Pro at higher prompt adherence, with strong photorealism and reliable spatial composition. Runs as a hosted Replicate model.

Flux 1.1 Pro Ultra

high-qualityphotorealistic

FLUX 1.1 Pro in ultra mode. Up to 4 megapixel images with raw mode for photorealism.

€0.6015.0s

FLUX 1.1 Pro Ultra

FLUX 1.1 Pro in Ultra mode by Black Forest Labs. Generates up to 4 megapixel images with a raw mode for less processed, more natural-looking photography. Best FLUX option when output resolution and fine detail matter.

Flux Dev

Black Forest Labs' development model. Fast, high-quality image generation with LoRA support.

€0.5010.0s

popularfastlora

Google Imagen 4

ImageGoogle DeepMind

Google DeepMind's Imagen 4 text-to-image model, hosted on Replicate. Sharp detail, accurate text rendering, and strong prompt adherence across photographic and illustrated styles. Outputs up to 2K resolution.

replicategoogleimagen

Google Veo 2

Google's state-of-the-art video generation model. Simulates real-world physics with various visual styles.

€5.00120.0s

high-qualitypopular

Google Veo 3 (Replicate)

Google's Veo 3 served via Replicate. Text-to-video with native synchronized audio generation. High-fidelity motion and scene coherence in short clips.

€8.00

replicategoogleveo

Google Veo 3.1

Latest Veo with image-to-video and context-aware audio

€6.0092.0s

popularaudioi2v

HunyuanVideo

VideoTencent

Tencent's HunyuanVideo, a 13B open-weights text-to-video diffusion transformer. Produces high-motion, photorealistic clips with smooth temporal consistency and was one of the first open models to rival closed systems on motion quality.

€5.00120.0s

replicatetencenthunyuan

Icons (SDXL Flat Pop)

SDXL fine-tune by galleri5 for slick flat icons and pop constructivist graphics with thick edges. Trained on Bing generations, it produces clean single-subject icon art that suits app icons, badges and UI glyphs. Raster output, not true vector.

replicateiconlogo

Ideogram v3 Quality

ImageIdeogram

The highest-quality tier of Ideogram v3. Improved photorealism and prompt adherence over v2 while keeping Ideogram's best-in-class text rendering. Supports style references and inline text layout.

replicateideogramtext-to-image

Incredibly Fast Whisper

Whisper Large v3 wrapped with Hugging Face Transformers optimizations (batched inference, flash attention) for very high throughput. Transcribes hours of audio in minutes on a single GPU. Maintained by Vaibhav Srivastav. Good when you need bulk transcription fast.

replicatewhisperstt

InstantID

InstantID makes realistic portraits of a real person from a single reference photo without per-user training. Combines a face encoder with an IdentityNet adapter on SDXL to keep identity and pose while following a text prompt, so it is fast and tuning-free.

avatarportraitinstant-id

Kling v2.1

Kuaishou's Kling v2.1, generating 5 and 10 second videos at 720p or 1080p from text or an image. Known for cinematic camera work and realistic physical motion, available on Replicate via the official KwaiVGI account.

€6.00

replicatekuaishoukling

Kling v2.1 Master

Kuaishou's premium Kling v2.1 Master. Generates 1080p 5s and 10s clips from text or an image with strong dynamics and prompt adherence. The top tier of the Kling 2.1 family.

€6.00

replicatekuaishoukling

Kling v3

Cinematic video up to 15s with multi-shot and native audio

€2.00120.0s

popularaudioi2v

Kling v3 Omni

Most versatile: multi-reference images, video editing, native audio

€2.50120.0s

popularaudioi2v

Midjourney V7

high-qualityaestheticpopular

The latest Midjourney model. Industry-leading aesthetic quality and prompt adherence for image generation.

€3.0030.0s

MiniMax Hailuo 02

VideoMinimax

MiniMax Hailuo 02 on Replicate. Text-to-video and image-to-video producing 6s or 10s clips at 768p standard or 1080p pro. Known for accurate real-world physics and stable motion.

replicateminimaxhailuo

MusicGen

AudioMeta

Meta's music generation model. Generate up to 1 minute of music from text descriptions.

€1.5030.0s

musicpopular

Professional Headshot (FLUX Kontext)

Turns any single selfie into a clean professional headshot using FLUX Kontext image editing. Keeps the person's face while swapping to business attire, a studio background and even lighting. Aimed at LinkedIn-style profile photos.

avatarportraitheadshot

Recraft 20B SVG

Recraft's faster, cheaper vector model. Outputs editable SVG paths instead of raster pixels, so logos, icons and flat illustrations scale to any size without blur. Defaults to a vector_illustration style and supports line art and engraving looks. Hosted API only.

replicaterecraftsvg

Recraft V3

Recraft's text-to-image model that topped the Hugging Face text-to-image arena at release. Strong long-text rendering, brand-style consistency, and precise control over image dimensions and color palettes.

replicaterecrafttext-to-image

Recraft Vectorize

Recraft's raster-to-vector converter. Takes a PNG or JPG and traces it into a clean SVG with precise vector paths, aimed at logos, icons and graphics that need to scale. Image-to-SVG counterpart to Recraft's text-to-SVG models.

svgvectorrecraft

Runway Gen 4.5

Top-ranked for motion quality and visual fidelity

€1.0030.0s

populartop-quality

Runway Gen-4 Turbo

Runway's Gen-4 Turbo on Replicate. Fast image-to-video generation producing 5s and 10s clips at 720p with strong character and scene consistency across shots.

replicaterunwaygen-4

SAM 2 (Segment Anything 2)

MultimodalMeta

Meta Segment Anything 2. Promptable segmentation across images and video with temporal memory. Zero-shot, point/box/mask prompts, fast on a single H100.

replicatesegmentationmeta

Stable Diffusion XL

Stability AI's SDXL 1.0 with the optional refiner. The 3.5B base plus 6.6B ensemble UNet that became the default open image model before FLUX. Good for fine-tuning and LoRAs, broad community support.

replicatesdxlstability-ai

Sticker Maker

fofr's sticker generator that outputs graphics with transparent backgrounds, so the result drops straight into chat apps or print sheets. Runs an SDXL-based pipeline at high speed (default 17 steps) and returns die-cut style art without manual background removal.

replicatestickertransparent

Whisper

STTOpenAI

OpenAI's Whisper running on Replicate. General-purpose speech recognition trained on 680k hours of multilingual audio. Transcribes and translates 99 languages, robust to accents and background noise, and outputs plain text, segments, or word-level timestamps.

replicateopenaiwhisper

851-Labs Background Remover

Background removal model from 851-Labs that outputs a clean cutout with a transparent alpha channel. One of the most-run background removers on Replicate, handles people, products and objects on busy backgrounds.

background-removal851-labscutout

Ad Inpaint (Product Photo)

Product advertising photo generator. You upload a cut-out product shot and a prompt describing the scene; it places the product on a new generated background with matching lighting and shadows, so a plain packshot becomes an ecommerce or ad-ready hero image without a photo studio.

productecommerceproduct-photo

AnimateDiff

Plug-and-play motion module that animates personalized Stable Diffusion models without further training. 16-frame clips at 512x512.

replicateanimationanimatediff

AnimateDiff Lightning

ByteDance distillation of AnimateDiff. 4-step sampling for over 10x faster inference at comparable quality to multi-step base model.

replicateanimationbytedance

AudioLDM 2

TTSAudioLDM

Latent-diffusion model for general-purpose text-to-audio. Generates speech, music, and sound effects with a unified prior.

audioldmmusic-generationdiffusion

AuraFlow v0.3

fal.ai's fully open-source 6.8B flow-based text-to-image model. Up to 1536x1536 resolution.

auraflowtext-to-imageopen-weights

Bark

AudioSuno

Suno's text-to-audio model. Generates realistic speech, music, and sound effects.

€0.5015.0s

speechsound-effects

BiRefNet Background Removal

BiRefNet high-resolution dichotomous image segmentation for background removal. Bilateral reference network that produces sharp matting on fine detail like hair, fur and thin structures, often cleaner than older U2Net or rembg models.

background-removalbirefnetsegmentation

BRIA Remove Background

BRIA AI's commercial background removal model trained on fully licensed data. Produces accurate cutouts for e-commerce and design, with attention to clean edges around products and people.

background-removalbriaecommerce

BRIA RMBG-1.4

BRIA's first commercial-safe background-removal model. Trained on fully-licensed data, suitable for production e-commerce and design pipelines.

replicatebackground-removalbria

BRIA RMBG-2.0

BRIA's professional background-removal model trained on fully-licensed data. Commercial-safe.

briaimage-editbackground-removal

Bringing Old Photos Back to Life

ImageMicrosoft

Microsoft Research pipeline by Ziyu Wan et al. that restores scanned old photos, removing scratches, dust and fading and optionally enhancing faces in one pass.

restoreold-photoscratch-removal

ByteDance Seedance 1 Pro

VideoByteDance

ByteDance's Seedance 1 Pro on Replicate. Text-to-video and image-to-video producing 5s or 10s clips at 480p or 1080p. Strong motion quality and prompt following from the Seedance family.

replicatebytedanceseedance

Cartoonify

catacolabs Cartoonify turns a photo into a flat cartoon illustration. Takes a single image and returns a stylized cartoon version with clean shapes and bold outlines. Straightforward one-input model for avatars and profile pictures.

avatarportraitcartoon

CCSR (Content-Consistent SR)

Content-Consistent Super-Resolution model. Reduces hallucination compared to typical diffusion-based upscalers while keeping perceptual quality high.

replicateupscalingimage-restore

Champ Human Animation

replicateanimationhuman-motion

Champ controllable human image animation. Uses 3D parametric guidance (SMPL) for realistic full-body motion transfer from a single reference image.

€0.12

Chatterbox

Resemble AI's open Chatterbox TTS. Zero-shot voice cloning from a short audio prompt with an exaggeration control for emotion intensity, plus CFG weight to balance pacing and fidelity.

replicateresemble-aitts

Clarity Upscaler

High-resolution image upscaler with creative detail re-imagination via SD-based hallucination. Strong for photography and product shots.

replicateupscalingcreative

Code Llama 13B Instruct

Meta's 13B Code Llama tuned for instruction following. A faster mid-size option for code generation and completion, supporting infilling for inserting code at a cursor position. Served on Replicate per call.

Code Llama 34B Instruct

Meta's 34B Code Llama tuned for instruction following. A balance of size and quality for code generation, completion, and explanation, with strong coverage of Python, JavaScript, and other common languages. Runs on Replicate per call.

Code Llama 70B Instruct

Meta's largest Code Llama, a 70B Llama-2 derivative specialized for programming and tuned to follow instructions in chat form. Handles code generation, completion, and explanation across common languages. Served on Replicate as a per-call endpoint.

Code Llama 7B Instruct

Meta's smallest Code Llama at 7B parameters, tuned for instruction following. The cheapest and fastest member of the family for quick code generation, completion, and infilling. Served on Replicate per call.

CodeFormer

Robust face-restoration model using a transformer-based codebook prior. Handles severe degradation, occlusion, and old-photo restoration with adjustable fidelity-quality tradeoff.

replicateface-restoreupscaling

CogVideoX-5B

CogVideoX-5B from Tsinghua/Zhipu AI, an open 5B-parameter text-to-video diffusion transformer. Generates 6-second 720p clips with coherent motion and is widely used in research for its open weights and reproducibility.

replicatecogvideoxzhipu

CogVideoX-5B (open)

Zhipu/Tsinghua's 5B open text-to-video model. 720x480 @ 8fps, 6s clips, image-to-video variant available.

zhiputsinghuacogvideox

CogVLM2 19B

Tsinghua CogVLM2 19B with Llama-3 8B base plus 11B vision expert. Strong document understanding and visual reasoning, 8k context.

Consistent Character

fofr's model generates the same character in many poses and angles from one reference image. Useful for building an avatar set or character sheet where the face and design stay consistent across outputs. Can produce a grid or individual images.

avatarportraitcharacter

ControlNet Canny

ControlNet conditioned on Canny edge maps. Preserves composition and outlines while restyling with Stable Diffusion 1.5 or SDXL backbones.

ControlNet Depth

ControlNet conditioned on depth maps. Preserves the 3D scene layout while letting the prompt change style, lighting and content.

DDColor

DDColor by Xiaoyang Kang et al. colorizes black-and-white photos using dual decoders that jointly learn pixel colors and semantic color queries, giving vivid and natural results on old images.

colorizerestoreddcolor

DeepSeek Coder 33B Instruct (GGUF)

Quantized GGUF build of DeepSeek's 33B code model, trained on roughly 2T tokens that are about 87 percent code. Designed for repository-level completion and project-aware generation thanks to a 16k context window. Runs on Replicate as a per-call endpoint.

deepseekcodinginstruct

DeepSeek-VL 7B

DeepSeek-VL 7B chat model. Vision-language model with hybrid vision encoder and strong real-world visual question answering performance.

Donut Document

Naver CLOVA Donut OCR-free document-understanding transformer. End-to-end JSON extraction from forms, receipts and invoices without explicit OCR.

Dots OCR

Rednote Hilab Dots OCR. End-to-end document parsing model with layout, text and reading-order prediction in one transformer.

DreamGaussian

Generative Gaussian-splatting model for fast image-to-3D synthesis. Produces textured meshes in two minutes via differentiable rasterization.

€0.09

DynamiCrafter

replicateanimationimage-to-video

Tencent DynamiCrafter. Animates still images into short videos preserving texture and structure, with strong open-domain coverage.

€0.09

EasyOCR

JaidedAI EasyOCR. Simple Python OCR wrapper supporting 80+ languages with deep-learning text detection and recognition.

EchoMimic

replicatelipsyncant-group

Ant Group EchoMimic. Lifelike audio-driven portrait animation with editable landmark conditioning for fine-grained motion control.

€0.10

Ecommerce Virtual Try-On

Try-on pipeline aimed at ecommerce listings. You give it a photo containing clothing on a body pose plus a separate face image; it composes a person wearing that clothing with the supplied face, controllable by prompt, CFG, and output size. Useful for generating on-model product shots from a flat garment image.

productvirtual-try-onvton

ESRGAN Classic

Enhanced Super-Resolution GAN, the original 2018 architecture. Produces sharp 4x upscales with strong perceptual quality on natural images.

replicateupscalingesrgan

F5-TTS

Open-source flow-matching TTS with strong zero-shot voice cloning. Code MIT, weights CC-BY-NC.

f5ttsopen-weights

F5-TTS

F5-TTS, a flow-matching TTS that clones a voice from a reference clip plus its transcript and reads new text in that voice. Fast non-autoregressive synthesis with optional silence removal.

replicatef5-ttstts

Face to Many

fofr's face stylizer converts a face photo into 3D render, emoji, pixel art, video-game character, claymation or toy styles. Uses InstantID plus style LoRAs on SDXL to keep the likeness while applying a chosen art style. Popular for fun avatars.

avatarportraitstylize

Face to Sticker

fofr's model turns a face photo into a die-cut sticker with a white border and transparent background. Uses InstantID to hold the likeness and outputs a clean PNG suitable for chat stickers or print. Simple single-image input.

avatarportraitsticker

FILM Frame Interpolation

VideoGoogle Research

Google FILM frame interpolation. Synthesizes high-quality intermediate frames between near-duplicate inputs, designed for large motion gaps.

replicateupscaleframe-interpolation

Florence-2 Large

Microsoft Florence-2 Large. Unified prompt-based vision foundation model for captioning, detection, segmentation and OCR with a single 770M-param backbone.

Florence-2 Segmentation

Microsoft Florence-2 unified vision model with referring expression segmentation. Text-prompted region and mask generation in one model.

FLUX PuLID

PuLID identity customization running on FLUX.1-dev. Inserts a face from one reference photo into prompt-driven scenes using contrastive alignment, giving higher likeness and detail than SDXL-era ID adapters. Good for realistic avatars and character portraits.

avatarportraitpulid

Flux Schnell

The fastest Flux model. Generate images in under 2 seconds. Great for prototyping.

€0.032.0s

fastaffordable

FLUX.1 [dev]

The open-weight 12B rectified-flow transformer from Black Forest Labs. Close to FLUX Pro quality with a guidance-distilled checkpoint released under a non-commercial license. The most widely fine-tuned base in the FLUX family.

FLUX.1 [schnell]

The fastest FLUX model from Black Forest Labs, distilled to produce images in 1 to 4 steps. Apache 2.0 licensed for commercial use. Built for high-volume generation and real-time previews.

FLUX.1 [Schnell]

Black Forest Labs' fastest open-weights image model. Apache-2.0 licensed, ~1-4 step inference.

fluxblack-forest-labsopen-weights

FLUX.1 Canny

FLUX structural control via Canny edge maps. Preserve composition while restyling.

FLUX.1 Canny [dev]

Open-weight edge-guided FLUX model from Black Forest Labs. Extracts Canny edges from a control image and regenerates it from your prompt while holding the original composition and outlines, so you can restyle a scene without changing its structure.

FLUX.1 Depth

FLUX structural control via depth maps. Keep 3D scene layout while changing style/content.

FLUX.1 Depth [dev]

Open-weight depth-guided FLUX model from Black Forest Labs. Derives a depth map from the control image and regenerates from your prompt while preserving 3D spatial layout, useful for re-texturing rooms, products, or scenes without moving objects.

FLUX.1 Fill

Black Forest Labs' inpainting/outpainting model for FLUX. Fill masked regions with prompt-guided content.

FLUX.1 Fill [dev]

Black Forest Labs' open-weight inpainting and outpainting model, guidance-distilled from FLUX.1 Fill [pro]. You supply an image plus a mask and a prompt; it fills the masked region or extends the canvas with prompt-guided content that matches lighting and texture.

FLUX.1 Kontext [dev]

Open-weight version of FLUX.1 Kontext by Black Forest Labs. Instruction-based editing: pass an input image and a plain text edit ('change the jacket to red', 'remove the person on the left') and it applies the change while keeping the rest of the scene and identity consistent.

fluxkontextblack-forest-labs

FLUX.1 Redux

FLUX image-variation adapter. Generate variations and remixes from a reference image.

FLUX.1-dev Inpainting

FLUX.1-dev inpainting wrapper that fills masked parts of an image from a prompt. Useful when you want FLUX-quality fills with a simple image plus mask plus prompt interface and adjustable mask strength.

fluximage-editinpainting

GFPGAN v1.4

ImageTencent ARC

Tencent ARC face-restoration GAN. Reconstructs realistic facial detail in low-quality or compressed photos using a pretrained StyleGAN2 prior.

replicateface-restoreupscaling

GLPN Depth

Global-Local Path Networks depth-estimation model. Combines hierarchical transformer encoder with selective feature fusion for sharp boundaries.

Google Veo 3 Fast

Faster cheaper Veo 3 with audio

€3.2059.0s

fastaudio

Google Veo 3.1 Fast

Faster Veo 3.1 with image-to-video and audio

€3.2059.0s

fastaudioi2v

GOT-OCR 2.0

StepFun GOT-OCR 2.0. Unified end-to-end OCR-2.0 model handling text, formulas, charts, sheet music and geometric shapes in one architecture.

Granite Code 20B

replicatecode-generationibm

IBM Granite 20B Code Instruct. Larger Granite code model balancing quality and inference cost for enterprise CI/CD code-review automation.

€0.006

Granite Code 8B

IBM Granite 8B Code Instruct. Trained on permissively-licensed code, strong on multi-language code completion and instruction-following.

replicatecode-generationibm

Grok Imagine Video

xAI video with native audio and lip-sync, up to 15s

€1.5090.0s

audioi2vxai

Grounded-SAM

Grounding DINO plus SAM. Open-vocabulary text-prompted detection and segmentation in one pipeline for fully-automatic mask generation.

Hailuo 2.3

VideoMinimax

Minimax model for realistic human motion and VFX

€0.5060.0s

i2v1080p

Hunyuan3D 2.0

ImageTencent

Tencent's Hunyuan3D 2.0 image-to-3D pipeline. Two-stage shape and texture generation producing high-resolution textured meshes.

€0.21

Hunyuan3D 2.1

Refreshed Hunyuan3D 2.1 with improved texture fidelity and PBR-material support. Image-to-3D with textured GLB output.

€0.24

HunyuanVideo

VideoTencent

Tencent's 13B open-weights video diffusion transformer. SOTA among open video models at release.

tencenthunyuantext-to-video

IC-Light (Product Relighting)

Lvmin Zhang's IC-Light packaged by zsxkib. Relights a product or portrait from a text prompt or a chosen light direction while keeping the subject's shape and detail intact, so a flat product photo can be given studio, window, or dramatic side lighting without re-shooting.

productic-lightrelight

Idefics3 8B

Hugging Face Idefics3 8B. Llama-3 based open-source vision-language model with strong document QA and chart-understanding performance.

€0.007

Ideogram v2

ImageIdeogram

Ideogram's text-to-image model known for accurate in-image text and typography. Handles posters, logos, and signage where other models garble lettering. Supports magic prompt expansion and multiple aspect ratios.

replicateideogramtext-to-image

Ideogram v3 Turbo

ImageIdeogram

Ideogram's fast v3 model, the fastest and cheapest tier of the v3 family. Known for accurate in-image text rendering and reliable typography, which most diffusion models still get wrong. Hosted API only.

replicateideogramtext-to-image

IDM-VTON (Virtual Try-On)

IDM-VTON virtual try-on from the CVPR 2024 paper. You give it a photo of a person and a garment image; it dresses the person in that garment while preserving pose, body shape, and the garment's pattern and text. Good for showing a clothing product on a model for an ecommerce listing.

productvirtual-try-onvton

InstantMesh

Image-to-3D mesh generator from sparse-view diffusion. Produces textured meshes in under one minute on a single A100.

€0.12

InstructPix2Pix

Berkeley InstructPix2Pix. Edits an image from natural-language instructions in a single forward pass. Trained on GPT-3 plus Stable Diffusion synthetic pairs.

IP-Adapter FaceID Plus v2

Tencent's face-identity conditioning adapter for SD/SDXL. Face embedding + CLIP for ID-consistent generation.

tencentimage-editface-id

Janus Pro 7B

DeepSeek's unified multimodal model. Decouples vision encoding for both understanding and generation tasks.

deepseekjanusopen-weights

Kling v1.6 Pro

Kuaishou's Kling v1.6 Pro on Replicate. Generates 5s and 10s clips in 1080p from text or an image, with cinematic motion and physics realism. The widely used pro tier of the 1.6 generation.

replicatekuaishoukling

Kokoro TTS 82M

Open-weights 82M-parameter TTS. Punches above its size class on naturalness benchmarks at a fraction of the inference cost of larger models.

kokorottsopen-weights

Kuaishou Kolors

Kuaishou's bilingual (CN/EN) latent diffusion text-to-image model with strong text rendering.

kuaishoutext-to-imageopen-weights

LivePortrait

Kuaishou LivePortrait. Efficient portrait animation driven by reference videos with stitching, retargeting and motion-control parameters.

€0.08

replicatelipsynckuaishou

Llama 3.2 Vision 11B (Ollama)

Meta Llama 3.2 11B Vision served via Ollama on Replicate. Open-weights multimodal model for image captioning, document and chart reading, and visual question answering.

replicatemetallama

Llama 3.2 Vision 90B

Meta Llama 3.2 90B Vision. Largest open-weights Llama vision model. Strong visual reasoning, chart, OCR and document understanding.

LLaVA 1.6 Vicuna 13B

LLaVA 1.6 (LLaVA-NeXT) with a Vicuna-13B language backbone. Open vision-language chat model that describes images, answers questions, reads charts and reasons about scenes. Version 1.6 adds higher input resolution and better OCR and reasoning than LLaVA 1.5.

replicatellavacaptioning

LLaVA v1.6 34B

LLaVA v1.6 on a Nous-Hermes-2 34B base, served on Replicate. Open-source vision-language assistant for image question answering, description and visual reasoning at higher resolution.

replicatellavavision-understanding

LogoAI (SDXL Logo Generator)

SDXL fine-tune by mejiabrayan aimed at logo generation. Produces simple, centered mark and wordmark style logos from a text prompt. Useful for quick brand concepts and mockups. Raster PNG output, not vector.

replicatelogoicon

Lotus-G

Lotus generative depth model. Treats depth as a generation task using a diffusion model, producing higher-fidelity depth on textured surfaces.

LTX-Video (Lightricks)

VideoLightricks

Lightricks' 2B DiT video model. Realtime generation on consumer GPUs (~6s @ H100, 24fps).

lightricksltxtext-to-video

Luma Ray Flash 2

VideoLuma AI

Fast affordable video with I2V support

€0.5045.0s

fastbudgeti2v

Luma Ray-2 720p

VideoLuma AI

Luma Labs' Ray-2 at 720p on Replicate. Text and image-to-video producing 5s and 9s clips with fast, coherent motion and strong camera control. Successor to Dream Machine.

replicatelumaray-2

MagicAnimate

replicateanimationhuman-motion

ByteDance MagicAnimate. Temporally consistent human-image animation driven by a DensePose motion sequence with strong identity preservation.

€0.10

Magicoder S CL 7B

CodeCommunity

UIUC Magicoder S CL 7B. CodeLlama-7B fine-tuned with OSS-Instruct synthetic data. Strong HumanEval Plus and MBPP Plus performance per parameter.

replicatecode-generationopen-weights

MAGNeT

AudioCommunity

MAGNeT is Meta's masked, non-autoregressive audio generator. Instead of predicting tokens left to right it fills masked audio tokens in parallel over a few decoding steps, so generation is faster than autoregressive MusicGen at similar quality. This Replicate packaging exposes the text-to-music and text-to-sound variants.

metamagnetnon-autoregressive

MAGNeT MusicGen

metamusic-generationmagnet

Meta MAGNeT non-autoregressive music generator. Up to 7x faster than MusicGen with comparable quality via masked generative transformers.

€0.007

Magnific-Style Upscaler

replicateupscalingcreative

Detail-hallucinating upscaler in the Magnific style. Adds plausible high-frequency texture using a Stable Diffusion refiner conditioned on the low-res input.

€0.06

Marigold

ETH Zurich Marigold. Diffusion-based monocular depth-estimation model fine-tuned from Stable Diffusion with strong fine-detail recovery.

Marker PDF Extract

Marker PDF-to-Markdown conversion pipeline. Combines layout, OCR and equation models to produce clean Markdown with preserved tables and formulas.

Mask2Former

Meta Mask2Former universal image-segmentation transformer. Single architecture for panoptic, instance and semantic segmentation tasks.

MiDaS v3.1

Intel MiDaS v3.1 relative depth-estimation model. Robust zero-shot single-image depth across diverse domains and resolutions.

MiniCPM-V 2.6

OpenBMB MiniCPM-V 2.6. 8B vision-language model with strong single-image, multi-image and video understanding plus OCR capabilities.

Minimax Video

VideoMinimax

MiniMax's video generation model. Fast, high-quality video output with text-to-video capabilities.

€2.5090.0s

fastaffordable

Mochi 1

Genmo's Mochi 1, an open text-to-video model with high-fidelity motion built on a 10B Asymmetric Diffusion Transformer. Released under Apache 2.0, it was the largest open video model at launch and is strong on smooth, physically plausible movement.

replicategenmomochi

Mochi 1

Genmo's 10B open-weights text-to-video model. AsymmDiT architecture, 5.4s @ 480p.

genmomochitext-to-video

Molmo 7B

Allen AI Molmo 7B-D on Replicate. Open vision-language model trained on the PixMo data, notable for pointing at and locating objects in images, not just describing them.

replicateallenaimolmo

Moondream2

Moondream2 small vision-language model on Replicate. About 1.9B params, designed to run on edge devices, handles captioning, visual QA and short OCR-style reads at very low cost.

replicatemoondreamvision-understanding

MuseTalk

Tencent MuseTalk real-time lip-sync model. Audio-driven mouth-region editing in latent space at 30+ fps on a single GPU.

€0.06

replicatelipsynctencent

MusicGen Large

TTSMeta

Meta's 3.3B-parameter MusicGen Large. Text-conditioned music generation with single-stage autoregressive transformer, supports melody conditioning.

metamusic-generationopen-weights

olmOCR

Allen AI olmOCR. Open-source 7B vision-language model fine-tuned for high-fidelity document parsing including math, code and tables.

OOTDiffusion (Try-On)

OOTDiffusion virtual try-on. Takes a clear photo of a model and an upper-body garment and renders the garment onto the person using an outfitting-fusion diffusion approach that keeps the garment's texture and the model's pose. A lightweight alternative to IDM-VTON for clothing previews.

productvirtual-try-onvton

OpenPose

CMU OpenPose multi-person 2D pose estimator. Real-time keypoint detection for body, hand, face and foot using Part Affinity Fields.

replicateposevision-understanding

OpenVoice v2

MyShell OpenVoice v2. Multilingual zero-shot voice cloning with accurate tone-color reproduction and style/emotion control.

myshellttsvoice-cloning

PaddleOCR v3

Baidu PaddleOCR v3 PP-OCR pipeline. Lightweight detector plus recognizer optimized for production use with 80+ language support.

Parler-TTS

Hugging Face Parler-TTS Mini. Lightweight TTS conditioned on a natural-language style description for fine-grained control over voice characteristics.

parlerttshuggingface

Phind CodeLlama 34B v2

Phind CodeLlama 34B v2. Highly tuned CodeLlama variant focused on retrieval-augmented developer assistant workflows.

replicatecode-generationphind

PhotoMaker

ImageTencent ARC

Tencent ARC PhotoMaker. Identity-preserving stylized photo generation from a stacked-ID embedding. Realistic re-styling of a subject in seconds.

PixVerse v5.6

Physics-accurate video generation up to 1080p

€0.5060.0s

i2v1080pphysics

Playground v2.5 (1024px Aesthetic)

ImagePlayground AI

Playground AI's diffusion model tuned for aesthetics. SDXL-based architecture trained on the EDM formulation, rated by users as more visually pleasing than SDXL in their study. Strong on vivid color and contrast.

replicateplaygroundtext-to-image

Point-E

OpenAI Point-E text-to-point-cloud system. Fast 3D point-cloud generation from text, optionally lifted to a mesh via marching cubes.

replicate3d-generationopenai

Qwen-Image-Edit

ImageAlibaba / Qwen

Alibaba Qwen's instruction-driven image editor. Extends Qwen-Image's text-rendering ability to editing, so it handles both semantic edits (swap objects, change style) and precise text edits inside the image while preserving the original layout and unedited regions.

qwenalibabaimage-edit

Qwen2-VL 7B Instruct

Alibaba Qwen2-VL 7B served on Replicate. Open-weights vision-language model that chats about images and video, with dynamic resolution and strong OCR and document QA for its size.

replicateqwenalibaba

Real-ESRGAN 4x

AI-Upscaler that increases image resolution up to 4x while preserving texture and detail. Trained on synthetic and real data to reduce common ESRGAN artifacts.

replicateupscalingimage-restore

Real-ESRGAN Anime 4x

Real-ESRGAN variant fine-tuned for anime, manga, and illustrated artwork. 4x upscaling with cartoon-aware artifact suppression.

replicateupscalinganime

Recraft V3

State-of-the-art image generation optimized for design and branding. SVG vector output support.

€0.6012.0s

designvectorbranding

Recraft v3 SVG

Recraft's v3 variant that outputs vector SVG instead of raster pixels. Generates clean, editable logos, icons and illustrations that scale without quality loss, which is unusual among image models. Hosted API only.

replicaterecrafttext-to-image

Recraft V4 SVG

Recraft V4 SVG turns a text prompt into production-ready SVG vector art with clean geometry and structured, editable layers. Newer generation than V3 with improved design quality on logos, icons and flat illustration. Returns true vector paths, not a traced bitmap.

svgvectorrecraft

Rembg

Open-source background-removal tool wrapping U2Net. Produces alpha mattes for photos, products and people with no manual masking.

replicatebackground-removalmatting

Remove Background (lucataco)

Lucataco's remove-bg, a rembg-based background removal model that returns the foreground subject on a transparent background. A popular, low-cost option for quick product and portrait cutouts.

background-removallucatacorembg

Remove Object (LaMa)

Object removal and cleanup using LaMa inpainting. Paint a mask over an unwanted object, logo or person and the model fills the area with plausible background, erasing it from the photo.

background-removalobject-removallama

Replit Code v1 3B

Replit's 3B code-completion model, trained on a permissively licensed code subset of the Stack across 20 programming languages. Built for low-latency autocomplete rather than chat. Served on Replicate per call.

replitcodingcompletion

RIFE Frame Interpolation

Real-Time Intermediate Flow Estimation. Doubles or quadruples FPS of an existing video via learned optical-flow-based frame interpolation.

replicateupscaleframe-interpolation

Riffusion

TTSRiffusion

Stable-Diffusion-based real-time music generator. Operates on spectrogram images then resynthesizes audio, enables seamless transitions and looping.

riffusionmusic-generationopen-weights

RVC Voice Conversion

rvcvoice-conversionvoice-cloning

Retrieval-based Voice Conversion. Converts a source recording into a target speaker's voice, preserving pitch, prosody and rhythm.

€0.006

SadTalker

replicatelipsynctalking-head

Stylized audio-driven talking-head generator. Synthesizes 3D motion coefficients from audio to animate a single portrait image with natural head movements.

€0.07

SDXL Emoji

SDXL fine-tune by fofr trained on Apple emoji art. Generates rounded, glossy emoji and icon style graphics from a text prompt, useful for custom reactions, app glyphs and playful icon sets. Raster output.

replicateemojiicon

SDXL Inpainting

SDXL inpainting built on the Hugging Face Diffusers inpaint pipeline. Replace or remove masked regions of an image with prompt-conditioned content at SDXL resolution. A cheap, well-understood baseline for object removal and local edits.

sdxlstability-aiimage-edit

SeamlessM4T

Meta's SeamlessM4T multimodal translation model. Takes speech or text input and produces transcription or translation across about 100 languages, including speech-to-text and speech-to-speech. One model covers ASR plus cross-lingual translation without chaining separate systems.

replicatemetaseamless

SeamlessM4T v2 Large (Speech)

Meta SeamlessM4T v2 Large speech mode. Speech-to-speech, speech-to-text, and text-to-speech translation across 100+ languages in a single unified model.

replicatetranslationmeta

SeamlessM4T v2 Large (Text)

Text & ChatCommunity

Meta SeamlessM4T v2 Large. Universal multilingual translation across 100+ languages with text-to-text mode for documents and chat.

€0.006

replicatetranslationmeta

Seedance Lite

VideoByteDance

Budget ByteDance video, fast and cheap

€0.5070.0s

budgeti2vfast

Seedance Pro

VideoByteDance

ByteDance video with T2V and I2V, up to 1080p

€1.0095.0s

i2v1080p

Segformer B5

NVIDIA SegFormer-B5 semantic segmentation. Hierarchical transformer encoder with lightweight MLP decoder, strong ADE20k and Cityscapes results.

€0.007

Shap-E (OpenAI)

OpenAI Shap-E text/image to 3D. Generates implicit neural representations renderable as textured meshes or NeRFs.

replicate3d-generationopenai

Spark TTS

Spark efficient TTS with disentangled control over speaker, content and style. Strong cross-lingual zero-shot performance.

sparkttsvoice-cloning

Stable Audio Open 1.0

AudioReplicate

Stability AI's Stable Audio Open generates short audio from text prompts, tuned for sound effects, drum loops, instrument riffs and production elements rather than full songs. Open weights, latent diffusion over a 44.1kHz audio autoencoder, with a configurable seconds_total up to about 47 seconds.

stability-aistable-audiosound-effects

Stable Diffusion 3.5 Large

Stability AI's 8B MMDiT-based flagship. Open weights at 1MP with improved typography and prompt adherence over SDXL. The largest model in the SD 3.5 release line.

replicatestability-aistable-diffusion

Stable Diffusion 3.5 Large Turbo

Distilled, 4-step version of SD 3.5 Large from Stability AI. Keeps most of the large model's quality and text rendering at a fraction of the inference time. Open weights under the Stability Community License.

replicatestability-aistable-diffusion

Stable Diffusion XL

Stability AI's SDXL model via Replicate. High-quality image generation with extensive customization.

€0.208.0s

open-sourcecustomizable

StarCoder2 15B

CodeCommunity

BigCode StarCoder2 15B code-generation flagship. Trained on 4T tokens of Stack v2 data with grouped-query attention and 16k context.

replicatecode-generationbigcode

StarVector 8B (image-to-SVG)

StarVector 8B is a multimodal model that generates SVG code directly from an input image. Rather than tracing pixels, it predicts the SVG markup token by token, which can produce compact, semantically structured paths for icons and simple graphics. Research model from the StarVector project.

svgvectorstarvector

StreamingT2V

replicateanimationlong-form

Picsart StreamingT2V. Generates long, consistent videos by chaining short autoregressive clips with motion and appearance memory.

€0.15

StyleTTS 2

Style-based TTS using diffusion and adversarial training. Human-level naturalness in zero-shot voice synthesis from a 3-5s reference clip.

stylettsttsvoice-cloning

Suno Bark

TTSSuno

Suno's text-prompted generative audio model. Speech, music, ambient sound and effects with non-verbal cues like laughter or sighs.

sunobarkmusic-generation

SUPIR

SUPIR by Fanghua Yu et al. is a large diffusion-based restoration model that recovers photorealistic detail from heavily degraded images and can be steered with a text prompt describing the scene.

restoresuper-resolutionsupir

SUPIR Upscaler

replicateupscalingimage-restore

SUPIR (Scaling-Up Image Restoration) photo-real restoration model. Combines SDXL prior with language-guided controls for severely degraded inputs.

€0.06

Swin2SR

Transformer-based image super-resolution using Swin-V2 attention. Handles classical, lightweight, real-world, and compressed-input variants with 2x/4x upscaling.

replicateupscalingtransformer

SwinIR Video

SwinIR transformer-based super-resolution and denoising applied per-frame to video. Handles classic, real-world and lightweight upscaling.

replicateupscaletransformer

ToonCrafter

replicateanimationtooncrafter

Tencent ToonCrafter generative cartoon interpolation model. Synthesizes smooth in-between frames between two cartoon keyframes.

€0.08

Tortoise TTS

Multi-voice expressive TTS. Slow but high-quality with strong prosody and natural intonation. Trained for long-form narration use cases.

tortoisettsexpressive

TRELLIS (3D)

Microsoft TRELLIS image-to-3D model. Generates textured 3D assets in GLB or Gaussian-splat format from a single reference image.

€0.18

TripoSR

Stability AI and Tripo single-image 3D reconstruction model. Generates 3D meshes from a single image in roughly half a second.

Udio V1.5

AudioReplicate

AI music generation with studio-quality output. Generate full songs with vocals, instruments, and production.

€2.0060.0s

musicvocalshigh-quality

V-Express

Tencent V-Express. Audio-driven portrait animation with progressive training, weak-condition learning, and expressive lip sync.

€0.09

replicatelipsynctencent

Vectorizer (VTracer)

PNG/JPG to SVG vectorizer built on VTracer, the open-source raster-to-vector engine. Traces a bitmap into layered color regions and clean paths with controls for color count, area threshold and path simplification. Fast, deterministic alternative to model-based vectorizers.

svgvectorvtracer

VideoCrafter

replicateupscalevideo-generation

Tencent VideoCrafter latent video diffusion. Text-to-video and image-to-video generation up to 2s at 1024x576 with strong motion fidelity.

€0.07

Wan 2.1 (Alibaba)

Alibaba's Wan 2.1 open-weights video diffusion model. 14B MoE-based, supports T2V and I2V.

alibabawantext-to-video

Wan 2.1 I2V 720p

Image-to-video variant of Alibaba's Wan 2.1 14B at 720p, accelerated by WaveSpeedAI. Animates a still input image into a short clip driven by a text prompt, keeping the source composition while adding motion.

replicatewanalibaba

Wan 2.1 T2V 720p (Accelerated)

Accelerated inference for Alibaba's Wan 2.1 14B text-to-video at 720p, hosted by WaveSpeedAI on Replicate. Open suite of video foundation models with high-resolution output and faster generation.

replicatealibabawan

Wan 2.2 Image-to-Video

Ultra-cheap I2V. Upload image and animate it.

€0.1030.0s

budgeti2vfast

Wan 2.2 Text-to-Video

Ultra-cheap T2V for pennies

€0.1030.0s

budgetfast

Wav2Lip

Lip-sync model that re-syncs a target video's lip movement to an arbitrary audio track. Robust to identity and language with a lip-sync discriminator loss.

replicatelipsyncvideo-edit

Whisper Diarization

Whisper Large v3 Turbo combined with pyannote 4.0 for speaker diarization, returning who-said-what segments with timestamps. Built by Thomas Mol. Returns a clean JSON of speaker-labeled segments, handy for meeting notes, interviews, and podcasts.

replicatewhisperstt

WhisperX

STTReplicate

WhisperX (Large v3) with forced alignment for accurate word-level timestamps plus optional speaker diarization. Uses VAD to cut long files into segments and a wav2vec2 aligner to pin each word to its exact time. Useful for subtitles and per-speaker transcripts.

replicatewhisperxstt

WizardCoder 33B

CodeCommunity

WizardLM WizardCoder 33B v1.1. Evol-Instruct fine-tune of DeepSeek-Coder-33B with strong code-generation benchmark performance.

replicatecode-generationwizardlm

XTTS v2

Coqui's XTTS v2 multilingual TTS with voice cloning from 6 seconds of reference audio. Supports 17 languages and emotion transfer.

coquittsvoice-cloning

Yi-VL 34B

Multimodal01.AI

01.AI Yi-VL 34B vision-language model. Bilingual (CN/EN) image understanding, strong CMMMU and MMMU performance among open-weights VLMs.

ZoeDepth

Intel ZoeDepth metric depth-estimation model. Combines relative-depth pretraining with metric fine-tuning for absolute distance in real units.