Replicate
Replicate is an open-model hosting platform that serves thousands of open-source models including Flux, Stable Diffusion, Llama, and Whisper variants via a unified API.
166 models from Replicate on Railwail
Access every Replicate model through Railwail's OpenAI-compatible API.
166 models available
Depth Anything v2
Monocular depth-estimation model trained on 595k labeled and 62M unlabeled images. Strong zero-shot generalization in indoor and outdoor scenes.
Flux 1.1 Pro Ultra
FLUX 1.1 Pro in ultra mode. Up to 4 megapixel images with raw mode for photorealism.
Flux Dev
Black Forest Labs' development model. Fast, high-quality image generation with LoRA support.
Google Veo 2
Google's state-of-the-art video generation model. Simulates real-world physics with various visual styles.
Google Veo 3.1
Latest Veo with image-to-video and context-aware audio
Kling v3
Cinematic video up to 15s with multi-shot and native audio
Kling v3 Omni
Most versatile: multi-reference images, video editing, native audio
Midjourney V7
The latest Midjourney model. Industry-leading aesthetic quality and prompt adherence for image generation.
MusicGen
Meta's music generation model. Generate up to 1 minute of music from text descriptions.
Runway Gen 4.5
Top-ranked for motion quality and visual fidelity
SAM 2 (Segment Anything 2)
Meta Segment Anything 2. Promptable segmentation across images and video with temporal memory. Zero-shot, point/box/mask prompts, fast on a single H100.
AnimateDiff
Plug-and-play motion module that animates personalized Stable Diffusion models without further training. 16-frame clips at 512x512.
AnimateDiff Evolved
Community fork of AnimateDiff with improved motion modules, beta scheduler control and ControlNet integration for richer animation control.
AnimateDiff Lightning
ByteDance distillation of AnimateDiff. 4-step sampling for over 10x faster inference at comparable quality to multi-step base model.
AudioCraft
Meta's AudioCraft framework wrapping MusicGen, AudioGen and EnCodec. Unified text-to-audio research toolkit for music and sound effects.
AudioLDM 2
Latent-diffusion model for general-purpose text-to-audio. Generates speech, music, and sound effects with a unified prior.
AuraFlow v0.3
fal.ai's fully open-source 6.8B flow-based text-to-image model. Up to 1536x1536 resolution.
Bark
Suno's text-to-audio model. Generates realistic speech, music, and sound effects.
BRIA RMBG-1.4
BRIA's first commercial-safe background-removal model. Trained on fully-licensed data, suitable for production e-commerce and design pipelines.
BRIA RMBG-2.0
BRIA's professional background-removal model trained on fully-licensed data. Commercial-safe.
CCSR (Content-Consistent SR)
Content-Consistent Super-Resolution model. Reduces hallucination compared to typical diffusion-based upscalers while keeping perceptual quality high.
Champ Human Animation
Champ controllable human image animation. Uses 3D parametric guidance (SMPL) for realistic full-body motion transfer from a single reference image.
Clarity Upscaler
High-resolution image upscaler with creative detail re-imagination via SD-based hallucination. Strong for photography and product shots.
CodeFormer
Robust face-restoration model using a transformer-based codebook prior. Handles severe degradation, occlusion, and old-photo restoration with adjustable fidelity-quality tradeoff.
CogVideoX-5B (open)
Zhipu/Tsinghua's 5B open text-to-video model. 720x480 @ 8fps, 6s clips, image-to-video variant available.
CogVLM2 19B
Tsinghua CogVLM2 19B with Llama-3 8B base plus 11B vision expert. Strong document understanding and visual reasoning, 8k context.
ControlNet Canny
ControlNet conditioned on Canny edge maps. Preserves composition and outlines while restyling with Stable Diffusion 1.5 or SDXL backbones.
ControlNet Depth
ControlNet conditioned on depth maps. Preserves the 3D scene layout while letting the prompt change style, lighting and content.
DeepSeek-VL 7B
DeepSeek-VL 7B chat model. Vision-language model with hybrid vision encoder and strong real-world visual question answering performance.
Detectron2
Meta Detectron2 object-detection and segmentation toolkit. Mask R-CNN, Cascade R-CNN, panoptic FPN and many other model variants in one wrapper.
DINOv2
Meta DINOv2 self-supervised vision backbone. Pretrained features for classification, segmentation and depth without task-specific fine-tuning.
Donut Document
Naver CLOVA Donut OCR-free document-understanding transformer. End-to-end JSON extraction from forms, receipts and invoices without explicit OCR.
Dots OCR
Rednote Hilab Dots OCR. End-to-end document parsing model with layout, text and reading-order prediction in one transformer.
DreamGaussian
Generative Gaussian-splatting model for fast image-to-3D synthesis. Produces textured meshes in two minutes via differentiable rasterization.
DreamGaussian 4D
4D Gaussian-splatting generator extending DreamGaussian to video. Image-conditioned dynamic 3D scenes with view-consistent motion.
DWPose
DWPose whole-body 2D pose estimator. Two-stage knowledge-distilled model with strong accuracy on face, hands and body keypoints simultaneously.
DynamiCrafter
Tencent DynamiCrafter. Animates still images into short videos preserving texture and structure, with strong open-domain coverage.
EasyOCR
JaidedAI EasyOCR. Simple Python OCR wrapper supporting 80+ languages with deep-learning text detection and recognition.
EchoMimic
Ant Group EchoMimic. Lifelike audio-driven portrait animation with editable landmark conditioning for fine-grained motion control.
ESRGAN Classic
Enhanced Super-Resolution GAN, the original 2018 architecture. Produces sharp 4x upscales with strong perceptual quality on natural images.
F5-TTS
Open-source flow-matching TTS with strong zero-shot voice cloning. Code MIT, weights CC-BY-NC.
FILM Frame Interpolation
Google FILM frame interpolation. Synthesizes high-quality intermediate frames between near-duplicate inputs, designed for large motion gaps.
Florence-2 Large
Microsoft Florence-2 Large. Unified prompt-based vision foundation model for captioning, detection, segmentation and OCR with a single 770M-param backbone.
Florence-2 Segmentation
Microsoft Florence-2 unified vision model with referring expression segmentation. Text-prompted region and mask generation in one model.
Flux Schnell
The fastest Flux model. Generate images in under 2 seconds. Great for prototyping.
FLUX.1 [Schnell]
Black Forest Labs' fastest open-weights image model. Apache-2.0 licensed, ~1-4 step inference.
FLUX.1 Canny
FLUX structural control via Canny edge maps. Preserve composition while restyling.
FLUX.1 Depth
FLUX structural control via depth maps. Keep 3D scene layout while changing style/content.
FLUX.1 Fill
Black Forest Labs' inpainting/outpainting model for FLUX. Fill masked regions with prompt-guided content.
FLUX.1 Redux
FLUX image-variation adapter. Generate variations and remixes from a reference image.
GFPGAN v1.4
Tencent ARC face-restoration GAN. Reconstructs realistic facial detail in low-quality or compressed photos using a pretrained StyleGAN2 prior.
GLPN Depth
Global-Local Path Networks depth-estimation model. Combines hierarchical transformer encoder with selective feature fusion for sharp boundaries.
Google Veo 3 Fast
Faster cheaper Veo 3 with audio
Google Veo 3.1 Fast
Faster Veo 3.1 with image-to-video and audio
GOT-OCR 2.0
StepFun GOT-OCR 2.0. Unified end-to-end OCR-2.0 model handling text, formulas, charts, sheet music and geometric shapes in one architecture.
Granite Code 20B
IBM Granite 20B Code Instruct. Larger Granite code model balancing quality and inference cost for enterprise CI/CD code-review automation.
Granite Code 34B
IBM Granite 34B Code Instruct. Largest Granite code-instruction model. Top-tier among Apache-2.0 code LLMs on HumanEval, MBPP and MultiPL-E.
Granite Code 3B
IBM Granite 3B Code Instruct. Apache-2.0 small code-instruction model. Strong on Python, Java, JavaScript and Go for enterprise IDE integrations.
Granite Code 8B
IBM Granite 8B Code Instruct. Trained on permissively-licensed code, strong on multi-language code completion and instruction-following.
Grok Imagine Video
xAI video with native audio and lip-sync, up to 15s
Grounded-SAM
Grounding DINO plus SAM. Open-vocabulary text-prompted detection and segmentation in one pipeline for fully-automatic mask generation.
Hailuo 2.3
Minimax model for realistic human motion and VFX
HRNet Pose
Microsoft HRNet high-resolution pose-estimation backbone. Parallel multi-resolution streams yield strong accuracy on COCO keypoint benchmarks.
Hunyuan3D 2.0
Tencent's Hunyuan3D 2.0 image-to-3D pipeline. Two-stage shape and texture generation producing high-resolution textured meshes.
Hunyuan3D 2.1
Refreshed Hunyuan3D 2.1 with improved texture fidelity and PBR-material support. Image-to-3D with textured GLB output.
HunyuanVideo
Tencent's 13B open-weights video diffusion transformer. SOTA among open video models at release.
HunyuanVideo
Tencent's open-source video generation model. Strong visual quality with diverse style support.
Idefics3 8B
Hugging Face Idefics3 8B. Llama-3 based open-source vision-language model with strong document QA and chart-understanding performance.
InstantMesh
Image-to-3D mesh generator from sparse-view diffusion. Produces textured meshes in under one minute on a single A100.
InstructPix2Pix
Berkeley InstructPix2Pix. Edits an image from natural-language instructions in a single forward pass. Trained on GPT-3 plus Stable Diffusion synthetic pairs.
InternVL 2.5
OpenGVLab InternVL 2.5 78B. Open-source vision-language model approaching GPT-4o on MMMU, OCRBench and Math-Vista benchmarks.
IP-Adapter FaceID Plus v2
Tencent's face-identity conditioning adapter for SD/SDXL. Face embedding + CLIP for ID-consistent generation.
Janus Pro 7B
DeepSeek's unified multimodal model. Decouples vision encoding for both understanding and generation tasks.
Kokoro TTS 82M
Open-weights 82M-parameter TTS. Punches above its size class on naturalness benchmarks at a fraction of the inference cost of larger models.
Kuaishou Kolors
Kuaishou's bilingual (CN/EN) latent diffusion text-to-image model with strong text rendering.
LayoutLMv3
Microsoft LayoutLMv3 multimodal document model. Unified text/image masking pretraining for form understanding, receipts and document QA.
LivePortrait
Kuaishou LivePortrait. Efficient portrait animation driven by reference videos with stitching, retargeting and motion-control parameters.
Llama 3.2 Vision 90B
Meta Llama 3.2 90B Vision. Largest open-weights Llama vision model. Strong visual reasoning, chart, OCR and document understanding.
LLaVA-OneVision 72B
LMMs-Lab LLaVA-OneVision 72B. Unified single-image, multi-image and video instruction-tuned VLM with task-transfer across modalities.
Lotus-G
Lotus generative depth model. Treats depth as a generation task using a diffusion model, producing higher-fidelity depth on textured surfaces.
LTX-Video (Lightricks)
Lightricks' 2B DiT video model. Realtime generation on consumer GPUs (~6s @ H100, 24fps).
Luma Ray Flash 2
Fast affordable video with I2V support
M2M-100 12B
Meta M2M-100 12B many-to-many translation model. Direct translation between 100 languages without pivoting through English.
MADLAD-400 3B
Google MADLAD-400 3B multilingual translation model. 419 languages supported, trained on a 5T-token multilingual corpus with strong low-resource performance.
MagicAnimate
ByteDance MagicAnimate. Temporally consistent human-image animation driven by a DensePose motion sequence with strong identity preservation.
Magicoder S CL 7B
UIUC Magicoder S CL 7B. CodeLlama-7B fine-tuned with OSS-Instruct synthetic data. Strong HumanEval Plus and MBPP Plus performance per parameter.
MAGNeT MusicGen
Meta MAGNeT non-autoregressive music generator. Up to 7x faster than MusicGen with comparable quality via masked generative transformers.
Magnific-Style Upscaler
Detail-hallucinating upscaler in the Magnific style. Adds plausible high-frequency texture using a Stable Diffusion refiner conditioned on the low-res input.
Marigold
ETH Zurich Marigold. Diffusion-based monocular depth-estimation model fine-tuned from Stable Diffusion with strong fine-detail recovery.
Marker PDF Extract
Marker PDF-to-Markdown conversion pipeline. Combines layout, OCR and equation models to produce clean Markdown with preserved tables and formulas.
Mask2Former
Meta Mask2Former universal image-segmentation transformer. Single architecture for panoptic, instance and semantic segmentation tasks.
mBART 50 Many-to-Many
Meta mBART-50 many-to-many translation model. 50 supported languages with strong performance on news and conversational text.
MediaPipe Pose
Google MediaPipe Pose. Lightweight on-device-friendly 33-keypoint 3D pose estimator with optional segmentation mask output.
MiDaS v3.1
Intel MiDaS v3.1 relative depth-estimation model. Robust zero-shot single-image depth across diverse domains and resolutions.
MiniCPM-V 2.6
OpenBMB MiniCPM-V 2.6. 8B vision-language model with strong single-image, multi-image and video understanding plus OCR capabilities.
Minimax Video
MiniMax's video generation model. Fast, high-quality video output with text-to-video capabilities.
MMPose
OpenMMLab MMPose toolbox. Wraps RTMPose, HRNet, HigherHRNet and many other pose models behind a unified inference API.
Mochi 1
Genmo's 10B open-weights text-to-video model. AsymmDiT architecture, 5.4s @ 480p.
MOFA-Video
Motion-Field-Adapter video generator. Controllable image animation from trajectories, keypoints or audio with a strong identity preservation prior.
MuseTalk
Tencent MuseTalk real-time lip-sync model. Audio-driven mouth-region editing in latent space at 30+ fps on a single GPU.
MusicGen Large
Meta's 3.3B-parameter MusicGen Large. Text-conditioned music generation with single-stage autoregressive transformer, supports melody conditioning.
MusicGen Medium
Meta MusicGen Medium (1.5B params). Strong quality-to-speed tradeoff for text-to-music with optional melody guidance.
MusicGen Small
Meta MusicGen Small (300M params). Fast text-to-music generation suitable for prototyping and low-latency demos.
NLLB-200 3B
Meta's No Language Left Behind 3.3B translation model. Direct translation between any pair of 200+ languages including many low-resource African and Asian languages.
NLLB-200 Distilled 600M
Meta's distilled 600M NLLB. Same 200-language coverage as the 3B model with a fraction of the parameters, ideal for edge or high-throughput deployment.
olmOCR
Allen AI olmOCR. Open-source 7B vision-language model fine-tuned for high-fidelity document parsing including math, code and tables.
OpenPose
CMU OpenPose multi-person 2D pose estimator. Real-time keypoint detection for body, hand, face and foot using Part Affinity Fields.
OpenVoice v1
MyShell OpenVoice v1. Cross-lingual voice cloning with flexible style control: emotion, accent, rhythm, pauses, and intonation.
OpenVoice v2
MyShell OpenVoice v2. Multilingual zero-shot voice cloning with accurate tone-color reproduction and style/emotion control.
PaddleOCR v3
Baidu PaddleOCR v3 PP-OCR pipeline. Lightweight detector plus recognizer optimized for production use with 80+ language support.
Parler-TTS
Hugging Face Parler-TTS Mini. Lightweight TTS conditioned on a natural-language style description for fine-grained control over voice characteristics.
Parler-TTS Large
Parler-TTS Large v1. 2.2B parameters, natural-language style prompting and improved prosody over the Mini variant.
Phi-3.5 Vision
Microsoft Phi-3.5 Vision Instruct. Small (4.2B) multimodal model with strong document, OCR and multi-image reasoning at low cost.
Phind CodeLlama 34B v2
Phind CodeLlama 34B v2. Highly tuned CodeLlama variant focused on retrieval-augmented developer assistant workflows.
PhotoMaker
Tencent ARC PhotoMaker. Identity-preserving stylized photo generation from a stacked-ID embedding. Realistic re-styling of a subject in seconds.
PixVerse v5.6
Physics-accurate video generation up to 1080p
Point-E
OpenAI Point-E text-to-point-cloud system. Fast 3D point-cloud generation from text, optionally lifted to a mesh via marching cubes.
Real-CUGAN
Real-CUGAN anime-focused upscaler. 2x/3x/4x super-resolution tuned for animation, line-art, and illustrated content.
Real-ESRGAN 4x
AI-Upscaler that increases image resolution up to 4x while preserving texture and detail. Trained on synthetic and real data to reduce common ESRGAN artifacts.
Real-ESRGAN Anime 4x
Real-ESRGAN variant fine-tuned for anime, manga, and illustrated artwork. 4x upscaling with cartoon-aware artifact suppression.
Recraft V3
State-of-the-art image generation optimized for design and branding. SVG vector output support.
Rembg
Open-source background-removal tool wrapping U2Net. Produces alpha mattes for photos, products and people with no manual masking.
RIFE Frame Interpolation
Real-Time Intermediate Flow Estimation. Doubles or quadruples FPS of an existing video via learned optical-flow-based frame interpolation.
Riffusion
Stable-Diffusion-based real-time music generator. Operates on spectrogram images then resynthesizes audio, enables seamless transitions and looping.
RVC Voice Conversion
Retrieval-based Voice Conversion. Converts a source recording into a target speaker's voice, preserving pitch, prosody and rhythm.
SadTalker
Stylized audio-driven talking-head generator. Synthesizes 3D motion coefficients from audio to animate a single portrait image with natural head movements.
SAM HQ
ETH Zurich SAM-HQ. High-quality mask refinement on top of SAM. Sharper edges and finer structure than the original Segment Anything model.
SeamlessM4T v2 Large (Speech)
Meta SeamlessM4T v2 Large speech mode. Speech-to-speech, speech-to-text, and text-to-speech translation across 100+ languages in a single unified model.
SeamlessM4T v2 Large (Text)
Meta SeamlessM4T v2 Large. Universal multilingual translation across 100+ languages with text-to-text mode for documents and chat.
Seedance Lite
Budget ByteDance video, fast and cheap
Seedance Pro
ByteDance video with T2V and I2V, up to 1080p
Segformer B5
NVIDIA SegFormer-B5 semantic segmentation. Hierarchical transformer encoder with lightweight MLP decoder, strong ADE20k and Cityscapes results.
Shap-E (OpenAI)
OpenAI Shap-E text/image to 3D. Generates implicit neural representations renderable as textured meshes or NeRFs.
Spark TTS
Spark efficient TTS with disentangled control over speaker, content and style. Strong cross-lingual zero-shot performance.
Stable Diffusion XL
Stability AI's SDXL model via Replicate. High-quality image generation with extensive customization.
StarCoder2 15B
BigCode StarCoder2 15B code-generation flagship. Trained on 4T tokens of Stack v2 data with grouped-query attention and 16k context.
StarCoder2 3B
BigCode StarCoder2 3B code-generation model. Trained on The Stack v2, supports 600+ programming languages. Apache-2.0 licensed for commercial use.
StarCoder2 7B
BigCode StarCoder2 7B code-generation model. 16k context, 600+ programming languages, strong fill-in-the-middle (FIM) performance.
StreamingT2V
Picsart StreamingT2V. Generates long, consistent videos by chaining short autoregressive clips with motion and appearance memory.
StyleTTS 2
Style-based TTS using diffusion and adversarial training. Human-level naturalness in zero-shot voice synthesis from a 3-5s reference clip.
Suno Bark
Suno's text-prompted generative audio model. Speech, music, ambient sound and effects with non-verbal cues like laughter or sighs.
SUPIR Upscaler
SUPIR (Scaling-Up Image Restoration) photo-real restoration model. Combines SDXL prior with language-guided controls for severely degraded inputs.
Swin2SR
Transformer-based image super-resolution using Swin-V2 attention. Handles classical, lightweight, real-world, and compressed-input variants with 2x/4x upscaling.
SwinIR Video
SwinIR transformer-based super-resolution and denoising applied per-frame to video. Handles classic, real-world and lightweight upscaling.
T2I-Adapter Color
Tencent T2I-Adapter color-guided generation for SDXL. Lightweight adapter that conditions image generation on a color reference image.
ToonCrafter
Tencent ToonCrafter generative cartoon interpolation model. Synthesizes smooth in-between frames between two cartoon keyframes.
Tortoise TTS
Multi-voice expressive TTS. Slow but high-quality with strong prosody and natural intonation. Trained for long-form narration use cases.
TowerInstruct 13B
Unbabel TowerInstruct 13B. Llama-2-based multilingual translation and post-editing model. Strong terminology consistency for enterprise localization.
Transparent Background
PyTorch background-removal tool supporting multiple modes: base, fast and high-quality. Produces RGBA outputs and is suitable for batch processing.
TRELLIS (3D)
Microsoft TRELLIS image-to-3D model. Generates textured 3D assets in GLB or Gaussian-splat format from a single reference image.
TripoSR
Stability AI and Tripo single-image 3D reconstruction model. Generates 3D meshes from a single image in roughly half a second.
TrOCR Large
Microsoft TrOCR large transformer-based OCR. End-to-end visual encoder plus text decoder, trained on synthetic and printed real-world data.
U2Net Saliency
Salient-object detection network used for background removal and matting. Nested U-Net architecture trained on DUTS-TR for general scenes.
Udio V1.5
AI music generation with studio-quality output. Generate full songs with vocals, instruments, and production.
V-Express
Tencent V-Express. Audio-driven portrait animation with progressive training, weak-condition learning, and expressive lip sync.
VideoCrafter
Tencent VideoCrafter latent video diffusion. Text-to-video and image-to-video generation up to 2s at 1024x576 with strong motion fidelity.
ViTPose
ViTPose plain-vision-transformer pose estimator. State-of-the-art keypoint accuracy on MS-COCO with a minimal architecture.
Wan 2.1 (Alibaba)
Alibaba's Wan 2.1 open-weights video diffusion model. 14B MoE-based, supports T2V and I2V.
Wan 2.2 Image-to-Video
Ultra-cheap I2V. Upload image and animate it.
Wan 2.2 Text-to-Video
Ultra-cheap T2V for pennies
Wav2Lip
Lip-sync model that re-syncs a target video's lip movement to an arbitrary audio track. Robust to identity and language with a lip-sync discriminator loss.
WizardCoder 33B
WizardLM WizardCoder 33B v1.1. Evol-Instruct fine-tune of DeepSeek-Coder-33B with strong code-generation benchmark performance.
XTTS v2
Coqui's XTTS v2 multilingual TTS with voice cloning from 6 seconds of reference audio. Supports 17 languages and emotion transfer.
Yi-Coder 9B
01.AI Yi-Coder 9B chat model. Strong multilingual code completion and chat, 128k context, competitive with code-specialized models 2x its size.
Yi-VL 34B
01.AI Yi-VL 34B vision-language model. Bilingual (CN/EN) image understanding, strong CMMMU and MMMU performance among open-weights VLMs.
ZoeDepth
Intel ZoeDepth metric depth-estimation model. Combines relative-depth pretraining with metric fine-tuning for absolute distance in real units.
Frequently asked questions
How is Replicate pricing handled on Railwail?
Are there rate limits when using Replicate via Railwail?
Which regions does Replicate support through Railwail?
Is there a sandbox or free tier to test Replicate models?
Start building with Replicate today
Free credits on sign-up. No credit card required. Access Replicate and 27+ other providers through a single API.