AI Models
Browse and explore all available AI models.
All AI models, one API — your unified catalog
Railwail gives you a single OpenAI-compatible endpoint that talks to every major AI model on the market — GPT-5 and Claude 4.6 Sonnet for reasoning, Gemini 3 Pro for the longest contexts, FLUX 1.1 Pro for photorealistic images, Veo 3 for video with synced audio, Whisper Large V3 for speech-to-text, ElevenLabs V3 and Cartesia Sonic for voice, Voyage 3 for embeddings, π-0 and OpenVLA for robotics. You pick a model, change one parameter in your request, and ship. No new SDK, no new auth flow, no provider lock-in — the catalog above lists every model we route to, with live per-token prices in EUR and the SLAs we observe in production.
The pricing is transparent and on-demand: you see the input and output rate before you call, you pay per token (or per call, or per second for video and audio), and there are no monthly minimums, no seat fees, and no surprise overage charges. Every new account starts with free credits so you can run real workloads — not just hello-world prompts — before deciding which model fits your product. Switching between flagships is a one-line change: replace `model: "gpt-5"` with `model: "claude-4-6-sonnet"` or `model: "gemini-3-pro"` and the rest of your code keeps working. That same surface covers cheap, fast budget tiers like GPT-5 Mini, Claude Haiku, Gemini Flash, DeepSeek V3, and Qwen 2.5 Coder when latency or unit cost matters more than peak quality.
The infrastructure runs in EU data centers under a DPA — DSGVO-compliant by default, no training on customer prompts, and per-provider data-residency guarantees listed on every model card so your compliance team can sign off without a six-week review. Compared with OpenRouter or Together AI, the differentiation is European hosting, EUR billing, and provider-failover routing that automatically reroutes a request to a healthy backend when a single provider has a regional incident. The catalog covers eight categories — text, image, video, audio, speech-to-text, embeddings, code, multimodal, and vision-language-action robotics — so a single integration handles your chatbot, your image pipeline, your transcripts, and your RAG retriever without juggling five SDKs.
Top picks across all categories
Resolved against the live catalog
€1.00 / 1M input tokens
Learn more500ms p50 latency
Learn moreFeatured this month
Learn more275 models available
Claude Opus 4
Anthropic's most powerful model. Exceptional at complex analysis, agentic tasks, and extended reasoning.
Claude Opus 4.7
Anthropic's April 2026 flagship. 87.6% on SWE-bench Verified, 3x higher image resolution, output self-verification, vision + reasoning.
Claude Sonnet 4
Anthropic's most capable model. Excellent for complex analysis, coding, math, and creative writing.
Claude Sonnet 4.6
Anthropic's balanced mid-tier model from February 2026. Best price/performance for production workloads: 5x cheaper than Opus, near-flagship quality.
Codestral
Mistral's code-specialized model. Optimized for code generation, completion, and understanding across 80+ languages.
DeepSeek V3.1
DeepSeek's refreshed V3.1 release. 671B MoE / 37B active. Tops open-weights leaderboards on coding and reasoning.
DeepSeek V4 Pro
DeepSeek's April 2026 flagship. 1.6T MoE / 49B active params, 1M context, rivals top closed-source models on STEM and coding at a fraction of the price.
Depth Anything v2
Monocular depth-estimation model trained on 595k labeled and 62M unlabeled images. Strong zero-shot generalization in indoor and outdoor scenes.
ElevenLabs Multilingual V2
ElevenLabs' most natural-sounding TTS model. Supports 29 languages with emotional range.
Flux 1.1 Pro Ultra
FLUX 1.1 Pro in ultra mode. Up to 4 megapixel images with raw mode for photorealism.
Flux Dev
Black Forest Labs' development model. Fast, high-quality image generation with LoRA support.
Gemini 2.0 Flash
Google's fastest multimodal model. Supports text, images, audio, and video input.
Gemini 2.5 Pro
Google's latest thinking model. Excels at reasoning, coding, math, and science with massive context window.
Gemini 3 Flash
Google's April 2026 fast multimodal model. Combines Gemini 3 Pro's reasoning with Flash-tier latency and price. Default model in the Gemini app.
Gemini 3.1 Pro
Google DeepMind's February 2026 flagship. 2M-token context, native multimodal (text/image/audio/video), Deep Think reasoning.
Google Imagen 4
Google's Imagen 4. Text-to-image with strong photorealism and improved typography support.
Google Imagen 4 Ultra
Premium Imagen 4 tier. Highest fidelity, prompt adherence and typography quality from Google.
Google Veo 2
Google's state-of-the-art video generation model. Simulates real-world physics with various visual styles.
Google Veo 3
Google's Veo 3. High-fidelity text-to-video with native audio generation, up to 8s clips.
Google Veo 3.1
Latest Veo with image-to-video and context-aware audio
GPT-4.1
OpenAI's newest flagship model. Improved reasoning, instruction following, and coding over GPT-4o.
GPT-4o
OpenAI's most capable multimodal model. Excellent for complex reasoning, coding, and creative tasks.
GPT-5.4
OpenAI's unified flagship combining GPT and o-series reasoning into one model. 1M context, multimodal, top SWE-Bench Pro and OSWorld scores.
GPT-5.4 Mini
OpenAI's efficient mid-tier model. 2x faster than its predecessor, 400k context, approaches GPT-5.4 quality on SWE-Bench Pro at a fraction of the cost.
Grok 4
xAI's flagship reasoning model with vision and tool use. 256k context, strong at complex reasoning and STEM tasks.
Grok 4.3
xAI's May 2026 flagship. 1M context, vision, always-on reasoning, real-time X/web retrieval via DeepSearch.
Ideogram 3.0
Ideogram's flagship text-to-image model with industry-leading text rendering and prompt adherence.
Kimi K2 (Moonshot)
Moonshot AI's 1T-parameter MoE model. Industry-leading agentic coding and tool-use benchmarks.
Kling v3
Cinematic video up to 15s with multi-shot and native audio
Kling v3 Omni
Most versatile: multi-reference images, video editing, native audio
Midjourney V7
The latest Midjourney model. Industry-leading aesthetic quality and prompt adherence for image generation.
MiniMax-01
MiniMax's 456B hybrid lightning-attention model with native 4M-token context. Industry-leading long-context.
MusicGen
Meta's music generation model. Generate up to 1 minute of music from text descriptions.
o3-mini
OpenAI's reasoning model optimized for STEM tasks, coding, and math. Uses chain-of-thought reasoning.
OpenAI Sora 2
OpenAI's second-generation Sora video model. Realistic motion, improved physics, audio support.
Perplexity Sonar Pro
Perplexity's premium web-grounded search model with multi-step reasoning over live sources.
Qwen 3 235B Instruct
Alibaba's Qwen 3 flagship MoE: 235B total / 22B active. Strong reasoning and tool use, open-weights.
Runway Gen 4.5
Top-ranked for motion quality and visual fidelity
SAM 2 (Segment Anything 2)
Meta Segment Anything 2. Promptable segmentation across images and video with temporal memory. Zero-shot, point/box/mask prompts, fast on a single H100.
Sora
OpenAI video generation model. Create realistic and imaginative videos from text prompts up to 20 seconds.
Text Embedding 3 Large
OpenAI's most powerful embedding model. 3072 dimensions for maximum accuracy.
Voyage AI voyage-3
Voyage's general-purpose embedding model. 1024 dims, 32k context, strong retrieval performance.
Whisper Large V3
OpenAI's Whisper model. State-of-the-art speech recognition supporting 99+ languages.
Whisper Large v3 Turbo
OpenAI's distilled Whisper Large v3. ~216x realtime, 99+ languages, MIT-licensed weights.
AI21 Jamba 1.5 Large
AI21's flagship hybrid Mamba-Transformer model with a 256k context window for long-document tasks.
AI21 Jamba 1.5 Mini
Cost-efficient hybrid Mamba-Transformer model with 256k context. Tuned for high-throughput RAG.
AnimateDiff
Plug-and-play motion module that animates personalized Stable Diffusion models without further training. 16-frame clips at 512x512.
AnimateDiff Evolved
Community fork of AnimateDiff with improved motion modules, beta scheduler control and ControlNet integration for richer animation control.
AnimateDiff Lightning
ByteDance distillation of AnimateDiff. 4-step sampling for over 10x faster inference at comparable quality to multi-step base model.
AudioCraft
Meta's AudioCraft framework wrapping MusicGen, AudioGen and EnCodec. Unified text-to-audio research toolkit for music and sound effects.
AudioLDM 2
Latent-diffusion model for general-purpose text-to-audio. Generates speech, music, and sound effects with a unified prior.
AuraFlow v0.3
fal.ai's fully open-source 6.8B flow-based text-to-image model. Up to 1536x1536 resolution.
Bark
Suno's text-to-audio model. Generates realistic speech, music, and sound effects.
BRIA RMBG-1.4
BRIA's first commercial-safe background-removal model. Trained on fully-licensed data, suitable for production e-commerce and design pipelines.
BRIA RMBG-2.0
BRIA's professional background-removal model trained on fully-licensed data. Commercial-safe.
Cartesia Sonic
Cartesia's ultra-low-latency TTS (~90ms TTFB). State-space model with voice cloning support.
CCSR (Content-Consistent SR)
Content-Consistent Super-Resolution model. Reduces hallucination compared to typical diffusion-based upscalers while keeping perceptual quality high.
Champ Human Animation
Champ controllable human image animation. Uses 3D parametric guidance (SMPL) for realistic full-body motion transfer from a single reference image.
Clarity Upscaler
High-resolution image upscaler with creative detail re-imagination via SD-based hallucination. Strong for photography and product shots.
Claude Haiku 3.5
Anthropic's fast and affordable model. Great for quick tasks, summarization, and simple coding.
Claude Haiku 4.5
Anthropic's fastest and cheapest 4.x model. Strong vision and tool use at ultra-low latency, ideal for high-concurrency workloads.
CodeFormer
Robust face-restoration model using a transformer-based codebook prior. Handles severe degradation, occlusion, and old-photo restoration with adjustable fidelity-quality tradeoff.
CogVideoX-5B (open)
Zhipu/Tsinghua's 5B open text-to-video model. 720x480 @ 8fps, 6s clips, image-to-video variant available.
CogVLM2 19B
Tsinghua CogVLM2 19B with Llama-3 8B base plus 11B vision expert. Strong document understanding and visual reasoning, 8k context.
Cohere Aya 23 35B
Open-weights multilingual research model from Cohere covering 23 languages. 35B parameters.
Cohere Command Light (legacy)
Cohere's fast lightweight chat model (deprecated Sep 2025). Kept as comparison tombstone.
Cohere Command R (08-2024)
Cohere's mid-tier RAG/tool model. Cost-efficient sibling of Command R+ with 128k context.
Cohere Command R+ (08-2024)
Cohere's flagship RAG- and tool-optimized chat model. 128k context, refreshed August 2024.
Cohere embed-multilingual-v3
Cohere's multilingual embedding model. Supports 100+ languages with separate search and classification modes.
ControlNet Canny
ControlNet conditioned on Canny edge maps. Preserves composition and outlines while restyling with Stable Diffusion 1.5 or SDXL backbones.
ControlNet Depth
ControlNet conditioned on depth maps. Preserves the 3D scene layout while letting the prompt change style, lighting and content.
DALL-E 3
OpenAI's latest image generation model. Excellent at following complex prompts with high fidelity.
Deepgram Nova-3
Deepgram's flagship STT. First to offer realtime multilingual transcription with self-serve customization.
DeepSeek Coder V2
DeepSeek's specialized coding model. Excellent at code generation, debugging, and explanation.
DeepSeek R1
DeepSeek's reasoning model with chain-of-thought capabilities. Excellent for complex problem-solving.
DeepSeek V3
Powerful open-weight model from DeepSeek. Strong at coding, math, and Chinese/English tasks.
DeepSeek V4 Flash
Efficiency-optimized variant of DeepSeek V4. 284B MoE / 13B active, 1M context, ultra-low pricing for high-throughput workloads.
DeepSeek-VL 7B
DeepSeek-VL 7B chat model. Vision-language model with hybrid vision encoder and strong real-world visual question answering performance.
Detectron2
Meta Detectron2 object-detection and segmentation toolkit. Mask R-CNN, Cascade R-CNN, panoptic FPN and many other model variants in one wrapper.
DINOv2
Meta DINOv2 self-supervised vision backbone. Pretrained features for classification, segmentation and depth without task-specific fine-tuning.
Donut Document
Naver CLOVA Donut OCR-free document-understanding transformer. End-to-end JSON extraction from forms, receipts and invoices without explicit OCR.
Dots OCR
Rednote Hilab Dots OCR. End-to-end document parsing model with layout, text and reading-order prediction in one transformer.
DreamGaussian
Generative Gaussian-splatting model for fast image-to-3D synthesis. Produces textured meshes in two minutes via differentiable rasterization.
DreamGaussian 4D
4D Gaussian-splatting generator extending DreamGaussian to video. Image-conditioned dynamic 3D scenes with view-consistent motion.
DWPose
DWPose whole-body 2D pose estimator. Two-stage knowledge-distilled model with strong accuracy on face, hands and body keypoints simultaneously.
DynamiCrafter
Tencent DynamiCrafter. Animates still images into short videos preserving texture and structure, with strong open-domain coverage.
EasyOCR
JaidedAI EasyOCR. Simple Python OCR wrapper supporting 80+ languages with deep-learning text detection and recognition.
EchoMimic
Ant Group EchoMimic. Lifelike audio-driven portrait animation with editable landmark conditioning for fine-grained motion control.
Edge TTS
Microsoft Edge neural voices accessed via the open-source edge-tts wrapper. 400+ voices across 100+ locales, suitable for batch generation.
ElevenLabs Scribe v1
ElevenLabs' STT. 99 languages, word-level timestamps, speaker diarization, audio-event tagging.
ElevenLabs v3 (alpha)
ElevenLabs' v3 alpha TTS. Most expressive voice model with audio tags and laughter, higher latency.
ESRGAN Classic
Enhanced Super-Resolution GAN, the original 2018 architecture. Produces sharp 4x upscales with strong perceptual quality on natural images.
F5-TTS
Open-source flow-matching TTS with strong zero-shot voice cloning. Code MIT, weights CC-BY-NC.
FILM Frame Interpolation
Google FILM frame interpolation. Synthesizes high-quality intermediate frames between near-duplicate inputs, designed for large motion gaps.
Florence-2 Large
Microsoft Florence-2 Large. Unified prompt-based vision foundation model for captioning, detection, segmentation and OCR with a single 770M-param backbone.
Florence-2 Segmentation
Microsoft Florence-2 unified vision model with referring expression segmentation. Text-prompted region and mask generation in one model.
Flux Schnell
The fastest Flux model. Generate images in under 2 seconds. Great for prototyping.
FLUX.1 [Schnell]
Black Forest Labs' fastest open-weights image model. Apache-2.0 licensed, ~1-4 step inference.
FLUX.1 Canny
FLUX structural control via Canny edge maps. Preserve composition while restyling.
FLUX.1 Depth
FLUX structural control via depth maps. Keep 3D scene layout while changing style/content.
FLUX.1 Fill
Black Forest Labs' inpainting/outpainting model for FLUX. Fill masked regions with prompt-guided content.
FLUX.1 Redux
FLUX image-variation adapter. Generate variations and remixes from a reference image.
Gemini Robotics (2025)
Google DeepMind's vision-language-action model based on Gemini 2.0. Generalist robot policy with strong dexterity.
Gemini Robotics-ER
Embodied-reasoning variant of Gemini Robotics. Enhanced 3D spatial reasoning and trajectory planning.
Get3D (NVIDIA)
NVIDIA GET3D generative model for textured 3D shapes. Trained on category-specific datasets producing meshes with high-quality textures.
GFPGAN v1.4
Tencent ARC face-restoration GAN. Reconstructs realistic facial detail in low-quality or compressed photos using a pretrained StyleGAN2 prior.
GLPN Depth
Global-Local Path Networks depth-estimation model. Combines hierarchical transformer encoder with selective feature fusion for sharp boundaries.
Google RT-2-X
Google's VLA from RT-X collaboration. Trained on Open-X-Embodiment (22 robots, 527 skills), positive transfer.
Google Veo 3 Fast
Faster cheaper Veo 3 with audio
Google Veo 3.1 Fast
Faster Veo 3.1 with image-to-video and audio
GOT-OCR 2.0
StepFun GOT-OCR 2.0. Unified end-to-end OCR-2.0 model handling text, formulas, charts, sheet music and geometric shapes in one architecture.
GPT-4o Mini
Small, fast, and affordable model for lightweight tasks. Great balance of speed and capability.
GPT-5.4 Nano
OpenAI's smallest and cheapest GPT-5.4 variant. Built for high-volume classification, extraction and coding subagents at edge-grade latency.
Granite Code 20B
IBM Granite 20B Code Instruct. Larger Granite code model balancing quality and inference cost for enterprise CI/CD code-review automation.
Granite Code 34B
IBM Granite 34B Code Instruct. Largest Granite code-instruction model. Top-tier among Apache-2.0 code LLMs on HumanEval, MBPP and MultiPL-E.
Granite Code 3B
IBM Granite 3B Code Instruct. Apache-2.0 small code-instruction model. Strong on Python, Java, JavaScript and Go for enterprise IDE integrations.
Granite Code 8B
IBM Granite 8B Code Instruct. Trained on permissively-licensed code, strong on multi-language code completion and instruction-following.
Grok 2 Vision
xAI's vision-capable Grok 2 snapshot. Image-in, text-out with strong multilingual instruction following.
Grok 3
xAI's flagship model. Strong at reasoning, coding, and real-time knowledge with web search capabilities.
Grok 4.1 Fast
xAI's cost-efficient high-throughput model. 2M context, optional reasoning, optimized for agentic loops and real-time apps.
Grok Imagine Video
xAI video with native audio and lip-sync, up to 15s
Grounded-SAM
Grounding DINO plus SAM. Open-vocabulary text-prompted detection and segmentation in one pipeline for fully-automatic mask generation.
Hailuo / MiniMax Video-01
MiniMax's Hailuo video-01. 6s 1280x720 clips with strong cinematic motion and physical realism.
Hailuo 2.3
Minimax model for realistic human motion and VFX
HRNet Pose
Microsoft HRNet high-resolution pose-estimation backbone. Parallel multi-resolution streams yield strong accuracy on COCO keypoint benchmarks.
Hunyuan3D 2.0
Tencent's Hunyuan3D 2.0 image-to-3D pipeline. Two-stage shape and texture generation producing high-resolution textured meshes.
Hunyuan3D 2.1
Refreshed Hunyuan3D 2.1 with improved texture fidelity and PBR-material support. Image-to-3D with textured GLB output.
HunyuanVideo
Tencent's 13B open-weights video diffusion transformer. SOTA among open video models at release.
HunyuanVideo
Tencent's open-source video generation model. Strong visual quality with diverse style support.
Idefics3 8B
Hugging Face Idefics3 8B. Llama-3 based open-source vision-language model with strong document QA and chart-understanding performance.
Ideogram 2.0 Turbo
Ideogram's fast text-to-image variant. Strong typography and logo rendering at low latency.
InstantMesh
Image-to-3D mesh generator from sparse-view diffusion. Produces textured meshes in under one minute on a single A100.
InstructPix2Pix
Berkeley InstructPix2Pix. Edits an image from natural-language instructions in a single forward pass. Trained on GPT-3 plus Stable Diffusion synthetic pairs.
InternVL 2.5
OpenGVLab InternVL 2.5 78B. Open-source vision-language model approaching GPT-4o on MMMU, OCRBench and Math-Vista benchmarks.
IP-Adapter FaceID Plus v2
Tencent's face-identity conditioning adapter for SD/SDXL. Face embedding + CLIP for ID-consistent generation.
Janus Pro 7B
DeepSeek's unified multimodal model. Decouples vision encoding for both understanding and generation tasks.
Jina Embeddings v3 (Multilingual)
Jina's frontier multilingual embedding model. 570M params, 8192 ctx, 89 languages, Matryoshka dims 128-1024.
Kling 1.6 Pro
Kuaishou's Kling 1.6 Pro. Premium cinematic motion and physics realism, ~$0.07/sec.
Kokoro TTS 82M
Open-weights 82M-parameter TTS. Punches above its size class on naturalness benchmarks at a fraction of the inference cost of larger models.
Kuaishou Kolors
Kuaishou's bilingual (CN/EN) latent diffusion text-to-image model with strong text rendering.
LayoutLMv3
Microsoft LayoutLMv3 multimodal document model. Unified text/image masking pretraining for form understanding, receipts and document QA.
LeRobot SmolVLA
HuggingFace's 450M VLA pretrained on 487 community LeRobot datasets. Runs on consumer GPUs.
LivePortrait
Kuaishou LivePortrait. Efficient portrait animation driven by reference videos with stitching, retargeting and motion-control parameters.
Llama 3.2 90B Vision (multimodal)
Meta's flagship vision-language model. 90B parameters, image understanding + chat, strong VQA performance.
Llama 3.2 Vision 90B
Meta Llama 3.2 90B Vision. Largest open-weights Llama vision model. Strong visual reasoning, chart, OCR and document understanding.
Llama 3.3 70B
Meta's open-source 70B parameter model. Strong all-around performance with multilingual support.
LLaVA-OneVision 72B
LMMs-Lab LLaVA-OneVision 72B. Unified single-image, multi-image and video instruction-tuned VLM with task-transfer across modalities.
Lotus-G
Lotus generative depth model. Treats depth as a generation task using a diffusion model, producing higher-fidelity depth on textured surfaces.
LTX-Video (Lightricks)
Lightricks' 2B DiT video model. Realtime generation on consumer GPUs (~6s @ H100, 24fps).
Luma Dream Machine v1.6
Luma's Dream Machine 1.6. 720p text/image-to-video with strong motion and camera control.
Luma Ray Flash 2
Fast affordable video with I2V support
M2M-100 12B
Meta M2M-100 12B many-to-many translation model. Direct translation between 100 languages without pivoting through English.
MADLAD-400 3B
Google MADLAD-400 3B multilingual translation model. 419 languages supported, trained on a 5T-token multilingual corpus with strong low-resource performance.
MagicAnimate
ByteDance MagicAnimate. Temporally consistent human-image animation driven by a DensePose motion sequence with strong identity preservation.
Magicoder S CL 7B
UIUC Magicoder S CL 7B. CodeLlama-7B fine-tuned with OSS-Instruct synthetic data. Strong HumanEval Plus and MBPP Plus performance per parameter.
MAGNeT MusicGen
Meta MAGNeT non-autoregressive music generator. Up to 7x faster than MusicGen with comparable quality via masked generative transformers.
Magnific-Style Upscaler
Detail-hallucinating upscaler in the Magnific style. Adds plausible high-frequency texture using a Stable Diffusion refiner conditioned on the low-res input.
Marigold
ETH Zurich Marigold. Diffusion-based monocular depth-estimation model fine-tuned from Stable Diffusion with strong fine-detail recovery.
Marker PDF Extract
Marker PDF-to-Markdown conversion pipeline. Combines layout, OCR and equation models to produce clean Markdown with preserved tables and formulas.
Mask2Former
Meta Mask2Former universal image-segmentation transformer. Single architecture for panoptic, instance and semantic segmentation tasks.
mBART 50 Many-to-Many
Meta mBART-50 many-to-many translation model. 50 supported languages with strong performance on news and conversational text.
MediaPipe Pose
Google MediaPipe Pose. Lightweight on-device-friendly 33-keypoint 3D pose estimator with optional segmentation mask output.
Microsoft Phi-3.5 MoE Instruct
Mixture-of-experts Phi-3.5: 42B total / 6.6B active params. 128k context, multilingual.
MiDaS v3.1
Intel MiDaS v3.1 relative depth-estimation model. Robust zero-shot single-image depth across diverse domains and resolutions.
MiniCPM-V 2.6
OpenBMB MiniCPM-V 2.6. 8B vision-language model with strong single-image, multi-image and video understanding plus OCR capabilities.
Minimax Video
MiniMax's video generation model. Fast, high-quality video output with text-to-video capabilities.
Mistral Large
Mistral's flagship model. Strong reasoning, multilingual, and coding capabilities.
Mistral OCR
Mistral OCR API. Document-understanding model with strong table and equation extraction, and structured JSON output.
Mistral Pixtral Large (124B)
Mistral's 124B multimodal flagship. 123B decoder + 1B vision encoder, 128k ctx, up to 30 images per request.
MMPose
OpenMMLab MMPose toolbox. Wraps RTMPose, HRNet, HigherHRNet and many other pose models behind a unified inference API.
Mochi 1
Genmo's 10B open-weights text-to-video model. AsymmDiT architecture, 5.4s @ 480p.
MOFA-Video
Motion-Field-Adapter video generator. Controllable image animation from trajectories, keypoints or audio with a strong identity preservation prior.
MuseTalk
Tencent MuseTalk real-time lip-sync model. Audio-driven mouth-region editing in latent space at 30+ fps on a single GPU.
MusicGen Large
Meta's 3.3B-parameter MusicGen Large. Text-conditioned music generation with single-stage autoregressive transformer, supports melody conditioning.
MusicGen Medium
Meta MusicGen Medium (1.5B params). Strong quality-to-speed tradeoff for text-to-music with optional melody guidance.
MusicGen Small
Meta MusicGen Small (300M params). Fast text-to-music generation suitable for prototyping and low-latency demos.
mxbai-embed-large-v1
Mixedbread's open-source 335M embedding model. Top MTEB benchmark for English retrieval at release.
NLLB-200 3B
Meta's No Language Left Behind 3.3B translation model. Direct translation between any pair of 200+ languages including many low-resource African and Asian languages.
NLLB-200 Distilled 600M
Meta's distilled 600M NLLB. Same 200-language coverage as the 3B model with a fraction of the parameters, ideal for edge or high-throughput deployment.
Nous Hermes 3 405B
Full-parameter fine-tune of Llama 3.1 405B by Nous Research. Steerable, uncensored, strong tool use.
Nous Hermes 3 70B
Llama-3.1-70B fine-tune from Nous Research with strong tool/agent capabilities and uncensored alignment.
NVIDIA Cosmos-Predict-1
NVIDIA's world foundation model for physical AI. Diffusion-based video prediction for robotics simulation.
Octo Base
Berkeley/Stanford 93M transformer diffusion policy. Pretrained on 800k Open-X-Embodiment episodes.
Octo Small
Compact 27M variant of Octo. Faster inference on consumer GPUs, designed for low-latency control.
olmOCR
Allen AI olmOCR. Open-source 7B vision-language model fine-tuned for high-fidelity document parsing including math, code and tables.
OpenAI TTS-1
OpenAI's text-to-speech model. Six built-in voices with natural intonation.
OpenAI TTS-1 HD
OpenAI's high-definition TTS model. Better quality for production use cases.
OpenPose
CMU OpenPose multi-person 2D pose estimator. Real-time keypoint detection for body, hand, face and foot using Part Affinity Fields.
OpenVLA-7B
Stanford/Berkeley open VLA trained on 970k Open-X-Embodiment episodes. Supports LoRA fine-tuning.
OpenVoice v1
MyShell OpenVoice v1. Cross-lingual voice cloning with flexible style control: emotion, accent, rhythm, pauses, and intonation.
OpenVoice v2
MyShell OpenVoice v2. Multilingual zero-shot voice cloning with accurate tone-color reproduction and style/emotion control.
PaddleOCR v3
Baidu PaddleOCR v3 PP-OCR pipeline. Lightweight detector plus recognizer optimized for production use with 80+ language support.
Parler-TTS
Hugging Face Parler-TTS Mini. Lightweight TTS conditioned on a natural-language style description for fine-grained control over voice characteristics.
Parler-TTS Large
Parler-TTS Large v1. 2.2B parameters, natural-language style prompting and improved prosody over the Mini variant.
Perplexity Sonar
Perplexity's fastest and cheapest web-grounded chat model. Live-source citations included.
Perplexity Sonar Reasoning
Perplexity's reasoning model with chain-of-thought and integrated web search.
Phi-3.5 Vision
Microsoft Phi-3.5 Vision Instruct. Small (4.2B) multimodal model with strong document, OCR and multi-image reasoning at low cost.
Phind CodeLlama 34B v2
Phind CodeLlama 34B v2. Highly tuned CodeLlama variant focused on retrieval-augmented developer assistant workflows.
PhotoMaker
Tencent ARC PhotoMaker. Identity-preserving stylized photo generation from a stacked-ID embedding. Realistic re-styling of a subject in seconds.
Physical Intelligence Pi-0-FAST
Autoregressive π-0 variant using FAST action tokenizer. Faster inference at competitive task success.
Physical Intelligence π-0
Physical Intelligence's flagship VLA flow-matching policy. Generalist robot control, pretrained on 10k+ hrs robot data.
Physical Intelligence π-0.5
Upgraded π-0 with open-world generalization via knowledge insulation. Weights and fine-tuning open-sourced.
Pika 2.0 (Official)
Pika Labs' 2.0 release. Cinematic text/image-to-video with scene composition controls.
PixVerse v5.6
Physics-accurate video generation up to 1080p
Playground v3 (Design)
Playground's text-to-image model focused on graphic design aesthetics and embedded typography.
PlayHT 2.0
PlayHT's 2.0 generative voice model. Multi-lingual expressive speech synthesis with sub-second latency and high-fidelity voice cloning.
Point-E
OpenAI Point-E text-to-point-cloud system. Fast 3D point-cloud generation from text, optionally lifted to a mesh via marching cubes.
Qwen 2.5 72B
Alibaba's powerful open-source model. Excellent at coding, math, and multilingual tasks.
Qwen 2.5-Max
Alibaba's flagship pretrained MoE model. Top-tier reasoning and code performance via DashScope API.
Qwen2-VL-72B Instruct
Alibaba's 72B vision-language model with M-RoPE and dynamic resolution. Strong document and video understanding.
RDT-1B
Tsinghua's 1B diffusion-transformer bimanual manipulation policy. Predicts next 64 actions per inference.
Real-CUGAN
Real-CUGAN anime-focused upscaler. 2x/3x/4x super-resolution tuned for animation, line-art, and illustrated content.
Real-ESRGAN 4x
AI-Upscaler that increases image resolution up to 4x while preserving texture and detail. Trained on synthetic and real data to reduce common ESRGAN artifacts.
Real-ESRGAN Anime 4x
Real-ESRGAN variant fine-tuned for anime, manga, and illustrated artwork. 4x upscaling with cartoon-aware artifact suppression.
Recraft V3
State-of-the-art image generation optimized for design and branding. SVG vector output support.
Recraft V3 Realistic
Recraft's high-prompt-adherence raster image model. Strong layout control and brand-style consistency.
Recraft V3 SVG
Recraft's vector/SVG generation model. Editable illustrations and icons from text.
Reka Core
Reka's frontier multimodal model supporting text, image, video and audio inputs.
Reka Edge
Reka's small on-device-friendly multimodal model. ~7B parameters, 16k context.
Reka Flash
Reka's 21B dense multimodal model balancing speed and quality. Up to 128k context.
Rembg
Open-source background-removal tool wrapping U2Net. Produces alpha mattes for photos, products and people with no manual masking.
RIFE Frame Interpolation
Real-Time Intermediate Flow Estimation. Doubles or quadruples FPS of an existing video via learned optical-flow-based frame interpolation.
Riffusion
Stable-Diffusion-based real-time music generator. Operates on spectrogram images then resynthesizes audio, enables seamless transitions and looping.
Runway Gen-3 Alpha Turbo
Runway's faster, cheaper Gen-3 variant. Image-to-video at 5 credits/sec (~$0.05/sec).
RVC Voice Conversion
Retrieval-based Voice Conversion. Converts a source recording into a target speaker's voice, preserving pitch, prosody and rhythm.
SadTalker
Stylized audio-driven talking-head generator. Synthesizes 3D motion coefficients from audio to animate a single portrait image with natural head movements.
SAM HQ
ETH Zurich SAM-HQ. High-quality mask refinement on top of SAM. Sharper edges and finer structure than the original Segment Anything model.
SeamlessM4T v2 Large (Speech)
Meta SeamlessM4T v2 Large speech mode. Speech-to-speech, speech-to-text, and text-to-speech translation across 100+ languages in a single unified model.
SeamlessM4T v2 Large (Text)
Meta SeamlessM4T v2 Large. Universal multilingual translation across 100+ languages with text-to-text mode for documents and chat.
Seedance Lite
Budget ByteDance video, fast and cheap
Seedance Pro
ByteDance video with T2V and I2V, up to 1080p
Segformer B5
NVIDIA SegFormer-B5 semantic segmentation. Hierarchical transformer encoder with lightweight MLP decoder, strong ADE20k and Cityscapes results.
Shap-E (OpenAI)
OpenAI Shap-E text/image to 3D. Generates implicit neural representations renderable as textured meshes or NeRFs.
Snowflake Arctic Instruct
Snowflake's open MoE model: 480B total / 17B active params with dense+MoE hybrid architecture.
Spark TTS
Spark efficient TTS with disentangled control over speaker, content and style. Strong cross-lingual zero-shot performance.
Stable Audio 2
Stability AI's Stable Audio 2.0. Text-to-music up to 3 minutes of full-length, structured tracks at 44.1 kHz.
Stable Diffusion 3.5 Large (Stability)
Stability AI's 8B-parameter flagship SD3.5 model. Strong prompt adherence and aesthetic quality.
Stable Diffusion 3.5 Large Turbo
Distilled 4-step variant of SD3.5 Large. 8B params, ~4x faster inference at competitive quality.
Stable Diffusion 3.5 Medium
Stability AI's 2.5B-parameter SD3.5 with strong quality/speed trade-off. Consumer-GPU friendly.
Stable Diffusion XL
Stability AI's SDXL model via Replicate. High-quality image generation with extensive customization.
StarCoder2 15B
BigCode StarCoder2 15B code-generation flagship. Trained on 4T tokens of Stack v2 data with grouped-query attention and 16k context.
StarCoder2 3B
BigCode StarCoder2 3B code-generation model. Trained on The Stack v2, supports 600+ programming languages. Apache-2.0 licensed for commercial use.
StarCoder2 7B
BigCode StarCoder2 7B code-generation model. 16k context, 600+ programming languages, strong fill-in-the-middle (FIM) performance.
StreamingT2V
Picsart StreamingT2V. Generates long, consistent videos by chaining short autoregressive clips with motion and appearance memory.
StyleTTS 2
Style-based TTS using diffusion and adversarial training. Human-level naturalness in zero-shot voice synthesis from a 3-5s reference clip.
Suno Bark
Suno's text-prompted generative audio model. Speech, music, ambient sound and effects with non-verbal cues like laughter or sighs.
SUPIR Upscaler
SUPIR (Scaling-Up Image Restoration) photo-real restoration model. Combines SDXL prior with language-guided controls for severely degraded inputs.
Swin2SR
Transformer-based image super-resolution using Swin-V2 attention. Handles classical, lightweight, real-world, and compressed-input variants with 2x/4x upscaling.
SwinIR Video
SwinIR transformer-based super-resolution and denoising applied per-frame to video. Handles classic, real-world and lightweight upscaling.
T2I-Adapter Color
Tencent T2I-Adapter color-guided generation for SDXL. Lightweight adapter that conditions image generation on a color reference image.
Text Embedding 3 Small
OpenAI's compact embedding model. 1536 dimensions, great for semantic search and RAG.
TII Falcon 180B Chat
TII's 180B causal decoder chat model fine-tuned on Ultrachat, Platypus and Airoboros.
ToonCrafter
Tencent ToonCrafter generative cartoon interpolation model. Synthesizes smooth in-between frames between two cartoon keyframes.
Tortoise TTS
Multi-voice expressive TTS. Slow but high-quality with strong prosody and natural intonation. Trained for long-form narration use cases.
TowerInstruct 13B
Unbabel TowerInstruct 13B. Llama-2-based multilingual translation and post-editing model. Strong terminology consistency for enterprise localization.
Transparent Background
PyTorch background-removal tool supporting multiple modes: base, fast and high-quality. Produces RGBA outputs and is suitable for batch processing.
TRELLIS (3D)
Microsoft TRELLIS image-to-3D model. Generates textured 3D assets in GLB or Gaussian-splat format from a single reference image.
TripoSR
Stability AI and Tripo single-image 3D reconstruction model. Generates 3D meshes from a single image in roughly half a second.
TrOCR Large
Microsoft TrOCR large transformer-based OCR. End-to-end visual encoder plus text decoder, trained on synthetic and printed real-world data.
U2Net Saliency
Salient-object detection network used for background removal and matting. Nested U-Net architecture trained on DUTS-TR for general scenes.
Udio V1.5
AI music generation with studio-quality output. Generate full songs with vocals, instruments, and production.
V-Express
Tencent V-Express. Audio-driven portrait animation with progressive training, weak-condition learning, and expressive lip sync.
VideoCrafter
Tencent VideoCrafter latent video diffusion. Text-to-video and image-to-video generation up to 2s at 1024x576 with strong motion fidelity.
ViTPose
ViTPose plain-vision-transformer pose estimator. State-of-the-art keypoint accuracy on MS-COCO with a minimal architecture.
Voyage AI voyage-code-3
Voyage's code-specialized embedding model. Up to 32k context, Matryoshka 256-2048 dims, int8/binary support.
Wan 2.1 (Alibaba)
Alibaba's Wan 2.1 open-weights video diffusion model. 14B MoE-based, supports T2V and I2V.
Wan 2.2 Image-to-Video
Ultra-cheap I2V. Upload image and animate it.
Wan 2.2 Text-to-Video
Ultra-cheap T2V for pennies
Wav2Lip
Lip-sync model that re-syncs a target video's lip movement to an arbitrary audio track. Robust to identity and language with a lip-sync discriminator loss.
WizardCoder 33B
WizardLM WizardCoder 33B v1.1. Evol-Instruct fine-tune of DeepSeek-Coder-33B with strong code-generation benchmark performance.
XTTS v2
Coqui's XTTS v2 multilingual TTS with voice cloning from 6 seconds of reference audio. Supports 17 languages and emotion transfer.
Yi Large
01.AI's larger general-purpose chat model with 32k context window and strong bilingual performance.
Yi-Coder 9B
01.AI Yi-Coder 9B chat model. Strong multilingual code completion and chat, 128k context, competitive with code-specialized models 2x its size.
Yi-VL 34B
01.AI Yi-VL 34B vision-language model. Bilingual (CN/EN) image understanding, strong CMMMU and MMMU performance among open-weights VLMs.
ZoeDepth
Intel ZoeDepth metric depth-estimation model. Combines relative-depth pretraining with metric fine-tuning for absolute distance in real units.
Frequently asked questions
Explore more
Dig deeper into pricing, benchmarks, and tooling.
Start Building with AI
Access all models through a single API. OpenAI-compatible, no vendor lock-in.