Google Veo 3.1
Latest Veo with image-to-video and context-aware audio
Google Veo 3.1 is video generation AI model from Google DeepMind, priced at €0.000 per 1M input tokens with a unknown context window.
Image References
Examples
See what Google Veo 3.1 can generate
Aerial
"Sweeping aerial view of coastal city at golden hour"
Pricing
API Integration
Use our OpenAI-compatible API to integrate Google Veo 3.1 into your application.
npm install railwailimport railwail from "railwail";
const rw = railwail("YOUR_API_KEY");
// Simple — just pass a string
const reply = await rw.run("veo-3-1", "Hello! What can you do?");
console.log(reply);
// With message history
const reply2 = await rw.run("veo-3-1", [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: "Explain quantum computing simply." },
]);
console.log(reply2);
// Full response with usage info
const res = await rw.chat("veo-3-1", [
{ role: "user", content: "Hello!" },
], { temperature: 0.7, max_tokens: 500 });
console.log(res.choices[0].message.content);
console.log(res.usage);Deep dive — Google DeepMind's Google Veo 3.1
Google DeepMind, formed in 2023 from the merger of Google Brain and DeepMind under Demis Hassabis, runs the Veo video-generation programme. After the headline launch of Veo 3 (May 2025, Google I/O) with native audio, the team shipped Veo 3.1 in late 2025 as an incremental quality and capability upgrade. Veo 3.1 improves prompt adherence, motion physics, character consistency across extended sequences and audio fidelity (including richer ambient soundscapes and more controllable dialogue). The model is exposed via Vertex AI, the Gemini API, Google Labs (VideoFX, Whisk) and the Flow filmmaking surface aimed at professional creators.
Visit Google DeepMind →Veo 3.1 retains the joint audio-video diffusion architecture introduced in Veo 3: video is encoded into a spatio-temporal latent space via a 3D causal VAE and denoised by a transformer-based diffusion model, while a coupled audio diffusion module generates synchronized music, ambient sound and dialogue. Improvements in 3.1 are reported across motion physics (water, fabric, crowds), audio quality and the ability to stay consistent across extended sequences via reference-frame and subject-reference conditioning. Native clips run up to 8 seconds at 1080p with extensions for longer sequences and a separate cascaded super-resolution stage for 4K. Text conditioning uses Gemini-family encoders; image and reference-frame conditioning add identity and layout control. Training expands on the Veo 3 corpus with additional curated multilingual audio-video data and refined recaptioning.
- Parameters
- Undisclosed
- Context
- unknown
- All Veo 3 features plus improved physics, audio fidelity and consistency
- Up to 8-second 1080p clips natively, 4K via cascaded super-resolution
- Reference-frame and subject-reference conditioning
- Joint audio-video diffusion with music, ambient sound and dialogue lip-sync
- Rich cinematographic prompt vocabulary
- Multilingual prompts via Gemini text encoders
- Available via Vertex AI, Gemini API, VideoFX, Whisk and Flow
- SynthID audio + visual watermarking
- Best for: high-end commercial creative, longer sequences, branded campaigns with sound.
Expanded curated multilingual audio-video corpus including licensed footage, public web video (including YouTube under Google's terms) and synthetic data, with multi-granularity captions from Gemini vision-language models.
License: Proprietary commercial licence via Google Cloud / Vertex AI and Gemini API; commercial use under Google's generative-AI terms; mandatory SynthID watermarking.
Known limitations
- 8-second native clip limit
- Audio short-form and English-leaning
- Strict moderation on people, brands and political content
- Closed model with no peer-reviewed paper
- Per-clip cost higher than Veo 3 Fast or Veo 3.1 Fast tiers
Frequently asked questions
Related Models
View all Video GenerationGoogle Veo 2
Google's state-of-the-art video generation model. Simulates real-world physics with various visual styles.
Google Veo 3
Google's Veo 3. High-fidelity text-to-video with native audio generation, up to 8s clips.
Kling v3
Cinematic video up to 15s with multi-shot and native audio
Kling v3 Omni
Most versatile: multi-reference images, video editing, native audio
Start using Google Veo 3.1 today
Get started with free credits. No credit card required. Access Google Veo 3.1 and 100+ other models through a single API.