The Score: how Veo 3.1 plus Eleven Music finally make AI video feel finished

Silent video is half a story

For the last two years, "AI video" has meant a 4–8 second clip with no audio. A neat tech demo, but useless for a real social post — no voiceover, no music, no foley, nothing. You'd render the clip, drop it into CapCut, manually add a backing track, manually time-sync the VO, and ship.

Vephon 2.0 closes that gap with two upgrades.

Veo 3.1 with native audio

We've moved video generation to Google's Veo 3.1 and exposed all three tiers:

Light — $0.05/sec. Use for preview renders and review cycles.
Fast — $0.15/sec. The default, balanced quality.
Flagship — $0.40/sec. 4K with native audio synthesis.

Veo 3.1's flagship tier generates audio with the video — ambient sound, on-screen voice, motion-synced foley — instead of leaving you to mix it in post. The audio is timed to the visual, not slapped on afterwards.

We also added scene extension: take any existing clip, supply the last frame as the anchor, and Veo extends the scene by 8–30 more seconds while preserving lighting, camera, and character. The 30-second cap on AI video disappears.

Eleven Music for the soundtrack

The other half of "feeling finished" is the music. We integrated Eleven Music so you can request a track by style and mood:

upbeat lofi hip-hop for a coffee shop, 110 BPM,
warm vinyl, jazz piano, no vocals

Up to 5 minutes per track. Add lyrics and the vocals can use a voice clone, so a brand mascot's singing voice matches their narration voice. The training data is licensed — commercial use is clean.

Voice with emotion

The third leg of the stool is voice. We're now on ElevenLabs v3, the expressive model that interprets inline emotion tags:

[whispering] I shouldn't be telling you this — [normal]
but the lavender oat-milk latte is back. [excited]
And we made twelve mugs to give away.

Tags like [whispering], [excited], [sighs], [laughs] literally instruct the voice on performance, not just text. We've shipped a sanitizer that strips unrecognized brackets so script authors can experiment without the model reading "[verb=cook]" out loud.

What this changes

The practical implication: a one-minute branded short can now be written, generated, scored, and narrated entirely inside Vephon — no post-production tools required.

The aesthetic implication: the floor of AI video output just rose by a meaningful amount. Quality features that used to require an editor in CapCut are now part of the prompt.

We'll keep pushing — Kling 3.0's reference-video lip-sync is on the roadmap, as is dynamic time-syncing of generated music to scene cuts. But Vephon 2.0 is where AI video stopped being mute.