Google’s Omni AI Pushes Video Generation Toward Hyper-Realism, but Consistency Remains Elusive

Table of Contents
The Shift Toward ‘Anything-to-Anything’
Google is pivoting its generative AI strategy toward a more fluid, multimodal future. At the center of this shift is Omni, a new family of generative models designed to eventually bridge the gap between any input—be it text, image, or audio—and any output. While the long-term vision is a universal translator of media, the current rollout focuses on a specific, high-stakes frontier: high-fidelity video.
The first iteration, Omni Flash, has landed in Google’s AI video generation and editing platform, Flow. For those who have spent time with Google’s previous model, Veo, the jump to Omni isn’t just about higher resolution; it’s about a fundamental change in how the AI handles context and character persistence. In theory, Omni is designed to possess a deeper understanding of real-world physics and object permanence, allowing users to maintain a consistent subject across multiple scenes.
The Struggle for Consistency
In practice, however, the transition from ‘impressive’ to ‘reliable’ remains fraught. Testing the model’s ability to maintain a character—in this case, a specific stuffed toy—reveals a frustrating dichotomy. On one hand, Omni is significantly more adherent to prompts than Veo was just five months ago. On the other, it is still prone to the ‘AI jump scare’—those sudden, jarring shifts in geometry or orientation that shatter the illusion of reality.
One attempt to create a playful narrative involving a vacationing plush deer highlighted the model’s erratic logic. While the AI successfully conceptualized a ‘joke’—having the character mistake a jar of honey for sunscreen—the execution was visually chaotic. The honey container morphed unpredictably from a jar to a clear plastic squeeze bottle and back again within the same clip. The sequence ended in a visual collapse, where the model seemed to simply ‘vomit’ various elements of the previous scenes into a final, incoherent frame.
Editing tools have also seen an upgrade. Where Veo often required users to scrap a project and start from scratch to implement a change, Omni’s text-based editing is more responsive. Yet, this responsiveness is a double-edged sword. Prompting the model to remove an unwanted detail in one scene can trigger a ‘hallucination’ that adds that very detail to every other scene in the sequence.
The Uncanny Valley of Deepfakes
Where Omni truly becomes unsettling is in its ability to integrate AI-generated elements into real-world footage. By using a neutral selfie video as a seed, the model can place a user in entirely new environments with frightening accuracy. From eating spaghetti to standing before the Eiffel Tower, the results move past the ‘cartoonish’ feel of early generative video and enter the realm of genuine deception.
The tells are there, but they are subtle. The audio of a fork hitting a ceramic bowl sounds slightly synthesized, and background characters may repeat or glitch. However, in a blind test, the realism was enough to fool a spouse of ten years—a high bar for any deepfake. This level of fidelity suggests that we are rapidly approaching a point where social media content can no longer be verified by sight alone.
The Cost of Iteration
This creative power comes with a steep price tag. Omni does not operate on a flat subscription fee but rather a credit-based system. Depending on the scene length and complexity, a single generation can cost between 15 and 40 credits. A single round of edits costs another 40.
For users on the $20-per-month AI Pro plan, which provides 1,000 credits, the burn rate is aggressive. After producing roughly 20 clips with a handful of edits, a user can easily find their balance depleted by 85%. For professional creators or those with a specific vision, the cost of the inevitable back-and-forth with the model may make the tool prohibitively expensive for anything beyond short-form experimentation.