AI-Native & Innovation

Text-to-Video

Text-to-video: an AI development frame from LEGS

Text-to-video is the use of generative AI models to produce short animated clips directly from a written prompt, used in animation production for ideation, mood exploration, and the very first pass of an animatic. Current models, including Runway, Veo, Sora, and Kling, typically produce clips of four to ten seconds at a time, with shot-to-shot consistency that varies considerably and a level of control that is coarse compared with a hand-keyed sequence. The output is generated, not directed, and what comes back is closer to a moving thumbnail than a delivered shot.

Inside a hybrid production, text-to-video earns its place when the cost of an idea is highest. A director and a writer can generate twenty rough takes on a sequence in an afternoon, narrow them to the strongest three, and bring those takes into a creative treatment. The unit cost of an experiment drops far enough that more experiments happen, which usually improves the brief that goes into pre-production. The cost is mainly compute, and the time to feedback is measured in minutes.

On work like LEGS, text-to-video does not appear in the broadcast master. It appears upstream, in the development phase, alongside image-to-video pipelines and traditional sketch animatics. The output informs the boards, the boards inform the shotlist, and the shotlist informs the hand-keyed work that follows. The model is a research partner, not a delivery tool. Used this way, it accelerates the question of what the film is, without changing the answer to how the film is made.

The honest limits are well known and worth naming. Output length caps at a few seconds, character identity drifts shot to shot, hands and faces still misbehave on close-up, and brand-specific look matching is unreliable without further fine-tuning through techniques such as LoRA training. Anchoring the look against a craft principle like staging is the human's job; the model has no opinion on whether a shot reads. We treat the output as reference, never as a delivered shot, and rebuild the hero work inside a traditional pipeline through our hybrid AI animation service. In production rhythm, text-to-video does the work of a sketch pad and a screening room rolled together. A team can generate, watch, judge, and regenerate inside a single afternoon, which compresses what used to be a multi-day cycle of brief, board, review, revise. The model never tires, so the constraint on iteration becomes attention, not time. The discipline of art direction becomes the binding constraint, because the model will happily produce mediocre work forever if no one decides what good looks like. We pin a master prompt and a style reference at the start of every session so the iteration is convergent rather than wandering.

Myth Labs operates production text-to-video workflows for brand and agency teams who need to test creative concepts at near-broadcast fidelity in days rather than weeks. The wider picture of how artists work alongside these tools is covered in how artists are using AI without losing the craft.

Related

Frequently asked questions

How is text-to-video different from image-to-video?

Text-to-video starts from a prompt only; the model invents the first frame and every frame that follows, which makes for fast ideation but unpredictable look. Image-to-video starts from a fixed reference still, so brand colour, character, and composition stay closer to intent. We use both, picking by where in the pipeline the work sits and how locked the look already is.

Can a text-to-video clip ship as final?

Rarely. Output length, frame consistency, and brand fidelity are all real limits today. For client-facing animatics and concept tests, the output is high enough fidelity. For the broadcast master, we rebuild the work in a traditional pipeline, using the text-to-video pass as a locked reference for staging and pacing and rhythm.

Which models do you reach for first?

We pick by job: Runway and Veo for fast turnaround on quick ideas, Kling when character consistency matters, and Sora-class models where access is granted. The toolkit moves quickly. The pipeline shape, brief, animatic, production, finishing, stays the same.

Who owns the output?

Rights vary by tool and tier, and we track licence terms per model. For brand-safe usage, see the Myth Labs animatics service, which only ships work where the deliverable is cleanly licensed for commercial use.

Sources (5)

Academic papers, recognised industry standards, and canonical industry texts that back up claims in this entry.

  1. Make-A-Video: Text-to-Video Generation without Text-Video Data. Singer, Polyak, Hayes, Yin, Ananthanarayanan, Norouzi, et al., arXiv, 2022Supports: text-conditioned video generation
  2. Imagen Video. Ho, Salimans, Gritsenko, Zheng, Lee, et al., arXiv, 2022Supports: text-to-video diffusion synthesis
  3. VideoPoet: A Large Language Model for Zero-Shot Video Generation. Kondratyuk, Yu, Gu, et al., Google Research, 2023Supports: prompt-driven video generation
  4. AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning. Guo, Yang, Rao, et al., arXiv, 2023Supports: text-prompted animated clip generation
  5. A Survey on Text-to-Video Generation: Challenges, Methods, and Applications. Wang, Zhu, et al., IEEE Access, 2024Supports: text-to-video methods and applications