Built an open-source one-prompt-to-cinematic-reel pipeline on a single GPU — FLUX.2 [klein] for character keyframes, Wan2.2-I2V for animation, vision critic with auto-retry, music + 9-language narration in the same pipeline

“`html Shipped this for the AMD x lablab hackathon. Attached video is one of the actual reels the pipeline produced – one…

By AI Maestro May 14, 2026 3 min read
Built an open-source one-prompt-to-cinematic-reel pipeline on a single GPU — FLUX.2 [klein] for character keyframes, Wan2.2-I2V for animation, vision critic with auto-retry, music + 9-language narration in the same pipeline

“`html


Built an open-source one-prompt-to-cinematic-reel pipeline on a single GPU — FLUX.2 [klein] for character keyframes, Wan2.2-I2V for animation, vision critic with auto-retry, music + 9-language narration in the same pipeline

Shipped this for the AMD x lablab hackathon. Attached video is one of the actual reels the pipeline produced – one English sentence in, finished mp4 with characters, story, music, and voice-over out (fast demo video, not the best quality). ~45 minutes end-to-end on a single AMD Instinct MI300X. Every model is Apache 2.0 or MIT.

Pipeline (8 stages, all sequential on the same GPU):

  • Director Agent – Qwen3.5-35B-A3B (vLLM + AITER MoE) plans 6 shots from one sentence, returns structured JSON with character bibles, shot prompts, music brief, per-shot voice-over script, narration language
  • Character masters – FLUX.2 [klein] paints one canonical portrait per character. No LoRA training step – reference editing pins identity across shots by construction
  • Per-shot keyframes – FLUX.2 again with reference image. Sub-second per keyframe after warmup
  • Animation – Wan2.2-I2V-A14B, 81 frames @ 16 fps native. FLF2V for cut:false continuation arcs (last frame of shot N anchors first frame of shot N+1)
  • Vision critic – same Qwen3.5-35B reloaded with 10 structured failure labels (character drift, extras invade frame, camera ignored, walking backwards, object morphing, hand/finger artifact, wardrobe drift, neon glow leak, stylized AI look, random intimacy). Bad clips re-render with targeted retry strategies (different seed, FLF2V anchor, prompt simplification)
  • Music – ACE-Step v1 generates a 30s instrumental from Director’s brief
  • Narration – Kokoro-82M, 9 languages. Director picks language to match setting (Tokyo→Japanese, Paris→French, Mumbai→Hindi)
  • Mix – ffmpeg with per-shot vo aligned via adelay

Wan 2.2 specifics (the bit this sub will care about): – 1280×720, not 640×640 default. Costs more but matches what producers want – 121 frames at 24 fps was my first attempt – gave temporal rippling. Switched to 81 @ 16 fps native (the distribution Wan was trained on) and it cleaned up – flow_shift = 5 for hero shots, 8 for b-roll (upstream wan_i2v_A14B.py defaults) – Negative prompt: verbatim Chinese trained negative from shared_config.py. umT5 was multilingual-pretrained against those exact tokens. English translation is observably weaker – Camera language: ONE camera verb per shot, sentence-case, placed first (“Tracking shot following from behind”). Multiple verbs in one prompt cancel each other out – Avoid the word “cinematic” – triggers Wan’s stylization branch, gives the AI look. Use lens/film tags instead (“Arri Alexa, anamorphic, 35mm film grain”)

Performance work: – ParaAttention FBCache (lossless 2× on Wan2.2) – torch.compile on transformer_2 (selective, the dual-expert MoE makes full compile flaky) – another 1.2× – AITER MoE acceleration on Qwen director (vLLM) – End-to-end: 25.9 min → 10.4 min per 720p clip on MI300X

Why a single MI300X: 192 GB HBM3 lets a 35B MoE, 4B diffusion, 14B I2V MoE, 3.5B music, and a TTS share the same card sequentially. Same stack on a 24 GB consumer GPU would need 4-5 boxes wired together.

Code (public, Apache 2.0): https://github.com/bladedevoff/studiomi300

Hugging Face (documentation, like this space 🙏) https://huggingface.co/spaces/lablab-ai-amd-developer-hackathon/studiomi300

Live demo on HF Space is temporarily offline while infra restores – should be back within hours. In the meantime the showcase reels in the repo are real pipeline outputs, no human re-edited shots.

Happy to dig into AITER MoE setup, FBCache tuning, FLF2V anchoring, or the vision critic’s failure taxonomy in comments.

Key Takeaways

  • A single AMD Instinct MI300X GPU can handle an entire end-to-end cinematic pipeline for a short video clip, reducing the processing time significantly.
  • The use of multiple models in parallel (like Qwen and Wan2.2) allows for efficient execution without needing to shard tasks across different hardware resources.
  • For future work, improving model performance and expanding language support for narration could be key areas of focus to make the pipeline more versatile and useful for a wider range of applications.

“`


Originally published at reddit.com. Curated by AI Maestro.

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Name
Scroll to Top