Built an open-source one-prompt-to-cinematic-reel pipeline on a single GPU — FLUX.2 [klein] for character keyframes, Wan2.2-I2V for animation, vision critic with auto-retry, music + 9-language narration in the same pipeline

“`html

Shipped this for the AMD x lablab hackathon. Attached video is one of the actual reels the pipeline produced – one English sentence in, finished mp4 with characters, story, music, and voice-over out (fast demo video, not the best quality). ~45 minutes end-to-end on a single AMD Instinct MI300X. Every model is Apache 2.0 or MIT.

Pipeline (8 stages, all sequential on the same GPU):

Director Agent – Qwen3.5-35B-A3B (vLLM + AITER MoE) plans 6 shots from one sentence, returns structured JSON with character bibles, shot prompts, music brief, per-shot voice-over script, narration language
Character masters – FLUX.2 [klein] paints one canonical portrait per character. No LoRA training step – reference editing pins identity across shots by construction
Per-shot keyframes – FLUX.2 again with reference image. Sub-second per keyframe after warmup
Animation – Wan2.2-I2V-A14B, 81 frames @ 16 fps native. FLF2V for cut:false continuation arcs (last frame of shot N anchors first frame of shot N+1)
Vision critic – same Qwen3.5-35B reloaded with 10 structured failure labels (character drift, extras invade frame, camera ignored, walking backwards, object morphing, hand/finger artifact, wardrobe drift, neon glow leak, stylized AI look, random intimacy). Bad clips re-render with targeted retry strategies (different seed, FLF2V anchor, prompt simplification)
Music – ACE-Step v1 generates a 30s instrumental from Director’s brief
Narration – Kokoro-82M, 9 languages. Director picks language to match setting (Tokyo→Japanese, Paris→French, Mumbai→Hindi)
Mix – ffmpeg with per-shot vo aligned via adelay

Wan 2.2 specifics (the bit this sub will care about): – 1280×720, not 640×640 default. Costs more but matches what producers want – 121 frames at 24 fps was my first attempt – gave temporal rippling. Switched to 81 @ 16 fps native (the distribution Wan was trained on) and it cleaned up – flow_shift = 5 for hero shots, 8 for b-roll (upstream wan_i2v_A14B.py defaults) – Negative prompt: verbatim Chinese trained negative from shared_config.py. umT5 was multilingual-pretrained against those exact tokens. English translation is observably weaker – Camera language: ONE camera verb per shot, sentence-case, placed first (“Tracking shot following from behind”). Multiple verbs in one prompt cancel each other out – Avoid the word “cinematic” – triggers Wan’s stylization branch, gives the AI look. Use lens/film tags instead (“Arri Alexa, anamorphic, 35mm film grain”)

Performance work: – ParaAttention FBCache (lossless 2× on Wan2.2) – torch.compile on transformer_2 (selective, the dual-expert MoE makes full compile flaky) – another 1.2× – AITER MoE acceleration on Qwen director (vLLM) – End-to-end: 25.9 min → 10.4 min per 720p clip on MI300X

Why a single MI300X: 192 GB HBM3 lets a 35B MoE, 4B diffusion, 14B I2V MoE, 3.5B music, and a TTS share the same card sequentially. Same stack on a 24 GB consumer GPU would need 4-5 boxes wired together.

Code (public, Apache 2.0): https://github.com/bladedevoff/studiomi300

Hugging Face (documentation, like this space 🙏) https://huggingface.co/spaces/lablab-ai-amd-developer-hackathon/studiomi300

Live demo on HF Space is temporarily offline while infra restores – should be back within hours. In the meantime the showcase reels in the repo are real pipeline outputs, no human re-edited shots.

Happy to dig into AITER MoE setup, FBCache tuning, FLF2V anchoring, or the vision critic’s failure taxonomy in comments.

Key Takeaways

A single AMD Instinct MI300X GPU can handle an entire end-to-end cinematic pipeline for a short video clip, reducing the processing time significantly.
The use of multiple models in parallel (like Qwen and Wan2.2) allows for efficient execution without needing to shard tasks across different hardware resources.
For future work, improving model performance and expanding language support for narration could be key areas of focus to make the pipeline more versatile and useful for a wider range of applications.

“`

Originally published at reddit.com. Curated by AI Maestro.

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.