“`html
![]() | Shipped this for the AMD x lablab hackathon. Attached video is one of the actual reels the pipeline produced – one English sentence in, finished mp4 with characters, story, music, and voice-over out (fast demo video, not the best quality). ~45 minutes end-to-end on a single AMD Instinct MI300X. Every model is Apache 2.0 or MIT. Pipeline (8 stages, all sequential on the same GPU):
Wan 2.2 specifics (the bit this sub will care about): – 1280×720, not 640×640 default. Costs more but matches what producers want – 121 frames at 24 fps was my first attempt – gave temporal rippling. Switched to 81 @ 16 fps native (the distribution Wan was trained on) and it cleaned up – flow_shift = 5 for hero shots, 8 for b-roll (upstream wan_i2v_A14B.py defaults) – Negative prompt: verbatim Chinese trained negative from shared_config.py. umT5 was multilingual-pretrained against those exact tokens. English translation is observably weaker – Camera language: ONE camera verb per shot, sentence-case, placed first (“Tracking shot following from behind”). Multiple verbs in one prompt cancel each other out – Avoid the word “cinematic” – triggers Wan’s stylization branch, gives the AI look. Use lens/film tags instead (“Arri Alexa, anamorphic, 35mm film grain”) Performance work: – ParaAttention FBCache (lossless 2× on Wan2.2) – torch.compile on transformer_2 (selective, the dual-expert MoE makes full compile flaky) – another 1.2× – AITER MoE acceleration on Qwen director (vLLM) – End-to-end: 25.9 min → 10.4 min per 720p clip on MI300X Why a single MI300X: 192 GB HBM3 lets a 35B MoE, 4B diffusion, 14B I2V MoE, 3.5B music, and a TTS share the same card sequentially. Same stack on a 24 GB consumer GPU would need 4-5 boxes wired together. Code (public, Apache 2.0): https://github.com/bladedevoff/studiomi300 Hugging Face (documentation, like this space 🙏) https://huggingface.co/spaces/lablab-ai-amd-developer-hackathon/studiomi300 Live demo on HF Space is temporarily offline while infra restores – should be back within hours. In the meantime the showcase reels in the repo are real pipeline outputs, no human re-edited shots. Happy to dig into AITER MoE setup, FBCache tuning, FLF2V anchoring, or the vision critic’s failure taxonomy in comments. |
Key Takeaways
- A single AMD Instinct MI300X GPU can handle an entire end-to-end cinematic pipeline for a short video clip, reducing the processing time significantly.
- The use of multiple models in parallel (like Qwen and Wan2.2) allows for efficient execution without needing to shard tasks across different hardware resources.
- For future work, improving model performance and expanding language support for narration could be key areas of focus to make the pipeline more versatile and useful for a wider range of applications.
“`
Originally published at reddit.com. Curated by AI Maestro.
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

![Built an open-source one-prompt-to-cinematic-reel pipeline on a single GPU — FLUX.2 [klein] for character keyframes, Wan2.2-I2V for animation, vision critic with auto-retry, music + 9-language narration in the same pipeline](https://ai-maestro.online/wp-content/uploads/2026/05/built-an-open-source-one-prompt-to-cinematic-reel-pipeline-o-1024x576.jpg)
![Built an open-source one-prompt-to-cinematic-reel pipeline on a single GPU — FLUX.2 [klein] for character keyframes, Wan2.2-I2V for animation, vision critic with auto-retry, music + 9-language narration in the same pipeline Built an open-source one-prompt-to-cinematic-reel pipeline on a single GPU — FLUX.2 [klein] for character keyframes, Wan2.2-I2V for animation, vision critic with auto-retry, music + 9-language narration in the same pipeline](https://external-preview.redd.it/d2cyNzc1ZHZrMjFoMQTkwgeObKTFg5fPX_CAi9JM9dmzJCVEk8nmjgrJwyZv.png?width=640&crop=smart&auto=webp&s=c7bd11e210e917f47c63f6a385fed4f8b31a7539)


