**What Happened:**
A Reddit user, /u/DiscipleofDeceit666, shared a custom build that bypasses an assert statement in the stock release of AMD Radeon RDNA2 GPUs. This workaround enables Flash Attention for the ROCM LLaMA cpp model, significantly improving its performance from 30 tokens per second (tok/s) to around 70-80 tok/s.
**Why It Matters:**
This discovery highlights a critical issue with the current state of AI models on AMD hardware, specifically regarding the `max_blocks_per_sm` assertion. The user found that by enabling Flash Attention through this custom build, they could achieve substantial performance gains. This finding is important for both researchers and developers working in the field of AI, as it reveals potential limitations in the current setup and suggests areas where improvements are needed.
– The workaround bypasses a known issue causing crashes when attempting to run Flash Attention on AMD GPUs.
– It demonstrates that significant optimizations can be made by custom-builds tailored to specific hardware configurations.
– This insight could lead to more robust and efficient AI models running on AMD RDNA2 GPUs, potentially addressing bottlenecks and enhancing overall performance.
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.




