“`html
A recent update to the MTP (Multi-Token Prediction) models in LLaMA has addressed a persistent issue with VRAM leaks. The problem stemmed from the destroy() function not freeing up resources allocated by the speculative decoder, draft context, and draft model during sleep cycles.
- The new fix involves explicitly resetting these resources (spec, ctx_dft, and model_dft) before resetting the main model initialization in the destroy() function. This ensures that all allocated GPU memory is properly freed after each sleep cycle.
- This update resolves a significant issue where VRAM would gradually increase over time, eventually leading to out-of-memory errors and server crashes.
- The fix has been implemented as part of PR #23461 in the LLaMA GitHub repository, addressing concerns raised by users experiencing similar issues with MTP models running for extended periods.
“`
### Takeaways
– **Resource Management**: Proper management and cleanup of resources like speculative decoder, draft context, and model during sleep cycles is crucial to avoid VRAM leaks.
– **Testing and Feedback**: The issue was identified through user feedback and testing. Continuous monitoring and reporting are essential for maintaining the stability and performance of AI models.
– **Community Collaboration**: Open-source projects benefit from community contributions and collaborative efforts like PRs to improve code quality and functionality.
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.




