MOE Experience on Snapdragon 8 Elite Android Devices
I recently acquired an Honor Magic 7 Pro with a Snapdragon 8 elite processor and 24GB of RAM. My experience has been both positive and challenging, offering significant potential for AI models like LLMs (Large Language Models).
Current State of Affairs
The Hexagon NPU and OpenCL GPU support are rolling out rapidly, but the fastest prompt processing and token generation remain CPU-based. The CPU is currently the most efficient solution, generating more heat than the NPU or GPU alternatives. However, it’s still the quickest option available.
RAM Constraints
No Android devices with 32GB of RAM are available without a virtual memory extension, which doesn’t work with large language models (LLMs). Therefore, the best configuration you can achieve is 24GB of RAM. This limits your options significantly but still offers substantial performance improvements over standard smartphones.
Recommended Models
I’ve tested several MOE (Model Oriented Engineering) models and found that they offer a good balance between speed, quality, and size. Here are some recommendations:
- Qwen3.6/3.5-35b-A3B: My preferred choice due to its performance.
- Qwen3-30b-a3b-2507: Offers better overall capabilities without running into memory issues.
- Gemma-4-a4b-26b, LFM-24b-a2b, GPT-OSS-20B: These models are popular for their intelligence and speed. However, I recommend avoiding the censored model GPT-OSS, as it’s too restrictive.
- LFM-24b-a2b: One of the fastest and smallest models available, with remarkable performance for its size.
Prompt Processing Speed
The token generation speed varies widely between different models. For instance:
To give you an idea: Q4_K_M at 24GB RAM is about 55 prompt processing (PP) tokens per second, while Phi-4-14b at the same configuration generates around 13 PP tokens.
Future Improvements and Recommendations
To improve performance further, I recommend keeping your models below 75% of your total system RAM. Models like dense models (e.g., 14B) are notably slower compared to larger models (e.g., 20-30B). For those interested in exploring more A2b and A1b models up to a combined total of 30B parameters, please let me know if you have any specific questions or want to test certain configurations.
If you need assistance with testing different models or have other requests, feel free to ask. I’m here to help optimize your experience on these powerful Android devices.
Key Takeaways
- Recommend using MOE models like Qwen3 and Gemma for best performance.
- Keep model sizes below 75% of available RAM.
- Avoid GPT-OSS due to its restrictive nature.
- Explore more A2b and A1b models up to a combined total of 30B parameters for even better performance.
Note: The provided test results are illustrative; actual values may vary based on the specific model, quantization level, and system configuration.
Originally published at reddit.com. Curated by AI Maestro.
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.




