Llama.cpp VS LiteRT on a custom Xiaomi 12 Pro 24/7 Server (V2 Redesign)

https://preview.redd.it/sm4ysgdw1w2h1.png?width=1376&format=png&auto=webp&s=3705932403919814fbf2008a1cba189d17e0591e

Thanks everyone for the advice on my previous post (24/7 Headless AI Server on Xiaomi 12 Pro (Snapdragon 8 Gen 1 + Ollama/Gemma4). You really inspired me, and I completely redesigned the cooling and power supply for this setup.

What’s new:

Cooling: Installed a copper heatsink with a fan on the back. On the front, I removed the screen and mounted the device directly onto an aluminum plate with 2 fans using a thermal pad. The cooling now turns on at 40°C and shuts off at 35°C.
Power Supply: Built a custom, fully safe PSU. I took apart the battery and wired the PSU directly to the battery’s BMS via a capacitor. Added 2 fuses (input/output), a crowbar circuit at 4.3V to protect the phone, and a backup fan for the PSU itself (though after a week of testing, I barely needed it since it doesn’t get that hot).
Housing: 3D-printed a custom case, built a stand out of aluminum extrusions, and routed an external power button.

Here is how it looks now:

https://preview.redd.it/z17nqy6w2w2h1.jpg?width=3072&format=pjpg&auto=webp&s=09c02d18e53d2771383ae85f35796150ed8b91d8

https://reddit.com/link/1tlgxms/video/ul2iivua3w2h1/player

https://reddit.com/link/1tlgxms/video/xiuyt9wk3w2h1/player

Benchmarks (gemma-4-E4B):
(Prompt: “Write 2000 words IT essay”)

Llama.cpp

https://reddit.com/link/1tlgxms/video/v0t8t5n54w2h1/player

Speed: Prompt: 30.6 t/s | Generation: 5.7 t/s
The CPU load is pretty "gentle," and the PSU shows a lower amp draw.

https://preview.redd.it/l0wnc1xo4w2h1.jpg?width=2937&format=pjpg&auto=webp&s=d426d9edb9e3801e0a9a487aa4cc729aa7da4dcd

LiteRT (by Google)

https://reddit.com/link/1tlgxms/video/1cbz7rk85w2h1/player

https://preview.redd.it/dh7lc91d5w2h1.png?width=1804&format=png&auto=webp&s=5aacb2bdbcd135e79cfe20afda44009a3896ce83

Slightly faster generation, but it maxes out the CPUs, and the amp draw is noticeably higher.

https://preview.redd.it/avfhuxlg5w2h1.jpg?width=2693&format=pjpg&auto=webp&s=3f5e143df4f192225e84e10738c7673f6394b948

GPU Struggles

I tried running LiteRT on the GPU, but unfortunately, Google AI Edge hasn’t released an APK for my Snapdragon 8 Gen 1. Swapping library files from the Qualcomm site didn’t work either. I also tried running a Vulkan build of llama.cpp but ran into issues. I’ll post updated benchmarks once I manage to get it working.

Conclusion

If anyone asks if it was worth it: If you have a powerful spare phone lying around and want a great DIY project, definitely yes. But if you just need an LLM server and don’t want the hassle, you’re better off just buying a Mini PC.

Thanks again to this sub for the inspiration-I wouldn’t have committed to such a massive rebuild without your feedback!

Key Takeaways

The custom Xiaomi 12 Pro server ran Llama.cpp slightly faster than LiteRT on the Snapdragon 8 Gen 1.
LiteRT required more CPU power and drew a higher amp draw compared to Llama.cpp, but it was still significantly quicker in generation speed.
Running LiteRT on the GPU proved challenging due to lack of an appropriate APK for Snapdragon 8 Gen 1 devices.

Note: The above key takeaways are based on the benchmarks provided and may not reflect all possible scenarios.

Source Read original →