Gemma 4 + LiteRT-LM on mobile: much better memory/perf than my llama.cpp setup

Using Gemma 4 with LiteRT-LM on Mobile: Better Memory and Performance I’ve been closely observing the edge AI ecosystem for a long…

By AI Maestro May 15, 2026 2 min read

Using Gemma 4 with LiteRT-LM on Mobile: Better Memory and Performance

I’ve been closely observing the edge AI ecosystem for a long time now. It’s an area where I believe there is immense potential, and where I truly see AI becoming more useful for everyday tasks.

Background and Problem

Recently, I started experimenting with local AI solutions. For smaller variants of Gemma 3, even the memory usage was unacceptable on my flagship Samsung phone. The app would sometimes crash due to low OS memory, which also made the device feel hot.

I’ve been using a Gemma 3 through a React Native bridge (llama.cpp) with around 4-5GB of memory for every inference and idle state requiring about 1GB of memory until I released it. Having this model IDLE was asking for more memory, which would cause the app to crash.

Gemma 4 and LiteRT-LM

When I saw Gemma 4 through the AI Edge Gallery, I noticed two key things:

  • The speed difference between CPU and GPU is enormous.
  • The model’s response was quick, and memory jumps were barely noticeable on my phone.

This led me to explore LiteRT-LM. It seems highly optimized for edge AI tasks.

Implementing LiteRT-LM

To get it working, I had to write some native modules—specifically for Android and iOS (using Objective-C since a Swift API isn’t available yet).

The memory footprint is around 1.5GB to 2GB. The oldest phone I tested this on was an iPhone 13 Pro Max.

Challenges

I don’t like the fact that you need to release the model to recover memory when it’s idle. This startup cost isn’t too high after it has chosen its preferred backend, but it could be even faster for users.

Current Use Case

I have a strength tracking mobile app where I use this model:

  • Routine generation
  • Performance checks during workouts
  • Follow-up suggestions after workouts

Each inference call takes about 2-4 seconds on GPU. Adding more calls (like one or two for CPU) extends the total time to around 3-6 seconds.

Next Steps

  • Image recognition for exercises, though Gemma has proven challenging in this area; perhaps with good prompting we can achieve something useful.
  • On-the-spot workout generation.

To sum up: I’ve had a great experience with the model and framework. I hope they continue to release updates with smaller models!

Key Takeaways

  • LiteRT-LM is highly optimized for edge AI tasks.
  • Gemma 4 offers significant improvements in memory and performance compared to previous versions.
  • The native module development process can be challenging but necessary for optimal integration on various platforms.
  • Image recognition remains a challenge, but it’s possible with proper prompting.

Originally published at reddit.com. Curated by AI Maestro.

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Name
Scroll to Top