Are the rich RAM /poor GPU people wrong here?

“`html

Hello Guys,

I know everyone has their own definition of local models, but for me I see two reasonable types. One is a dense model that fits in around 32GB or 24GB of GPU memory for the most ‘reasonable’ GPU-rich folks. The other is an Mixture-of-Experts (MOE) model with about 100 billion parameters, which can be run using hybrid offloading on a system with up to 128GB of RAM. This makes it accessible and affordable even for those who don’t have a high-end GPU but still want a powerful local model for tasks like tool calling.

This MOE approach is currently the closest we have to a good alternative for rich RAM people when they lack GPU resources.
For example, Qwen 3.5 with around 122 billion parameters can be run on a system with 128GB of RAM, making it an attractive option for those looking for powerful local models.
The absence of a higher version like Qwen 3.6 suggests that this is the best MOE model available within their parameter range, and rich RAM folks may indeed have fewer choices compared to GPU-rich users.

For someone who recently bought an MSI Stix Halo before the RAM apocalypse, they might find limited use cases for a system with 128GB of RAM. The main benefit is being able to load multiple models efficiently, which can be achieved through techniques like model swapping or using systems that support such operations.