Pushing the limit: minimax m2.7 q8_0 128k on 2x3090, 256GB DDR4

“`html

A user shared details about running a large-scale language model, specifically the minimax m2.7 q8_0 128k variant on two NVIDIA RTX 3090 GPUs with 256GB of DDR4 RAM.

The CPU is an old 10900x processor, used as a secondary component for this task.
Context length is set to 128k tokens, and the model operates without quantization on its key-value cache.
Accuracy is prioritized over speed; the user aims for usable performance rather than high throughput.

The model’s execution speed is relatively slow at approximately 50 token-per-second (TPS) per process and around 10 TPS for generating text. Despite this, it’s considered usable for tasks like coding agent workflows.

Some users are running similar models on low-end hardware, which might be of interest to the community.
The user seeks recommendations for other models within their constraints and suggestions for further optimizations.
They also mention wishing they could access a draft model for MTP (Model Training Pipeline) as it was declined for this size class.

“`

Source Read original →