“`html
A user shared details about running a large-scale language model, specifically the minimax m2.7 q8_0 128k variant on two NVIDIA RTX 3090 GPUs with 256GB of DDR4 RAM.
- The CPU is an old 10900x processor, used as a secondary component for this task.
- Context length is set to 128k tokens, and the model operates without quantization on its key-value cache.
- Accuracy is prioritized over speed; the user aims for usable performance rather than high throughput.
The model’s execution speed is relatively slow at approximately 50 token-per-second (TPS) per process and around 10 TPS for generating text. Despite this, it’s considered usable for tasks like coding agent workflows.
- Some users are running similar models on low-end hardware, which might be of interest to the community.
- The user seeks recommendations for other models within their constraints and suggestions for further optimizations.
- They also mention wishing they could access a draft model for MTP (Model Training Pipeline) as it was declined for this size class.
“`
Originally published at reddit.com. Curated by AI Maestro.
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.



![Would a new result in pre-print be considered by reviewers? [D]](https://ai-maestro.online/wp-content/uploads/2026/05/would-a-new-result-in-pre-print-be-considered-by-reviewers-d-768x768.jpg)
