“`html
A British user has shared their experience with a large-scale language model, specifically the minimax m2.7 q8_0 variant running on two RTX 3090 GPUs and 256GB of DDR4 RAM. The primary focus is using this configuration to achieve a context length of 128k tokens, which they describe as unquantized with a kv cache at the Q8_0 quantization level.
- The model runs at a very slow pace, clocking in around 50 queries per second (QPS) for processing and about 10 QPS for text generation tasks. This is due to running on relatively low-end hardware, specifically an Intel i9-10900X CPU.
- The user highlights the model’s accuracy as their primary concern over speed, noting that they are looking for a balance where performance isn’t “literally all day.”
- They seek recommendations on other models within their constraints and suggestions for optimization techniques. The absence of a draft model for Model-Tooling-Pipeline (MTP) deployment is also mentioned as an issue.
“`
### Takeaways
– This setup demonstrates the current limits of large language models with modest hardware.
– Accuracy remains a priority even at slower processing speeds.
– Users are seeking advice on alternative models and optimizations within their resource constraints.
Originally published at reddit.com. Curated by AI Maestro.
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.




