Pushing the limit: minimax m2.7 q8_0 128k on 2x3090, 256GB DDR4

“`html

A British user has shared their experience with a large-scale language model, specifically the minimax m2.7 q8_0 variant running on two RTX 3090 GPUs and 256GB of DDR4 RAM. The primary focus is using this configuration to achieve a context length of 128k tokens, which they describe as unquantized with a kv cache at the Q8_0 quantization level.

The model runs at a very slow pace, clocking in around 50 queries per second (QPS) for processing and about 10 QPS for text generation tasks. This is due to running on relatively low-end hardware, specifically an Intel i9-10900X CPU.
The user highlights the model’s accuracy as their primary concern over speed, noting that they are looking for a balance where performance isn’t “literally all day.”
They seek recommendations on other models within their constraints and suggestions for optimization techniques. The absence of a draft model for Model-Tooling-Pipeline (MTP) deployment is also mentioned as an issue.

“`

### Takeaways
– This setup demonstrates the current limits of large language models with modest hardware.
– Accuracy remains a priority even at slower processing speeds.
– Users are seeking advice on alternative models and optimizations within their resource constraints.

Source Read original →