“`html
I’ve seen a user on Reddit sharing their setup for running Qwen, a large language model from Alibaba Cloud, with the Hermes agent. This is done using Docker and NVIDIA DGX Spark, leveraging VLLM as the inference backend.
- The user has configured Qwen to run in an aggressive mode, focusing on performance metrics like throughput and memory utilization.
- They have provided specific model parameters such as attention-backend set to FlashInfer for efficient computation.
- This setup aims to handle long-context interactions efficiently with a focus on stability and performance.
The main point of this post is seeking feedback from other users who are running similar setups. They want insights into how their configuration might be improved or if they’re encountering any issues that others have faced.
“`
### Takeaways
– The user is looking for feedback on an aggressive Qwen3.6-35B-A3B-FP8 setup with Hermes Agent.
– This involves optimizing performance parameters like memory utilization and attention backend.
– Users are encouraged to share their experiences or suggestions for improvement.
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.




