Poker with Language Models
I made several language models play Texas Hold’em poker against each other. The smallest model won twice in 5 tournaments. It played aggressively by always raising and never folding.
- Models used: Liquid lfm2.5 (1.2B, local via LM Studio), Qwen3 (1.7B, local via LM Studio), Claude Haiku 4.5 (Anthropic), GPT-OSS (120B, Fireworks), MiniMax M2 (230B, Fireworks), Kimi K2 (~1T, Fireworks).
- The smallest model won because it was too dumb to fold when its cards were bad.
- A larger model understood the game better and folded correctly but still lost due to losing chips over time from aggressive play.
Building this from scratch in pure Python without any dependencies, I used a custom agent framework called Hive. It supports various LLMs like LM Studio, Ollama, Anthropic, OpenAI, Fireworks, and Groq. Models can be given different personas to simulate varied behaviors.
I have run 5 tournaments with the smallest model winning twice. For anyone interested in testing their model or persona, let me know! I’ll run more tournaments and share results on this thread.
Key Takeaways
- The smallest language model performed better due to its lack of strategic caution.
- This setup is a proof-of-concept for how LLMs can be tested in poker scenarios without needing complex infrastructure.
- I’m open to feedback on the framework and engine code if anyone wants to review it. The core components are solid but still evolving.
Code, engine, and tournament results available here.
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.




