How close can a small, fast reinforcement-learning agent get to perfect Gin Rummy —
and which training ideas actually make it stronger? We built the whole framework to find out.
📊 Full HTML report · 📄 PDF paper · 🎮 Play the web game
| 34% best agent vs the perfect player |
<2% how often the perfect player gins |
100+ controlled experiments |
62× faster LLM serving (scratch vs NFS) |
Gin Rummy needs both short-horizon arithmetic (counting deadwood) and long-horizon planning (forming melds), and you never see the opponent's hand. Training an RL agent hits the opponent bottleneck: it is only as good as who it practises against. So we built a fast RL player, a provably optimal opponent to grade everyone honestly, and a distributed system to put an LLM in the game — then ran 100+ experiments to see what truly helps.
Our best agent climbed from the old champion's ~30% to 34% against perfect play — through a systematic search, not luck.
We built a perfect Gin Rummy player (exact meld solving, no learning) to use as an honest yardstick. It beats every trained agent 70–99% of the time. The surprise: it gins under 2% of games — it wins by knocking early with low deadwood, not by chasing gin. That single fact reframed every reward experiment we ran.
We benchmarked nearly every reasonable way to make the agent stronger on one metric — win-rate against the perfect player — and, for each, found why it lands where it does.
| Idea | Verdict | Why |
|---|---|---|
| Keep the best checkpoint | ✅ helps | training drifts past its peak; saving the best recovers 2–3 pts for free |
| Warm-start from the champion | ✅ helps | start strong, then specialise |
| TRPO over PPO | ✅ helps | safer policy steps suit sparse, shifting self-play |
| Reward knocking, not gin | ✅ helps | copies the optimal low-risk style |
| Rising opponent curriculum | ✅ helps | always a fair-but-harder challenge |
| Paying 3× more for gin | ➖️ no effect | the agent refuses the bad habit at any bribe |
| Learned state embeddings | ❌ hurts | a frozen bottleneck discards useful detail |
| Imitation learning (DAgger) | ❌ fails | copies moves, not the reasoning behind them |
| Dense short-term rewards | ❌ fails | myopia — greedy for points, blind to winning |
| Live LLM-in-the-loop | ❌ infeasible | strong (beat our agent 3–2) but ~9–27 s/move — far too slow |
💡 The clean result: we tried to bribe the agent into ginning by paying three times more for a gin than a knock. Across 30 controlled runs it still gins under 1% of the time — just like the perfect player, it discovers that chasing gin loses. You cannot pay a policy into a bad habit.
A single RL run fires tens of thousands of opponent queries; at 0.5–3 s per 7B call a naive loop takes hours. We decouple inference from training with a master/worker stack:
env subprocess ─▶ Master (CPU, FastAPI) ─▶ suit-symmetry cache ──(hit)──▶ return
(per-step query) Ollama-compatible API │ miss
│ round-robin ▼
┌───────────────┼───────────────┐
▼ ▼ ▼
GPU worker GPU worker … GPU worker (1 GPU each, Qwen2.5-7B)
self-registers in a shared-filesystem registry; master health-checks + balances
The training pipeline is game-agnostic. Point it at any PettingZoo game, or your own environment, and it trains a masked PPO/TRPO agent through an opponent curriculum, keeps the best checkpoint, and grades it. The only thing that changes between games is the environment.
from pettingzoo.classic import connect_four_v3
from coev import CoevConfig, train
# same pipeline that trained the Gin Rummy agents, now on Connect Four
train(CoevConfig(env_fn=connect_four_v3.env, env_id="connect_four",
algo="trpo", total_steps=2_000_000))
Add seed_models to give it prior agents to practise against, a
benchmark_agent to grade against an expert, and a reward_transform to
shape the reward. See coev/ and
coev/examples/ (Connect Four and Gin Rummy).
A no-install browser game with a curated 5-rung ladder, easiest to perfect:
| Opponent | Strength | What it is |
|---|---|---|
| 🎲 Rookie | easiest | random legal moves — a warm-up |
| 🤖 Self-Play Champion | strong | our earlier best (~30% vs gold) |
| 🃏 Curriculum Ace | strongest | the final agent — 34% vs the perfect player |
| 🛡️ League Tactician | strongest | a close second (PFSP-trained) |
| 🏆 Gold Standard | perfect | the hand-coded expert — the wall everyone hits |
| Path | What |
|---|---|
coev/ | the universal pipeline: masked policy, opponent curriculum, and trainer for any PettingZoo AEC game (+ examples) |
ppo_train.py, gym_wrapper.py | the original Gin-Rummy-specific masked PPO/TRPO policy + wrapper |
agents/ | GoldStandardAgent (optimal benchmark), PPOAgent (masked-argmax), Random, LLM agents |
sweep/ | the experiment families: gold benchmark, algorithm (PPO vs TRPO), representation, the curriculum sweep + the keep-best/warm-start harness |
llm/, slurm/ | distributed LLM master/worker/cache + the SLURM jobs (incl. the self-sustaining sweep watchdog) |
game/ | zero-dependency human-vs-agent web client |
paper/, docs/ | the paper (main.tex), figure + report generators, and the full HTML report |
# 0) Train on any game with the universal pipeline
python -m coev.examples.connect_four # any PettingZoo game, no game-specific code
python -m coev.examples.gin_rummy # same pipeline + a gold benchmark and reward shaping
# 1) Play the heroes (web game)
python game/server.py --host 127.0.0.1 --port 8000 # open http://127.0.0.1:8000
# 2) The gold-standard benchmark
python sweep/bench_gold.py
# 3) The final sweeps (SLURM array + self-sustaining watchdog)
python sweep/curriculum_configs.py && sbatch --array=0-29%10 slurm/curriculum.slurm
python sweep/phase7_configs.py && sbatch --array=0-8%6 --export=ALL,CFG_DIR=phase7_cfgs slurm/curriculum.slurm
# 4) Regenerate every figure + the HTML report from saved JSON
python paper/make_figures.py && python paper/make_report_html.py
sweep/.
Nima Kelidari · Mahdi Salmani · Mohammadsaeed Haghi
University of Southern California
Built on action-masked PPO/TRPO, PettingZoo/RLCard, Stable-Baselines3, and Qwen2.5.
See the full report for the whole story.


