Adversarial Co-Evolution of RL and LLM Agents in Gin Rummy

How close can a small, fast reinforcement-learning agent get to perfect Gin Rummy —
and which training ideas actually make it stronger? We built the whole framework to find out.

📊 Full HTML report · 📄 PDF paper · 🎮 Play the web game

34%
_{best agent vs the
perfect player}

<2%
_{how often the perfect
player gins}

100+
_{controlled
experiments}

62×
_{faster LLM serving
(scratch vs NFS)}

Gin Rummy needs both short-horizon arithmetic (counting deadwood) and long-horizon planning (forming melds), and you never see the opponent's hand. Training an RL agent hits the opponent bottleneck: it is only as good as who it practises against. So we built a fast RL player, a provably optimal opponent to grade everyone honestly, and a distributed system to put an LLM in the game — then ran 100+ experiments to see what truly helps.

Win-rate vs the perfect player across the project

_{Our best agent climbed from the old champion's ~30% to 34% against perfect play — through a systematic search, not luck.}

The gold standard — and the surprise it revealed

We built a perfect Gin Rummy player (exact meld solving, no learning) to use as an honest yardstick. It beats every trained agent 70–99% of the time. The surprise: it gins under 2% of games — it wins by knocking early with low deadwood, not by chasing gin. That single fact reframed every reward experiment we ran.

Everything we tried, ranked vs the perfect player

We benchmarked nearly every reasonable way to make the agent stronger on one metric — win-rate against the perfect player — and, for each, found why it lands where it does.

Every regime ranked vs the perfect player

Idea	Verdict	Why
Keep the best checkpoint	✅ helps	training drifts past its peak; saving the best recovers 2–3 pts for free
Warm-start from the champion	✅ helps	start strong, then specialise
TRPO over PPO	✅ helps	safer policy steps suit sparse, shifting self-play
Reward knocking, not gin	✅ helps	copies the optimal low-risk style
Rising opponent curriculum	✅ helps	always a fair-but-harder challenge
Paying 3× more for gin	➖️ no effect	the agent refuses the bad habit at any bribe
Learned state embeddings	❌ hurts	a frozen bottleneck discards useful detail
Imitation learning (DAgger)	❌ fails	copies moves, not the reasoning behind them
Dense short-term rewards	❌ fails	myopia — greedy for points, blind to winning
Live LLM-in-the-loop	❌ infeasible	strong (beat our agent 3–2) but ~9–27 s/move — far too slow

💡 The clean result: we tried to bribe the agent into ginning by paying three times more for a gin than a knock. Across 30 controlled runs it still gins under 1% of the time — just like the perfect player, it discovers that chasing gin loses. You cannot pay a policy into a bad habit.

The framework we built

_{The gold standard is used for scoring only — it never trains the agent.}

A single RL run fires tens of thousands of opponent queries; at 0.5–3 s per 7B call a naive loop takes hours. We decouple inference from training with a master/worker stack:

   env subprocess  ─▶  Master (CPU, FastAPI)  ─▶  suit-symmetry cache  ──(hit)──▶ return
   (per-step query)    Ollama-compatible API            │ miss
                              │ round-robin              ▼
              ┌───────────────┼───────────────┐
              ▼               ▼               ▼
          GPU worker      GPU worker  …   GPU worker      (1 GPU each, Qwen2.5-7B)
          self-registers in a shared-filesystem registry; master health-checks + balances

⚠️ Infra finding: loading a 7B worker from home NFS runs at ~11 MB/s (~28 min — blows the health-check timeout). Staging weights on scratch/BeeGFS cuts it to ~27 s (62×) — mandatory at scale (~32 queries/s with 14 workers).

It is a universal pipeline, not just a Gin Rummy script

The training pipeline is game-agnostic. Point it at any PettingZoo game, or your own environment, and it trains a masked PPO/TRPO agent through an opponent curriculum, keeps the best checkpoint, and grades it. The only thing that changes between games is the environment.

from pettingzoo.classic import connect_four_v3
from coev import CoevConfig, train

# same pipeline that trained the Gin Rummy agents, now on Connect Four
train(CoevConfig(env_fn=connect_four_v3.env, env_id="connect_four",
                 algo="trpo", total_steps=2_000_000))

Add seed_models to give it prior agents to practise against, a benchmark_agent to grade against an expert, and a reward_transform to shape the reward. See coev/ and coev/examples/ (Connect Four and Gin Rummy).

Play the heroes 🎮

A no-install browser game with a curated 5-rung ladder, easiest to perfect:

Opponent	Strength	What it is
🎲 Rookie	easiest	random legal moves — a warm-up
🤖 Self-Play Champion	strong	our earlier best (~30% vs gold)
🃏 Curriculum Ace	strongest	the final agent — 34% vs the perfect player
🛡️ League Tactician	strongest	a close second (PFSP-trained)
🏆 Gold Standard	perfect	the hand-coded expert — the wall everyone hits

Repository layout

Path	What
`coev/`	the universal pipeline: masked policy, opponent curriculum, and trainer for any PettingZoo AEC game (+ examples)
`ppo_train.py`, `gym_wrapper.py`	the original Gin-Rummy-specific masked PPO/TRPO policy + wrapper
`agents/`	`GoldStandardAgent` (optimal benchmark), `PPOAgent` (masked-argmax), Random, LLM agents
`sweep/`	the experiment families: gold benchmark, algorithm (PPO vs TRPO), representation, the curriculum sweep + the keep-best/warm-start harness
`llm/`, `slurm/`	distributed LLM master/worker/cache + the SLURM jobs (incl. the self-sustaining sweep watchdog)
`game/`	zero-dependency human-vs-agent web client
`paper/`, `docs/`	the paper (`main.tex`), figure + report generators, and the full HTML report

Quickstart

# 0) Train on any game with the universal pipeline
python -m coev.examples.connect_four     # any PettingZoo game, no game-specific code
python -m coev.examples.gin_rummy        # same pipeline + a gold benchmark and reward shaping

# 1) Play the heroes (web game)
python game/server.py --host 127.0.0.1 --port 8000      # open http://127.0.0.1:8000

# 2) The gold-standard benchmark
python sweep/bench_gold.py

# 3) The final sweeps (SLURM array + self-sustaining watchdog)
python sweep/curriculum_configs.py && sbatch --array=0-29%10 slurm/curriculum.slurm
python sweep/phase7_configs.py    && sbatch --array=0-8%6 --export=ALL,CFG_DIR=phase7_cfgs slurm/curriculum.slurm

# 4) Regenerate every figure + the HTML report from saved JSON
python paper/make_figures.py && python paper/make_report_html.py

⚠️ Load LLM worker weights from scratch/BeeGFS, not home NFS. Every figure regenerates from measured JSON results under sweep/.

Nima Kelidari · Mahdi Salmani · Mohammadsaeed Haghi
_{University of Southern California}

_{Built on action-masked PPO/TRPO, PettingZoo/RLCard, Stable-Baselines3, and Qwen2.5.

See the full report for the whole story.}

Name		Name	Last commit message	Last commit date
Latest commit History 335 Commits
agents		agents
artifacts		artifacts
build		build
coev		coev
config		config
controller		controller
docs		docs
game		game
llm		llm
notebook		notebook
paper		paper
pipeline		pipeline
slurm		slurm
src		src
sweep		sweep
templates		templates
utils		utils
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
Readme.md		Readme.md
adverserial-coev.png		adverserial-coev.png
app.py		app.py
curriculum_manager.py		curriculum_manager.py
environment.yml		environment.yml
eval.py		eval.py
get_wandb_data.py		get_wandb_data.py
gym_wrapper.py		gym_wrapper.py
hand_scoring.py		hand_scoring.py
llm_test.py		llm_test.py
main.py		main.py
misc.py		misc.py
model_server.py		model_server.py
ppo_train.py		ppo_train.py
requirements.txt		requirements.txt
setup.py		setup.py
test_curriculum_setup.py		test_curriculum_setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Adversarial Co-Evolution of RL and LLM Agents in Gin Rummy

The gold standard — and the surprise it revealed

Everything we tried, ranked vs the perfect player

The framework we built

It is a universal pipeline, not just a Gin Rummy script

Play the heroes 🎮

Repository layout

Quickstart

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Adversarial Co-Evolution of RL and LLM Agents in Gin Rummy

The gold standard — and the surprise it revealed

Everything we tried, ranked vs the perfect player

The framework we built

It is a universal pipeline, not just a Gin Rummy script

Play the heroes 🎮

Repository layout

Quickstart

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages