Skip to content

Nikelroid/adversarial-coevolution

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

335 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Adversarial Co-Evolution of RL and LLM Agents in Gin Rummy

How close can a small, fast reinforcement-learning agent get to perfect Gin Rummy —
and which training ideas actually make it stronger? We built the whole framework to find out.

📊 Full HTML report  ·  📄 PDF paper  ·  🎮 Play the web game


34%
best agent vs the
perfect player
<2%
how often the perfect
player gins
100+
controlled
experiments
62×
faster LLM serving
(scratch vs NFS)

Gin Rummy needs both short-horizon arithmetic (counting deadwood) and long-horizon planning (forming melds), and you never see the opponent's hand. Training an RL agent hits the opponent bottleneck: it is only as good as who it practises against. So we built a fast RL player, a provably optimal opponent to grade everyone honestly, and a distributed system to put an LLM in the game — then ran 100+ experiments to see what truly helps.

Win-rate vs the perfect player across the project
Our best agent climbed from the old champion's ~30% to 34% against perfect play — through a systematic search, not luck.

The gold standard — and the surprise it revealed

We built a perfect Gin Rummy player (exact meld solving, no learning) to use as an honest yardstick. It beats every trained agent 70–99% of the time. The surprise: it gins under 2% of games — it wins by knocking early with low deadwood, not by chasing gin. That single fact reframed every reward experiment we ran.

Gold beats everyone but rarely gins

Everything we tried, ranked vs the perfect player

We benchmarked nearly every reasonable way to make the agent stronger on one metric — win-rate against the perfect player — and, for each, found why it lands where it does.

Every regime ranked vs the perfect player
IdeaVerdictWhy
Keep the best checkpoint✅ helpstraining drifts past its peak; saving the best recovers 2–3 pts for free
Warm-start from the champion✅ helpsstart strong, then specialise
TRPO over PPO✅ helpssafer policy steps suit sparse, shifting self-play
Reward knocking, not gin✅ helpscopies the optimal low-risk style
Rising opponent curriculum✅ helpsalways a fair-but-harder challenge
Paying 3× more for gin➖️ no effectthe agent refuses the bad habit at any bribe
Learned state embeddings❌ hurtsa frozen bottleneck discards useful detail
Imitation learning (DAgger)❌ failscopies moves, not the reasoning behind them
Dense short-term rewards❌ failsmyopia — greedy for points, blind to winning
Live LLM-in-the-loop❌ infeasiblestrong (beat our agent 3–2) but ~9–27 s/move — far too slow

💡 The clean result: we tried to bribe the agent into ginning by paying three times more for a gin than a knock. Across 30 controlled runs it still gins under 1% of the time — just like the perfect player, it discovers that chasing gin loses. You cannot pay a policy into a bad habit.


The framework we built

System overview
The gold standard is used for scoring only — it never trains the agent.

A single RL run fires tens of thousands of opponent queries; at 0.5–3 s per 7B call a naive loop takes hours. We decouple inference from training with a master/worker stack:

   env subprocess  ─▶  Master (CPU, FastAPI)  ─▶  suit-symmetry cache  ──(hit)──▶ return
   (per-step query)    Ollama-compatible API            │ miss
                              │ round-robin              ▼
              ┌───────────────┼───────────────┐
              ▼               ▼               ▼
          GPU worker      GPU worker  …   GPU worker      (1 GPU each, Qwen2.5-7B)
          self-registers in a shared-filesystem registry; master health-checks + balances

⚠️ Infra finding: loading a 7B worker from home NFS runs at ~11 MB/s (~28 min — blows the health-check timeout). Staging weights on scratch/BeeGFS cuts it to ~27 s (62×) — mandatory at scale (~32 queries/s with 14 workers).


It is a universal pipeline, not just a Gin Rummy script

The training pipeline is game-agnostic. Point it at any PettingZoo game, or your own environment, and it trains a masked PPO/TRPO agent through an opponent curriculum, keeps the best checkpoint, and grades it. The only thing that changes between games is the environment.

from pettingzoo.classic import connect_four_v3
from coev import CoevConfig, train

# same pipeline that trained the Gin Rummy agents, now on Connect Four
train(CoevConfig(env_fn=connect_four_v3.env, env_id="connect_four",
                 algo="trpo", total_steps=2_000_000))

Add seed_models to give it prior agents to practise against, a benchmark_agent to grade against an expert, and a reward_transform to shape the reward. See coev/ and coev/examples/ (Connect Four and Gin Rummy).


Play the heroes 🎮

A no-install browser game with a curated 5-rung ladder, easiest to perfect:

OpponentStrengthWhat it is
🎲 Rookieeasiestrandom legal moves — a warm-up
🤖 Self-Play Championstrongour earlier best (~30% vs gold)
🃏 Curriculum Acestrongestthe final agent — 34% vs the perfect player
🛡️ League Tacticianstrongesta close second (PFSP-trained)
🏆 Gold Standardperfectthe hand-coded expert — the wall everyone hits

Repository layout

PathWhat
coev/the universal pipeline: masked policy, opponent curriculum, and trainer for any PettingZoo AEC game (+ examples)
ppo_train.py, gym_wrapper.pythe original Gin-Rummy-specific masked PPO/TRPO policy + wrapper
agents/GoldStandardAgent (optimal benchmark), PPOAgent (masked-argmax), Random, LLM agents
sweep/the experiment families: gold benchmark, algorithm (PPO vs TRPO), representation, the curriculum sweep + the keep-best/warm-start harness
llm/, slurm/distributed LLM master/worker/cache + the SLURM jobs (incl. the self-sustaining sweep watchdog)
game/zero-dependency human-vs-agent web client
paper/, docs/the paper (main.tex), figure + report generators, and the full HTML report

Quickstart

# 0) Train on any game with the universal pipeline
python -m coev.examples.connect_four     # any PettingZoo game, no game-specific code
python -m coev.examples.gin_rummy        # same pipeline + a gold benchmark and reward shaping

# 1) Play the heroes (web game)
python game/server.py --host 127.0.0.1 --port 8000      # open http://127.0.0.1:8000

# 2) The gold-standard benchmark
python sweep/bench_gold.py

# 3) The final sweeps (SLURM array + self-sustaining watchdog)
python sweep/curriculum_configs.py && sbatch --array=0-29%10 slurm/curriculum.slurm
python sweep/phase7_configs.py    && sbatch --array=0-8%6 --export=ALL,CFG_DIR=phase7_cfgs slurm/curriculum.slurm

# 4) Regenerate every figure + the HTML report from saved JSON
python paper/make_figures.py && python paper/make_report_html.py

⚠️ Load LLM worker weights from scratch/BeeGFS, not home NFS. Every figure regenerates from measured JSON results under sweep/.


Nima Kelidari · Mahdi Salmani · Mohammadsaeed Haghi
University of Southern California

Built on action-masked PPO/TRPO, PettingZoo/RLCard, Stable-Baselines3, and Qwen2.5.
See the full report for the whole story.

About

Adversarial Co-Evolution of RL and LLM Agents: A framework for training high-performance PPO agents against Large Language Models in Gin Rummy, utilizing curriculum learning and knowledge distillation.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors