Adding Custom Models

This guide covers how to add new LLM models to the evaluation framework.

OpenRouter Integration

The framework uses OpenRouter for LLM access, supporting 100+ models from various providers.

Available Models

Check available models at: https://openrouter.ai/models

Using a New Model

Simply specify the model ID:

python run_game.py --model <provider>/<model-name>

Examples:

# Anthropic
python run_game.py --model anthropic/claude-3.5-sonnet
python run_game.py --model anthropic/claude-3-opus

# OpenAI
python run_game.py --model openai/gpt-4-turbo
python run_game.py --model openai/gpt-4o

# Meta
python run_game.py --model meta-llama/llama-3.1-70b-instruct
python run_game.py --model meta-llama/llama-3.1-405b-instruct

# Google
python run_game.py --model google/gemini-pro-1.5

# Mistral
python run_game.py --model mistralai/mistral-large

Configuration

Model-Specific Settings

Edit config/openrouter_config.py:

MODEL_CONFIGS = {
    "deepseek/deepseek-v3.2-exp": {
        "max_tokens": 1024,
        "temperature": 0.7,
        "cost_per_1k_input": 0.0001,
        "cost_per_1k_output": 0.0002,
    },
    "anthropic/claude-3.5-sonnet": {
        "max_tokens": 1024,
        "temperature": 0.7,
        "cost_per_1k_input": 0.003,
        "cost_per_1k_output": 0.015,
    },
}

Default Model

Set the default model in config/openrouter_config.py:

DEFAULT_MODEL = "deepseek/deepseek-v3.2-exp"

Cost Considerations

Estimating Costs

Typical game costs (7 players, ~50 API calls per game):

Model	Cost per Game
deepseek-v3.2-exp	~$0.02
claude-3.5-sonnet	~$0.50
gpt-4-turbo	~$0.40
llama-3.1-70b	~$0.15

Cost Tracking

Enable cost logging:

python run_game.py --enable-db-logging

Query costs:

SELECT model, SUM(cost) as total_cost, COUNT(*) as games
FROM api_requests
GROUP BY model;

Model Comparison Experiments

Running Comparisons

# Create comparison script
for model in "deepseek/deepseek-v3.2-exp" "anthropic/claude-3.5-sonnet"; do
  python run_game.py --batch --games 50 --players 7 \
    --enable-db-logging \
    --batch-tag "comparison" \
    --model "$model"
done

Analyzing Comparisons

from analysis import compare_win_rates, calculate_elo_ratings

# Compare two models
result = compare_win_rates(games_df, 'model_a', 'model_b')

# Calculate Elo ratings across all models
elo_ratings = calculate_elo_ratings(games_df)

Custom Model Integration

For Non-OpenRouter Models

Extend agents/openrouter_client.py:

class CustomModelClient(OpenRouterClient):
    def __init__(self, api_key: str, base_url: str):
        super().__init__(api_key)
        self.base_url = base_url

    async def get_decision(self, prompt: str) -> str:
        # Custom API call implementation
        pass

Troubleshooting

Model Not Available

Check OpenRouter status: https://openrouter.ai/status

Rate Limits

Add delays between requests:

# In config/openrouter_config.py
RATE_LIMIT_DELAY = 1.0  # seconds between requests

Invalid Responses

Some models may not follow JSON format. The framework handles this with fallback parsing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding Custom Models

OpenRouter Integration

Available Models

Using a New Model

Configuration

Model-Specific Settings

Default Model

Cost Considerations

Estimating Costs

Cost Tracking

Model Comparison Experiments

Running Comparisons

Analyzing Comparisons

Custom Model Integration

For Non-OpenRouter Models

Troubleshooting

Model Not Available

Rate Limits

Invalid Responses

FilesExpand file tree

custom-models.md

Latest commit

History

custom-models.md

File metadata and controls

Adding Custom Models

OpenRouter Integration

Available Models

Using a New Model

Configuration

Model-Specific Settings

Default Model

Cost Considerations

Estimating Costs

Cost Tracking

Model Comparison Experiments

Running Comparisons

Analyzing Comparisons

Custom Model Integration

For Non-OpenRouter Models

Troubleshooting

Model Not Available

Rate Limits

Invalid Responses