Skip to content

hincaltopcuoglu/llm-agent-eval-framework

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM Agent Evaluation & Reliability Harness

"Evaluation Over Hype" — Treat the agent as a production system. Focus on measurable, reproducible evaluation rather than impressive demos.

A modular, type-safe Python framework for evaluating tool-using LLM agents on analytical tasks against a local SQLite database. Designed for reliability engineering, not just proof-of-concept loops.


Philosophy

Most agent demos optimize for the happy path. This harness optimizes for truth.

  • Every task has a gold standard.
  • Every failure is categorized (SQL_ERROR, HALLUCINATION, INCOMPLETE, TOOL_MISUSE, MAX_STEPS, UNKNOWN).
  • Every run produces structured, queryable traces.
  • The system is provider-agnostic (LiteLLM) and database-isolated (local SQLite).

If an agent cannot reliably answer "What was total revenue last week?" with correct SQL, it does not matter how fluent its chain-of-thought appears.


Current State

Component Status Description
demo.db ✅ Ready 500 users, 2000 sessions, 15k events, 800 orders with injected anomalies & trends
Tasks 4 / 10 2 SQL/KPI, 1 Anomaly, 1 Trend (all passing with current agent)
Tools ✅ Done sql_query, get_schema, python_calculator
Agent (ReAct) ✅ LiteLLM Multi-step Thought → Action → Observation → FINAL_ANSWER
Scoring ✅ Done task_success, tool_correctness, step_count + Failure Analysis
Runner 🔜 Planned Batch evaluation → results.jsonl

Technical Stack

  • Language: Python 3.11+
  • LLM Interface: LiteLLM (OpenAI, Anthropic, Groq, etc.)
  • Database: Local SQLite (data/demo.db)
  • Core Tools: sql_query, get_schema, python_calculator
  • Output: Agent traces + ScoreResult per task

Project Structure

llm-agent-eval-framework/
├── README.md
├── requirements.txt
├── src/harness/
│   ├── agent.py          # LiteLLM ReAct loop
│   ├── tools.py          # SQL execution, schema introspection, calculator
│   ├── tasks.py          # Task dataclass + JSON loader
│   └── scoring.py        # Metrics + Failure Analysis
├── data/
│   ├── init_db.py        # Database generator (reproducible seed)
│   └── demo.db
├── tasks/                # Evaluation tasks (JSON)
│   ├── task_001_sql_kpi.json
│   ├── task_002_sql_kpi.json
│   ├── task_003_anomaly.json
│   └── task_004_trend.json
├── runner.py             # Main evaluation orchestrator (planned)
└── results/              # Structured evaluation traces (planned)

Quick Start

1. Install dependencies

pip install -r requirements.txt

2. Configure API keys (never commit .env)

Create a .env file in the project root (already gitignored):

LITELLM_MODEL=gpt-4o-mini
OPENAI_API_KEY=your_key_here

Use the provider key that matches your model (e.g. ANTHROPIC_API_KEY, GROQ_API_KEY).

3. Generate the database

python data/init_db.py

4. Run evaluation on all tasks

python test_all_tasks.py

5. Run a single task + score

python test_scoring.py

How the Agent Works

Each step, the LLM responds in a fixed format:

Thought: <reasoning>
Action: <get_schema | sql_query | python_calculator | FINAL_ANSWER>
Action Input: <tool input or numeric final answer>

The harness:

  1. Parses the response
  2. Executes the tool (if not FINAL_ANSWER)
  3. Appends the observation to the conversation
  4. Repeats until FINAL_ANSWER or max_steps

scoring.py compares final_answer to the task gold standard (with tolerance) and records failure categories when checks fail.


Roadmap

  1. runner.py — run all tasks, write results/results.jsonl, print summary
  2. Remaining 6 tasks — complete the 10-task benchmark set
  3. Structured logging — JSON logs for every thought/action/observation
  4. Model comparison — evaluate multiple LLMs on the same task suite

Contributing

  • Type hints everywhere
  • Clear separation: agent logic, tools, tasks, scoring
  • Reproducible data generation (random.seed(42) in init_db.py)
  • Never commit secrets (.env is gitignored)

License

MIT — Use it to build reliable agents, not just impressive ones.


Author: Hincal Topcuoglu — hincal@topcuoglu.me

Evaluation is not the enemy of creativity. It is the only way to know whether your creativity actually works.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages