"Evaluation Over Hype" — Treat the agent as a production system. Focus on measurable, reproducible evaluation rather than impressive demos.
A modular, type-safe Python framework for evaluating tool-using LLM agents on analytical tasks against a local SQLite database. Designed for reliability engineering, not just proof-of-concept loops.
Most agent demos optimize for the happy path. This harness optimizes for truth.
- Every task has a gold standard.
- Every failure is categorized (
SQL_ERROR,HALLUCINATION,INCOMPLETE,TOOL_MISUSE,MAX_STEPS,UNKNOWN). - Every run produces structured, queryable traces.
- The system is provider-agnostic (LiteLLM) and database-isolated (local SQLite).
If an agent cannot reliably answer "What was total revenue last week?" with correct SQL, it does not matter how fluent its chain-of-thought appears.
| Component | Status | Description |
|---|---|---|
demo.db |
✅ Ready | 500 users, 2000 sessions, 15k events, 800 orders with injected anomalies & trends |
| Tasks | 4 / 10 | 2 SQL/KPI, 1 Anomaly, 1 Trend (all passing with current agent) |
| Tools | ✅ Done | sql_query, get_schema, python_calculator |
| Agent (ReAct) | ✅ LiteLLM | Multi-step Thought → Action → Observation → FINAL_ANSWER |
| Scoring | ✅ Done | task_success, tool_correctness, step_count + Failure Analysis |
| Runner | 🔜 Planned | Batch evaluation → results.jsonl |
- Language: Python 3.11+
- LLM Interface: LiteLLM (OpenAI, Anthropic, Groq, etc.)
- Database: Local SQLite (
data/demo.db) - Core Tools:
sql_query,get_schema,python_calculator - Output: Agent traces +
ScoreResultper task
llm-agent-eval-framework/
├── README.md
├── requirements.txt
├── src/harness/
│ ├── agent.py # LiteLLM ReAct loop
│ ├── tools.py # SQL execution, schema introspection, calculator
│ ├── tasks.py # Task dataclass + JSON loader
│ └── scoring.py # Metrics + Failure Analysis
├── data/
│ ├── init_db.py # Database generator (reproducible seed)
│ └── demo.db
├── tasks/ # Evaluation tasks (JSON)
│ ├── task_001_sql_kpi.json
│ ├── task_002_sql_kpi.json
│ ├── task_003_anomaly.json
│ └── task_004_trend.json
├── runner.py # Main evaluation orchestrator (planned)
└── results/ # Structured evaluation traces (planned)
pip install -r requirements.txtCreate a .env file in the project root (already gitignored):
LITELLM_MODEL=gpt-4o-mini
OPENAI_API_KEY=your_key_hereUse the provider key that matches your model (e.g. ANTHROPIC_API_KEY, GROQ_API_KEY).
python data/init_db.pypython test_all_tasks.pypython test_scoring.pyEach step, the LLM responds in a fixed format:
Thought: <reasoning>
Action: <get_schema | sql_query | python_calculator | FINAL_ANSWER>
Action Input: <tool input or numeric final answer>
The harness:
- Parses the response
- Executes the tool (if not
FINAL_ANSWER) - Appends the observation to the conversation
- Repeats until
FINAL_ANSWERormax_steps
scoring.py compares final_answer to the task gold standard (with tolerance) and records failure categories when checks fail.
runner.py— run all tasks, writeresults/results.jsonl, print summary- Remaining 6 tasks — complete the 10-task benchmark set
- Structured logging — JSON logs for every thought/action/observation
- Model comparison — evaluate multiple LLMs on the same task suite
- Type hints everywhere
- Clear separation: agent logic, tools, tasks, scoring
- Reproducible data generation (
random.seed(42)ininit_db.py) - Never commit secrets (
.envis gitignored)
MIT — Use it to build reliable agents, not just impressive ones.
Author: Hincal Topcuoglu — hincal@topcuoglu.me
Evaluation is not the enemy of creativity. It is the only way to know whether your creativity actually works.