LLM Agent Evaluation & Reliability Harness

"Evaluation Over Hype" — Treat the agent as a production system. Focus on measurable, reproducible evaluation rather than impressive demos.

A modular, type-safe Python framework for evaluating tool-using LLM agents on analytical tasks against a local SQLite database. Designed for reliability engineering, not just proof-of-concept loops.

Philosophy

Most agent demos optimize for the happy path. This harness optimizes for truth.

Every task has a gold standard.
Every failure is categorized (SQL_ERROR, HALLUCINATION, INCOMPLETE, TOOL_MISUSE, MAX_STEPS, UNKNOWN).
Every run produces structured, queryable traces.
The system is provider-agnostic (LiteLLM) and database-isolated (local SQLite).

If an agent cannot reliably answer "What was total revenue last week?" with correct SQL, it does not matter how fluent its chain-of-thought appears.

Current State

Component	Status	Description
`demo.db`	✅ Ready	500 users, 2000 sessions, 15k events, 800 orders with injected anomalies & trends
Tasks	4 / 10	2 SQL/KPI, 1 Anomaly, 1 Trend (all passing with current agent)
Tools	✅ Done	`sql_query`, `get_schema`, `python_calculator`
Agent (ReAct)	✅ LiteLLM	Multi-step Thought → Action → Observation → `FINAL_ANSWER`
Scoring	✅ Done	`task_success`, `tool_correctness`, `step_count` + Failure Analysis
Runner	🔜 Planned	Batch evaluation → `results.jsonl`

Technical Stack

Language: Python 3.11+
LLM Interface: LiteLLM (OpenAI, Anthropic, Groq, etc.)
Database: Local SQLite (data/demo.db)
Core Tools: sql_query, get_schema, python_calculator
Output: Agent traces + ScoreResult per task

Project Structure

llm-agent-eval-framework/
├── README.md
├── requirements.txt
├── src/harness/
│   ├── agent.py          # LiteLLM ReAct loop
│   ├── tools.py          # SQL execution, schema introspection, calculator
│   ├── tasks.py          # Task dataclass + JSON loader
│   └── scoring.py        # Metrics + Failure Analysis
├── data/
│   ├── init_db.py        # Database generator (reproducible seed)
│   └── demo.db
├── tasks/                # Evaluation tasks (JSON)
│   ├── task_001_sql_kpi.json
│   ├── task_002_sql_kpi.json
│   ├── task_003_anomaly.json
│   └── task_004_trend.json
├── runner.py             # Main evaluation orchestrator (planned)
└── results/              # Structured evaluation traces (planned)

Quick Start

1. Install dependencies

pip install -r requirements.txt

2. Configure API keys (never commit `.env`)

Create a .env file in the project root (already gitignored):

LITELLM_MODEL=gpt-4o-mini
OPENAI_API_KEY=your_key_here

Use the provider key that matches your model (e.g. ANTHROPIC_API_KEY, GROQ_API_KEY).

3. Generate the database

python data/init_db.py

4. Run evaluation on all tasks

python test_all_tasks.py

5. Run a single task + score

python test_scoring.py

How the Agent Works

Each step, the LLM responds in a fixed format:

Thought: <reasoning>
Action: <get_schema | sql_query | python_calculator | FINAL_ANSWER>
Action Input: <tool input or numeric final answer>

The harness:

Parses the response
Executes the tool (if not FINAL_ANSWER)
Appends the observation to the conversation
Repeats until FINAL_ANSWER or max_steps

scoring.py compares final_answer to the task gold standard (with tolerance) and records failure categories when checks fail.

Roadmap

runner.py — run all tasks, write results/results.jsonl, print summary
Remaining 6 tasks — complete the 10-task benchmark set
Structured logging — JSON logs for every thought/action/observation
Model comparison — evaluate multiple LLMs on the same task suite

Contributing

Type hints everywhere
Clear separation: agent logic, tools, tasks, scoring
Reproducible data generation (random.seed(42) in init_db.py)
Never commit secrets (.env is gitignored)

License

MIT — Use it to build reliable agents, not just impressive ones.

Author: Hincal Topcuoglu — hincal@topcuoglu.me

Evaluation is not the enemy of creativity. It is the only way to know whether your creativity actually works.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Agent Evaluation & Reliability Harness

Philosophy

Current State

Technical Stack

Project Structure

Quick Start

1. Install dependencies

2. Configure API keys (never commit `.env`)

3. Generate the database

4. Run evaluation on all tasks

5. Run a single task + score

How the Agent Works

Roadmap

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
src/harness		src/harness
tasks		tasks
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

LLM Agent Evaluation & Reliability Harness

Philosophy

Current State

Technical Stack

Project Structure

Quick Start

1. Install dependencies

2. Configure API keys (never commit .env)

3. Generate the database

4. Run evaluation on all tasks

5. Run a single task + score

How the Agent Works

Roadmap

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

2. Configure API keys (never commit `.env`)

Packages