Agentic RAG Audit Harness

See how content moves through an agentic-RAG pipeline. Stage by stage, decision by decision, on your own hardware. Companion to the iPullRank article "Beyond RAG: Why Every AI Search Platform Is Now Agentic — and What That Means for Your Content."

This is a local, fully observable version of the kind of agentic-RAG loop that powers ChatGPT Deep Research, Gemini Deep Research, Perplexity Pro, and AI Mode. You feed it a query plus your brand domain; it runs the query through a five-stage agentic pipeline (planner → router → retriever → synthesizer → critic) and produces a diagnostic that shows where your content was considered, how it was ranked, where it fell out, and what the agent picked instead.

It runs on Google's Gemma 4 via Ollama, finishes a single query in roughly two minutes on a workstation GPU, and writes every intermediate decision to a structured trace so you can read what the agent did instead of guessing.

Status

A working diagnostic harness. Not an ML distillation tool in the formal sense — see Extending toward distillation for what would need to be added to make it one. What it currently delivers is per-query visibility into the agentic loop: the planner's sub-queries, the router's tool choices, the retriever's candidate pool, the reranker's head-to-head verdicts, the critic's grade, and a clear funnel showing where brand content survived or got eliminated.

What it tells you

After a run, you get eight diagnostic sections in the terminal plus a JSON trace and a log file.

1. Plan & routing — the sub-queries the planner generated from your original query, the intent it classified them as, and the tool the router chose for each.

2. Retrieval funnel — every URL the agent saw, with SerpAPI rank, host, passage chunks, whether it entered the reranker pool, its pairwise win-loss-tie record, and its citation position. Sorted so cited URLs are on top, then surviving-but-uncited, then eliminated. Brand hosts highlighted.

3. Pairwise decisions — every head-to-head comparison the reranker made, with the agent's stated reasoning. This is where you see why one piece of content beat another.

4. Brand journey — a single-pane walkthrough for your --brand-domain: which sub-queries surfaced it, SerpAPI rank, chunks produced, whether it made the reranker pool, its pairwise record, and final citation status. When content falls out, the diagnostic explains the stage and gives plain-English How to compete recommendations rooted in real-world RAG mechanics (first-stage retrieval scoring, cross-encoder reranking, LLM-as-judge selection).

5. Critic — the agent's self-grading: sufficient, contradictions found, freshness concern, source-diversity concern, plus any follow-up queries it requested. When reflection is enabled, follow-ups drive another retrieval+synthesis cycle.

6. Pipeline timing — per-stage breakdown so you know which stage dominates the budget.

7. Final answer — the synthesized markdown answer with inline [n] citations.

8. Citations — deduped, numbered list aligned with the answer's [n] markers.

Quick start

Prerequisites

Python 3.10+ (3.13 works; install scrapling deps below)
Ollama installed and running (download)
A workstation GPU (8GB+ VRAM for gemma4:e2b; more for larger variants). CPU also works but will be slow.
A SerpAPI key for retrieval (free tier)

Install

git clone <your-fork>
cd agentic-rag-distillation
pip install -r requirements.txt
playwright install                  # only if you want the playwright fetcher
ollama pull gemma4:e2b              # ~7.2 GB
cp .env.example .env                # then fill in SERPAPI_KEY and BRAND_DOMAIN

Configure Ollama for this workload

In your system environment variables (Windows: System → Edit environment variables for your account; macOS/Linux: shell rc file):

OLLAMA_CONTEXT_LENGTH=8192          # 2048 default truncates everything
OLLAMA_NUM_PARALLEL=4               # serve multiple agent calls concurrently
OLLAMA_FLASH_ATTENTION=1            # meaningful speedup on most GPUs
OLLAMA_KV_CACHE_TYPE=q8_0           # quantize KV cache, recovers VRAM

Restart Ollama after setting these. Verify with ollama ps while a run is in progress — you should see 100% GPU and a CONTEXT value matching OLLAMA_CONTEXT_LENGTH.

Run a single query

python -m examples.run_audit \
    --query "what is relevance engineering?" \
    --brand-domain "ipullrank.com" \
    --trace-out traces/run-001.json

A typical run on gemma4:e2b finishes in 60–120 seconds. You'll see tqdm progress bars during retrieval and pairwise rerank, then the eight diagnostic sections, then a final summary line pointing to the trace JSON and the log file.

Inspect a trace

The trace JSON has every node's inputs, outputs, and timing. For a richer view:

streamlit run examples/view_trace.py -- traces/run-001.json

Roll up across a query set

After running a batch of queries (each writes its own trace JSON into traces/):

python -m examples.view_program \
    --trace-dir traces/ \
    --brand-domain "ipullrank.com"

This computes the six measurement metrics from docs/measurement.md — sub-query coverage, retrieval-to-citation ratio, reflection survival, tool-call inclusion, and stage-failure rate by stage — plus a per-query funnel table.

Configuration

All config is environment-driven via .env. Defaults are tuned for a single workstation GPU running gemma4:e2b. Adjust depending on hardware and how thorough you need the audit to be.

Models (Ollama)

GEMMA_BACKEND=ollama
GEMMA_PLANNER_MODEL=google/gemma-4-E2B-it       # maps to gemma4:e2b
GEMMA_ROUTER_MODEL=google/gemma-4-E2B-it
GEMMA_SYNTHESIZER_MODEL=google/gemma-4-E2B-it
GEMMA_CRITIC_MODEL=google/gemma-4-E2B-it

Available variants on Ollama: gemma4:e2b (7.2GB), gemma4:e4b (9.6GB), gemma4:26b MoE (18GB), gemma4:31b dense (20GB), gemma4:31b-cloud (Ollama-hosted). Upgrading the planner and critic models to gemma4:e4b is the single highest-leverage quality improvement if you have the VRAM.

Agent loop sizing

AGENT_MAX_SUBQUERIES=5              # how many ways the planner decomposes the query
AGENT_RETRIEVER_TOP_K=5              # URLs fetched per sub-query
AGENT_PAIRWISE_TOP_K=6              # candidates entering the reranker
AGENT_MAX_REFLECTION_LOOPS=2        # critic-driven follow-up cycles

Bump AGENT_PAIRWISE_TOP_K to 8–10 for wider competition (adds ~~20s per increase on gemma4:e2b). Raise AGENT_MAX_REFLECTION_LOOPS to 3 for production-grade thoroughness (~~+30s per loop).

Retrieval

SERPAPI_KEY=your_key                # required
SCRAPLING_FETCHER=basic             # basic | playwright | stealth
RETRIEVER_FETCH_WORKERS=8           # parallel URL fetches

basic is fastest; move to playwright or stealth only if you see 403/429 in the log. Stealth requires python -m scrapling install to fetch the Camoufox browser.

Chunking

LangExtract (LLM-based structured chunking) is off by default because at e2b/e4b model sizes the JSON parse rate is unreliable and it can take minutes per article. Heuristic paragraph chunking is used instead. To turn LangExtract on (only worth it with a 26B+ model):

LANGEXTRACT_ENABLED=1
LANGEXTRACT_MODEL_ID=gemma4:26b
LANGEXTRACT_MAX_CHAR_BUFFER=3000
LANGEXTRACT_WORKERS=1               # keep at 1 — retriever serializes anyway

How the pipeline works

                user query + brand domain
                          │
                          ▼
                ┌──────────────────┐
                │     PLANNER      │  decomposes into sub-queries
                └────────┬─────────┘
                         │
                         ▼
                ┌──────────────────┐
                │     ROUTER       │  picks a tool per sub-query
                └─┬──────┬──────┬──┘
                  │      │      │
                  ▼      ▼      ▼
                  Phase 1: parallel I/O
                  ┌───────────────────┐
                  │  SerpAPI seed     │
                  │  → Scrapling      │   (or trafilatura.fetch_url
                  │  → Trafilatura    │    if scrapling unavailable)
                  └────────┬──────────┘
                           │
                           ▼
                  Phase 2: serial GPU
                  ┌───────────────────┐
                  │  Heuristic chunker│  (LangExtract opt-in)
                  └────────┬──────────┘
                           │
                           ▼
                ┌──────────────────┐
                │   SYNTHESIZER    │  diverse seed → pairwise rerank →
                └────────┬─────────┘  unique-URL citation composition
                         │
                         ▼
                ┌──────────────────┐
                │      CRITIC      │  line-format grading; can trigger
                └─┬──────────────┬─┘  reflection loop (follow-up retrieval)
                  │              │
              reflection         ship
                  │              │
                  └──────────────┴────► final answer + trace + diagnostics

Two design choices worth knowing about:

Retrieval is two-phase. Phase 1 (HTML fetch + body extract) runs across 8 parallel workers — pure network I/O, safe to push. Phase 2 (chunking) runs serially because the GPU is the resource and parallel callers just queue and time out together.

The pairwise seed is URL-diversified. Heuristic chunks all have score=0, so a naive top-k-by-score selection would pack the pool with whichever URL was processed first. synthesizer._build_diverse_seed round-robins across unique URLs so every source gets at least one passage in the pool before any URL doubles up. This is what makes the brand-journey diagnostic meaningful — competing content actually competes.

Extending toward distillation

The current harness is a diagnostic tool: it tells you, per query, what the agent decided. To turn it into a real agentic-RAG distillation program that calibrates against production systems and aggregates measurement over time, here are the missing pieces in priority order. All of them consume the trace JSON that's already being written — no changes needed to the agent code itself.

1. Bridge-entity centrality (the one metric still missing)

compute_metrics() in src/distill/audit.py now implements five of the six metrics from docs/measurement.md: sub-query coverage, retrieval-to-citation ratio, reflection survival, tool-call inclusion, stage-failure rate. Bridge-entity centrality is the holdout — it needs:

Entity extraction over each sub-query (spaCy NER, or LangExtract with LANGEXTRACT_ENABLED=1)
A topical graph — entities become nodes, co-occurrence within sub-queries become edges
Centrality scoring — does brand content appear at high-betweenness nodes (entities that bridge multiple sub-graphs), or only at peripheral ones?

Use NetworkX or a similar library. The trace already captures sub-queries and entity-tagged passages when LangExtract is enabled.

2. Batch runner

examples/run_batch.py (doesn't exist yet) — takes a YAML query set like examples/example_query_set.yaml, runs each query, writes a trace per query into a directory. A few hours of work; unlocks (3).

3. Cross-query rollup viewer

examples/view_program.py — equivalent of run_audit.py's output but over a directory of traces. Shows the six metrics, per-host citation share, stage-failure heatmap, query-coverage drilldown. This is what makes the harness a program rather than a one-off.

4. Real production diff

compare_to_production.py exists but audit.diff() uses keyword-overlap (_share_keywords with a 0.5 threshold) for sub-query alignment. To meaningfully compare against a Deep Research run, swap that for embedding cosine similarity using the same sentence-transformers model the vector retriever uses. Also add per-stage diff — tool choice, top-k retrieval overlap, ranked passage overlap — not just sub-queries and final citations.

5. Production trace ingestion ergonomics

The README in v1 described copy-pasting Deep Research plans into a YAML template. The template exists at examples/production-template.yaml but there's no tooling. A small Streamlit form that captures the sub-queries and citations as you paste, validates against the schema, and writes the YAML would make calibration practical instead of theoretical.

6. Classical ML distillation (the named-after technique)

If you actually want to do teacher→student distillation — train a smaller model on traces from a larger one — the path is:

Run the same query set against gemma4:31b (teacher) and capture the per-stage prompts and outputs from the trace
Export (prompt, response) pairs in a format compatible with a fine-tuning library (e.g., axolotl, unsloth, trl)
Fine-tune a smaller student (e.g., a 2B model not in the Gemma 4 family) on those pairs
Evaluate the student against the teacher on held-out queries using the metrics from (1)

This is significantly more work than (1)–(5) — it crosses into MLops territory. But if your goal is to ship a distilled-for-your-domain student model rather than to instrument the production gap, this is the path.

Performance notes

GPU detection. The first thing OllamaGemma.__post_init__ does is hit /api/ps and log whether gemma4:e2b is GPU-resident, CPU-split, or CPU-only. If it says CPU-only, runs will take 30–60× longer. Fix Ollama's GPU detection before tuning anything else.
Context window. The single most common cause of slow runs is Ollama's default 2,048-token context. Set OLLAMA_CONTEXT_LENGTH=8192 on the server and OLLAMA_NUM_CTX=8192 in .env. Long prompts otherwise get truncated and produce garbage output the parsers reject.
Scrapling deps. curl_cffi and playwright are in requirements.txt but playwright install is a separate step that fetches the browser binaries. If both are missing, the harness silently falls back to trafilatura's urllib fetcher — this works but gets blocked by ~10–20% of modern sites. The log file shows which path each fetch used.

Project layout

agentic-rag-distillation/
├── README.md                       (this file)
├── requirements.txt
├── pyproject.toml
├── .env / .env.example
├── src/distill/                    (package kept as "distill" for import compatibility)
│   ├── config.py                   env-driven configuration
│   ├── graph.py                    five-node agentic loop + reflection
│   ├── planner.py                  sub-query decomposition
│   ├── router.py                   tool selection (multi-strategy parser)
│   ├── retriever.py                two-phase web retrieval
│   ├── fetcher.py                  scrapling + trafilatura wrapper
│   ├── chunker.py                  heuristic + LangExtract chunking
│   ├── synthesizer.py              diverse seed + pairwise rerank + citation
│   ├── critic.py                   line-format grading + follow-up queries
│   ├── models.py                   Ollama / local / hosted Gemma wrappers
│   ├── trace.py                    structured trace logging
│   └── audit.py                    production diff (metrics layer to be built)
├── prompts/                        prompt templates (currently informational)
├── examples/
│   ├── run_audit.py                main entry point — single query
│   ├── view_program.py             roll up the six metrics across a trace directory
│   ├── compare_to_production.py    diff against production Deep Research YAML
│   ├── view_trace.py               Streamlit single-trace viewer
│   ├── example_query_set.yaml      sample query set
│   └── production-template.yaml    Deep Research capture template
├── docs/
│   ├── architecture.md
│   ├── calibration.md
│   └── measurement.md              the six metrics to compute
└── traces/                         output directory (auto-created)

Caveats

The local agent is not the production system. Calibrate against visible Deep Research plans before treating the trace as ground truth.
The retriever uses SerpAPI for seed URLs. Brand-content visibility depends on what shows up in those results.
Pairwise verdicts on gemma4:e2b are directional, not authoritative. The reranker quality goes up materially with gemma4:e4b or gemma4:26b.
The critic occasionally still produces output that doesn't conform — robust parsing recovers from most cases, but the more sophisticated the agent loop you want, the more you'll want a larger model for the critic role specifically.

License

Apache 2.0. Use it, modify it, ship it.

Contact

iPullRank — ipullrank.com/contact

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Agentic RAG Audit Harness

Status

What it tells you

Quick start

Prerequisites

Install

Configure Ollama for this workload

Run a single query

Inspect a trace

Roll up across a query set

Configuration

Models (Ollama)

Agent loop sizing

Retrieval

Chunking

How the pipeline works

Extending toward distillation

1. Bridge-entity centrality (the one metric still missing)

2. Batch runner

3. Cross-query rollup viewer

4. Real production diff

5. Production trace ingestion ergonomics

6. Classical ML distillation (the named-after technique)

Performance notes

Project layout

Caveats

License

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
docs		docs
examples		examples
prompts		prompts
src/distill		src/distill
tests		tests
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
example-run.txt		example-run.txt
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Agentic RAG Audit Harness

Status

What it tells you

Quick start

Prerequisites

Install

Configure Ollama for this workload

Run a single query

Inspect a trace

Roll up across a query set

Configuration

Models (Ollama)

Agent loop sizing

Retrieval

Chunking

How the pipeline works

Extending toward distillation

1. Bridge-entity centrality (the one metric still missing)

2. Batch runner

3. Cross-query rollup viewer

4. Real production diff

5. Production trace ingestion ergonomics

6. Classical ML distillation (the named-after technique)

Performance notes

Project layout

Caveats

License

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages