See how content moves through an agentic-RAG pipeline. Stage by stage, decision by decision, on your own hardware. Companion to the iPullRank article "Beyond RAG: Why Every AI Search Platform Is Now Agentic — and What That Means for Your Content."
This is a local, fully observable version of the kind of agentic-RAG loop that powers ChatGPT Deep Research, Gemini Deep Research, Perplexity Pro, and AI Mode. You feed it a query plus your brand domain; it runs the query through a five-stage agentic pipeline (planner → router → retriever → synthesizer → critic) and produces a diagnostic that shows where your content was considered, how it was ranked, where it fell out, and what the agent picked instead.
It runs on Google's Gemma 4 via Ollama, finishes a single query in roughly two minutes on a workstation GPU, and writes every intermediate decision to a structured trace so you can read what the agent did instead of guessing.
A working diagnostic harness. Not an ML distillation tool in the formal sense — see Extending toward distillation for what would need to be added to make it one. What it currently delivers is per-query visibility into the agentic loop: the planner's sub-queries, the router's tool choices, the retriever's candidate pool, the reranker's head-to-head verdicts, the critic's grade, and a clear funnel showing where brand content survived or got eliminated.
After a run, you get eight diagnostic sections in the terminal plus a JSON trace and a log file.
1. Plan & routing — the sub-queries the planner generated from your original query, the intent it classified them as, and the tool the router chose for each.
2. Retrieval funnel — every URL the agent saw, with SerpAPI rank, host, passage chunks, whether it entered the reranker pool, its pairwise win-loss-tie record, and its citation position. Sorted so cited URLs are on top, then surviving-but-uncited, then eliminated. Brand hosts highlighted.
3. Pairwise decisions — every head-to-head comparison the reranker made, with the agent's stated reasoning. This is where you see why one piece of content beat another.
4. Brand journey — a single-pane walkthrough for your --brand-domain: which sub-queries surfaced it, SerpAPI rank, chunks produced, whether it made the reranker pool, its pairwise record, and final citation status. When content falls out, the diagnostic explains the stage and gives plain-English How to compete recommendations rooted in real-world RAG mechanics (first-stage retrieval scoring, cross-encoder reranking, LLM-as-judge selection).
5. Critic — the agent's self-grading: sufficient, contradictions found, freshness concern, source-diversity concern, plus any follow-up queries it requested. When reflection is enabled, follow-ups drive another retrieval+synthesis cycle.
6. Pipeline timing — per-stage breakdown so you know which stage dominates the budget.
7. Final answer — the synthesized markdown answer with inline [n] citations.
8. Citations — deduped, numbered list aligned with the answer's [n] markers.
- Python 3.10+ (3.13 works; install scrapling deps below)
- Ollama installed and running (download)
- A workstation GPU (8GB+ VRAM for
gemma4:e2b; more for larger variants). CPU also works but will be slow. - A SerpAPI key for retrieval (free tier)
git clone <your-fork>
cd agentic-rag-distillation
pip install -r requirements.txt
playwright install # only if you want the playwright fetcher
ollama pull gemma4:e2b # ~7.2 GB
cp .env.example .env # then fill in SERPAPI_KEY and BRAND_DOMAINIn your system environment variables (Windows: System → Edit environment variables for your account; macOS/Linux: shell rc file):
OLLAMA_CONTEXT_LENGTH=8192 # 2048 default truncates everything
OLLAMA_NUM_PARALLEL=4 # serve multiple agent calls concurrently
OLLAMA_FLASH_ATTENTION=1 # meaningful speedup on most GPUs
OLLAMA_KV_CACHE_TYPE=q8_0 # quantize KV cache, recovers VRAM
Restart Ollama after setting these. Verify with ollama ps while a run is in progress — you should see 100% GPU and a CONTEXT value matching OLLAMA_CONTEXT_LENGTH.
python -m examples.run_audit \
--query "what is relevance engineering?" \
--brand-domain "ipullrank.com" \
--trace-out traces/run-001.jsonA typical run on gemma4:e2b finishes in 60–120 seconds. You'll see tqdm progress bars during retrieval and pairwise rerank, then the eight diagnostic sections, then a final summary line pointing to the trace JSON and the log file.
The trace JSON has every node's inputs, outputs, and timing. For a richer view:
streamlit run examples/view_trace.py -- traces/run-001.jsonAfter running a batch of queries (each writes its own trace JSON into traces/):
python -m examples.view_program \
--trace-dir traces/ \
--brand-domain "ipullrank.com"This computes the six measurement metrics from docs/measurement.md — sub-query coverage, retrieval-to-citation ratio, reflection survival, tool-call inclusion, and stage-failure rate by stage — plus a per-query funnel table.
All config is environment-driven via .env. Defaults are tuned for a single workstation GPU running gemma4:e2b. Adjust depending on hardware and how thorough you need the audit to be.
GEMMA_BACKEND=ollama
GEMMA_PLANNER_MODEL=google/gemma-4-E2B-it # maps to gemma4:e2b
GEMMA_ROUTER_MODEL=google/gemma-4-E2B-it
GEMMA_SYNTHESIZER_MODEL=google/gemma-4-E2B-it
GEMMA_CRITIC_MODEL=google/gemma-4-E2B-it
Available variants on Ollama: gemma4:e2b (7.2GB), gemma4:e4b (9.6GB), gemma4:26b MoE (18GB), gemma4:31b dense (20GB), gemma4:31b-cloud (Ollama-hosted). Upgrading the planner and critic models to gemma4:e4b is the single highest-leverage quality improvement if you have the VRAM.
AGENT_MAX_SUBQUERIES=5 # how many ways the planner decomposes the query
AGENT_RETRIEVER_TOP_K=5 # URLs fetched per sub-query
AGENT_PAIRWISE_TOP_K=6 # candidates entering the reranker
AGENT_MAX_REFLECTION_LOOPS=2 # critic-driven follow-up cycles
Bump AGENT_PAIRWISE_TOP_K to 8–10 for wider competition (adds 20s per increase on +30s per loop).gemma4:e2b). Raise AGENT_MAX_REFLECTION_LOOPS to 3 for production-grade thoroughness (
SERPAPI_KEY=your_key # required
SCRAPLING_FETCHER=basic # basic | playwright | stealth
RETRIEVER_FETCH_WORKERS=8 # parallel URL fetches
basic is fastest; move to playwright or stealth only if you see 403/429 in the log. Stealth requires python -m scrapling install to fetch the Camoufox browser.
LangExtract (LLM-based structured chunking) is off by default because at e2b/e4b model sizes the JSON parse rate is unreliable and it can take minutes per article. Heuristic paragraph chunking is used instead. To turn LangExtract on (only worth it with a 26B+ model):
LANGEXTRACT_ENABLED=1
LANGEXTRACT_MODEL_ID=gemma4:26b
LANGEXTRACT_MAX_CHAR_BUFFER=3000
LANGEXTRACT_WORKERS=1 # keep at 1 — retriever serializes anyway
user query + brand domain
│
▼
┌──────────────────┐
│ PLANNER │ decomposes into sub-queries
└────────┬─────────┘
│
▼
┌──────────────────┐
│ ROUTER │ picks a tool per sub-query
└─┬──────┬──────┬──┘
│ │ │
▼ ▼ ▼
Phase 1: parallel I/O
┌───────────────────┐
│ SerpAPI seed │
│ → Scrapling │ (or trafilatura.fetch_url
│ → Trafilatura │ if scrapling unavailable)
└────────┬──────────┘
│
▼
Phase 2: serial GPU
┌───────────────────┐
│ Heuristic chunker│ (LangExtract opt-in)
└────────┬──────────┘
│
▼
┌──────────────────┐
│ SYNTHESIZER │ diverse seed → pairwise rerank →
└────────┬─────────┘ unique-URL citation composition
│
▼
┌──────────────────┐
│ CRITIC │ line-format grading; can trigger
└─┬──────────────┬─┘ reflection loop (follow-up retrieval)
│ │
reflection ship
│ │
└──────────────┴────► final answer + trace + diagnostics
Two design choices worth knowing about:
Retrieval is two-phase. Phase 1 (HTML fetch + body extract) runs across 8 parallel workers — pure network I/O, safe to push. Phase 2 (chunking) runs serially because the GPU is the resource and parallel callers just queue and time out together.
The pairwise seed is URL-diversified. Heuristic chunks all have score=0, so a naive top-k-by-score selection would pack the pool with whichever URL was processed first. synthesizer._build_diverse_seed round-robins across unique URLs so every source gets at least one passage in the pool before any URL doubles up. This is what makes the brand-journey diagnostic meaningful — competing content actually competes.
The current harness is a diagnostic tool: it tells you, per query, what the agent decided. To turn it into a real agentic-RAG distillation program that calibrates against production systems and aggregates measurement over time, here are the missing pieces in priority order. All of them consume the trace JSON that's already being written — no changes needed to the agent code itself.
compute_metrics() in src/distill/audit.py now implements five of the six metrics from docs/measurement.md: sub-query coverage, retrieval-to-citation ratio, reflection survival, tool-call inclusion, stage-failure rate. Bridge-entity centrality is the holdout — it needs:
- Entity extraction over each sub-query (spaCy NER, or LangExtract with
LANGEXTRACT_ENABLED=1) - A topical graph — entities become nodes, co-occurrence within sub-queries become edges
- Centrality scoring — does brand content appear at high-betweenness nodes (entities that bridge multiple sub-graphs), or only at peripheral ones?
Use NetworkX or a similar library. The trace already captures sub-queries and entity-tagged passages when LangExtract is enabled.
examples/run_batch.py (doesn't exist yet) — takes a YAML query set like examples/example_query_set.yaml, runs each query, writes a trace per query into a directory. A few hours of work; unlocks (3).
examples/view_program.py — equivalent of run_audit.py's output but over a directory of traces. Shows the six metrics, per-host citation share, stage-failure heatmap, query-coverage drilldown. This is what makes the harness a program rather than a one-off.
compare_to_production.py exists but audit.diff() uses keyword-overlap (_share_keywords with a 0.5 threshold) for sub-query alignment. To meaningfully compare against a Deep Research run, swap that for embedding cosine similarity using the same sentence-transformers model the vector retriever uses. Also add per-stage diff — tool choice, top-k retrieval overlap, ranked passage overlap — not just sub-queries and final citations.
The README in v1 described copy-pasting Deep Research plans into a YAML template. The template exists at examples/production-template.yaml but there's no tooling. A small Streamlit form that captures the sub-queries and citations as you paste, validates against the schema, and writes the YAML would make calibration practical instead of theoretical.
If you actually want to do teacher→student distillation — train a smaller model on traces from a larger one — the path is:
- Run the same query set against
gemma4:31b(teacher) and capture the per-stage prompts and outputs from the trace - Export
(prompt, response)pairs in a format compatible with a fine-tuning library (e.g.,axolotl,unsloth,trl) - Fine-tune a smaller student (e.g., a 2B model not in the Gemma 4 family) on those pairs
- Evaluate the student against the teacher on held-out queries using the metrics from (1)
This is significantly more work than (1)–(5) — it crosses into MLops territory. But if your goal is to ship a distilled-for-your-domain student model rather than to instrument the production gap, this is the path.
- GPU detection. The first thing
OllamaGemma.__post_init__does is hit/api/psand log whethergemma4:e2bis GPU-resident, CPU-split, or CPU-only. If it says CPU-only, runs will take 30–60× longer. Fix Ollama's GPU detection before tuning anything else. - Context window. The single most common cause of slow runs is Ollama's default 2,048-token context. Set
OLLAMA_CONTEXT_LENGTH=8192on the server andOLLAMA_NUM_CTX=8192in.env. Long prompts otherwise get truncated and produce garbage output the parsers reject. - Scrapling deps.
curl_cffiandplaywrightare inrequirements.txtbutplaywright installis a separate step that fetches the browser binaries. If both are missing, the harness silently falls back to trafilatura's urllib fetcher — this works but gets blocked by ~10–20% of modern sites. The log file shows which path each fetch used.
agentic-rag-distillation/
├── README.md (this file)
├── requirements.txt
├── pyproject.toml
├── .env / .env.example
├── src/distill/ (package kept as "distill" for import compatibility)
│ ├── config.py env-driven configuration
│ ├── graph.py five-node agentic loop + reflection
│ ├── planner.py sub-query decomposition
│ ├── router.py tool selection (multi-strategy parser)
│ ├── retriever.py two-phase web retrieval
│ ├── fetcher.py scrapling + trafilatura wrapper
│ ├── chunker.py heuristic + LangExtract chunking
│ ├── synthesizer.py diverse seed + pairwise rerank + citation
│ ├── critic.py line-format grading + follow-up queries
│ ├── models.py Ollama / local / hosted Gemma wrappers
│ ├── trace.py structured trace logging
│ └── audit.py production diff (metrics layer to be built)
├── prompts/ prompt templates (currently informational)
├── examples/
│ ├── run_audit.py main entry point — single query
│ ├── view_program.py roll up the six metrics across a trace directory
│ ├── compare_to_production.py diff against production Deep Research YAML
│ ├── view_trace.py Streamlit single-trace viewer
│ ├── example_query_set.yaml sample query set
│ └── production-template.yaml Deep Research capture template
├── docs/
│ ├── architecture.md
│ ├── calibration.md
│ └── measurement.md the six metrics to compute
└── traces/ output directory (auto-created)
- The local agent is not the production system. Calibrate against visible Deep Research plans before treating the trace as ground truth.
- The retriever uses SerpAPI for seed URLs. Brand-content visibility depends on what shows up in those results.
- Pairwise verdicts on
gemma4:e2bare directional, not authoritative. The reranker quality goes up materially withgemma4:e4borgemma4:26b. - The critic occasionally still produces output that doesn't conform — robust parsing recovers from most cases, but the more sophisticated the agent loop you want, the more you'll want a larger model for the critic role specifically.
Apache 2.0. Use it, modify it, ship it.
iPullRank — ipullrank.com/contact