Evaluation pipeline for comparing HITL (Human-In-The-Loop), HOTL (Human-On-The-Loop), and autonomous agentic AI runs on single-cell multiome (scRNA-seq + scATAC-seq) analysis tasks across three cancer datasets (BRCA, EAC, liposarcoma). Agent execution traces are logged to Weights & Biases, then downloaded, annotated, reviewed by both humans and an LLM judge, and analyzed.
The repo is organized into two sections matching the paper structure:
01_interactions/— characterize what happens during human↔AI sessions; compare HITL vs HOTL vs autonomous patterns.02_survey/— analyze how human-experts and an AI-expert (LLM judge) graded the blinded final analyst outputs.
| Condition | Description |
|---|---|
| HITL | Human-In-The-Loop — human analyst actively steers the agent during execution |
| HOTL | Human-On-The-Loop — human monitors but intervenes less frequently |
| Autonomous (M3A) | Fully unsupervised agent runs with no human involvement. Stored under W&B user sjohri20, tagged AI-BRCA, AI-EAC, AI-liposarcoma. These appear in 01_interactions/whitelist.csv under the autonomous column and serve as the no-human baseline. |
1_download_logs.py
Downloads all runs from the W&B project vanallenlab-agentic-ai/agentic-ai-pilot-final. For each run, saves the execution log (out.txt) and any logged tables (as CSVs) under wandb_runs/<username>/<experiment_type>/<run_name>/. Skips runs whose output folder already exists.
2_convert_user_interactions.py
For runs missing a native user_interactions.csv, reconstructs it from execution.csv by extracting USER_QUESTION and USER_FEEDBACK rows into converted_user_interactions.csv.
Three annotation scripts, each using the Anthropic Batch API. All share common polling/retry machinery via llm_batch_utils.py. Outputs go to results/per_action/.
3a_annotate_with_action_category.py— classifies each execution and UI row with a semantic action category (Claude Sonnet). Supports single-file, whitelist batch, and Batch API modes.3b_two_pass_task_labels.py— two-pass task labeller (Claude Sonnet): Pass 1 labels rows from intent+tool alone; a neighbor-fill step propagates labels acrossunclearruns; Pass 2 re-classifies remainingunclearrows with full task definitions. Propagates labels to matching UI CSVs.3c_annotate_with_reasoning_category.py— classifies five qualitative reasoning columns per agent action (Claude Haiku). An--alternativesmode separately classifiesalternatives_consideredinto the action-category vocabulary.
4_figures.py— action and reasoning classification figures: fig4b (run length boxplot), fig4c (sub-task effort allocation dumbbell), fig4d (action mix bars + alternatives dot plot), sup4a (action category × task heatmap), sup4b (alternative co-occurrence matrix).6_figures.py— human steering and UI interaction figures: fig5ab (steering proportion dumbbell + UI-category boxplot), fig5c (HOTL vs HITL per UI category), sup5a (UI category × task heatmap).
Both scripts write PNGs and source-data TSVs to results/figs/.
Two parallel expert pools graded the same blinded analyst outputs against the same instrument:
- Human-expert (n=274 responses): single-cell biologists, computational biologists, and combined-background reviewers, collected via Google Form.
- AI-expert (n=348 responses): Claude Opus 4.6 as an LLM judge, via the Anthropic Batches API.
2_survey_prepare_summaries_human.py
Two-pass pipeline per run: (1) programmatically compresses execution.csv into a structured trace (removes references to human involvement to blind reviewers); (2) calls Claude Sonnet to generate a narrative summary. A cross-summary standardization pass ensures HITL and HOTL summaries are indistinguishable in style and length. Outputs go to survey_assets/summary_pairwise_freeze/.
3_survey_prepare_dataset_agent.py
Parses survey_assets/survey_questions.md into a flat survey_assets/survey_questions.csv keyed by (scope, dataset, file, question_id). Per-file questions (Q1–Q8, Q11) expand one row per dataset × file; ranking questions (Q9–Q10) expand one row per dataset.
4_survey_run_agent.py
Reads survey_questions.csv and answers each question using Claude Opus 4.6 via the Anthropic Batches API. Uses question-type-specific system prompts (task quality, credibility, novelty, HITL/HOTL classifier, comparative ranking, open-ended). Outputs results/survey_results.json and results/survey_results.csv.
5_figures.py— six-panel Figure 6 (fig6a–6f): individual panel PNGs and source-data TSVs written toresults/figs/.
build_survey_trace_key.py— joins survey ratings (Q1–Q8, per blinded output letter) with the corresponding per-action trace files from01_interactions/results/per_action/. Output:survey_trace_key.csv.
.
├── README.md
├── LICENSE
├── tasks/ # per-dataset task prompts (shared)
├── build_survey_trace_key.py # joins survey ratings with trace files
├── survey_trace_key.csv
│
├── 01_interactions/ # Section 1 — interaction analysis
│ ├── 1_download_logs.py
│ ├── 2_convert_user_interactions.py
│ ├── 3a_annotate_with_action_category.py
│ ├── 3b_two_pass_task_labels.py
│ ├── 3c_annotate_with_reasoning_category.py
│ ├── 4_figures.py
│ ├── 6_figures.py
│ ├── llm_batch_utils.py # shared batch API utilities
│ ├── whitelist.csv # run manifest
│ ├── fig_utils/ # shared plotting helpers
│ └── results/
│ └── figs/ # figures + source-data TSVs
│
└── 02_survey/ # Section 2 — survey grading
├── 2_survey_prepare_summaries_human.py
├── 3_survey_prepare_dataset_agent.py
├── 4_survey_run_agent.py
├── 5_figures.py
├── survey_assets/
│ ├── survey_questions.{md,pdf,csv}
│ └── blinded_review_files/ # anonymized analyst outputs for review
└── results/
└── figs/ # figures 6a–6f + source-data TSVs
anthropic(Claude API — Sonnet for summarization/annotation, Haiku for reasoning classification, Opus 4.6 for judging)wandbpandas,numpy,scipy,matplotlib- Standard library:
argparse,csv,json,pathlib,re
pip install anthropic wandb pandas numpy scipy matplotlib
export ANTHROPIC_API_KEY=...
wandb login