M3A Evals — Human-AI Interaction Study

Evaluation pipeline for comparing HITL (Human-In-The-Loop), HOTL (Human-On-The-Loop), and autonomous agentic AI runs on single-cell multiome (scRNA-seq + scATAC-seq) analysis tasks across three cancer datasets (BRCA, EAC, liposarcoma). Agent execution traces are logged to Weights & Biases, then downloaded, annotated, reviewed by both humans and an LLM judge, and analyzed.

The repo is organized into two sections matching the paper structure:

01_interactions/ — characterize what happens during human↔AI sessions; compare HITL vs HOTL vs autonomous patterns.
02_survey/ — analyze how human-experts and an AI-expert (LLM judge) graded the blinded final analyst outputs.

Run types

Condition	Description
HITL	Human-In-The-Loop — human analyst actively steers the agent during execution
HOTL	Human-On-The-Loop — human monitors but intervenes less frequently
Autonomous (M3A)	Fully unsupervised agent runs with no human involvement. Stored under W&B user `sjohri20`, tagged `AI-BRCA`, `AI-EAC`, `AI-liposarcoma`. These appear in `01_interactions/whitelist.csv` under the `autonomous` column and serve as the no-human baseline.

Section 1 — Interaction analysis (`01_interactions/`)

1. Download logs

1_download_logs.py

Downloads all runs from the W&B project vanallenlab-agentic-ai/agentic-ai-pilot-final. For each run, saves the execution log (out.txt) and any logged tables (as CSVs) under wandb_runs/<username>/<experiment_type>/<run_name>/. Skips runs whose output folder already exists.

2. Reconstruct user interactions

2_convert_user_interactions.py

For runs missing a native user_interactions.csv, reconstructs it from execution.csv by extracting USER_QUESTION and USER_FEEDBACK rows into converted_user_interactions.csv.

3. Annotate execution traces

Three annotation scripts, each using the Anthropic Batch API. All share common polling/retry machinery via llm_batch_utils.py. Outputs go to results/per_action/.

3a_annotate_with_action_category.py — classifies each execution and UI row with a semantic action category (Claude Sonnet). Supports single-file, whitelist batch, and Batch API modes.
3b_two_pass_task_labels.py — two-pass task labeller (Claude Sonnet): Pass 1 labels rows from intent+tool alone; a neighbor-fill step propagates labels across unclear runs; Pass 2 re-classifies remaining unclear rows with full task definitions. Propagates labels to matching UI CSVs.
3c_annotate_with_reasoning_category.py — classifies five qualitative reasoning columns per agent action (Claude Haiku). An --alternatives mode separately classifies alternatives_considered into the action-category vocabulary.

4. Figures

4_figures.py — action and reasoning classification figures: fig4b (run length boxplot), fig4c (sub-task effort allocation dumbbell), fig4d (action mix bars + alternatives dot plot), sup4a (action category × task heatmap), sup4b (alternative co-occurrence matrix).
6_figures.py — human steering and UI interaction figures: fig5ab (steering proportion dumbbell + UI-category boxplot), fig5c (HOTL vs HITL per UI category), sup5a (UI category × task heatmap).

Both scripts write PNGs and source-data TSVs to results/figs/.

Section 2 — Survey grading (`02_survey/`)

Two parallel expert pools graded the same blinded analyst outputs against the same instrument:

Human-expert (n=274 responses): single-cell biologists, computational biologists, and combined-background reviewers, collected via Google Form.
AI-expert (n=348 responses): Claude Opus 4.6 as an LLM judge, via the Anthropic Batches API.

1. Prepare blinded summaries

2_survey_prepare_summaries_human.py

Two-pass pipeline per run: (1) programmatically compresses execution.csv into a structured trace (removes references to human involvement to blind reviewers); (2) calls Claude Sonnet to generate a narrative summary. A cross-summary standardization pass ensures HITL and HOTL summaries are indistinguishable in style and length. Outputs go to survey_assets/summary_pairwise_freeze/.

2. Prepare survey dataset for the agent judge

3_survey_prepare_dataset_agent.py

Parses survey_assets/survey_questions.md into a flat survey_assets/survey_questions.csv keyed by (scope, dataset, file, question_id). Per-file questions (Q1–Q8, Q11) expand one row per dataset × file; ranking questions (Q9–Q10) expand one row per dataset.

3. Run the LLM judge

4_survey_run_agent.py

Reads survey_questions.csv and answers each question using Claude Opus 4.6 via the Anthropic Batches API. Uses question-type-specific system prompts (task quality, credibility, novelty, HITL/HOTL classifier, comparative ranking, open-ended). Outputs results/survey_results.json and results/survey_results.csv.

4. Survey figures

5_figures.py — six-panel Figure 6 (fig6a–6f): individual panel PNGs and source-data TSVs written to results/figs/.

Cross-section utility

build_survey_trace_key.py — joins survey ratings (Q1–Q8, per blinded output letter) with the corresponding per-action trace files from 01_interactions/results/per_action/. Output: survey_trace_key.csv.

Directory structure

.
├── README.md
├── LICENSE
├── tasks/                                  # per-dataset task prompts (shared)
├── build_survey_trace_key.py               # joins survey ratings with trace files
├── survey_trace_key.csv
│
├── 01_interactions/                        # Section 1 — interaction analysis
│   ├── 1_download_logs.py
│   ├── 2_convert_user_interactions.py
│   ├── 3a_annotate_with_action_category.py
│   ├── 3b_two_pass_task_labels.py
│   ├── 3c_annotate_with_reasoning_category.py
│   ├── 4_figures.py
│   ├── 6_figures.py
│   ├── llm_batch_utils.py                  # shared batch API utilities
│   ├── whitelist.csv                       # run manifest
│   ├── fig_utils/                          # shared plotting helpers
│   └── results/
│       └── figs/                           # figures + source-data TSVs
│
└── 02_survey/                              # Section 2 — survey grading
    ├── 2_survey_prepare_summaries_human.py
    ├── 3_survey_prepare_dataset_agent.py
    ├── 4_survey_run_agent.py
    ├── 5_figures.py
    ├── survey_assets/
    │   ├── survey_questions.{md,pdf,csv}
    │   └── blinded_review_files/           # anonymized analyst outputs for review
    └── results/
        └── figs/                           # figures 6a–6f + source-data TSVs

Dependencies

anthropic (Claude API — Sonnet for summarization/annotation, Haiku for reasoning classification, Opus 4.6 for judging)
wandb
pandas, numpy, scipy, matplotlib
Standard library: argparse, csv, json, pathlib, re

Setup

pip install anthropic wandb pandas numpy scipy matplotlib
export ANTHROPIC_API_KEY=...
wandb login

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

M3A Evals — Human-AI Interaction Study

Run types

Section 1 — Interaction analysis (`01_interactions/`)

1. Download logs

2. Reconstruct user interactions

3. Annotate execution traces

4. Figures

Section 2 — Survey grading (`02_survey/`)

1. Prepare blinded summaries

2. Prepare survey dataset for the agent judge

3. Run the LLM judge

4. Survey figures

Cross-section utility

Directory structure

Dependencies

Setup

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
01_interactions		01_interactions
02_survey		02_survey
tasks		tasks
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

M3A Evals — Human-AI Interaction Study

Run types

Section 1 — Interaction analysis (01_interactions/)

1. Download logs

2. Reconstruct user interactions

3. Annotate execution traces

4. Figures

Section 2 — Survey grading (02_survey/)

1. Prepare blinded summaries

2. Prepare survey dataset for the agent judge

3. Run the LLM judge

4. Survey figures

Cross-section utility

Directory structure

Dependencies

Setup

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Section 1 — Interaction analysis (`01_interactions/`)

Section 2 — Survey grading (`02_survey/`)

Packages