Skip to content

vanallenlab/agentic-ai-analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

M3A Evals — Human-AI Interaction Study

Evaluation pipeline for comparing HITL (Human-In-The-Loop), HOTL (Human-On-The-Loop), and autonomous agentic AI runs on single-cell multiome (scRNA-seq + scATAC-seq) analysis tasks across three cancer datasets (BRCA, EAC, liposarcoma). Agent execution traces are logged to Weights & Biases, then downloaded, annotated, reviewed by both humans and an LLM judge, and analyzed.

The repo is organized into two sections matching the paper structure:

  • 01_interactions/ — characterize what happens during human↔AI sessions; compare HITL vs HOTL vs autonomous patterns.
  • 02_survey/ — analyze how human-experts and an AI-expert (LLM judge) graded the blinded final analyst outputs.

Run types

Condition Description
HITL Human-In-The-Loop — human analyst actively steers the agent during execution
HOTL Human-On-The-Loop — human monitors but intervenes less frequently
Autonomous (M3A) Fully unsupervised agent runs with no human involvement. Stored under W&B user sjohri20, tagged AI-BRCA, AI-EAC, AI-liposarcoma. These appear in 01_interactions/whitelist.csv under the autonomous column and serve as the no-human baseline.

Section 1 — Interaction analysis (01_interactions/)

1. Download logs

  • 1_download_logs.py

Downloads all runs from the W&B project vanallenlab-agentic-ai/agentic-ai-pilot-final. For each run, saves the execution log (out.txt) and any logged tables (as CSVs) under wandb_runs/<username>/<experiment_type>/<run_name>/. Skips runs whose output folder already exists.

2. Reconstruct user interactions

  • 2_convert_user_interactions.py

For runs missing a native user_interactions.csv, reconstructs it from execution.csv by extracting USER_QUESTION and USER_FEEDBACK rows into converted_user_interactions.csv.

3. Annotate execution traces

Three annotation scripts, each using the Anthropic Batch API. All share common polling/retry machinery via llm_batch_utils.py. Outputs go to results/per_action/.

  • 3a_annotate_with_action_category.py — classifies each execution and UI row with a semantic action category (Claude Sonnet). Supports single-file, whitelist batch, and Batch API modes.
  • 3b_two_pass_task_labels.py — two-pass task labeller (Claude Sonnet): Pass 1 labels rows from intent+tool alone; a neighbor-fill step propagates labels across unclear runs; Pass 2 re-classifies remaining unclear rows with full task definitions. Propagates labels to matching UI CSVs.
  • 3c_annotate_with_reasoning_category.py — classifies five qualitative reasoning columns per agent action (Claude Haiku). An --alternatives mode separately classifies alternatives_considered into the action-category vocabulary.

4. Figures

  • 4_figures.py — action and reasoning classification figures: fig4b (run length boxplot), fig4c (sub-task effort allocation dumbbell), fig4d (action mix bars + alternatives dot plot), sup4a (action category × task heatmap), sup4b (alternative co-occurrence matrix).
  • 6_figures.py — human steering and UI interaction figures: fig5ab (steering proportion dumbbell + UI-category boxplot), fig5c (HOTL vs HITL per UI category), sup5a (UI category × task heatmap).

Both scripts write PNGs and source-data TSVs to results/figs/.


Section 2 — Survey grading (02_survey/)

Two parallel expert pools graded the same blinded analyst outputs against the same instrument:

  • Human-expert (n=274 responses): single-cell biologists, computational biologists, and combined-background reviewers, collected via Google Form.
  • AI-expert (n=348 responses): Claude Opus 4.6 as an LLM judge, via the Anthropic Batches API.

1. Prepare blinded summaries

  • 2_survey_prepare_summaries_human.py

Two-pass pipeline per run: (1) programmatically compresses execution.csv into a structured trace (removes references to human involvement to blind reviewers); (2) calls Claude Sonnet to generate a narrative summary. A cross-summary standardization pass ensures HITL and HOTL summaries are indistinguishable in style and length. Outputs go to survey_assets/summary_pairwise_freeze/.

2. Prepare survey dataset for the agent judge

  • 3_survey_prepare_dataset_agent.py

Parses survey_assets/survey_questions.md into a flat survey_assets/survey_questions.csv keyed by (scope, dataset, file, question_id). Per-file questions (Q1–Q8, Q11) expand one row per dataset × file; ranking questions (Q9–Q10) expand one row per dataset.

3. Run the LLM judge

  • 4_survey_run_agent.py

Reads survey_questions.csv and answers each question using Claude Opus 4.6 via the Anthropic Batches API. Uses question-type-specific system prompts (task quality, credibility, novelty, HITL/HOTL classifier, comparative ranking, open-ended). Outputs results/survey_results.json and results/survey_results.csv.

4. Survey figures

  • 5_figures.py — six-panel Figure 6 (fig6a–6f): individual panel PNGs and source-data TSVs written to results/figs/.

Cross-section utility

  • build_survey_trace_key.py — joins survey ratings (Q1–Q8, per blinded output letter) with the corresponding per-action trace files from 01_interactions/results/per_action/. Output: survey_trace_key.csv.

Directory structure

.
├── README.md
├── LICENSE
├── tasks/                                  # per-dataset task prompts (shared)
├── build_survey_trace_key.py               # joins survey ratings with trace files
├── survey_trace_key.csv
│
├── 01_interactions/                        # Section 1 — interaction analysis
│   ├── 1_download_logs.py
│   ├── 2_convert_user_interactions.py
│   ├── 3a_annotate_with_action_category.py
│   ├── 3b_two_pass_task_labels.py
│   ├── 3c_annotate_with_reasoning_category.py
│   ├── 4_figures.py
│   ├── 6_figures.py
│   ├── llm_batch_utils.py                  # shared batch API utilities
│   ├── whitelist.csv                       # run manifest
│   ├── fig_utils/                          # shared plotting helpers
│   └── results/
│       └── figs/                           # figures + source-data TSVs
│
└── 02_survey/                              # Section 2 — survey grading
    ├── 2_survey_prepare_summaries_human.py
    ├── 3_survey_prepare_dataset_agent.py
    ├── 4_survey_run_agent.py
    ├── 5_figures.py
    ├── survey_assets/
    │   ├── survey_questions.{md,pdf,csv}
    │   └── blinded_review_files/           # anonymized analyst outputs for review
    └── results/
        └── figs/                           # figures 6a–6f + source-data TSVs

Dependencies

  • anthropic (Claude API — Sonnet for summarization/annotation, Haiku for reasoning classification, Opus 4.6 for judging)
  • wandb
  • pandas, numpy, scipy, matplotlib
  • Standard library: argparse, csv, json, pathlib, re

Setup

pip install anthropic wandb pandas numpy scipy matplotlib
export ANTHROPIC_API_KEY=...
wandb login

About

Analysis for M3A paper

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages