Thank you for your interest in contributing! ResearchClawBench welcomes contributions in three main areas:
- New Research Tasks — Expand our benchmark with tasks from new domains or papers
- New Agents — Add support for additional AI coding agents
- Bug Fixes & Features — Improve the evaluation framework itself
Each task is a curated research challenge derived from a real published paper. To contribute a task, create a directory under tasks/ following the naming convention {Domain}_{NNN} (e.g., Biology_000).
tasks/YourDomain_000/
├── task_info.json # Task description + data file manifest
├── data/ # Input datasets (read-only for agents)
│ ├── dataset1.csv
│ └── dataset2.json
├── related_work/ # Reference papers (PDF)
│ ├── paper_000.pdf
│ ├── paper_001.pdf
│ └── ...
└── target_study/ # Ground truth for evaluation
├── paper.pdf # The original published paper
├── checklist.json # Expert-annotated evaluation checklist
└── images/ # Target figures referenced in checklist
├── figure1.png
└── ...
{
"task": "A detailed description of the research task the agent should complete...",
"data": [
{
"name": "Dataset Name",
"path": "./data/dataset1.csv",
"type": "CSV",
"description": "What this dataset contains and how it should be used."
}
]
}Requirements:
task: Clear research task description grounded in the provided workspace. The agent receives this task text together with the workspace contents.data[].path: Must start with./data/(workspace-relative path).data[].type: File format (e.g., CSV, JSON, TXT, PDF, HDF5).
The evaluation checklist defines what the judge scores. Each item represents a key finding or analysis from the original paper:
[
{
"type": "text",
"content": "Description of the expected finding or analysis...",
"path": null,
"keywords": [
"Technical keyword 1 the judge should verify",
"Technical keyword 2..."
],
"weight": 0.3
},
{
"type": "image",
"content": "Description of the expected figure...",
"path": "images/figure1.png",
"keywords": [
"Visual element 1 to verify",
"Visual element 2..."
],
"weight": 0.2
}
]Item types:
text: Methodology, findings, or analysis that should appear in the reportimage: A figure the agent should generate, compared against the target image
Guidelines:
- Weights should sum to 1.0 across all items
- Keywords should be specific and technical, not generic
- Each item should correspond to a distinct, verifiable contribution of the paper
- Include a mix of text and image items where appropriate
PDFs should be named paper_000.pdf, paper_001.pdf, etc. These are reference papers the agent can read for context. No duplicate files (identical content) are allowed.
Before submitting:
-
task_info.jsonis valid JSON with alldata[].pathstarting with./data/ - All referenced data files exist in
data/ -
related_work/contains at least one reference paper -
target_study/checklist.jsonis valid JSON with weights summing to ~1.0 -
target_study/paper.pdfis the original published paper - Image items in checklist have corresponding files in
target_study/images/ - A human researcher can reproduce the paper's key results from the provided workspace and instructions
Agent configuration is stored in evaluation/agents.json. Adding a new agent requires:
Edit evaluation/agents.json to add your agent:
{
"my_agent": {
"label": "My Agent",
"icon": "A",
"logo": "/static/logos/my_agent.svg",
"cmd": "my-agent run -m <PROMPT> -w <WORKSPACE>"
}
}Fields:
label: Display name in the UIicon: Single character fallback icon (used when logo is unavailable)logo: Path to an SVG logo file (place inevaluation/static/logos/)cmd: Shell command to run the agent, with placeholders:<PROMPT>— Replaced with the prompt content. For-pstyle flags (file path), replaced with"path". For other flags, replaced with"$(cat 'path')"to pass file content.<WORKSPACE>— Replaced with the absolute workspace directory path (optional).
Place an SVG logo file at evaluation/static/logos/my_agent.svg. Recommended size: 16-20px square, monochrome or simple colors.
python -m evaluation
# Select your agent in the UI → Start Run → verify output streams correctlyYour agent must:
- Accept a prompt/instruction (via file path or stdin)
- Work within a given directory (cwd is set to the workspace)
- Write output to stdout (captured as
_agent_output.jsonl) - Be fully autonomous — no interactive prompts or confirmation dialogs
- Generate
report/report.mdandreport/images/as deliverables
| Agent | Repository | Notes |
|---|---|---|
| Claude Code | Anthropic | Stream-JSON output |
| Codex CLI | OpenAI | Full-auto mode |
| OpenClaw | OpenClaw | Self-hosted, 3600s timeout |
| Nanobot | HKUDS/nanobot | Lightweight, reliable tool execution |
git clone https://github.com/InternScience/ResearchClawBench.git
cd ResearchClawBench
pip install -r evaluation/requirements.txt
cp evaluation/.env.example evaluation/.env
# Edit .env with your API credentials
python -m evaluationevaluation/
├── server.py # Flask API + SSE streaming
├── run_task.py # Workspace setup + agent subprocess
├── score.py # LLM scoring engine (dual-mode rubric)
├── config.py # Configuration loader
├── agents.json # Agent presets
├── instructions_tmpl.py # Prompt template
├── utils.py # Utilities (file tree, path safety)
├── static/app.js # Frontend (single file)
├── static/style.css # Styles (4 themes)
└── templates/index.html # HTML shell
- Fork the repository and create a feature branch
- Make changes — keep PRs focused on a single concern
- Test locally — verify the UI works, agents can run, scoring produces valid results
- Submit PR with a clear description of what changed and why
- Python: Follow existing patterns. No additional linters required.
- JavaScript: Single-file
app.js, vanilla JS (no frameworks). Useesc()for user content to prevent XSS. - CSS: Use CSS variables (
var(--text),var(--accent), etc.) for theme compatibility. Test in all 4 themes.
- STATIC_MODE:
app.jsserves both the local Flask UI and GitHub Pages static site. Code guarded byif (STATIC_MODE)only runs on GitHub Pages. - Stale guards: All async functions that write to the DOM must check if the current task/run has changed before rendering (see
_selectEpochandisStale()patterns). - File tree limits:
build_file_tree()supportsmax_per_dirandmax_depthto prevent browser freezes from large workspaces.
Be respectful, constructive, and collaborative. We're all working toward the same goal: advancing AI's ability to do real science.
- Open a GitHub Issue
- Email: xu_wanghan@sjtu.edu.cn