You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
With the rapid development of auto research agents, it is becoming increasingly impressive to see systems capable of multi-step reasoning, coding, and tool use.
However, there is still a lack of benchmarks that can rigorously evaluate whether these agents can truly perform end-to-end scientific research.
Most existing benchmarks focus on:
knowledge recall,
reasoning tasks,
or code generation,
but they do not capture the full research workflow — from raw data understanding, to analysis, to producing paper-level conclusions.
As a result, it is still unclear:
whether current agents can genuinely reproduce scientific findings,
how different research agents compare under a unified setting,
and what gaps remain between current systems and real-world research capability.
Proposed Solution
We would like to suggest trying ResearchClawBench, a benchmark specifically designed for evaluating auto research agents.
It introduces a two-stage evaluation framework:
Stage 1 — Autonomous Research
The agent is given raw datasets, task instructions, and references, and must independently perform data analysis, coding, visualization, and report writing.
Stage 2 — Paper-level Evaluation
The generated report is compared against a real published paper using expert-designed checklists (rubrics) and an LLM-based judge.
The scoring is calibrated such that:
~50 corresponds to reproducing the original paper (Re-Discovery)
higher scores indicate surpassing the original work (New Discovery)
The benchmark includes:
40 tasks across 10 scientific domains
real datasets and reproducible setups
fine-grained evaluation grounded in expert annotations
support for multiple agents and easy integration of custom systems
We believe this setup may provide a more direct way to evaluate and demonstrate research capabilities of agents.
If relevant, it could be interesting to see how your system performs under such a benchmark.
Problem Statement
With the rapid development of auto research agents, it is becoming increasingly impressive to see systems capable of multi-step reasoning, coding, and tool use.
However, there is still a lack of benchmarks that can rigorously evaluate whether these agents can truly perform end-to-end scientific research.
Most existing benchmarks focus on:
but they do not capture the full research workflow — from raw data understanding, to analysis, to producing paper-level conclusions.
As a result, it is still unclear:
Proposed Solution
We would like to suggest trying ResearchClawBench, a benchmark specifically designed for evaluating auto research agents.
It introduces a two-stage evaluation framework:
Stage 1 — Autonomous Research
The agent is given raw datasets, task instructions, and references, and must independently perform data analysis, coding, visualization, and report writing.
Stage 2 — Paper-level Evaluation
The generated report is compared against a real published paper using expert-designed checklists (rubrics) and an LLM-based judge.
The scoring is calibrated such that:
The benchmark includes:
We believe this setup may provide a more direct way to evaluate and demonstrate research capabilities of agents.
If relevant, it could be interesting to see how your system performs under such a benchmark.
Links:
ResearchClawBench.mp4
Alternatives Considered
No response
Feature Area
AI / Chat / Agent
Additional Context
No response