PaperPilot is an open-source research intelligence system that automatically extracts datasets, metrics, figures, tables, claims, and reproducibility signals from academic papers.
It combines rule-based NLP pipelines with LLM-assisted extraction, producing structured, explainable, and confidence-labeled outputs — designed for researchers, students, and data scientists who want to understand papers faster and more reliably.
-
📄 PDF Analysis
- Extract sections, claims, figures, tables, and experimental evidence from research papers.
-
📊 Dataset Discovery
- Identify primary and secondary datasets used in the paper
- Confidence scoring and role labeling (training, evaluation, reference)
-
📈 Metrics & Results Extraction
- Detect reported metrics (accuracy, AP, F1, etc.)
- Link metrics to experiments and datasets (when possible)
-
🧪 Reproducibility Signals
- Dataset availability
- Metric definitions
- Baseline comparisons
- Code release detection
- Tabular results presence
-
🤖 Hybrid Intelligence
- Deterministic rule-based extraction for reliability
- LLM-assisted reasoning for ambiguous or implicit information
-
📦 Structured Outputs
- Machine-readable JSON outputs
- Export-friendly for downstream tools (Dash, notebooks, pipelines)
-
🖥️ Interactive UI
- Upload PDFs
- Inspect extracted datasets, figures, claims, and plots
- Human-in-the-loop validation ready
Reading research papers is slow and error-prone. Important details like:
- Which dataset was actually used?
- Which metric matters?
- Is this result reproducible?
- What evidence supports the claim?
are often scattered across sections, tables, and figures.
PaperPilot turns papers into structured evidence.
PDF Upload ↓ Document Parsing & Sectioning ↓ Rule-Based NLP Extraction ↓ LLM-Assisted Reasoning ↓ Confidence & Reproducibility Scoring ↓ Structured Outputs + UI Visualization
- Python
- Streamlit (UI)
- Rule-based NLP (regex, heuristics)
- LLMs (optional / pluggable)
- PDF parsing (PyMuPDF / PDFMiner)
- Data visualization (matplotlib / plotly)
paperpilot/ ├── core/ │ ├── parser.py # PDF parsing & section splitting │ ├── pipeline.py # End-to-end extraction pipeline │ ├── datasets.py # Dataset detection & matching │ ├── metrics.py # Metric extraction logic │ ├── figures.py # Figure & table detection │ └── reproducibility.py ├── frontend/ │ └── app.py # Streamlit UI ├── examples/ ├── outputs/ └── README.md
git clone https://github.com/yourusername/paperpilot.git
cd paperpilot
2. Install dependencies
pip install -r requirements.txt
3. Run the app
streamlit run frontend/app.py