Skip to content

Aakash0440/PaperPilot

Repository files navigation

🚀 PaperPilot — Research Intelligence for Scientific Papers

PaperPilot is an open-source research intelligence system that automatically extracts datasets, metrics, figures, tables, claims, and reproducibility signals from academic papers.

It combines rule-based NLP pipelines with LLM-assisted extraction, producing structured, explainable, and confidence-labeled outputs — designed for researchers, students, and data scientists who want to understand papers faster and more reliably.


✨ Key Features

  • 📄 PDF Analysis

    • Extract sections, claims, figures, tables, and experimental evidence from research papers.
  • 📊 Dataset Discovery

    • Identify primary and secondary datasets used in the paper
    • Confidence scoring and role labeling (training, evaluation, reference)
  • 📈 Metrics & Results Extraction

    • Detect reported metrics (accuracy, AP, F1, etc.)
    • Link metrics to experiments and datasets (when possible)
  • 🧪 Reproducibility Signals

    • Dataset availability
    • Metric definitions
    • Baseline comparisons
    • Code release detection
    • Tabular results presence
  • 🤖 Hybrid Intelligence

    • Deterministic rule-based extraction for reliability
    • LLM-assisted reasoning for ambiguous or implicit information
  • 📦 Structured Outputs

    • Machine-readable JSON outputs
    • Export-friendly for downstream tools (Dash, notebooks, pipelines)
  • 🖥️ Interactive UI

    • Upload PDFs
    • Inspect extracted datasets, figures, claims, and plots
    • Human-in-the-loop validation ready

🧠 Why PaperPilot?

Reading research papers is slow and error-prone. Important details like:

  • Which dataset was actually used?
  • Which metric matters?
  • Is this result reproducible?
  • What evidence supports the claim?

are often scattered across sections, tables, and figures.

PaperPilot turns papers into structured evidence.


🏗️ Architecture Overview

PDF Upload ↓ Document Parsing & Sectioning ↓ Rule-Based NLP Extraction ↓ LLM-Assisted Reasoning ↓ Confidence & Reproducibility Scoring ↓ Structured Outputs + UI Visualization


🛠️ Tech Stack

  • Python
  • Streamlit (UI)
  • Rule-based NLP (regex, heuristics)
  • LLMs (optional / pluggable)
  • PDF parsing (PyMuPDF / PDFMiner)
  • Data visualization (matplotlib / plotly)

📂 Project Structure

paperpilot/ ├── core/ │ ├── parser.py # PDF parsing & section splitting │ ├── pipeline.py # End-to-end extraction pipeline │ ├── datasets.py # Dataset detection & matching │ ├── metrics.py # Metric extraction logic │ ├── figures.py # Figure & table detection │ └── reproducibility.py ├── frontend/ │ └── app.py # Streamlit UI ├── examples/ ├── outputs/ └── README.md


🚀 Getting Started

1. Clone the repository

git clone https://github.com/yourusername/paperpilot.git
cd paperpilot
2. Install dependencies
pip install -r requirements.txt
3. Run the app
streamlit run frontend/app.py

About

An open-source research intelligence tool that extracts datasets, metrics, figures, claims, and reproducibility signals from academic papers using hybrid rule-based NLP and LLM-assisted analysis.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors