Prospectus Risk Extraction

Turning unstructured biotech IPO prospectuses into structured risk-factor data — with a rule-based baseline, a learned line classifier, and an honest evaluation harness.

SEC prospectuses bury a company's risk profile in a long, inconsistently formatted "Risk Factors" section — dozens of individually-titled risks, rendered differently in every filing (bold headings in one, ALL-CAPS noun phrases in another, plain wrapped sentences in a third). This project parses those PDFs and extracts each risk factor as structured (title, body, word_count), then measures how well it does against hand-labeled gold data.

Why this project is interesting

Real, messy data. 15 biotech IPO prospectuses spanning ~20 years of filing conventions — no two formatted the same way.
A baseline and a learned model. A transparent rule-based pipeline sets the bar; a scikit-learn line classifier is trained to beat it, distilling the heuristic and then learning from hand-annotated gold labels.
Evaluation done right. Document-grouped cross-validation (no line from a training filing leaks into its own test fold), a frozen held-out split, and metrics that report the document-level number the product actually cares about — not just line accuracy.
End-to-end and reproducible. PDF in → structured risks out → CSV / Excel / JSON reports, all from pip install -e . and a single CLI command.

Results

Line-level classification (heading / body / subheading / skip), evaluated with 5-fold document-grouped cross-validation over 15 filings (~13.7k labeled lines):

Metric	Score
Overall accuracy	0.97
Heading F1	0.91
Macro F1	0.83

Risk-factor segmentation (reconstructing whole risks from predicted lines, scored against gold):

Metric	Score
Mean title recall@1 (fuzzy ≥ 0.85)	96%
Risk-count MAE	2.4 risks / doc
Risk-count bias	+2.1 (slight over-splitting)

The honest take: heading detection generalizes well, and titles are recovered almost perfectly; the remaining gap is a handful of over-split risks on the hardest plain-text filings — analyzed per-document in docs/evaluation_and_baseline_reporting.md.

How it works

PDF ──▶ layout parsing ──▶ locate "Risk Factors" ──▶ segment risks ──▶ structured output
        (spans→lines→        (regex + typography      ┌─ heuristic: 3-pass cascade
         paragraphs,          scoring)                 └─ learned:   line classifier
         font/gap features)                                          (layout + TF-IDF → LogReg)

1. Layout-aware parsing (analyzer.py) — extracts text spans with font, weight, size, and position via PyMuPDF, then groups them into lines and paragraphs and learns each document's own body style so detection isn't tied to absolute font sizes.

2. Section localization — finds the Risk Factors heading and the next major section using regex patterns scored by typography (bold, caps, font size) and table-of-contents disambiguation.

3a. Heuristic segmentation (baseline) — a three-pass cascade that adapts to a document's formatting: font-style headings → indentation levels → text-pattern fallback, so it degrades gracefully from cleanly-formatted to plain-text PDFs.

3b. Learned segmentation (ml/) — a scikit-learn pipeline (standardized layout features + TF-IDF over line text → balanced logistic regression) classifies each line, then risks are reconstructed by collapsing consecutive heading runs and attaching their following body. Trainable on weak heuristic-distilled labels or hand-annotated gold.

Quickstart

python -m venv .venv && . .venv/bin/activate
pip install -e ".[ml]"        # omit [ml] for just the heuristic pipeline

Heuristic pipeline — extract risks and generate CSV/Excel/JSON reports:

# Single PDF → printed risks
prospectus-extract-risks "data/sample_pdfs/1A. Aegerion Pharmaceuticals.pdf"

# Whole folder → summary + detailed + JSON + Excel reports
prospectus-analyze data/sample_pdfs -o artifacts/reports/results.csv

Learned pipeline — build a dataset, train + evaluate, predict on new filings:

prospectus-ml-train   data/sample_pdfs --labels gold   # grouped-CV eval + saves model
prospectus-ml-predict data/sample_pdfs --eval          # score segmentation vs gold
prospectus-ml-predict "data/sample_pdfs/Aduro Biotech.pdf"   # predict on one PDF → JSON

Run the tests with pytest.

Repository layout

Path	Contents
src/prospectus_risk_extraction/	Core package: `analyzer.py`, `extract_risks.py`, and the `ml/` training/eval/predict pipeline
data/sample_pdfs/	15 biotech IPO prospectus PDFs
data/labels/	Hand-annotated gold risk factors + train/val/test splits
docs/	Annotation guidelines and evaluation methodology
tests/	pytest suite (feature/label consistency, word counting, imports)
artifacts/	Generated reports, models, and datasets

Tech stack

Python · PyMuPDF · pdfplumber · scikit-learn · pandas / NumPy · openpyxl · pytest

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
data		data
docs		docs
src/prospectus_risk_extraction		src/prospectus_risk_extraction
test_fixtures		test_fixtures
tests		tests
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Prospectus Risk Extraction

Why this project is interesting

Results

How it works

Quickstart

Repository layout

Tech stack

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Prospectus Risk Extraction

Why this project is interesting

Results

How it works

Quickstart

Repository layout

Tech stack

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages