Reviewer-revision cleanup: reproducibility (CV, benchmarking, demo, tests, CI, container) by tmyates · Pull Request #3 · biomedicalinformaticsgroup/LitDD_mining

tmyates · 2026-06-23T13:41:39Z

Addresses the Reviewer 2 Section C (code reproducibility) requests for the NCOMMS revision, plus the prior cleanup-branch work. main is unchanged; this is a reviewable branch to merge at the end of the revision.

What's here (10 commits)

Pinned env — requirements.txt now uses tested ==X.Y.* pins; added environment.yml (R2-R2).
Containers — containers/litdd.def (Apptainer/Singularity) + Dockerfile on CUDA 12.1 (R2-R3).
CI — .github/workflows/ci.yml: uv-based ruff lint + CPU unit tests on every push (R2-S2).
Tests — tests/test_llm_map.py, tests/test_crossencode.py (deterministic logic; GPU imports made lazy) + existing final_data_clean test; 22 pass, no GPU (R2-S1).
Missing stages added — hpo_annotations/ (cadmus → fulltext → HPO), visualisation/ (ce_tsne.py, datamap_plot.py), annotate_pubmed/bert_predict_vllm.py, plus cross_validation/ and benchmarking/ (R2-R1).
Packaging/docs — pyproject.toml, .gitignore; README reconciled to the shipped layout, uv install, single-node multi-GPU note (R2-R4/S4).

Full text (cadmus) and large ontology/embedding artefacts are gitignored and regenerable; the full text is not redistributed (publisher permissions). A release will be tagged to match the manuscript/Zenodo version on acceptance (R2-S3).

🤖 Generated with Claude Code

Pin requirements.txt to tested minor versions; add environment.yml, Apptainer/Docker containers, pyproject.toml (ruff/pytest config), and a uv-based GitHub Actions CI (ruff + CPU unit tests). Ignore large pipeline/ontology artefacts, notebooks, and the local revision/ area. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Group-stratified 80/20 split, conservative fine-tuning, and 5-fold StratifiedGroupKFold hyperparameter search on the training set only (BERT and cross-encoder). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Evaluate every baseline with the same CV-on-train + refit + held-out-test protocol as LitDD-BERT (fixes the previously anomalous ~15% baseline F1 from an untrained classification head). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Document cross-encoder scoring; make torch/vllm/sentence-transformers imports lazy so deterministic helpers are CPU-importable for unit tests; add bert_predict_vllm.py. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…lisation run_cadmus.py -> get_fulltext_df.py -> extract_hpo.py (weighted + unweighted HPO profiles per G2P disease; full text fetched via cadmus, not redistributed). ce_tsne.py + datamap_plot.py reproduce the MONDO-labelled datamap figure (converted from the exploratory notebook). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

run_pipeline.sh --demo exercises the full training pipeline end-to-end on 100 rows with tiny CPU models (~5 min); --full runs the real models. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Fixture-backed test for final_data_clean.py plus unit tests for llm_map.py (prompt/answer parsing, sharding) and crossencode.py (top-5 selection, G2P LGMDE builder). 22 tests, no GPU required. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

uv-based installation, container usage, single-node multi-GPU note, and the actual shipped pipeline layout (HPO, visualisation, tests, CI). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

build_fixtures.py and build_demo_data.py now read the reference clean_pipeline location from the LITDD_REF_DIR env var instead of an absolute dev path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Blinded stratified sampler (sample_audit.py: confidence/recency/disease-volume/gene-multiplicity strata, ~500 + 100 overlap), training-label IAA sampler (sample_trainlabel_iaa.py), cascade funnel (cascade_funnel.py), and scorer (score_audit.py: per-stratum precision + Wilson CIs, implied FP count, error categories, Cohen's kappa for both IAA exercises). Outputs are gitignored (annotator-facing, contain abstracts). Stats unit-tested. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

final_traintest_dataset.py derives gene/g2p_id from g2p_lgmde so --group_col gene (gene-held-out) and --group_col g2p_id (disease-held-out) give splits where no gene/disease appears in both train and test — stronger generalisation tests than a TIAB-level split. datasets/sklearn imports made lazy; helper unit-tested; README updated. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…3.1) --cutoff_year reports precision for records at/after the LLM knowledge cutoff vs before, reusing the audit's per-record year; similar precision either side argues against memorisation inflating results. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

--cutoff_year/--min_post_cutoff swap in post-cutoff (>=2024) records (default >=80/500) so the post- vs pre-cutoff precision comparison is adequately powered. Worksheet regenerated (gitignored): 500 units, 80 post-cutoff, strata balanced. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

DeepSeek-R1-Distill-Qwen-14B is Qwen2.5-14B base (knowledge cutoff ~Dec 2023); distillation adds reasoning not knowledge, so 2024+ is post-cutoff (2025 = margin past DeepSeek-R1's 2024-07). Rationale baked into --cutoff_year help + README; deployed corpus has 6,616 (9.6%) >=2024 and 2,983 (4.3%) >=2025. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ing (R2-C1/R3.4) Quantifies how many score-passing mappings the gene-in-TIAB filter drops, and --no_gene_check produces the relaxed corpus (gene filter off) for the recall-recovery comparison the reviewer asks for. Unit-tested. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…2-C1) Use pipeline_df_complete (raw LLM mappings) not the already-filtered map CSV. Real funnel: 782,230 BERT+ -> 335,684 LLM-mapped -> 153,627 score>=0.9 -> 76,480 deployed. Gene-mention filter drops 86,338 score-passing mappings (56.2%) -> gene_dropped_worksheet for the recall-cost audit (R3.4). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Queries PubTator3 per PMID for gene annotations and writes gene2pubtator3 format consumable by final_data_clean.py, so the gene-mention filter can use fresh per-abstract annotations instead of the bulk download (which has coverage gaps that inflate the filter's attrition). Batched, with retry/back-off (robust to transient NCBI errors), optional API key, and resume. Verified live. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

I/O-bound thread pool (per-thread sessions, locked writes, resume-safe); 6 workers cut the ~129k-PMID score-passing fetch from ~2.7h to ~30 min at ~1 CPU / <1GB RAM. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… non-reportable Per-disease (G2P ID) micro+macro PMID-retrieval recall vs HPOA + pre-mined G2P (case-report-weighted, in-scope), restricted to Aug-2025 DDG2P disorders and >1980 (--min_year). Deployed recall: premined 0.65, HPOA 0.62 (0.72 with gene filter relaxed), combined 0.63->0.69; in-corpus 0.74->0.81. Fixed HPOA multi-PMID reference parsing. ClinGen computed but excluded from the paper (web-scraped + functional/mechanistic out-of-scope evidence; see NOTES.md). fetch_pmid_meta.py adds year/pubtype via esummary. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ix (R3.4/R2-C1) Reconcile the recall harness to manuscript Table 6: - Restrict the recall universe (all sources incl. G2P) to leaf MONDOs — single gene-diseases, dropping broad grouping terms (e.g. dilated cardiomyopathy, skeletal dysplasia) whose curated literature spans many genes. Uses MONDO obographs (--mondo_json); supersedes the multi-g2p flag. Excluded entries -> excluded_grouping_mondos.csv. - ClinGen = case-level genetic_evidence_* only (Reference(PMID) column), matching the manuscript's "case level evidence"; experimental/functional tables excluded. - Exclude train/test PMIDs (--exclude_pmids); premined uses `publications` only, never `additional mined publications`. - Fix mined_deployed: split ';'-separated multi-G2P-ID cells (8% of rows) — they were treated as one key, undercounting deployed recall by ~8%. deployed now reproduces Table 6 (premined 0.68/0.72, HPOA 0.69/0.72, ClinGen 0.71/0.72) and ~= relaxed, so the gene-mention filter costs only ~1-4 recall points. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Categorise every deployed-corpus recall miss (not_in_corpus / mapped_other / below_score / gene_filtered / llm_no_match) and tag each by NCBI publication type (in-scope vs review/editorial/etc.), to show whether the recall gap is a corpus-coverage boundary or a model failure. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…R3.4) - Rename the miss category not_in_corpus -> litdd_bert_negative and the recall scope in_corpus -> bert_positive. LitDD's BERT runs over all of PubMed (gene seeding was only for building the train/test corpus), so these papers were BERT-classified negative, not "out of corpus". - Add bert_negative_gene_check.py: for each BERT-negative miss, test whether the G2P gene (symbol or previous/alias symbol) is mentioned in the title+abstract. ~29% mention no gene at all — candidate papers with no molecular confirmation (phenotype described, causative gene unpublished) that the pipeline excludes by design. PubTator over full text refines this (the accurate, alias/name-aware test). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…uth (R3.4) The mined corpus excludes non-English papers and review compendia, so the truth sets must too for a fair recall comparison: - fetch_pmid_meta.py: also capture language + journal/booktitle (GeneReviews and StatPearls are NCBI Bookshelf chapters with empty `source`, named in `booktitle`); checkpoint incrementally so transient NCBI 5xx don't lose progress. - build_truthsets.py --exclude_meta: drop non-English (lang != eng) and GeneReviews/StatPearls truth PMIDs. Reports counts (31 GeneReviews, 2 StatPearls, 19 non-English) for the results text. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…a exclusions (R3.4) NOTES.md: record the non-English/GeneReviews/StatPearls truth exclusions and the BERT-negative no-molecular-confirmation analysis (945/1155 misses name the gene; 200 name none -> design exclusion lifting deployed recall 0.66/0.71 -> 0.68/0.72). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

tmyates and others added 24 commits June 23, 2026 13:23

Add CPU demo and master pipeline runner

5446c69

run_pipeline.sh --demo exercises the full training pipeline end-to-end on 100 rows with tiny CPU models (~5 min); --full runs the real models. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Update README: uv install, shipped layout, reproducibility

18b33f2

uv-based installation, container usage, single-node multi-GPU note, and the actual shipped pipeline layout (HPO, visualisation, tests, CI). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reviewer-revision cleanup: reproducibility (CV, benchmarking, demo, tests, CI, container)#3

Reviewer-revision cleanup: reproducibility (CV, benchmarking, demo, tests, CI, container)#3
tmyates wants to merge 24 commits into
mainfrom
cleanup/reviewer-fixes

tmyates commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

tmyates commented Jun 23, 2026

What's here (10 commits)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant