Skip to content

Reviewer-revision cleanup: reproducibility (CV, benchmarking, demo, tests, CI, container)#3

Open
tmyates wants to merge 24 commits into
mainfrom
cleanup/reviewer-fixes
Open

Reviewer-revision cleanup: reproducibility (CV, benchmarking, demo, tests, CI, container)#3
tmyates wants to merge 24 commits into
mainfrom
cleanup/reviewer-fixes

Conversation

@tmyates

@tmyates tmyates commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

Addresses the Reviewer 2 Section C (code reproducibility) requests for the NCOMMS revision, plus the prior cleanup-branch work. main is unchanged; this is a reviewable branch to merge at the end of the revision.

What's here (10 commits)

  • Pinned envrequirements.txt now uses tested ==X.Y.* pins; added environment.yml (R2-R2).
  • Containerscontainers/litdd.def (Apptainer/Singularity) + Dockerfile on CUDA 12.1 (R2-R3).
  • CI.github/workflows/ci.yml: uv-based ruff lint + CPU unit tests on every push (R2-S2).
  • Teststests/test_llm_map.py, tests/test_crossencode.py (deterministic logic; GPU imports made lazy) + existing final_data_clean test; 22 pass, no GPU (R2-S1).
  • Missing stages addedhpo_annotations/ (cadmus → fulltext → HPO), visualisation/ (ce_tsne.py, datamap_plot.py), annotate_pubmed/bert_predict_vllm.py, plus cross_validation/ and benchmarking/ (R2-R1).
  • Packaging/docspyproject.toml, .gitignore; README reconciled to the shipped layout, uv install, single-node multi-GPU note (R2-R4/S4).

Full text (cadmus) and large ontology/embedding artefacts are gitignored and regenerable; the full text is not redistributed (publisher permissions). A release will be tagged to match the manuscript/Zenodo version on acceptance (R2-S3).

🤖 Generated with Claude Code

tmyates and others added 24 commits June 23, 2026 13:23
Pin requirements.txt to tested minor versions; add environment.yml, Apptainer/Docker containers, pyproject.toml (ruff/pytest config), and a uv-based GitHub Actions CI (ruff + CPU unit tests). Ignore large pipeline/ontology artefacts, notebooks, and the local revision/ area.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Group-stratified 80/20 split, conservative fine-tuning, and 5-fold StratifiedGroupKFold hyperparameter search on the training set only (BERT and cross-encoder).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Evaluate every baseline with the same CV-on-train + refit + held-out-test protocol as LitDD-BERT (fixes the previously anomalous ~15% baseline F1 from an untrained classification head).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Document cross-encoder scoring; make torch/vllm/sentence-transformers imports lazy so deterministic helpers are CPU-importable for unit tests; add bert_predict_vllm.py.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…lisation

run_cadmus.py -> get_fulltext_df.py -> extract_hpo.py (weighted + unweighted HPO profiles per G2P disease; full text fetched via cadmus, not redistributed). ce_tsne.py + datamap_plot.py reproduce the MONDO-labelled datamap figure (converted from the exploratory notebook).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
run_pipeline.sh --demo exercises the full training pipeline end-to-end on 100 rows with tiny CPU models (~5 min); --full runs the real models.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Fixture-backed test for final_data_clean.py plus unit tests for llm_map.py (prompt/answer parsing, sharding) and crossencode.py (top-5 selection, G2P LGMDE builder). 22 tests, no GPU required.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
uv-based installation, container usage, single-node multi-GPU note, and the actual shipped pipeline layout (HPO, visualisation, tests, CI).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
build_fixtures.py and build_demo_data.py now read the reference clean_pipeline location from the LITDD_REF_DIR env var instead of an absolute dev path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Blinded stratified sampler (sample_audit.py: confidence/recency/disease-volume/gene-multiplicity strata, ~500 + 100 overlap), training-label IAA sampler (sample_trainlabel_iaa.py), cascade funnel (cascade_funnel.py), and scorer (score_audit.py: per-stratum precision + Wilson CIs, implied FP count, error categories, Cohen's kappa for both IAA exercises). Outputs are gitignored (annotator-facing, contain abstracts). Stats unit-tested.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
final_traintest_dataset.py derives gene/g2p_id from g2p_lgmde so --group_col gene (gene-held-out) and --group_col g2p_id (disease-held-out) give splits where no gene/disease appears in both train and test — stronger generalisation tests than a TIAB-level split. datasets/sklearn imports made lazy; helper unit-tested; README updated.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…3.1)

--cutoff_year reports precision for records at/after the LLM knowledge cutoff vs before, reusing the audit's per-record year; similar precision either side argues against memorisation inflating results.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
--cutoff_year/--min_post_cutoff swap in post-cutoff (>=2024) records (default >=80/500) so the post- vs pre-cutoff precision comparison is adequately powered. Worksheet regenerated (gitignored): 500 units, 80 post-cutoff, strata balanced.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
DeepSeek-R1-Distill-Qwen-14B is Qwen2.5-14B base (knowledge cutoff ~Dec 2023); distillation adds reasoning not knowledge, so 2024+ is post-cutoff (2025 = margin past DeepSeek-R1's 2024-07). Rationale baked into --cutoff_year help + README; deployed corpus has 6,616 (9.6%) >=2024 and 2,983 (4.3%) >=2025.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ing (R2-C1/R3.4)

Quantifies how many score-passing mappings the gene-in-TIAB filter drops, and --no_gene_check produces the relaxed corpus (gene filter off) for the recall-recovery comparison the reviewer asks for. Unit-tested.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…2-C1)

Use pipeline_df_complete (raw LLM mappings) not the already-filtered map CSV. Real funnel: 782,230 BERT+ -> 335,684 LLM-mapped -> 153,627 score>=0.9 -> 76,480 deployed. Gene-mention filter drops 86,338 score-passing mappings (56.2%) -> gene_dropped_worksheet for the recall-cost audit (R3.4).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Queries PubTator3 per PMID for gene annotations and writes gene2pubtator3 format consumable by final_data_clean.py, so the gene-mention filter can use fresh per-abstract annotations instead of the bulk download (which has coverage gaps that inflate the filter's attrition). Batched, with retry/back-off (robust to transient NCBI errors), optional API key, and resume. Verified live.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
I/O-bound thread pool (per-thread sessions, locked writes, resume-safe); 6 workers cut the ~129k-PMID score-passing fetch from ~2.7h to ~30 min at ~1 CPU / <1GB RAM.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… non-reportable

Per-disease (G2P ID) micro+macro PMID-retrieval recall vs HPOA + pre-mined G2P (case-report-weighted, in-scope), restricted to Aug-2025 DDG2P disorders and >1980 (--min_year). Deployed recall: premined 0.65, HPOA 0.62 (0.72 with gene filter relaxed), combined 0.63->0.69; in-corpus 0.74->0.81. Fixed HPOA multi-PMID reference parsing. ClinGen computed but excluded from the paper (web-scraped + functional/mechanistic out-of-scope evidence; see NOTES.md). fetch_pmid_meta.py adds year/pubtype via esummary.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ix (R3.4/R2-C1)

Reconcile the recall harness to manuscript Table 6:
- Restrict the recall universe (all sources incl. G2P) to leaf MONDOs — single
  gene-diseases, dropping broad grouping terms (e.g. dilated cardiomyopathy,
  skeletal dysplasia) whose curated literature spans many genes. Uses MONDO
  obographs (--mondo_json); supersedes the multi-g2p flag. Excluded entries ->
  excluded_grouping_mondos.csv.
- ClinGen = case-level genetic_evidence_* only (Reference(PMID) column), matching
  the manuscript's "case level evidence"; experimental/functional tables excluded.
- Exclude train/test PMIDs (--exclude_pmids); premined uses `publications` only,
  never `additional mined publications`.
- Fix mined_deployed: split ';'-separated multi-G2P-ID cells (8% of rows) — they
  were treated as one key, undercounting deployed recall by ~8%.

deployed now reproduces Table 6 (premined 0.68/0.72, HPOA 0.69/0.72, ClinGen
0.71/0.72) and ~= relaxed, so the gene-mention filter costs only ~1-4 recall points.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Categorise every deployed-corpus recall miss (not_in_corpus / mapped_other /
below_score / gene_filtered / llm_no_match) and tag each by NCBI publication
type (in-scope vs review/editorial/etc.), to show whether the recall gap is a
corpus-coverage boundary or a model failure.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…R3.4)

- Rename the miss category not_in_corpus -> litdd_bert_negative and the recall
  scope in_corpus -> bert_positive. LitDD's BERT runs over all of PubMed (gene
  seeding was only for building the train/test corpus), so these papers were
  BERT-classified negative, not "out of corpus".
- Add bert_negative_gene_check.py: for each BERT-negative miss, test whether the
  G2P gene (symbol or previous/alias symbol) is mentioned in the title+abstract.
  ~29% mention no gene at all — candidate papers with no molecular confirmation
  (phenotype described, causative gene unpublished) that the pipeline excludes by
  design. PubTator over full text refines this (the accurate, alias/name-aware test).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…uth (R3.4)

The mined corpus excludes non-English papers and review compendia, so the truth
sets must too for a fair recall comparison:
- fetch_pmid_meta.py: also capture language + journal/booktitle (GeneReviews and
  StatPearls are NCBI Bookshelf chapters with empty `source`, named in `booktitle`);
  checkpoint incrementally so transient NCBI 5xx don't lose progress.
- build_truthsets.py --exclude_meta: drop non-English (lang != eng) and
  GeneReviews/StatPearls truth PMIDs. Reports counts (31 GeneReviews, 2 StatPearls,
  19 non-English) for the results text.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…a exclusions (R3.4)

NOTES.md: record the non-English/GeneReviews/StatPearls truth exclusions and the
BERT-negative no-molecular-confirmation analysis (945/1155 misses name the gene;
200 name none -> design exclusion lifting deployed recall 0.66/0.71 -> 0.68/0.72).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant