Skip to content

Fix integer overflow in LDA partition bookkeeping for large corpora#227

Open
sldyns wants to merge 1 commit into
bab2min:mainfrom
sldyns:fix/lda-partition-overflow
Open

Fix integer overflow in LDA partition bookkeeping for large corpora#227
sldyns wants to merge 1 commit into
bab2min:mainfrom
sldyns:fix/lda-partition-overflow

Conversation

@sldyns
Copy link
Copy Markdown

@sldyns sldyns commented Apr 16, 2026

Summary

This fixes integer overflow in the LDA multi-worker partition path for very large corpora.

Root cause

LDAModel::updatePartition() accumulated total token counts with a literal 0, which makes std::accumulate sum in 32-bit integer space before assigning back to a wider type.

For corpora whose total token count exceeds INT32_MAX, this can overflow and corrupt partition bookkeeping in the PARTITION path.

Changes

  • use uint64_t for the partition total-count accumulator
  • use uint64_t for the partition cumulative-count accumulator
  • use uint64_t for totCf in the term-weighting path

Notes

This change is intended to preserve behavior for normal corpora while avoiding overflow for large-token inputs.

@sldyns
Copy link
Copy Markdown
Author

sldyns commented Apr 20, 2026

I took a closer look at the failures.

Aside from macOS Intel + Python 3.9, the remaining failures all appear to be the same flaky reproducibility test (test_DTModel_reproducibility_NONE). The topic words and ordering are unchanged, and the diffs are only tiny floating-point differences (~1e-9), so this looks like an overly strict exact-equality assertion rather than a regression from this patch.

The macOS Intel + Python 3.9 failure is different (test_corpus_transform) and looks like a separate metadata/Unicode issue in the DMR/document binding path. This PR only changes src/TopicModel/LDAModel.hpp, so I do not think that failure is caused by this patch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant