Fix integer overflow in LDA partition bookkeeping for large corpora by sldyns · Pull Request #227 · bab2min/tomotopy

sldyns · 2026-04-16T05:40:29Z

Summary

This fixes integer overflow in the LDA multi-worker partition path for very large corpora.

Root cause

LDAModel::updatePartition() accumulated total token counts with a literal 0, which makes std::accumulate sum in 32-bit integer space before assigning back to a wider type.

For corpora whose total token count exceeds INT32_MAX, this can overflow and corrupt partition bookkeeping in the PARTITION path.

Changes

use uint64_t for the partition total-count accumulator
use uint64_t for the partition cumulative-count accumulator
use uint64_t for totCf in the term-weighting path

Notes

This change is intended to preserve behavior for normal corpora while avoiding overflow for large-token inputs.

sldyns · 2026-04-20T07:48:39Z

I took a closer look at the failures.

Aside from macOS Intel + Python 3.9, the remaining failures all appear to be the same flaky reproducibility test (test_DTModel_reproducibility_NONE). The topic words and ordering are unchanged, and the diffs are only tiny floating-point differences (~1e-9), so this looks like an overly strict exact-equality assertion rather than a regression from this patch.

The macOS Intel + Python 3.9 failure is different (test_corpus_transform) and looks like a separate metadata/Unicode issue in the DMR/document binding path. This PR only changes src/TopicModel/LDAModel.hpp, so I do not think that failure is caused by this patch.

fix: avoid integer overflow in LDA partition bookkeeping

566c29a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix integer overflow in LDA partition bookkeeping for large corpora#227

Fix integer overflow in LDA partition bookkeeping for large corpora#227
sldyns wants to merge 1 commit into
bab2min:mainfrom
sldyns:fix/lda-partition-overflow

sldyns commented Apr 16, 2026

Uh oh!

sldyns commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

sldyns commented Apr 16, 2026

Summary

Root cause

Changes

Notes

Uh oh!

sldyns commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant