Skip to content

Speed up simulation hot path (histories + in-place pool update)#50

Open
ythat wants to merge 1 commit into
CausalInference:masterfrom
ythat:perf-refactor
Open

Speed up simulation hot path (histories + in-place pool update)#50
ythat wants to merge 1 commit into
CausalInference:masterfrom
ythat:perf-refactor

Conversation

@ythat

@ythat ythat commented Jun 24, 2026

Copy link
Copy Markdown

Hi! Thank you for developing and maintaining thegfoRmula package.

It has been very helpful both at work as well as the summer courses at CAUSALab. 😊

With the help of Claude, I'm proposing some changes that help with the performance, yielding identical output as the original package.

The short version: Across every scenario I tested, the point estimates and bootstrap CIs come out all.equal()-identical to the current master. It touches only two files and adds no dependencies (still pure R on data.table).

What changed

R/histories.R

The history helpers (lagged, lagavg, cumavg, visit_sum) were rebuilding the same time-slice subset several times and constructing a full-length get(id_name) %in% current_ids filter once per history variable per time step. The refactor caches each time slice's row indices once and aligns source rows to the current rows by id with match(), then writes by reference with data.table::set(). The first-creation cumulative-average branches swap the repeated filtered tapply() calls for grouped means via split()/vapply() with id-indexed assignment. This relies on the existing invariant that the pool has one row per id per time in a consistent order.

R/simulate.R

Two small changes:

  • The mutated time-t slice was written back with pool[pool[[time_name]] == t] <- newdf, a row-subassignment that round-trips the whole table. It's now updated in place at the precomputed row index with set() (creating any new columns first).
  • The survival product-limit update was re-subsetting the previous time slice three times (five with a competing event) to read prodp0/prodd0/poprisk. It now grabs that slice once and reuses it; the arithmetic is untouched.

Benchmark

Using the bundled basicdata_nocomp example (the documented gformula_survival spec; 2,500 ids, 7 time points), so it's fully reproducible without any external data. Same call on master vs this branch. Machine: Apple Silicon, R 4.3.2,
data.table 1.17.8.

Configuration master this PR speedup results
Point estimate (nsamples = 0) 0.45s 0.24s 1.86× ✅ identical
Bootstrap 100, sequential 35.61s 18.69s 1.91× ✅ identical
Bootstrap 100, parallel (16 cores)¹ 12.59s 10.73s 1.17× ✅ identical

¹ On a dataset this small the parallel run is dominated by fixed cluster-setup overhead (16 cores give the baseline only ~2.8× over its own serial run), so the per-simulation savings show up less in wall-clock here. They scale up with the number of time points, covariates, and bootstrap samples. For instance, on a larger analysis I work with (~7k subjects, 48 time points, 10 time varying covariates) the bootstrap speedup was around 2.3×, also with identical results.

I also spot-checked that output stays identical across integer / character / factor / non-sequential ids, the continuous_eof outcome type, and a competing-event survival analysis.

I hope this is helpful 🙂

Cache time-slice row indices and align source rows by id in the history
helpers (lagged/lagavg/cumavg/visit_sum) instead of rebuilding subsets and
full-length %in% filters per history variable. Update the time-t slice in
simulate() by reference rather than via row subassignment, and fetch the
previous time slice once in the survival update. Results are unchanged
(all.equal-identical to before); ~1.8x faster on the bundled survival example.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant