Speed up simulation hot path (histories + in-place pool update) by ythat · Pull Request #50 · CausalInference/gfoRmula

ythat · 2026-06-24T08:08:31Z

Hi! Thank you for developing and maintaining thegfoRmula package.

It has been very helpful both at work as well as the summer courses at CAUSALab. 😊

With the help of Claude, I'm proposing some changes that help with the performance, yielding identical output as the original package.

The short version: Across every scenario I tested, the point estimates and bootstrap CIs come out all.equal()-identical to the current master. It touches only two files and adds no dependencies (still pure R on data.table).

What changed

R/histories.R

The history helpers (lagged, lagavg, cumavg, visit_sum) were rebuilding the same time-slice subset several times and constructing a full-length get(id_name) %in% current_ids filter once per history variable per time step. The refactor caches each time slice's row indices once and aligns source rows to the current rows by id with match(), then writes by reference with data.table::set(). The first-creation cumulative-average branches swap the repeated filtered tapply() calls for grouped means via split()/vapply() with id-indexed assignment. This relies on the existing invariant that the pool has one row per id per time in a consistent order.

R/simulate.R

Two small changes:

The mutated time-t slice was written back with pool[pool[[time_name]] == t] <- newdf, a row-subassignment that round-trips the whole table. It's now updated in place at the precomputed row index with set() (creating any new columns first).
The survival product-limit update was re-subsetting the previous time slice three times (five with a competing event) to read prodp0/prodd0/poprisk. It now grabs that slice once and reuses it; the arithmetic is untouched.

Benchmark

Using the bundled basicdata_nocomp example (the documented gformula_survival spec; 2,500 ids, 7 time points), so it's fully reproducible without any external data. Same call on master vs this branch. Machine: Apple Silicon, R 4.3.2,
data.table 1.17.8.

Configuration	master	this PR	speedup	results
Point estimate (`nsamples = 0`)	0.45s	0.24s	1.86×	✅ identical
Bootstrap 100, sequential	35.61s	18.69s	1.91×	✅ identical
Bootstrap 100, parallel (16 cores)¹	12.59s	10.73s	1.17×	✅ identical

¹ On a dataset this small the parallel run is dominated by fixed cluster-setup overhead (16 cores give the baseline only ~2.8× over its own serial run), so the per-simulation savings show up less in wall-clock here. They scale up with the number of time points, covariates, and bootstrap samples. For instance, on a larger analysis I work with (~7k subjects, 48 time points, 10 time varying covariates) the bootstrap speedup was around 2.3×, also with identical results.

I also spot-checked that output stays identical across integer / character / factor / non-sequential ids, the continuous_eof outcome type, and a competing-event survival analysis.

I hope this is helpful 🙂

Cache time-slice row indices and align source rows by id in the history helpers (lagged/lagavg/cumavg/visit_sum) instead of rebuilding subsets and full-length %in% filters per history variable. Update the time-t slice in simulate() by reference rather than via row subassignment, and fetch the previous time slice once in the survival update. Results are unchanged (all.equal-identical to before); ~1.8x faster on the bundled survival example.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Speed up simulation hot path (histories + in-place pool update)#50

Speed up simulation hot path (histories + in-place pool update)#50
ythat wants to merge 1 commit into
CausalInference:masterfrom
ythat:perf-refactor

ythat commented Jun 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

ythat commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changed

Benchmark

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ythat commented Jun 24, 2026 •

edited

Loading