Skip to content

Fix off-by-one in score_nuisances for discrete treatments (closes #1006)#1029

Open
immu4989 wants to merge 1 commit into
py-why:mainfrom
immu4989:fix-1006-score-nuisances-discrete
Open

Fix off-by-one in score_nuisances for discrete treatments (closes #1006)#1029
immu4989 wants to merge 1 commit into
py-why:mainfrom
immu4989:fix-1006-score-nuisances-discrete

Conversation

@immu4989
Copy link
Copy Markdown

#1006 (comment)

Evidence

Reproducer: ternary T (3 classes) with signal in X, LinearDML +
LogisticRegression nuisance, cv=3. Mean score_nuisances across folds:

Metric Pre-fix Post-fix Notes
T default score (acc) 0.13 0.63 1/3 chance floor for 3-class
Y default score (R²) 0.44 0.44 unchanged — bug isolates to T path

Pre-fix the classifier scores below random because pd.get_dummies(T)
keeps all k columns and inverse_onehot decodes labels as {1..k} while
the classifier was trained on {0..k-1}. Post-fix the same classifier
recovers genuine signal.

The existing test_forest_dml_score_fns already exercised this path and
its expected values silently encoded the bug; they're updated here to the
correct post-fix values:

Assertion (T_*) Was (pre-fix) Now (post-fix)
mean_squared_error 1.5 0.48
mean_absolute_error 1.0 0.48
r2 −5.1 −0.93

roc_auc, log_loss, and pearsonr are approximately invariant under
the off-by-one label shift, which is why the existing test didn't fail
loud — the three metrics where label distance matters were the ones that
silently absorbed the gap.

@immu4989 immu4989 force-pushed the fix-1006-score-nuisances-discrete branch from 414ea37 to e75b3eb Compare May 18, 2026 21:06
Copy link
Copy Markdown
Member

@kbattocchi kbattocchi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution! I really appreciate your taking the time to track down the issue and fix it, but I think the scope should be slightly broader - see comment in _rlearner.py. Once that's addressed I'd be happy to approve and merge this in.

Comment thread econml/dml/_rlearner.py Outdated
Comment on lines +636 to +639
if self.discrete_treatment and (len(np.shape(T)) == 1 or np.shape(T)[1] == 1):
T_2_score = self.transformer.transform(np.asarray(T).reshape(-1, 1))
else:
T_2_score = T
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for submitting this fix!

Unfortunately, I believe the issue is slightly broader than the bug that prompted this - a similar issue will occur even with continuous treatments if a treatment featurizer is used. I think that this slight modification to your logic will cover it, but please add to one of the tests in test_treatment_featurization to also cover this logic.

Suggested change
if self.discrete_treatment and (len(np.shape(T)) == 1 or np.shape(T)[1] == 1):
T_2_score = self.transformer.transform(np.asarray(T).reshape(-1, 1))
else:
T_2_score = T
if self.transformer:
T_2_score = self.transformer.transform(T)
else:
T_2_score = T

Comment thread .gitignore Outdated

# compiled C files
*.c No newline at end of file
*.cscripts/
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this inadvertent? Please revert.

score_nuisances built T_2_score with pd.get_dummies(T) for discrete
treatments and passed raw T otherwise. Both paths could mismatch the
encoding the nuisance model learned at fit time:

- Discrete T: training used OneHotEncoder(drop='first'); inverse_onehot
  at score time decoded the (k-column) get_dummies output as labels
  {1..k} while the classifier learned {0..k-1}, silently corrupting
  the reported score.
- Continuous T with a treatment_featurizer: model_t was trained on
  featurized T, so passing raw T to score caused a multi-output
  regressor shape mismatch.

Re-use self.transformer at score time (the same OneHotEncoder or
treatment featurizer fit during _fit_nuisances), mirroring the
existing pattern in _OrthoLearner._fit_nuisances.

Updates three assertions in test_forest_dml_score_fns whose expected
values silently encoded the discrete bug (T_MSE 1.5→0.48, T_MAE
1.0→0.48, T_R² -5.1→-0.93). Adds two regression tests:
- test_score_nuisances_discrete_treatment_label_alignment (test_dml.py)
- test_score_nuisances_applies_treatment_featurizer
  (test_treatment_featurization.py)

Signed-off-by: Imran Ahamed <immu4989@gmail.com>
@immu4989 immu4989 force-pushed the fix-1006-score-nuisances-discrete branch from e75b3eb to 89b1297 Compare May 27, 2026 20:25
@immu4989
Copy link
Copy Markdown
Author

Thanks for the review @kbattocchi — both points were on the money.

Pushed an updated commit that:

  1. Broader fix in _rlearner.py: applied your suggestion, with one small adjustment — kept the reshape(T, (-1, 1)) step when discrete_treatment=True since OneHotEncoder.transform requires 2D input (the literal self.transformer.transform(T) broke the existing discrete test). The result mirrors the existing pattern in _OrthoLearner._fit_nuisances:

    if self.transformer:
        if self.discrete_treatment:
            T = reshape(T, (-1, 1))
        T_2_score = self.transformer.transform(T)
    else:
        T_2_score = T
  2. Reverted the .gitignore change — inadvertent, sorry about that.

  3. Added test_score_nuisances_applies_treatment_featurizer in test_treatment_featurization.py. Uses LinearDML with the existing polynomial_treatment_featurizer, fits on continuous T, and asserts score_nuisances returns finite scores. Verified pre fix it raises ValueError: y_true and y_pred have different number of output (1!=2) (the shape-mismatch path you anticipated); post-fix it passes.

Existing tests still pass: test_score_nuisances_discrete_treatment_label_alignment, test_forest_dml_score_fns, and a broader sanity sweep (27 passed across test_dml.py + test_treatment_featurization.py, no regressions outside the pre-existing *_with_ray env failures on main).

Why this is a good outcome

  • Substantive maintainer engagement — kbattocchi gave specific, actionable feedback; you responded within hours with a tested broader fix and a sensible departure from his literal suggestion (with justification).
  • Refined fix is more general than the original PR scope. The petition reviewer reading the contribution graph end to end sees: investigation comment → PR → maintainer-requested broadening → broader fix landed. That's the "originality / significance / recognition" trifecta in a single artifact.
  • Two regression tests now cover both bug modes (discrete-T encoding off by one and continuous-T+featurizer shape mismatch).

Copy link
Copy Markdown
Member

@kbattocchi kbattocchi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates, this looks great.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants