feat: enhance meaning units and utility#156
Conversation
Greptile SummaryThis PR enhances the meaning-unit pipeline with a two-level
Confidence Score: 5/5Safe to merge. The importance weighting flows correctly through extraction, QA generation, and scoring with a safe flat-mean fallback when data is missing. All code paths for the new importance field are correctly wired: the enum is declared, carried through both LLM prompt stages, stored in the schemas, looked up by ID in _compute_metrics_columns, and passed correctly to compute_utility_score. The required_columns decorator was updated to include COL_QUALITY_QA, preventing silent missing-data bugs. Tests cover the weighted, mismatch, and None fallback cases. The only observation is a prompt-level ambiguity about how strictly to score critical units with abstracted answers, which does not affect code correctness. No files require special attention. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["LLM: Meaning Unit Extraction\n(MeaningUnitsSchema)"] -->|"units with importance:\ncritical | important"| B["Serialize to JSON\n(_serialize_meaning_units)"]
B --> C["LLM: QA Generation\n(QualityQAPairsSchema)"]
C -->|"items with importance\ncarried through"| D["LLM: Quality Re-answer\n(QualityAnswersSchema)"]
D --> E["LLM: Quality Compare\n(QACompareResultsSchema)\ngrading rules now importance-aware"]
E -->|"ids + scores"| F["_compute_metrics_columns"]
C -->|"importance lookup\nby item.id"| F
F -->|"importance_levels list"| G["compute_utility_score\ncritical=2.0x, important=1.0x\nweighted mean"]
G --> H["COL_UTILITY_SCORE"]
Reviews (6): Last reviewed commit: "Update tests/engine/test_evaluate.py" | Re-trigger Greptile |
|
We will never have the legacy problem greptile is referring to. There are couple small changes needed by the sensitivity PR that are duplicated in this PR, so that both will merge smoothly: Removing reference to "sensitiive_attribute" and adding checks in validate_protection_consistency for combined_risk_level. |
|
I think this looks good, but to be sure, could you merge/rebase latest main into this PR? I did that locally and the diff becomes much clearer: the duplicated sensitivity-disposition changes are absorbed from main, and this PR is left with the intended quality/importance changes. But I wanna make sure that propagates correctly. Also one small test ask: could we add direct coverage for the new weighted utility behavior in |
8111523 to
99ed115
Compare
lipikaramaswamy
left a comment
There was a problem hiding this comment.
Looks good. The rebase makes the diff much cleaner: the sensitivity-disposition changes are coming from main, and this PR is focused on meaning-unit importance / weighted utility.
Thanks for adding the weighted utility tests.
Q/non-blocking: do we still want compute_utility_score to fall back to a flat mean when importance_levels is missing or length-mismatched? Since importance is now required and this is going into a new release (0.2.0), a mismatch could arguably fail loudly instead of silently using the old scoring behavior.
Co-authored-by: lipikaramaswamy <31832945+lipikaramaswamy@users.noreply.github.com>
Changes include:
-- Meaning units should preserve only the minimum semantic detail
necessary to retain the text's functional utility.
-- <segmentation_rules> updated (emphasizing one idea per unit)
-- Each meaning unit now has an importance. We later use this to weight quality_qa failures. Eventually we may use it if we repair quality_qa. We may also use it if we implement rare semantic bundles.