feat: add anonymous run-level telemetry by lipikaramaswamy · Pull Request #155 · NVIDIA-NeMo/Anonymizer

lipikaramaswamy · 2026-05-12T21:31:18Z

Summary

Adds src/anonymizer/telemetry.py with AnonymizerEvent + TelemetryHandler (DD-style fire-and-flush, NSS-style env handling)
Wires emission into Anonymizer.run() / Anonymizer.preview() with try/except/finally so every exit path (completed/error/canceled) emits exactly one event
Sets NEMO_SESSION_PREFIX=anonymizer- and NEMO_DEPLOYMENT_TYPE=sdk|cli via os.environ.setdefault so DataDesigner's existing per-LLM-call telemetry is filterable to anonymizer-driven runs
Three opt-outs: NEMO_TELEMETRY_ENABLED=false env var, AnonymizerConfig(emit_telemetry=False), --no-emit-telemetry CLI flag
README "Telemetry and Privacy" section with opt-out instructions and third-party endpoint disclaimer

Event shape matches the anonymizer_event schema in nemo-telemetry (38 fields, raw integer counts, tiktoken cl100k_base for avgTokensPerRecord mirroring DD's column-stat approach, modelHosts as a sorted-deduped list).

Known limitation

FailedRecord.step is workflow-name granularity today, so entityValidatorFailureCount and entityAugmenterFailureCount cannot be separated — both bucket into entityDetectionFailureCount. Same for the rewrite pipeline.

Coordination

This PR can land in main safely — CI has NEMO_TELEMETRY_ENABLED=false. Alternatively, we wait until the nemo-telemetry MR lands, flip to true and then merge this.

Test plan

uv run pytest — 721 tests pass (including 56 new unit + 21 new integration)
make format-check, make copyright-check, make lock-check
CLI smoke: anonymizer validate ... --no-emit-telemetry accepted
Manually verify anonymizer run emits a real event against staging endpoint (post-merge, post-schema-deploy)

Emit one anonymizer_event per Anonymizer.run() / Anonymizer.preview() call, matching the schema in nemo-telemetry's anonymous_events.json. Telemetry is opt-out via NEMO_TELEMETRY_ENABLED=false, AnonymizerConfig(emit_telemetry=False), or the --no-emit-telemetry CLI flag. DD's session_ids are tagged with the anonymizer- prefix so anonymizer-driven LLM calls are filterable in dashboards. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

greptile-apps · 2026-05-12T21:36:18Z

Greptile Summary

This PR adds anonymous run-level telemetry to NeMo Anonymizer, emitting one anonymizer_event per run() / preview() call with technical metadata only (no user data). It introduces a new telemetry.py module with a Pydantic event schema and a fire-and-flush TelemetryHandler, wires emission into both public entry points via try/except/finally, and provides three opt-out mechanisms (env var, config field, CLI flag).

src/anonymizer/telemetry.py — new event schema (AnonymizerEvent, 38 camelCase fields), enums, TelemetryHandler with DLQ bookkeeping, and helper utilities (avg_tokens_per_record, classify_model_host, sort_join_aliases).
src/anonymizer/interface/anonymizer.py — wraps _run_internal() in try/except/finally on both run() and preview(), short-circuits telemetry before tiktoken when the env-var opt-out is active, and adds _build_telemetry_event() / _collect_step_models() / _collect_failure_counts() helpers.
77 new tests (unit + integration) covering opt-out paths, all transformation types, failure-count aggregation, HTTP send semantics, and the context manager lifecycle.

Confidence Score: 5/5

Safe to merge — telemetry is fully isolated from the main pipeline, all exceptions are swallowed, and CI has emission disabled.

The telemetry path is additive and wrapped in best-effort guards throughout; no change to core anonymization logic. The two flagged items are data-quality nits in the emitted payload, not pipeline correctness issues. Test coverage is thorough across opt-out paths, transformation types, and HTTP send semantics.

No files require special attention for merge safety. The two suggestions are improvements to telemetry data quality, not blockers.

Important Files Changed

Filename	Overview
src/anonymizer/telemetry.py	New module: fire-and-flush telemetry handler, event schema, and env-var helpers. Well-structured with thorough opt-out support; `_add_to_dlq` note is honest about unreachability. Main concern: no explicit timeout on the httpx POST, which can stall callers in offline environments.
src/anonymizer/interface/anonymizer.py	Wires telemetry into run()/preview() via try/except/finally covering all exit paths. Short-circuit before tiktoken on env-var opt-out is in place. Minor: on the error path, `num_success_records` is incorrectly set to `total_records` because `result` is None and `failure_count` falls back to 0.
src/anonymizer/interface/cli/main.py	Adds `--no-emit-telemetry` flag and sets `NEMO_DEPLOYMENT_TYPE=cli` at module-import time; design is intentional so the CLI wins the setdefault race over SDK. No issues.
src/anonymizer/config/anonymizer_config.py	Adds `emit_telemetry: bool = True` field to `AnonymizerConfig` with clear description. No issues.
tests/test_telemetry.py	Comprehensive unit tests for all telemetry helpers, enums, payload building, handler opt-out, DLQ, and context manager. Covers edge cases well.
tests/interface/test_anonymizer_telemetry.py	Integration tests for emission on success/error/cancel paths, all transformation types, opt-out paths, and failure count aggregation. Well-structured with `_FakeHandler` stub and `captured_events` fixture.
tests/conftest.py	Adds `_isolate_telemetry_env` autouse fixture that disables telemetry and clears all related env vars for every test. Correct pattern.
.github/workflows/ci.yml	Adds `NEMO_TELEMETRY_ENABLED: "false"` as a workflow-level env var to prevent CI from emitting real telemetry events. Correct placement.

Sequence Diagram

sequenceDiagram
    participant U as Caller
    participant A as Anonymizer.run()/preview()
    participant P as _run_internal()
    participant T as _maybe_emit_telemetry()
    participant TH as TelemetryHandler
    participant E as Telemetry Endpoint

    U->>A: run(config, data)
    A->>A: read_input() to input_df, t_start
    A->>P: _run_internal()
    alt Success
        P-->>A: AnonymizerResult
        A->>A: "status = COMPLETED"
    else Exception
        P-->>A: raises Exception
        A->>A: "status = ERROR or CANCELED"
    end
    A->>T: finally _maybe_emit_telemetry(status, result)
    T->>T: check config.emit_telemetry and _telemetry_enabled()
    alt Opt-out active
        T-->>A: return no-op
    else Telemetry enabled
        T->>T: _build_telemetry_event tiktoken and model resolution
        T->>TH: with TelemetryHandler as h h.enqueue(event)
        TH->>TH: flush asyncio.run(_flush_events())
        TH->>E: POST /v1.1/events/json
        alt 2xx or 4xx
            E-->>TH: drop no retry
        else 5xx or timeout
            E-->>TH: _add_to_dlq effectively dropped
        end
        TH-->>T: __exit__
    end
    T-->>A: swallows any exception
    A-->>U: result or re-raises original exception

_{Reviews (3): Last reviewed commit: "Merge remote-tracking branch 'origin/mai..." | Re-trigger Greptile}

greptile-apps · 2026-05-12T21:36:22Z

+from anonymizer.telemetry import (
+    NOT_APPLICABLE,
+    AnonymizerEvent,
+    TaskEnum,
+    TaskStatusEnum,
+    TelemetryHandler,
+    avg_tokens_per_record,
+    classify_model_host,
+    collect_model_hosts,
+    sort_join_aliases,
+)


When NEMO_TELEMETRY_ENABLED=false is set but config.emit_telemetry is still True, execution falls through the first guard and calls _build_telemetry_event, which encodes every row of input_df[COL_TEXT] through tiktoken before the event is silently discarded inside enqueue(). For large datasets this imposes O(n) tokenization overhead on every run even with the env-var opt-out active. The _telemetry_enabled() guard (already present in telemetry.py) needs to be checked here, before the event is built.

Suggested change

from anonymizer.telemetry import (

NOT_APPLICABLE,

AnonymizerEvent,

TaskEnum,

TaskStatusEnum,

TelemetryHandler,

avg_tokens_per_record,

classify_model_host,

collect_model_hosts,

sort_join_aliases,

)

from anonymizer.telemetry import (

NOT_APPLICABLE,

AnonymizerEvent,

TaskEnum,

TaskStatusEnum,

TelemetryHandler,

_telemetry_enabled,

avg_tokens_per_record,

classify_model_host,

collect_model_hosts,

sort_join_aliases,

)

Good catch — fixed in a57c0c7. Added the _telemetry_enabled() short-circuit in _maybe_emit_telemetry before _build_telemetry_event so the env-var opt-out path no longer pays the tiktoken cost. Updated test_env_var_disables_emission to verify the short-circuit (no event reaches the handler at all).

greptile-apps · 2026-05-12T21:36:23Z

+        try:
+            if not getattr(config, "emit_telemetry", True):
+                return
+            event = self._build_telemetry_event(


Second half of the fix: short-circuit before _build_telemetry_event when the env-var opt-out is active. Without this guard, tiktoken encodes all records and all step models are resolved even though the resulting event will be discarded by enqueue().

Suggested change

try:

if not getattr(config, "emit_telemetry", True):

return

event = self._build_telemetry_event(

try:

if not getattr(config, "emit_telemetry", True):

return

if not _telemetry_enabled():

return

event = self._build_telemetry_event(

See reply above — fixed in a57c0c7.

greptile-apps · 2026-05-12T21:36:24Z

+        with patch("httpx.AsyncClient", side_effect=RuntimeError("boom")):
+            asyncio.run(handler._flush_events.__wrapped__(handler)) if False else None  # noqa
+            # call _flush_events with an event present


This line is unreachable dead code — the if False condition prevents it from ever executing. It looks like a leftover from a debugging session and should be removed to keep the test body clean.

Suggested change

with patch("httpx.AsyncClient", side_effect=RuntimeError("boom")):

asyncio.run(handler._flush_events.__wrapped__(handler)) if False else None # noqa

# call _flush_events with an event present

with patch("httpx.AsyncClient", side_effect=RuntimeError("boom")):

# call _flush_events with an event present

Removed in a57c0c7. Leftover from when I was debugging unwrapped vs wrapped coroutines.

mckornfield

pretty good so far

mckornfield · 2026-05-12T21:35:38Z

+        failure_counts = _collect_failure_counts(failed)
+        hosts = _resolve_model_hosts(self._resolved_providers)
+
+        return AnonymizerEvent(


lol this is a biggun

😅 Yeah — it's mostly a 30-kwarg AnonymizerEvent(...) constructor call. The actual computation is already extracted into helpers (_collect_step_models, _collect_failure_counts, _resolve_model_hosts, _transformation_type_string, etc.); the method body is just wiring those into the event. I tried sketching a split and didn't find a clean boundary that wasn't just hiding the constructor behind a wrapper. Open to suggestions if you see one.

nah... other than a builder/factory or passing in a state object that pops things off (like models could be passed in then extract). but meh

mckornfield · 2026-05-12T21:37:32Z

+
+def _step_to_field(step: str) -> str:
+    """Map a FailedRecord.step (workflow_name) to a schema failure-count field key."""
+    if step == "entity-detection":


switch/case?

Done in a57c0c7 — converted to match/case. The two regex-matched cases use case _ if <regex>.match(step): guards which works cleanly.

mckornfield · 2026-05-12T21:38:10Z

+        "entity_validator": sort_join_aliases(det.entity_validator or []),
+        "entity_augmenter": det.entity_augmenter or NOT_APPLICABLE,
+        # latent_detector only runs in rewrite mode
+        "latent_detector": (det.latent_detector or NOT_APPLICABLE) if has_rewrite else NOT_APPLICABLE,


seems kinda weird, could make a _default_na() method for all these not applicable, but seems like there's ONE case where it's has_substitute instead of has_rewrite, so prolly not worth it

Agreed — the has_substitute vs has_rewrite asymmetry is exactly why a generic _default_na() would obscure more than it'd save. Leaving as-is.

mckornfield · 2026-05-12T21:40:22Z

+        return ModelHostEnum.NVIDIA_INTERNAL
+    if "openrouter.ai" in endpoint:
+        return ModelHostEnum.OPENROUTER
+    if any(h in endpoint for h in ("localhost", "127.0.0.1", "0.0.0.0", "[::1]")):


is this going character by character? or is endpoint a list?

endpoint is a single URL string — the in checks are substring searches against it (e.g. "build.nvidia.com" in "https://build.nvidia.com/v1/chat" → True). Added a clarifying comment + pulled the local-host list to a named tuple in a57c0c7 so it reads more obviously.

mckornfield · 2026-05-12T21:41:57Z

+        captured_events: list[AnonymizerEvent],
+        stub_input: AnonymizerInput,
+    ) -> None:
+        anonymizer, _, _, _ = _make_anonymizer()


I can't remember does

Suggested change

anonymizer, _, _, _ = _make_anonymizer()

anonymizer, * = _make_anonymizer()

work too or does it complain that * is unused?

Yep — anonymizer, *_ = _make_anonymizer() works (just _ alone would error since _make_anonymizer returns a 4-tuple; you need *_ to slurp the rest). Applied across all 13 sites in a57c0c7.

mckornfield · 2026-05-12T21:42:36Z


+## Telemetry and Privacy
+
+NeMo Anonymizer collects anonymous run-level telemetry to help prioritize product improvements. One event is sent per `Anonymizer.run()` / `Anonymizer.preview()` call, containing only technical metadata: the replacement strategy in use, models used, model hosts (e.g. `nvidia-build`, `openrouter`, `other`), input-record counts, run duration, and failure attribution by pipeline step. **No user data, record contents, prompts, or model outputs are collected.**


as with NSS, should we also link to/have this info in our docs?

Done in a57c0c7 — added a "Telemetry and Privacy" section to docs/index.md mirroring NSS's pattern, and the README link now points to that anchor (#telemetry-and-privacy).

Greptile fixes: - Short-circuit on NEMO_TELEMETRY_ENABLED=false before _build_telemetry_event, so disabling telemetry no longer pays the tiktoken cost on every record - Remove dead "if False else None" line in test_client_setup_failure_routes_to_dlq - Document on _add_to_dlq that DLQ retries are unreachable in fire-and-flush usage; preserved for DD shape compatibility Matt's suggestions: - _step_to_field: if/elif chain -> match/case - classify_model_host: clarify that `host in endpoint` is substring search on a URL string, not iteration over characters - Test helpers: anonymizer, _, _, _ -> anonymizer, *_ (13 call sites) - Add Telemetry and Privacy section to docs/index.md (mirroring NSS); README links to that anchor Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

mckornfield · 2026-05-15T18:56:54Z

+        failure_counts = _collect_failure_counts(failed)
+        hosts = _resolve_model_hosts(self._resolved_providers)
+
+        return AnonymizerEvent(


nah... other than a builder/factory or passing in a state object that pops things off (like models could be passed in then extract). but meh

…telemetry # Conflicts: # src/anonymizer/interface/anonymizer.py

lipikaramaswamy requested review from a team as code owners May 12, 2026 21:31

greptile-apps Bot reviewed May 12, 2026

View reviewed changes

mckornfield reviewed May 12, 2026

View reviewed changes

lipikaramaswamy requested a review from mckornfield May 15, 2026 18:40

mckornfield approved these changes May 15, 2026

View reviewed changes

Merge remote-tracking branch 'origin/main' into lipikaramaswamy/feat/…

fb45aee

…telemetry # Conflicts: # src/anonymizer/interface/anonymizer.py

lipikaramaswamy merged commit 051ced1 into main May 15, 2026
11 checks passed

lipikaramaswamy deleted the lipikaramaswamy/feat/telemetry branch May 15, 2026 21:06

	anonymizer, _, _, _ = _make_anonymizer()
	anonymizer, * = _make_anonymizer()


		## Telemetry and Privacy

		NeMo Anonymizer collects anonymous run-level telemetry to help prioritize product improvements. One event is sent per `Anonymizer.run()` / `Anonymizer.preview()` call, containing only technical metadata: the replacement strategy in use, models used, model hosts (e.g. `nvidia-build`, `openrouter`, `other`), input-record counts, run duration, and failure attribution by pipeline step. No user data, record contents, prompts, or model outputs are collected.

Conversation

lipikaramaswamy commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Known limitation

Coordination

Test plan

Uh oh!

greptile-apps Bot commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mckornfield left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lipikaramaswamy commented May 12, 2026 •

edited

Loading

greptile-apps Bot commented May 12, 2026 •

edited

Loading