Detect and replace sensitive entities in text using LLM-powered workflows.
- Detect entities using GLiNER-PII and LLM-based augmentation and validation
- Replace with 4 strategies — LLM-generated substitute, redact, annotate, or hash (deterministic, local)
- Preview results before full runs with
display_record()visualization
git clone https://github.com/NVIDIA-NeMo/Anonymizer.git
cd Anonymizer
make installBy default, Anonymizer uses models hosted on build.nvidia.com — GLiNER-PII for entity detection and a text LLM for augmentation/validation. You can also bring your own models via custom provider configs.
The default build.nvidia.com (NVIDIA Build) setup is a convenient way to try Anonymizer and iterate on previews. Use of NVIDIA Build is subject to NVIDIA Build's own terms of service and privacy practices, which are separate from and independent of the NeMo Framework library. NVIDIA Build is intended for evaluation and testing purposes only and may not be used in production environments. Do not upload any confidential information or personal data when using NVIDIA Build. Your use of NVIDIA Build is logged for security purposes and to improve NVIDIA products and services.
Request and token rate limits on build.nvidia.com vary by account and model access, and lower-volume development access can be slow for full-dataset runs. Start with preview() on a small sample, then move to your own endpoint for production data and usage.
export NVIDIA_API_KEY="your-nvidia-api-key"Tip: All examples below use
uv runto invoke commands. If you prefer, activate the venv withsource .venv/bin/activateand run commands directly.
DATA_URL="https://raw.githubusercontent.com/NVIDIA-NeMo/Anonymizer/refs/heads/main/docs/data/NVIDIA_synthetic_biographies.csv"
# Preview on a small sample
uv run anonymizer preview --source $DATA_URL --text-column biography --replace redact --num_records 3
# Full run with output file
uv run anonymizer run --source $DATA_URL --text-column biography --replace redact --output result.csv
# Validate config without running
uv run anonymizer validate --source $DATA_URL --text-column biography --replace hashRun anonymizer --help or anonymizer <subcommand> --help for all options.
from anonymizer import Anonymizer, AnonymizerConfig, AnonymizerInput, Redact
DATA_URL = "https://raw.githubusercontent.com/NVIDIA-NeMo/Anonymizer/refs/heads/main/docs/data/NVIDIA_synthetic_biographies.csv"
# Uses default model providers (build.nvidia.com) via NVIDIA_API_KEY env var
anonymizer = Anonymizer()
config = AnonymizerConfig(replace=Redact())
preview = anonymizer.preview(
config=config,
data=AnonymizerInput(source=DATA_URL, text_column="biography"),
num_records=3,
)
# Visualize with entity highlights and replacement map
preview.display_record()
# Most important columns only
preview.dataframe
# Full pipeline trace, including internal underscore-prefixed columns
preview.trace_dataframeFor custom model endpoints, pass a providers YAML:
anonymizer = Anonymizer(model_providers="path/to/model_providers.yaml")Anonymizer has been tested most extensively on English-language data. Multilingual quality has not yet been evaluated systematically across languages, domains, and models.
Although testing so far has been primarily in English, the supported entity set is not limited to U.S.-specific identifiers. Detection and anonymization can also apply to international formats such as non-U.S. phone numbers, addresses, legal references, and national or regional identification numbers, though coverage will vary by language, region, and model configuration.
If you are working with another language, we encourage you to experiment on a small sample first with preview(), validate detected entities and transformed output carefully, and adjust your model providers and model configs as needed.
| Strategy | Output for "Alice" (first_name) |
Configurable |
|---|---|---|
| Substitute | Maya |
instructions |
| Redact | [REDACTED_FIRST_NAME] |
format_template |
| Annotate | <Alice, first_name> |
format_template |
| Hash | <HASH_FIRST_NAME_3bc51062973c> |
format_template, algorithm, digest_length |
from anonymizer import Redact, Annotate, Hash, Substitute
# LLM-generated contextual replacements
AnonymizerConfig(replace=Substitute())
# Constant redaction
AnonymizerConfig(replace=Redact(format_template="****"))
# Annotation with entities tagging
AnonymizerConfig(replace=Annotate(format_template="<{text}-|-{label}>"))
# Deterministic hash with short digest
AnonymizerConfig(replace=Hash(algorithm="sha256", digest_length=8))This repo ships a Claude Code skill at skills/anonymizer/ that elicits your dataset's privacy requirements, recommends Rewrite or Replace with a strategy, and drafts a runnable script for you to iterate on. While the skill should work with other coding agents that support skills, development and testing has focused on Claude Code at this stage.
Install via skills.sh:
npx skills add NVIDIA-NeMo/AnonymizerAfter installation, invoke it with /anonymizer from within Claude Code, or describe what you want to anonymize and let it auto-trigger.
make install-dev # Install with dev dependencies
make test # Run tests
make coverage # Run with coverage report
make format-check # Lint + format check (read-only)
anonymizer --help # CLI usage
make install-pre-commit # Install pre-commit hooks- Python 3.11+
- NeMo Data Designer (installed as dependency)
- NVIDIA API key for default model providers (GLiNER-PII + text LLM), or custom model endpoints
NeMo Anonymizer collects anonymous run-level telemetry to help prioritize product improvements. One event is sent per Anonymizer.run() / Anonymizer.preview() call, containing only technical metadata: the replacement strategy in use, models used, model hosts (e.g. nvidia-build, openrouter, other), input-record counts, run duration, and failure attribution by pipeline step. No user data, record contents, prompts, or model outputs are collected. See the Telemetry and Privacy docs for the full field list.
You may opt out of telemetry at any time:
- For one CLI invocation: pass
--no-emit-telemetryuv run anonymizer run --source data.csv --text-column text --replace redact --no-emit-telemetry
- In the SDK: set
emit_telemetry=FalseonAnonymizerConfigconfig = AnonymizerConfig(replace=Redact(), emit_telemetry=False)
- For the current shell: set the environment variable
export NEMO_TELEMETRY_ENABLED=false
Aggregate usage data (such as which models are most popular) will be shared back with the community. It is not used to track any individual user behavior.
Use of third-party endpoints, including NVIDIA Build: Anonymizer can be configured to use various inference endpoints, including build.nvidia.com, OpenRouter, or local model servers. If you choose to use a third-party endpoint, that endpoint's own terms of service and privacy practices apply independently of this library. Any opt-out you exercise within Anonymizer does not extend to data collection by your chosen endpoint.
Apache License 2.0 — see LICENSE for details.