A Python script that identifies and merges duplicate correspondents in your Paperless installation by clustering similar-sounding names using fuzzy string matching.
When OCR processing in Paperless doesn't work perfectly, the same correspondent might be added multiple times with various spelling mistakes. For example:
- Deutsche Rentenversicherung
- Deutsche Rente Versicherung
- Deutsche Renteversicherung
This tool helps you identify these duplicates and merge them into a single canonical correspondent, updating all associated documents automatically.
- API-based: Uses the Paperless REST API (no direct database access required)
- Fuzzy matching: Clusters similar correspondent names using advanced string similarity algorithms
- Interactive CLI: User-friendly interface to review and confirm merges
- Safe operations: Dry-run mode to preview changes before applying them
- Automatic updates: Updates all document references when merging correspondents
- Pagination support: Handles large numbers of correspondents and documents
- Python 3.7 or higher
- Access to a Paperless instance with REST API enabled
- API token from your Paperless instance
-
Clone or download this repository
-
Install dependencies:
pip install -r requirements.txtSet the following environment variables:
export PAPERLESS_URL=http://localhost:8000
export PAPERLESS_TOKEN=your_api_token_here- Log into your Paperless web interface
- Go to Settings → API Tokens
- Create a new API token
- Copy the token and set it as
PAPERLESS_TOKEN
python dedupe_correspondents.pyThe script will:
- Connect to your Paperless API
- Fetch all correspondents
- Cluster similar names using fuzzy matching
- Display each cluster and prompt for your input
- Merge duplicates by updating document references and deleting duplicate correspondents
To preview changes without making any modifications:
python dedupe_correspondents.py --dry-runor
python dedupe_correspondents.py -nThe script uses Python's difflib.SequenceMatcher to calculate similarity scores between correspondent names. Names with a similarity score above 0.85 (default) are grouped into clusters.
The clustering uses a graph-based approach: if correspondent A is similar to B, and B is similar to C, then A, B, and C are grouped together.
For each cluster with 2 or more correspondents, you'll see:
======================================================================
Cluster 1 of 3
======================================================================
Similar correspondents found:
----------------------------------------------------------------------
1. [42] Deutsche Rentenversicherung (15 documents)
2. [67] Deutsche Rente Versicherung (3 documents)
3. [89] Deutsche Renteversicherung (8 documents)
----------------------------------------------------------------------
Options:
- Enter a number (1-N) to use that correspondent as canonical
- Enter 'n' followed by a new name to create a new canonical name
- Enter 's' to skip this cluster
- Enter 'q' to quit
Your choice:
When you choose to merge a cluster:
- Select canonical correspondent: Choose which existing correspondent to keep, or enter a new name
- Update documents: All documents referencing duplicate correspondents are updated to point to the canonical one
- Update name: If you provided a new name, the canonical correspondent's name is updated
- Delete duplicates: The duplicate correspondent entries are removed
$ python dedupe_correspondents.py
Testing API connection...
✓ Connected successfully
Fetching correspondents from Paperless API...
Found 127 correspondents.
Clustering similar names...
Found 5 cluster(s) with potential duplicates.
======================================================================
Cluster 1 of 5
======================================================================
Similar correspondents found:
----------------------------------------------------------------------
1. [42] Deutsche Rentenversicherung (15 documents)
2. [67] Deutsche Rente Versicherung (3 documents)
----------------------------------------------------------------------
Options:
- Enter a number (1-N) to use that correspondent as canonical
- Enter 'n' followed by a new name to create a new canonical name
- Enter 's' to skip this cluster
- Enter 'q' to quit
Your choice: 1
Merging cluster...
Canonical: [42] Deutsche Rentenversicherung
Duplicates to merge: 1
Updating 3 document(s) from [67] Deutsche Rente Versicherung
Deleting duplicate: [67] Deutsche Rente Versicherung
✓ Successfully merged cluster (3 documents updated)
...
To adjust the similarity threshold (default: 0.85), edit the CorrespondentClusterer initialization in the script:
clusterer = CorrespondentClusterer(similarity_threshold=0.90) # More strict
clusterer = CorrespondentClusterer(similarity_threshold=0.80) # More lenientLower values (e.g., 0.80) will create more clusters (catch more potential duplicates), while higher values (e.g., 0.90) will be more conservative.
If you get connection errors:
- Verify
PAPERLESS_URLis correct (includehttp://orhttps://) - Check that your Paperless instance is running and accessible
- Ensure the API is enabled in your Paperless configuration
If authentication fails:
- Verify your API token is correct
- Check that the token hasn't expired
- Ensure the token has the necessary permissions
The script includes automatic retry logic for rate limiting. If you encounter persistent rate limit errors, you may need to:
- Reduce the number of operations per second
- Contact your Paperless administrator about rate limits
- Dry-run mode: Test the script without making changes
- Transaction-like behavior: Each cluster merge is atomic (all or nothing)
- Error handling: Failed operations are reported without crashing
- Confirmation prompts: Review each cluster before merging
This script is provided as-is for use with Paperless installations.
Feel free to submit issues or pull requests if you find bugs or have suggestions for improvements.