Paperless Correspondent Deduplication Tool

A Python script that identifies and merges duplicate correspondents in your Paperless installation by clustering similar-sounding names using fuzzy string matching.

Overview

When OCR processing in Paperless doesn't work perfectly, the same correspondent might be added multiple times with various spelling mistakes. For example:

Deutsche Rentenversicherung
Deutsche Rente Versicherung
Deutsche Renteversicherung

This tool helps you identify these duplicates and merge them into a single canonical correspondent, updating all associated documents automatically.

Features

API-based: Uses the Paperless REST API (no direct database access required)
Fuzzy matching: Clusters similar correspondent names using advanced string similarity algorithms
Interactive CLI: User-friendly interface to review and confirm merges
Safe operations: Dry-run mode to preview changes before applying them
Automatic updates: Updates all document references when merging correspondents
Pagination support: Handles large numbers of correspondents and documents

Requirements

Python 3.7 or higher
Access to a Paperless instance with REST API enabled
API token from your Paperless instance

Installation

Clone or download this repository
Install dependencies:

pip install -r requirements.txt

Configuration

Set the following environment variables:

export PAPERLESS_URL=http://localhost:8000
export PAPERLESS_TOKEN=your_api_token_here

Getting Your API Token

Log into your Paperless web interface
Go to Settings → API Tokens
Create a new API token
Copy the token and set it as PAPERLESS_TOKEN

Usage

Basic Usage

python dedupe_correspondents.py

The script will:

Connect to your Paperless API
Fetch all correspondents
Cluster similar names using fuzzy matching
Display each cluster and prompt for your input
Merge duplicates by updating document references and deleting duplicate correspondents

Dry-Run Mode

To preview changes without making any modifications:

python dedupe_correspondents.py --dry-run

or

python dedupe_correspondents.py -n

How It Works

Clustering Algorithm

The script uses Python's difflib.SequenceMatcher to calculate similarity scores between correspondent names. Names with a similarity score above 0.85 (default) are grouped into clusters.

The clustering uses a graph-based approach: if correspondent A is similar to B, and B is similar to C, then A, B, and C are grouped together.

Interactive Interface

For each cluster with 2 or more correspondents, you'll see:

======================================================================
Cluster 1 of 3
======================================================================

Similar correspondents found:
----------------------------------------------------------------------
  1. [42] Deutsche Rentenversicherung (15 documents)
  2. [67] Deutsche Rente Versicherung (3 documents)
  3. [89] Deutsche Renteversicherung (8 documents)
----------------------------------------------------------------------

Options:
  - Enter a number (1-N) to use that correspondent as canonical
  - Enter 'n' followed by a new name to create a new canonical name
  - Enter 's' to skip this cluster
  - Enter 'q' to quit

Your choice:

Merging Process

When you choose to merge a cluster:

Select canonical correspondent: Choose which existing correspondent to keep, or enter a new name
Update documents: All documents referencing duplicate correspondents are updated to point to the canonical one
Update name: If you provided a new name, the canonical correspondent's name is updated
Delete duplicates: The duplicate correspondent entries are removed

Example Session

$ python dedupe_correspondents.py

Testing API connection...
✓ Connected successfully
Fetching correspondents from Paperless API...
Found 127 correspondents.
Clustering similar names...
Found 5 cluster(s) with potential duplicates.

======================================================================
Cluster 1 of 5
======================================================================

Similar correspondents found:
----------------------------------------------------------------------
  1. [42] Deutsche Rentenversicherung (15 documents)
  2. [67] Deutsche Rente Versicherung (3 documents)
----------------------------------------------------------------------

Options:
  - Enter a number (1-N) to use that correspondent as canonical
  - Enter 'n' followed by a new name to create a new canonical name
  - Enter 's' to skip this cluster
  - Enter 'q' to quit

Your choice: 1

Merging cluster...
  Canonical: [42] Deutsche Rentenversicherung
  Duplicates to merge: 1
  Updating 3 document(s) from [67] Deutsche Rente Versicherung
  Deleting duplicate: [67] Deutsche Rente Versicherung
  ✓ Successfully merged cluster (3 documents updated)

...

Customization

Similarity Threshold

To adjust the similarity threshold (default: 0.85), edit the CorrespondentClusterer initialization in the script:

clusterer = CorrespondentClusterer(similarity_threshold=0.90)  # More strict
clusterer = CorrespondentClusterer(similarity_threshold=0.80)  # More lenient

Lower values (e.g., 0.80) will create more clusters (catch more potential duplicates), while higher values (e.g., 0.90) will be more conservative.

Troubleshooting

Connection Errors

If you get connection errors:

Verify PAPERLESS_URL is correct (include http:// or https://)
Check that your Paperless instance is running and accessible
Ensure the API is enabled in your Paperless configuration

Authentication Errors

If authentication fails:

Verify your API token is correct
Check that the token hasn't expired
Ensure the token has the necessary permissions

API Rate Limiting

The script includes automatic retry logic for rate limiting. If you encounter persistent rate limit errors, you may need to:

Reduce the number of operations per second
Contact your Paperless administrator about rate limits

Safety Features

Dry-run mode: Test the script without making changes
Transaction-like behavior: Each cluster merge is atomic (all or nothing)
Error handling: Failed operations are reported without crashing
Confirmation prompts: Review each cluster before merging

License

This script is provided as-is for use with Paperless installations.

Contributing

Feel free to submit issues or pull requests if you find bugs or have suggestions for improvements.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Paperless Correspondent Deduplication Tool

Overview

Features

Requirements

Installation

Configuration

Getting Your API Token

Usage

Basic Usage

Dry-Run Mode

How It Works

Clustering Algorithm

Interactive Interface

Merging Process

Example Session

Customization

Similarity Threshold

Troubleshooting

Connection Errors

Authentication Errors

API Rate Limiting

Safety Features

License

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md
dedupe_correspondents.py		dedupe_correspondents.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Paperless Correspondent Deduplication Tool

Overview

Features

Requirements

Installation

Configuration

Getting Your API Token

Usage

Basic Usage

Dry-Run Mode

How It Works

Clustering Algorithm

Interactive Interface

Merging Process

Example Session

Customization

Similarity Threshold

Troubleshooting

Connection Errors

Authentication Errors

API Rate Limiting

Safety Features

License

Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages