Skip to content

jiox/paperless-deduplicate-correspondents

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

Paperless Correspondent Deduplication Tool

A Python script that identifies and merges duplicate correspondents in your Paperless installation by clustering similar-sounding names using fuzzy string matching.

Overview

When OCR processing in Paperless doesn't work perfectly, the same correspondent might be added multiple times with various spelling mistakes. For example:

  • Deutsche Rentenversicherung
  • Deutsche Rente Versicherung
  • Deutsche Renteversicherung

This tool helps you identify these duplicates and merge them into a single canonical correspondent, updating all associated documents automatically.

Features

  • API-based: Uses the Paperless REST API (no direct database access required)
  • Fuzzy matching: Clusters similar correspondent names using advanced string similarity algorithms
  • Interactive CLI: User-friendly interface to review and confirm merges
  • Safe operations: Dry-run mode to preview changes before applying them
  • Automatic updates: Updates all document references when merging correspondents
  • Pagination support: Handles large numbers of correspondents and documents

Requirements

  • Python 3.7 or higher
  • Access to a Paperless instance with REST API enabled
  • API token from your Paperless instance

Installation

  1. Clone or download this repository

  2. Install dependencies:

pip install -r requirements.txt

Configuration

Set the following environment variables:

export PAPERLESS_URL=http://localhost:8000
export PAPERLESS_TOKEN=your_api_token_here

Getting Your API Token

  1. Log into your Paperless web interface
  2. Go to Settings → API Tokens
  3. Create a new API token
  4. Copy the token and set it as PAPERLESS_TOKEN

Usage

Basic Usage

python dedupe_correspondents.py

The script will:

  1. Connect to your Paperless API
  2. Fetch all correspondents
  3. Cluster similar names using fuzzy matching
  4. Display each cluster and prompt for your input
  5. Merge duplicates by updating document references and deleting duplicate correspondents

Dry-Run Mode

To preview changes without making any modifications:

python dedupe_correspondents.py --dry-run

or

python dedupe_correspondents.py -n

How It Works

Clustering Algorithm

The script uses Python's difflib.SequenceMatcher to calculate similarity scores between correspondent names. Names with a similarity score above 0.85 (default) are grouped into clusters.

The clustering uses a graph-based approach: if correspondent A is similar to B, and B is similar to C, then A, B, and C are grouped together.

Interactive Interface

For each cluster with 2 or more correspondents, you'll see:

======================================================================
Cluster 1 of 3
======================================================================

Similar correspondents found:
----------------------------------------------------------------------
  1. [42] Deutsche Rentenversicherung (15 documents)
  2. [67] Deutsche Rente Versicherung (3 documents)
  3. [89] Deutsche Renteversicherung (8 documents)
----------------------------------------------------------------------

Options:
  - Enter a number (1-N) to use that correspondent as canonical
  - Enter 'n' followed by a new name to create a new canonical name
  - Enter 's' to skip this cluster
  - Enter 'q' to quit

Your choice:

Merging Process

When you choose to merge a cluster:

  1. Select canonical correspondent: Choose which existing correspondent to keep, or enter a new name
  2. Update documents: All documents referencing duplicate correspondents are updated to point to the canonical one
  3. Update name: If you provided a new name, the canonical correspondent's name is updated
  4. Delete duplicates: The duplicate correspondent entries are removed

Example Session

$ python dedupe_correspondents.py

Testing API connection...
✓ Connected successfully
Fetching correspondents from Paperless API...
Found 127 correspondents.
Clustering similar names...
Found 5 cluster(s) with potential duplicates.

======================================================================
Cluster 1 of 5
======================================================================

Similar correspondents found:
----------------------------------------------------------------------
  1. [42] Deutsche Rentenversicherung (15 documents)
  2. [67] Deutsche Rente Versicherung (3 documents)
----------------------------------------------------------------------

Options:
  - Enter a number (1-N) to use that correspondent as canonical
  - Enter 'n' followed by a new name to create a new canonical name
  - Enter 's' to skip this cluster
  - Enter 'q' to quit

Your choice: 1

Merging cluster...
  Canonical: [42] Deutsche Rentenversicherung
  Duplicates to merge: 1
  Updating 3 document(s) from [67] Deutsche Rente Versicherung
  Deleting duplicate: [67] Deutsche Rente Versicherung
  ✓ Successfully merged cluster (3 documents updated)

...

Customization

Similarity Threshold

To adjust the similarity threshold (default: 0.85), edit the CorrespondentClusterer initialization in the script:

clusterer = CorrespondentClusterer(similarity_threshold=0.90)  # More strict
clusterer = CorrespondentClusterer(similarity_threshold=0.80)  # More lenient

Lower values (e.g., 0.80) will create more clusters (catch more potential duplicates), while higher values (e.g., 0.90) will be more conservative.

Troubleshooting

Connection Errors

If you get connection errors:

  • Verify PAPERLESS_URL is correct (include http:// or https://)
  • Check that your Paperless instance is running and accessible
  • Ensure the API is enabled in your Paperless configuration

Authentication Errors

If authentication fails:

  • Verify your API token is correct
  • Check that the token hasn't expired
  • Ensure the token has the necessary permissions

API Rate Limiting

The script includes automatic retry logic for rate limiting. If you encounter persistent rate limit errors, you may need to:

  • Reduce the number of operations per second
  • Contact your Paperless administrator about rate limits

Safety Features

  • Dry-run mode: Test the script without making changes
  • Transaction-like behavior: Each cluster merge is atomic (all or nothing)
  • Error handling: Failed operations are reported without crashing
  • Confirmation prompts: Review each cluster before merging

License

This script is provided as-is for use with Paperless installations.

Contributing

Feel free to submit issues or pull requests if you find bugs or have suggestions for improvements.

About

CLI tool for deduplicating correspondents in paperless database

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages