Skip to content

Interactive Flask dashboard for semantic analysis and comparison of legal document corpora. Features advanced search, relevance scoring (cosine similarity + cross-encoder), NLP analytics with visualizations, and multi-format export. Built with Chart.js, Tailwind CSS, and HTMX.

License

Notifications You must be signed in to change notification settings

pierreolivierbonin/Corpora-Comparison-App

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Corpora Comparison App

A comprehensive Flask-based web dashboard for analyzing and comparing two document corpora. The example shown as a demo is a comparison between the Labour Program Collection and the Wage Earner Protection Program Collection.

Features

Core Functionality

  • Interactive Overview: Summary statistics, category distribution, and top keywords
  • Advanced Search & Filter: Multi-dimensional filtering with boolean/regex search
  • Analytics Dashboard: Interactive charts and visualizations with Chart.js
  • Corpus Analysis: Detailed NLP analysis including embeddings, outliers, and statistics
  • Data Export: CSV and Excel export functionality

Technical Features

  • Normalized quality scoring (0-1 scale)
  • Real-time search with HTMX
  • Responsive design with Tailwind CSS
  • SQLite database backend
  • Advanced analytics with pandas and numpy

Screenshots

Main Dashboard

Main Dashboard Overview page showing summary statistics, category distribution, and top keywords

Search and Filter

Search and Filter Advanced search interface with multi-dimensional filtering and boolean/regex search capabilities

Analytics Dashboard

Analytics Dashboard Interactive charts and visualizations powered by Chart.js

Corpus Analysis

Corpus Analysis Detailed NLP analysis including embeddings, outliers, and statistical insights

Quick Start

Local Development

  1. Clone the repository

    git clone https://github.com/pierreolivierbonin/Corpora-Comparison-App.git
    cd Corpora-Comparison-App
  2. Create virtual environment

    python -m venv venv
    
    # Windows
    venv\Scripts\activate
    
    # Linux/Mac
    source venv/bin/activate
  3. Install dependencies

    Option A: Using uv (recommended - much faster)

    # Install uv if you haven't already
    pip install uv
    
    # Install dependencies with uv
    uv pip install -r requirements.txt

    Option B: Using pip

    pip install -r requirements.txt
  4. Run the dashboard

    python app.py
  5. Open in browser

    http://localhost:8080
    

Project Structure

Labour-Dashboard/
├── app.py                          # Main Flask application
├── templates/                      # HTML templates
│   ├── base.html                  # Base template
│   ├── dashboard.html             # Main dashboard
│   ├── search_results.html        # Search results
│   ├── corpus_analysis.html       # Corpus analysis
│   └── error.html                 # Error page
├── overlap_results/               # Database files
│   ├── labour_vs_wage_earner_comparison.db
│   └── annotations.db
├── corpus_analysis/               # Corpus analysis data
│   ├── labour/                    # Labour Code analysis
│   └── wage_earner/               # Wage Earner analysis
├── requirements.txt               # Python dependencies
└── README.md                      # This file

Data Files

Note: The data files required to run this dashboard are not included in the Git repository due to their large size. You will need to obtain them separately.

Required Files

The application requires the following directory structure:

overlap_results/
├── labour_vs_wage_earner_comparison.db   # Main comparison database
└── annotations.db                         # User annotations and saved filters

corpus_analysis/
├── labour/                                # Labour Code corpus analysis
│   ├── *.png                             # Visualization images
│   ├── analysis_report.txt
│   ├── metadata_summary.csv
│   ├── outliers.csv
│   └── nlp_analysis/
└── wage_earner/                          # Wage Earner corpus analysis
    ├── *.png
    ├── analysis_report.txt
    ├── metadata_summary.csv
    ├── outliers.csv
    ├── nlp_analysis/
    └── source_data/

How to Obtain Data Files

Contact the project maintainer to obtain the required data files. Once you have them:

  1. Create the directories if they don't exist:

    mkdir -p overlap_results corpus_analysis
  2. Place the database files in overlap_results/

  3. Place the corpus analysis folders in corpus_analysis/

Alternative: Using Your Own Data

If you want to use this dashboard with your own corpus comparison data, ensure your SQLite database follows the expected schema. See the database queries in app.py for the required table structure and columns.

Relevance Score Explained

The dashboard uses a Relevance Score (0.0 to 1.0) combining:

  • 60% Semantic Similarity: Calculated from cosine distance between embeddings (1 - cosine_distance)
  • 40% Cross-Encoder Rerank Score: Deep learning model measuring semantic relevance

Higher scores (≥0.7) indicate high-quality matches with strong semantic similarity.

Match Types

Results are also classified by match type based on the rerank score:

  • Strong Match (rerank ≥ 0.15): High confidence matches
  • Related (0.05 ≤ rerank < 0.15): Moderate relevance
  • Weak (rerank < 0.05): Low relevance

Technology Stack

  • Backend: Flask 3.0.0
  • Data Processing: pandas 2.2.3, numpy 1.26.4
  • Visualization: Chart.js 4.4.0
  • Frontend: Tailwind CSS, HTMX
  • Database: SQLite
  • Export: openpyxl (Excel), CSV

Configuration

Environment Variables (Optional)

SECRET_KEY=your-secret-key-here
FLASK_ENV=production
FLASK_DEBUG=False
DB_PATH=./overlap_results/labour_vs_wage_earner_comparison.db
ANNOTATIONS_DB=./overlap_results/annotations.db

Customization

Edit app.py:

# Color scheme
COLORS = {
    "primary": "#6B9BD1",
    "secondary": "#B47E9A",
    ...
}

# Corpus names
REFERENCE_CORPUS = "Labour Code Collection"
TARGET_CORPUS = "Wage Earner Collection"

API Endpoints

The dashboard provides REST API endpoints:

  • GET / - Main dashboard
  • POST /search - Search comparisons
  • GET /api/analytics - Analytics data
  • GET /api/visualizations - Visualization data
  • GET /api/corpora - Available corpora
  • GET /corpus/<name> - Corpus analysis page
  • POST /export/csv - Export to CSV
  • POST /export/excel - Export to Excel

Performance

Optimization Tips

  • Limit search results for faster response
  • Use quality score filter to reduce data
  • Consider pagination for large result sets
  • Caching enabled for expensive operations
  • Data loading optimized with pandas

Troubleshooting

Common Issues

Database not found

  • Ensure you have obtained the required data files (see Data Files section)
  • Verify overlap_results/ directory exists with the database files
  • Check file paths are correct in your configuration

Import errors

  • Install all dependencies: pip install -r requirements.txt
  • Check Python version (3.10+ recommended)

Charts not displaying

  • Check browser console for JavaScript errors
  • Verify Chart.js CDN is accessible
  • Clear browser cache

Development

Adding New Features

  1. Update app.py with new routes/logic
  2. Create/modify templates in templates/
  3. Update styles if needed (using Tailwind classes)
  4. Test locally before deployment

Running Tests

# Test imports
python -c "from app import app; print('[OK]')"

# Test database connection
python -c "import sqlite3; conn = sqlite3.connect('./overlap_results/labour_vs_wage_earner_comparison.db'); print('[OK] Database connected')"

Contributing

This is a research project. For questions or issues:

  1. Check existing documentation
  2. Check error logs for specific issues

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Attribution Requirements

When using this software, you must comply with the Apache License 2.0 attribution requirements:

  1. Include Copyright Notice: You must include a copy of the Apache 2.0 license with any distribution of this software.

  2. State Modifications: If you modify this software, you must clearly indicate what changes were made.

  3. Retain Attribution: All original copyright notices, patent notices, trademark notices, and attribution notices from the source code must be retained.

  4. Include NOTICE File: You must include a readable copy of the NOTICE file in any derivative works.

Example attribution in documentation or about page:

This software includes code from Corpora Comparison App
Copyright 2024-2025 Pierre-Olivier Bonin
Licensed under the Apache License, Version 2.0

For the complete license terms, see the LICENSE file.

Acknowledgments

  • Flask framework
  • Chart.js for visualizations
  • Tailwind CSS for styling
  • HTMX for dynamic interactions

Version

Version: 1.0.0 Last Updated: 2025-10-09 Python: 3.10+ Flask: 3.0.0

About

Interactive Flask dashboard for semantic analysis and comparison of legal document corpora. Features advanced search, relevance scoring (cosine similarity + cross-encoder), NLP analytics with visualizations, and multi-format export. Built with Chart.js, Tailwind CSS, and HTMX.

Resources

License

Stars

Watchers

Forks