A comprehensive Flask-based web dashboard for analyzing and comparing two document corpora. The example shown as a demo is a comparison between the Labour Program Collection and the Wage Earner Protection Program Collection.
- Interactive Overview: Summary statistics, category distribution, and top keywords
- Advanced Search & Filter: Multi-dimensional filtering with boolean/regex search
- Analytics Dashboard: Interactive charts and visualizations with Chart.js
- Corpus Analysis: Detailed NLP analysis including embeddings, outliers, and statistics
- Data Export: CSV and Excel export functionality
- Normalized quality scoring (0-1 scale)
- Real-time search with HTMX
- Responsive design with Tailwind CSS
- SQLite database backend
- Advanced analytics with pandas and numpy
Overview page showing summary statistics, category distribution, and top keywords
Advanced search interface with multi-dimensional filtering and boolean/regex search capabilities
Interactive charts and visualizations powered by Chart.js
Detailed NLP analysis including embeddings, outliers, and statistical insights
-
Clone the repository
git clone https://github.com/pierreolivierbonin/Corpora-Comparison-App.git cd Corpora-Comparison-App -
Create virtual environment
python -m venv venv # Windows venv\Scripts\activate # Linux/Mac source venv/bin/activate
-
Install dependencies
Option A: Using uv (recommended - much faster)
# Install uv if you haven't already pip install uv # Install dependencies with uv uv pip install -r requirements.txt
Option B: Using pip
pip install -r requirements.txt
-
Run the dashboard
python app.py
-
Open in browser
http://localhost:8080
Labour-Dashboard/
├── app.py # Main Flask application
├── templates/ # HTML templates
│ ├── base.html # Base template
│ ├── dashboard.html # Main dashboard
│ ├── search_results.html # Search results
│ ├── corpus_analysis.html # Corpus analysis
│ └── error.html # Error page
├── overlap_results/ # Database files
│ ├── labour_vs_wage_earner_comparison.db
│ └── annotations.db
├── corpus_analysis/ # Corpus analysis data
│ ├── labour/ # Labour Code analysis
│ └── wage_earner/ # Wage Earner analysis
├── requirements.txt # Python dependencies
└── README.md # This file
Note: The data files required to run this dashboard are not included in the Git repository due to their large size. You will need to obtain them separately.
The application requires the following directory structure:
overlap_results/
├── labour_vs_wage_earner_comparison.db # Main comparison database
└── annotations.db # User annotations and saved filters
corpus_analysis/
├── labour/ # Labour Code corpus analysis
│ ├── *.png # Visualization images
│ ├── analysis_report.txt
│ ├── metadata_summary.csv
│ ├── outliers.csv
│ └── nlp_analysis/
└── wage_earner/ # Wage Earner corpus analysis
├── *.png
├── analysis_report.txt
├── metadata_summary.csv
├── outliers.csv
├── nlp_analysis/
└── source_data/
Contact the project maintainer to obtain the required data files. Once you have them:
-
Create the directories if they don't exist:
mkdir -p overlap_results corpus_analysis
-
Place the database files in
overlap_results/ -
Place the corpus analysis folders in
corpus_analysis/
If you want to use this dashboard with your own corpus comparison data, ensure your SQLite database follows the expected schema. See the database queries in app.py for the required table structure and columns.
The dashboard uses a Relevance Score (0.0 to 1.0) combining:
- 60% Semantic Similarity: Calculated from cosine distance between embeddings (1 - cosine_distance)
- 40% Cross-Encoder Rerank Score: Deep learning model measuring semantic relevance
Higher scores (≥0.7) indicate high-quality matches with strong semantic similarity.
Results are also classified by match type based on the rerank score:
- Strong Match (rerank ≥ 0.15): High confidence matches
- Related (0.05 ≤ rerank < 0.15): Moderate relevance
- Weak (rerank < 0.05): Low relevance
- Backend: Flask 3.0.0
- Data Processing: pandas 2.2.3, numpy 1.26.4
- Visualization: Chart.js 4.4.0
- Frontend: Tailwind CSS, HTMX
- Database: SQLite
- Export: openpyxl (Excel), CSV
SECRET_KEY=your-secret-key-here
FLASK_ENV=production
FLASK_DEBUG=False
DB_PATH=./overlap_results/labour_vs_wage_earner_comparison.db
ANNOTATIONS_DB=./overlap_results/annotations.dbEdit app.py:
# Color scheme
COLORS = {
"primary": "#6B9BD1",
"secondary": "#B47E9A",
...
}
# Corpus names
REFERENCE_CORPUS = "Labour Code Collection"
TARGET_CORPUS = "Wage Earner Collection"The dashboard provides REST API endpoints:
GET /- Main dashboardPOST /search- Search comparisonsGET /api/analytics- Analytics dataGET /api/visualizations- Visualization dataGET /api/corpora- Available corporaGET /corpus/<name>- Corpus analysis pagePOST /export/csv- Export to CSVPOST /export/excel- Export to Excel
- Limit search results for faster response
- Use quality score filter to reduce data
- Consider pagination for large result sets
- Caching enabled for expensive operations
- Data loading optimized with pandas
Database not found
- Ensure you have obtained the required data files (see Data Files section)
- Verify
overlap_results/directory exists with the database files - Check file paths are correct in your configuration
Import errors
- Install all dependencies:
pip install -r requirements.txt - Check Python version (3.10+ recommended)
Charts not displaying
- Check browser console for JavaScript errors
- Verify Chart.js CDN is accessible
- Clear browser cache
- Update app.py with new routes/logic
- Create/modify templates in templates/
- Update styles if needed (using Tailwind classes)
- Test locally before deployment
# Test imports
python -c "from app import app; print('[OK]')"
# Test database connection
python -c "import sqlite3; conn = sqlite3.connect('./overlap_results/labour_vs_wage_earner_comparison.db'); print('[OK] Database connected')"This is a research project. For questions or issues:
- Check existing documentation
- Check error logs for specific issues
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
When using this software, you must comply with the Apache License 2.0 attribution requirements:
-
Include Copyright Notice: You must include a copy of the Apache 2.0 license with any distribution of this software.
-
State Modifications: If you modify this software, you must clearly indicate what changes were made.
-
Retain Attribution: All original copyright notices, patent notices, trademark notices, and attribution notices from the source code must be retained.
-
Include NOTICE File: You must include a readable copy of the NOTICE file in any derivative works.
Example attribution in documentation or about page:
This software includes code from Corpora Comparison App
Copyright 2024-2025 Pierre-Olivier Bonin
Licensed under the Apache License, Version 2.0
For the complete license terms, see the LICENSE file.
- Flask framework
- Chart.js for visualizations
- Tailwind CSS for styling
- HTMX for dynamic interactions
Version: 1.0.0 Last Updated: 2025-10-09 Python: 3.10+ Flask: 3.0.0