WQSurrogateModels is a FastAPI backend for WQI5-based water quality assessment. It provides a direct WQI5 formula baseline, surrogate regression models, API endpoints, and scripts for reproducing the experiments.
Scope: this repository assesses current water quality state from five physicochemical indicators. It does not perform temporal forecasting because the committed dataset does not contain timestamps.
It provides:
- a
direct_wqi5baseline - surrogate regression models
/api/v2/*endpoints for WaterMirror and other HTTP clients- reproducibility scripts and experiment documentation
This project is part of a two-repository system:
WaterMirror: cross-platform mobile frontend for data entry, CSV upload, and result visualizationWQSurrogateModels: FastAPI backend and reproducibility repository for WQI5-based current-state water quality assessment
WaterMirror depends on the API contract exposed by this repository. WQSurrogateModels can also be used independently through curl, Postman, or custom scripts.
- serves a FastAPI backend for WQI5 assessment
- supports a
direct_wqi5formula baseline - supports surrogate regression models:
lr,mpr,svm,rf,xgboost,lightgbm - provides reproducibility scripts and experiment configuration
- keeps compatibility with legacy endpoints while treating
/api/v2/*as the primary contract
direct_wqi5: computes the WQI5 score directly from the documented formula.surrogate model: a regression model trained to approximate WQI5 scores from the same five indicators.complete-input model: a model that requires all five indicators:DO,BOD,NH3N,EC, andSS.missing-indicator experiment: an experiment that evaluates model behavior when one or more indicators are unavailable. The committed complete-input artifacts are not incomplete-input models.107-window stress test: a repository-specific synthetic perturbation analysis over consecutive external hold-out windows. It is not a new validation method and should not be called cross-validation.
flowchart LR
A[WaterMirror user input or CSV upload] --> B[WaterMirror frontend]
B --> C[POST /api/v2/assessment or /api/v2/assessment/csv/summary]
C --> D[WQSurrogateModels FastAPI service]
D --> E[Input validation and assessment warnings]
E --> F{Model selection}
F --> G[direct_wqi5 baseline]
F --> H[Surrogate regressors: lr mpr svm rf xgboost lightgbm]
G --> I[WQI5 score category rating range]
H --> I
I --> J[Result payload]
J --> B
Copy .env.example to .env and adjust values if needed.
cp .env.example .envKey variables:
MODEL_DIR=modelsDEFAULT_MODEL=direct_wqi5API_HOST=0.0.0.0API_PORT=8001AUTO_PORT=false
pip install .For development and tests:
pip install -e ".[dev]"Local or externally provided scikit-learn surrogate artifacts should be loaded
with the compatible scikit-learn version used during export. Model binaries are
not committed to Git; see models/production_model_manifest.json
for the expected local paths.
To also enable the full set of surrogate models (xgboost, lightgbm):
pip install -e ".[dev,models]"Model binaries are local artifacts and are not committed to Git. The current model package exports one complete-input API artifact for each surrogate model:
models/LightGBM/modelLGBMVer.2.0-50000-seed0.pkl
models/LR/modelLRVer.2.0-50000-seed0.pkl
models/MPR/modelMPRVer.2.0-50000-seed3.pkl
models/RF/modelRFVer.2.0-50000-seed0.pkl
models/SVM/modelSVMVer.2.0-50000-seed3.pkl
models/XGBoost/modelXGBVer.2.0-50000-seed2.pkl
The committed model artifact manifest is:
models/production_model_manifest.json
The manifest filename is retained for compatibility with existing scripts. In this documentation, it refers to local inference artifacts, not evidence of a formally validated deployment.
Each artifact is extracted from the complete-input full_reference
result with the lowest external 10,714-row hold-out MAE for that model type.
These artifacts remain complete-input WQI5 surrogates and require:
DO, BOD, NH3N, EC, SS
They should not be interpreted as models for incomplete-input cases. Legacy
API artifacts are kept locally under models/archive/legacy_v1/ for
traceability. Experiment bundles remain under ignored results_* folders.
python main.pyIf API_PORT is already occupied, the default behavior is to fail fast with a clearer error message. For local development, you can opt in to automatic fallback ports:
AUTO_PORT=trueWith AUTO_PORT=true, the server tries API_PORT first and then scans upward (8002, 8003, ...) until it finds a free port.
Primary endpoints live under /api/v2/*.
POST /api/v2/assessment
{ "DO": 7.2, "BOD": 2.1, "NH3N": 0.3, "EC": 450, "SS": 12, "model_type": "lightgbm" }Legacy compatibility endpoints such as POST /predict, POST /score/total/, and GET /status are retained but deprecated.
User and API:
Methodology:
Experiments and statistics:
- Revised Experiment Protocol
- Sample-Size Experiments
- Missing-Indicator Experiments
- Statistical Analysis
- Statistics Output Guide
Archive:
Run:
pip install -e ".[dev]"
python scripts/reproduce_results.py --config configs/experiment_config.yaml --output-dir results/verification_runIf you use the local WQI conda environment and want to run the full experiment (all models including xgboost/lightgbm):
conda activate WQI
pip install -e ".[models]"
python scripts/reproduce_results.py --config configs/experiment_config.yaml --output-dir results/verification_runTo protect archived result outputs, the script now refuses to overwrite an existing results directory unless --overwrite is passed explicitly.
Run the missing-indicator core experiments:
python scripts/run_missing_indicator_experiments.py \
--config configs/missing_indicator_config.yaml \
--output-dir results/missing_indicator_core_run \
--compute-device gpu \
--gpu-id 0This workflow saves model artifacts, internal-test predictions, external
10,714-row inference predictions, summary metrics, confidence intervals,
paired tests, and stress-scenario summaries into the selected output directory.
Run the missing-indicator workflow with single-indicator missing settings, event-window stress testing, the 107-window stress test, and CPU-only timing support:
python scripts/run_missing_indicator_robustness_experiments.py \
--config configs/missing_indicator_robustness_config.yaml \
--output-dir results/missing_indicator_robustness_run
python scripts/measure_missing_indicator_cpu_timing.py \
--output-dir results/missing_indicator_robustness_run
python scripts/run_stress107_event_windows.py \
--artifact-dir results/missing_indicator_robustness_run \
--output-dir results/stress107_run
python scripts/export_missing_indicator_robustness_excel.py \
--output-dir results/stress107_runThe 107-window stress test divides the external 10,714-row hold-out into
107 consecutive event windows and applies 30%, 100%, and 300% synthetic
perturbations. The stress107 filename prefix is repository-specific. It should
not be described as 107-fold cross-validation; these are event locations, not
training-validation folds.
Prepare result tables and local inference artifacts from the organized result bundle:
python scripts/prepare_statistics_outputs.py \
--bundle-dir results/manuscript_package \
--complete-input-gpu-dir results/complete_input_gpu \
--output-dir statistics/outputs \
--update-production-model \
--archive-legacy-50000-artifactsThe --update-production-model flag name is retained for script compatibility.
It updates local inference artifacts and the model artifact manifest.
Result-table outputs are written to:
statistics/outputs/complete_input_performance.csvstatistics/outputs/missing_indicator_robustness.csvstatistics/outputs/cpu_only_timing.csvstatistics/outputs/stress107_summary.csvstatistics/outputs/bootstrap_ci.csvstatistics/outputs/paired_error_tests.csvstatistics/outputs/sample_size_sensitivity.csvstatistics/outputs/sample_size_metrics_by_fold.csv
GPU and multicore CPU acceleration may be used to reproduce the model-comparison experiments. CPU-only timing is reported separately as a rough inference-time reference for constrained CPU environments.
Prepare the sample-size result tables from the consolidated local sample-size run:
python scripts/prepare_sample_size_outputs.py \
--metrics-dir results/sample_size_experiments/metrics \
--output-dir statistics/outputsLarge experiment outputs are organized under results/ and are not committed
to Git. The current local layout is:
results/complete_input_cpu/: complete-input repeated validation on CPU.results/complete_input_gpu/: complete-input repeated validation with GPU acceleration for supported models.results/reduced_indicator_cpu/: reduced-indicator experiment on CPU.results/reduced_indicator_gpu/: reduced-indicator experiment with GPU acceleration for supported models.results/missing_indicator_core/: core missing-indicator experiment with saved models, predictions, metrics, confidence intervals, paired tests, and stress summaries.results/missing_indicator_robustness/: single-indicator and combined missing-indicator robustness results, including CPU-only timing outputs.results/stress107/: 107 sequential event-window stress-test outputs.results/manuscript_package/: organized CSV files and Excel workbooks for result tables and discussion.results/sample_size_experiments/: consolidated1,000,5,000,10,000, and50,000row sample-size experiment outputs.
Model binaries under models/*/*.pkl are also local artifacts and are not
committed. models/production_model_manifest.json records the expected local
paths and source experiment artifacts for the six supported model families.
The table below describes the current reproducibility workflow. Archived exploratory scripts may use GridSearchCV and library defaults; see docs/original-benchmark-protocol.md.
| Model | Library | Preprocessing | Key Hyperparameters |
|---|---|---|---|
direct_wqi5 |
formula baseline | none | direct WQI5 equation |
lr |
scikit-learn | mean imputation + standard scaling | default LinearRegression() |
mpr |
scikit-learn | mean imputation + polynomial features + standard scaling | degree=2, include_bias=False |
svm |
scikit-learn | mean imputation + standard scaling | kernel=rbf, C=10.0, epsilon=0.1 |
rf |
scikit-learn | mean imputation | n_estimators=300, random_state=0, n_jobs=-1 |
xgboost |
xgboost | mean imputation | n_estimators=300, max_depth=6, learning_rate=0.05, subsample=0.9, colsample_bytree=0.9, random_state=0 |
lightgbm |
lightgbm | mean imputation | n_estimators=300, learning_rate=0.05, random_state=0 |
Repeated validation uses stratified random splits over WQI5 categories with seeds 0, 1, 2, 3, 4.
data/: processed datasets and subsetsmodels/: local inference manifest and artifact paths; model binaries are not committedsrc/: API and reusable backend logicscripts/: reproducibility runnersarchive/legacy_training/: archived exploratory training scripts from the oldersrc/traininglayoutconfigs/: experiment settingstests/: pytest suite
Apache License 2.0. See LICENSE.