Skip to main content
Back to Use Cases

Open-data capability demonstration: Water & Environment

Wastewater Effluent Data Quality

Continuous effluent monitoring is only as useful as the data is trustworthy. This is a worked, fully open demonstration of how we make it trustworthy: statistics to find the anomalies, and declarative rules to encode the physics so impossible records can never pass.

Data analysed

1,382

days of full-scale works data

Physical invariant

100%

of rows pass COD ≥ BOD

Validation rules

8

SHACL shapes (R1–R8)

The Challenge

The UK is rolling out continuous water-quality monitoring on wastewater assets under the Environment Act 2021, and the Environment Agency is exploring whether that data can serve as a regulatory tool and a transparent record for the public. The hard question underneath is data quality: calibration drift, fouled probes, telemetry gaps and transcription errors all masquerade as real water-quality signals. Before continuous monitoring can carry regulatory weight, someone has to be able to say, defensibly, which readings are trustworthy.

To demonstrate our method on real, open data, we analysed a full-scale wastewater treatment works dataset of 1,382 daily records carrying the regulated effluent suite (ammonia, BOD, COD, total nitrogen, plus inflow and outflow). Source: a public dataset released under CC BY-SA 4.0.

The quality-assurance battery

Two layers work together. Statistics finds the anomalies; a declarative ontology and SHACL rule set of eight shapes (R1–R8: date, four determinand ranges, non-empty record, the COD ≥ BOD invariant and a flow-conservation check) encodes the domain physics that decides which anomalies are merely unusual and which are impossible. The six check families below are reproducible and open source.

Five years of wastewater effluent data-quality analysis: ammonia, BOD, COD and total nitrogen with flagged statistical outliers.
QA pass over five years of effluent determinands. The series is complete and internally consistent, yet the battery still surfaces dozens of statistical outliers and a multi-year COD baseline drift of +115 mg/L that no single-point check would catch.
#CheckWhat it catches
1Completeness against cadenceDetects telemetry dropouts and maintenance gaps where a monitor reports less often than its design interval.
2Stuck-sensor / flatline detectionFlags runs of invariant values, the signature of a frozen or fouled probe reporting a constant.
3Robust outlier detection (MAD)Median-absolute-deviation spikes, so genuine pollution events are not masked by their own influence on a mean-based threshold.
4Physical-range plausibilityPer-determinand non-negative bands (ammonia, BOD, COD, total nitrogen) that reject impossible readings.
5COD >= BOD cross-parameter invariantChemical Oxygen Demand can never be below Biochemical Oxygen Demand. A record where it is, is physically impossible and is rejected, something no single-column check can see.
6Baseline drift estimationQuantifies slow shifts between service visits, the calibration-and-fouling drift that single-point range checks miss.

Outcome

The analysis is complete and internally consistent: the COD ≥ BOD invariant holds on every one of 1,382 rows, a strong curation signal. But the quality-assurance pass still surfaces genuine structure a trend analyst must handle, including a multi-year COD baseline drift and dozens of statistical outliers per determinand. The declarative SHACL rules catch every planted physically-impossible record and pass every clean one, with a named reason for each rejection. The verdict on whether the data can be trusted is therefore auditable, which is exactly what a regulator needs.

The full ontology, the SHACL data-quality shapes, the statistical analysis and the worked examples are published open source as a case study in Open Ontologies, our open data-validation platform.

"Classification of whether data can be trusted should be declarative, not learned. Statistics finds the anomalies; the ontology encodes the physics, such as the simple fact that chemical oxygen demand can never fall below biochemical oxygen demand, that decides which anomalies are impossible. That separation is what makes data quality auditable."

Tesseract Academy

Explore the open-source case study

Ontology, SHACL rules and analysis on GitHub. Data under CC BY-SA 4.0.