A practical guide to modern data quality architecture
Data Quality on Read (DQOR) is a modern data architecture principle that mirrors the well-known "Schema on Read" approach used in data lakes and lakehouses. Instead of enforcing strict schema, cleansing, validation, or enrichment at ingest time (which can slow pipelines, block velocity, or require perfect upfront knowledge), DQOR defers data quality processing until the moment the data is actually read or queried.
The core idea is simple:
Ingest raw data as fast as possible, then apply profiling, validation, corrections, and enhancements on demand—at read time.
This delivers agility, cost efficiency, and flexibility while still providing downstream consumers with trustworthy views of the data when they need them—without ever overwriting or losing the original source material.
DataRadar and its underlying open-source engine bytefreq are purpose-built implementations of Data Quality on Read, designed from the ground up for real-world messy data—especially open data sources.
Open data is a treasure trove, but it comes with well-known risks:
Traditional approaches force you to either build heavy upfront ETL (slow and brittle) or risk propagating errors downstream.
DQOR changes the game: ingest raw, profile instantly, and let consumers choose their level of quality—dramatically de-risking solutions built on open data.
Most data profiling tools struggle with truly global, multilingual datasets. Many assume Latin scripts, falter on encoding detection, or lack deep Unicode analysis.
DataRadar excels here with sophisticated, client-side handling of international data:
Example from profiling a global places dataset:
This enables safe exploration of international open data (e.g., global POI datasets, multilingual surveys) while spotting encoding issues or mixed-script anomalies early.
DataRadar provides a progressive ladder of DQOR capabilities:
Upload CSV, Excel, JSON/NDJSON, or paste a URL. All processing happens client-side—your data never leaves your machine. Instantly see frequency distributions, pattern discovery, script detection, format inference, and recommended fixes.
For nested or semi-structured data, enable "Flat Enhanced" output. This produces a tabular-friendly structure where every field is enriched with multiple parallel layers:
.raw – the original, untouched value. This is preserved for provenance: it guarantees the source data is never overwritten or lost, providing an immutable audit trail back to the original open data feed..HU / .LU – anonymized pattern masks (works across all scripts) for safe sharing and review..Rules – automatically inferred properties and suggested treatments (e.g., is_unix_timestamp, std_datetime, is_numeric, is_uk_postcode, latitude/longitude ranges).The key philosophy: suggestions, never mandates.
Multiple competing rules or treatments can coexist for the same field (e.g., alternative date parsing strategies, different standardisation options). Downstream users remain in control—they can:
.raw valuestd_datetime)This non-destructive, provenance-first approach ensures trust, reproducibility, and flexibility.
Open-source bytefreq CLI for larger local files; proven Spark-based enterprise engine for massive scale.
Paste USGS feed URL:
https://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_hour.geojson
Instant profile reveals nested structure and epoch timestamps.
Enable Flat Enhanced → export flattened NDJSON.
You get columns such as:
properties.time.raw → 1766181640870
properties.time.Rules.is_unix_timestamp → "milliseconds"
properties.time.Rules.std_datetime → "2025-12-19 22:00:40 UTC"
Load into Pandas/Polars:
df['event_time'] = df['properties.time.Rules.std_datetime'] # Adopt suggestion
# Or keep original epoch if preferred
df['event_time_raw'] = df['properties.time.raw']
The original is always available for verification.
Upload a CSV of global points of interest (names in native scripts, mixed addresses).
DataRadar instantly detects scripts, patterns, and potential geo fields. Export with provenance layers intact for safe downstream use or stakeholder review.
If upstream changes? Re-profile—the .raw values remain the single source of truth.
Minutes to insight, not days of cleaning
Catch issues early without risking original data
Raw values preserved forever; suggestions are optional and auditable
Full client-side processing + cross-script masking
Handles real international data out-of-the-box
Consumers choose their preferred treatment—or none
Adapts to drift without destructive rewrites
No infrastructure for exploration
Data Quality on Read turns open data—from local councils to global feeds—into a reliable, low-risk, and auditable foundation.
Ready to experience it? Visit dataradar.co.uk, paste any open data URL, and see the insights appear instantly.