Data Quality on Read: Design and Tutorial

A practical guide to modern data quality architecture

What is Data Quality on Read?

Data Quality on Read (DQOR) is a modern data architecture principle that mirrors the well-known "Schema on Read" approach used in data lakes and lakehouses. Instead of enforcing strict schema, cleansing, validation, or enrichment at ingest time (which can slow pipelines, block velocity, or require perfect upfront knowledge), DQOR defers data quality processing until the moment the data is actually read or queried.

The core idea is simple:

Ingest raw data as fast as possible, then apply profiling, validation, corrections, and enhancements on demand—at read time.

This delivers agility, cost efficiency, and flexibility while still providing downstream consumers with trustworthy views of the data when they need them—without ever overwriting or losing the original source material.

DataRadar and its underlying open-source engine bytefreq are purpose-built implementations of Data Quality on Read, designed from the ground up for real-world messy data—especially open data sources.

Why Data Quality on Read Matters for Open Data

Open data is a treasure trove, but it comes with well-known risks:

Inconsistent formats (dates as strings, epochs, or mixed)
Nested or semi-structured JSON/GeoJSON with varying depth
Evolving schemas and poor documentation
Hidden quality issues (outliers, invalid codes, encoding problems)
Multilingual and multiscript content (names, addresses in non-Latin scripts)
Privacy-sensitive fields mixed in
No control over upstream changes

Traditional approaches force you to either build heavy upfront ETL (slow and brittle) or risk propagating errors downstream.

DQOR changes the game: ingest raw, profile instantly, and let consumers choose their level of quality—dramatically de-risking solutions built on open data.

Advanced International and Unicode Support: A Key Differentiator

Most data profiling tools struggle with truly global, multilingual datasets. Many assume Latin scripts, falter on encoding detection, or lack deep Unicode analysis.

DataRadar excels here with sophisticated, client-side handling of international data:

Automatic script detection: Identifies dominant scripts per field (e.g., Latin, Cyrillic, Chinese Traditional, Korean Hangul/Hanja, Thai, Arabic, Devanagari, Ethiopic, Greek, Turkish).
Pattern masking across scripts: The HU/LU anonymization preserves structure for non-Latin text (e.g., "a a a" for Arabic, "Aaaaa" for Icelandic place names).
Frequency-based pattern discovery: Surfaces real-world variations in names, addresses, and categories across languages without language-specific rules.

Example from profiling a global places dataset:

Business names in Chinese (天安門広場), Thai (วัดพระแก้ว), Arabic, Devanagari, Cyrillic (Кафе "Пушкинъ"), Ethiopic, and more—all correctly patterned and scripted.
Addresses mixing numeric prefixes with local scripts.
Cities and countries in native forms (北京, Αθήνα, मुंबई, አዲስ አበባ).

This enables safe exploration of international open data (e.g., global POI datasets, multilingual surveys) while spotting encoding issues or mixed-script anomalies early.

How DataRadar Implements Data Quality on Read

DataRadar provides a progressive ladder of DQOR capabilities:

1. Zero-Install Browser Profiling (WebAssembly)

Upload CSV, Excel, JSON/NDJSON, or paste a URL. All processing happens client-side—your data never leaves your machine. Instantly see frequency distributions, pattern discovery, script detection, format inference, and recommended fixes.

2. Flat Enhanced Mode – The DQOR Superpower

For nested or semi-structured data, enable "Flat Enhanced" output. This produces a tabular-friendly structure where every field is enriched with multiple parallel layers:

.raw – the original, untouched value. This is preserved for provenance: it guarantees the source data is never overwritten or lost, providing an immutable audit trail back to the original open data feed.
.HU / .LU – anonymized pattern masks (works across all scripts) for safe sharing and review.
.Rules – automatically inferred properties and suggested treatments (e.g., is_unix_timestamp, std_datetime, is_numeric, is_uk_postcode, latitude/longitude ranges).

The key philosophy: suggestions, never mandates.

Multiple competing rules or treatments can coexist for the same field (e.g., alternative date parsing strategies, different standardisation options). Downstream users remain in control—they can:

Use the original .raw value
Adopt a suggested enhancement (e.g., std_datetime)
Ignore recommendations entirely
Combine layers for validation or masking

This non-destructive, provenance-first approach ensures trust, reproducibility, and flexibility.

3. CLI and Enterprise Scaling

Open-source bytefreq CLI for larger local files; proven Spark-based enterprise engine for massive scale.

Tutorial: De-Risking an Open Data Project with DQOR

Example 1: Global Earthquake Data (GeoJSON)

Paste USGS feed URL:

https://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_hour.geojson

Instant profile reveals nested structure and epoch timestamps.
Enable Flat Enhanced → export flattened NDJSON.

You get columns such as:

properties.time.raw → 1766181640870
properties.time.Rules.is_unix_timestamp → "milliseconds"
properties.time.Rules.std_datetime → "2025-12-19 22:00:40 UTC"

Load into Pandas/Polars:

df['event_time'] = df['properties.time.Rules.std_datetime']  # Adopt suggestion
# Or keep original epoch if preferred
df['event_time_raw'] = df['properties.time.raw']

The original is always available for verification.

Example 2: International Places Dataset

Upload a CSV of global points of interest (names in native scripts, mixed addresses).

DataRadar instantly detects scripts, patterns, and potential geo fields. Export with provenance layers intact for safe downstream use or stakeholder review.

If upstream changes? Re-profile—the .raw values remain the single source of truth.

Benefits for Open Data Projects

Speed

Minutes to insight, not days of cleaning

Safety

Catch issues early without risking original data

Provenance & Trust

Raw values preserved forever; suggestions are optional and auditable

Privacy

Full client-side processing + cross-script masking

Global Readiness

Handles real international data out-of-the-box

Flexibility

Consumers choose their preferred treatment—or none

Resilience

Adapts to drift without destructive rewrites

Zero Cost Start

No infrastructure for exploration

Get Started

Data Quality on Read turns open data—from local councils to global feeds—into a reliable, low-risk, and auditable foundation.

Ready to experience it? Visit dataradar.co.uk, paste any open data URL, and see the insights appear instantly.

Try DataRadar Now Download CLI Tool