DataRadar

Data Quality on Read: Design and Tutorial

A practical guide to modern data quality architecture

What is Data Quality on Read?

Data Quality on Read (DQOR) is a modern data architecture principle that mirrors the well-known "Schema on Read" approach used in data lakes and lakehouses. Instead of enforcing strict schema, cleansing, validation, or enrichment at ingest time (which can slow pipelines, block velocity, or require perfect upfront knowledge), DQOR defers data quality processing until the moment the data is actually read or queried.

The core idea is simple:

Ingest raw data as fast as possible, then apply profiling, validation, corrections, and enhancements on demand—at read time.

This delivers agility, cost efficiency, and flexibility while still providing downstream consumers with trustworthy views of the data when they need them—without ever overwriting or losing the original source material.

DataRadar and its underlying open-source engine bytefreq are purpose-built implementations of Data Quality on Read, designed from the ground up for real-world messy data—especially open data sources.

Why Data Quality on Read Matters for Open Data

Open data is a treasure trove, but it comes with well-known risks:

Traditional approaches force you to either build heavy upfront ETL (slow and brittle) or risk propagating errors downstream.

DQOR changes the game: ingest raw, profile instantly, and let consumers choose their level of quality—dramatically de-risking solutions built on open data.

Advanced International and Unicode Support: A Key Differentiator

Most data profiling tools struggle with truly global, multilingual datasets. Many assume Latin scripts, falter on encoding detection, or lack deep Unicode analysis.

DataRadar excels here with sophisticated, client-side handling of international data:

Example from profiling a global places dataset:

This enables safe exploration of international open data (e.g., global POI datasets, multilingual surveys) while spotting encoding issues or mixed-script anomalies early.

How DataRadar Implements Data Quality on Read

DataRadar provides a progressive ladder of DQOR capabilities:

1. Zero-Install Browser Profiling (WebAssembly)

Upload CSV, Excel, JSON/NDJSON, or paste a URL. All processing happens client-side—your data never leaves your machine. Instantly see frequency distributions, pattern discovery, script detection, format inference, and recommended fixes.

2. Flat Enhanced Mode – The DQOR Superpower

For nested or semi-structured data, enable "Flat Enhanced" output. This produces a tabular-friendly structure where every field is enriched with multiple parallel layers:

The key philosophy: suggestions, never mandates.

Multiple competing rules or treatments can coexist for the same field (e.g., alternative date parsing strategies, different standardisation options). Downstream users remain in control—they can:

This non-destructive, provenance-first approach ensures trust, reproducibility, and flexibility.

3. CLI and Enterprise Scaling

Open-source bytefreq CLI for larger local files; proven Spark-based enterprise engine for massive scale.

Tutorial: De-Risking an Open Data Project with DQOR

Example 1: Global Earthquake Data (GeoJSON)

Paste USGS feed URL:

https://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_hour.geojson

Instant profile reveals nested structure and epoch timestamps.
Enable Flat Enhanced → export flattened NDJSON.

You get columns such as:

properties.time.raw → 1766181640870
properties.time.Rules.is_unix_timestamp → "milliseconds"
properties.time.Rules.std_datetime → "2025-12-19 22:00:40 UTC"

Load into Pandas/Polars:

df['event_time'] = df['properties.time.Rules.std_datetime']  # Adopt suggestion
# Or keep original epoch if preferred
df['event_time_raw'] = df['properties.time.raw']

The original is always available for verification.

Example 2: International Places Dataset

Upload a CSV of global points of interest (names in native scripts, mixed addresses).

DataRadar instantly detects scripts, patterns, and potential geo fields. Export with provenance layers intact for safe downstream use or stakeholder review.

If upstream changes? Re-profile—the .raw values remain the single source of truth.

Benefits for Open Data Projects

Speed

Minutes to insight, not days of cleaning

Safety

Catch issues early without risking original data

Provenance & Trust

Raw values preserved forever; suggestions are optional and auditable

Privacy

Full client-side processing + cross-script masking

Global Readiness

Handles real international data out-of-the-box

Flexibility

Consumers choose their preferred treatment—or none

Resilience

Adapts to drift without destructive rewrites

Zero Cost Start

No infrastructure for exploration

Get Started

Data Quality on Read turns open data—from local councils to global feeds—into a reliable, low-risk, and auditable foundation.

Ready to experience it? Visit dataradar.co.uk, paste any open data URL, and see the insights appear instantly.

Try DataRadar Now Download CLI Tool