bytefreq CLI

Command-Line Data Quality Tool

bytefreq is the command-line version of DataRadar, designed for power users who need to profile large datasets (millions of rows) on their desktop or in CI/CD pipelines. Built in Rust with multi-threaded processing, it's fast, reliable, and open source.

Features

Blazing-fast Rust implementation with multi-threaded processing (Rayon)
Supports CSV, Excel (.xlsx, .xls, .xlsb, .ods), JSON, NDJSON formats
Handles files too large for browser tools (millions of rows)
Pipe-based Unix workflows: cat data.csv | bytefreq
Enhanced JSON output with data quality assertions
Character profiling for encoding issue detection
Configurable masking levels (High/Low Unicode)
Perfect for DIY users, data engineers, and automated pipelines

Installation

Prerequisites

You'll need the Rust toolchain installed:

                # Install Rust (if not already installed)

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

Install from GitHub

cargo install --git https://github.com/minkymorgan/bytefreq

Verify Installation

bytefreq --version

Quick Start

Basic CSV Profiling

            # Pipe CSV data to bytefreq

cat data.csv | bytefreq

# Use high-grain Unicode masking

cat data.csv | bytefreq -g HU

JSON Profiling

            # Profile NDJSON data

cat data.json | bytefreq -f json

# Low-grain masking for compressed patterns

cat data.json | bytefreq -f json -g LU

Excel File Profiling

            # Profile Excel file (requires --features excel build)

cargo install --git https://github.com/minkymorgan/bytefreq --features excel

# Profile specific sheet

bytefreq -f excel --excel-path data.xlsx --sheet 1

Character Profiling

            # Analyze character frequencies (useful for encoding issues)

cat data.csv | bytefreq -r CP

Enhanced JSON Output

            # Generate enhanced JSON with data quality assertions

cat data.csv | bytefreq -e

# Flat enhanced JSON (easier parsing)

cat data.csv | bytefreq -E

Documentation

Full documentation, examples, and source code available on GitHub:

View on GitHub

Need More Power?

If you're working with datasets that exceed millions of rows (billions or trillions of data points), check out our enterprise Spark-based solution:

Explore Enterprise Options