Beta · CSV/Parquet validation in seconds

Turn Messy Datasets Into Model-Ready Data You Can Trust.

Automatically validate your datasets before training with a clear AI Readiness Score™ and actionable fixes.

Most ML failures don’t start in the model. They start in the dataset.

Free during beta · No credit card required
Profiles · Types · Missing
Duplicates · Outliers · Imbalance
AI Readiness Score™
JSON report for CI
dataset_report.html
86
AI Readiness Score™
Open example
Completeness 92% · 3 cols w/ null spikes
Consistency 85% · mixed types detected
Stability 81% · drift risk flagged

Top findings

CRITICAL

label has 7.8% missing values in class “A”. Possible training bias.

WARNING

user_id duplicates detected. Consider deduplication strategy.

WARNING

user_id duplicates detected by id_columns. Use deterministic dedupe strategy.

OK

timestamp format is consistent across the dataset.

What TrustYourData does

Before you train or deploy, make sure your dataset is ready. Get a structured validation report with an AI Readiness Score™ and clear remediation steps.

Catch data issues before they silently break your models.

No heavy platform. No black-box magic. No randomness.

Task intent in

Validation adapts to what you’re actually trying to do.

  • classification · regression · time_series · analytics
  • target/timestamp/id columns
  • split strategy + constraints

Decision artifacts out

Designed as a decision layer for ML workflows.

  • AI Readiness Score™ (0–100)
  • report_confidence (0–100)
  • Hard gating indicators

Actionable remediation

Evidence-backed, prioritized steps — not vague advice.

  • Structured findings (stable IDs)
  • Penalty breakdown by category
  • Top remediation actions (gain/effort)

How it works

A simple, deterministic flow designed for ML engineers.

1. Upload

Provide CSV/Parquet dataset and task intent (JSON).

2. Analyze

Deterministic profiling and task-aware validation checks run.

3. Decide

AI Readiness Score™, hard gates, and confidence are computed.

4. Act

Apply prioritized remediation and gate training in CI.

AI Readiness Score™

Explainable by design. The score is computed using a deterministic penalty model and bounded caps.

readiness_score = 100 − TotalRisk
  1. Findings are produced by deterministic checks (plugins) with stable IDs.
  2. Penalties are computed from severity, category weights, and confidence.
  3. Caps prevent runaway risk (per-category and total).
  4. Hard gating applies when critical risks are detected (e.g., leakage or inference mismatch).

What’s included

Findings are grouped by category for operational clarity.

Categories
schema · quality · leakage · split · inference_mismatch · bias_signals
v1
Determinism
stable ordering · fixed sampling policy · golden-test compatible
non-negotiable
Evidence discipline
small stats · tiny samples only · optional strict privacy mode
safe

Built for CI & engineering workflows

Run readiness checks as a gate before training, fine-tuning, or deployment. Emit strict JSON, optionally render HTML, and enforce minimum score thresholds.

CLI

readiness analyze \ --data dataset.csv \ --task task.json \ --out report.json
  • Strict JSON output contract
  • Optional HTML rendering
  • Optional reversible cleaning export

CI gating

  • Fail builds when score < threshold
  • Track regressions over time
  • Enforce data contracts for training/inference
Deterministic outputs
same input → same report
reproducible

Deployment modes

One scoring engine. Multiple deployment options.From API access to privacy-sensitive and enterprise environments.

API mode (Beta)

CLI/SDK uploads dataset to a secure API. Server performs scoring and returns artifacts.

  • Fastest onboarding
  • Great for teams iterating quickly
  • Centralized scoring updates

Privacy mode (planned)

Profiling happens locally. Only structured aggregates are sent for scoring.

  • No raw data leaves your environment
  • Reduced security friction
  • Same deterministic scoring core

Enterprise Runner (planned)

Self-hosted, licensed Runner. Full pipeline runs entirely inside your infrastructure.

  • Data residency & compliance
  • Version pinning
  • Offline operation (optional)

Security & privacy

Designed for predictable behavior and minimal data exposure.

Data handling

  • No model training
  • Encrypted transfer
  • No retention by default (Beta policy)
  • Evidence discipline (small stats; tiny samples only)

Determinism guarantees

  • No randomness or stochastic estimators
  • Stable finding IDs + stable ordering
  • Fixed sampling policy when sampling is required
  • Golden-test compatible outputs

What this is not

Not a feature store · Not a governance suite · Not automated feature engineering · Not a data observability platform

Example report (preview)

A report you can attach to PRs, share in reviews, or use in CI.

{ "readiness_score": 72, "report_confidence": 88, "category_risks": { "leakage": 10.8, "quality": 8.1, "schema": 4.2, "split": 3.6 }, "top_findings": [ { "id": "leakage.suspicious_high_association", "severity": "high", "column": "future_status_flag" }, { "id": "quality.high_missingness", "severity": "medium", "column": "income" } ], "remediation_plan": [ { "title": "Remove feature 'future_status_flag'", "expected_score_gain": 10.2 }, { "title": "Impute 'income' missing values", "expected_score_gain": 4.1 } ], "...": "truncated" }
See full report

What you see

  • AI Readiness Score™ + penalty breakdown
  • Top critical issues and warnings
  • Category risks (schema/quality/leakage/...)
  • Remediation plan ranked by gain/effort

What you do next

  • Apply fixes (recommended) with a clear plan
  • Optionally apply safe reversible auto-fixes
  • Re-run to verify improvements
  • Gate training on a minimum score

Join the free beta

Get early access, influence the roadmap, and lock in launch pricing. Free during beta — no credit card required.