Inferring JSON Schema from Real Data: When and Why

JSON Schema is a vocabulary for annotating and validating JSON documents. It describes the structure of JSON data: what fields exist, what types they hold, which are required, and what constraints apply. Schemas are used in OpenAPI specifications, form generation libraries, configuration validation, and data pipeline contracts. But writing a JSON Schema from scratch is verbose and error-prone. For many use cases, inferring a schema from real data and then refining it is faster and produces more accurate results.

What JSON Schema Does

At its core, a JSON Schema is a JSON document that describes another JSON document. A simple schema for a user object looks like this:

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "type": "object",
  "properties": {
    "id": { "type": "integer" },
    "name": { "type": "string", "minLength": 1 },
    "email": { "type": "string", "format": "email" },
    "active": { "type": "boolean" }
  },
  "required": ["id", "name", "email"]
}

This schema says: the data is an object with four properties, three of which are required. The id must be an integer, name must be a non-empty string, email must be a valid email format, and active is an optional boolean. Validators like ajv (JavaScript), jsonschema (Python), or any other JSON Schema implementation can check data against this schema and report exactly which fields fail and why.

Manual vs Inferred Schemas

Writing schemas by hand makes sense when you are designing an API from scratch. You know exactly what fields you want, what types they should have, and what constraints to apply. The schema is the specification, and the implementation conforms to it.

But many real-world scenarios work in the opposite direction. You have data — from a third-party API, a legacy database, a CSV import, or a webhook — and you need a schema that describes it. In these cases, the data comes first and the schema is derived from it. Writing the schema by hand means reading through sample responses, counting fields, checking types, and mapping nested structures. For a response with 30+ fields and three levels of nesting, this takes significant time and is easy to get wrong.

Schema inference automates the mechanical parts: it reads the JSON, determines the type of each field, identifies required vs optional fields (based on null values), and generates the nested structure. You then add the human parts: descriptions, constraints, examples, and enum values.

How Inference Works

A schema inference tool analyzes a JSON sample and produces a schema through several steps:

Type detection. Each value is mapped to a JSON Schema type: strings become "string", numbers become "number" or "integer" (depending on whether they have decimal points), booleans become "boolean", null becomes "null", arrays become "array", and objects become "object".

Required fields. If a field has a non-null value in the sample, it is marked as required. Fields with null values are treated as optional with a union type (e.g., {"type": ["string", "null"]}).

Nested structures. Objects within objects become nested schema definitions. The tool recurses into each level.

Array items. For arrays, the tool inspects all items and generates an items schema. If all items have the same shape, you get a single schema. If items vary, the tool may generate a oneOf or fall back to a loose schema.

Try it yourself

Paste any JSON and see the inferred schema instantly. Open the JSON Schema Generator →

Example 1: Generating a Schema for OpenAPI Documentation

You are documenting an internal API that has been running for months without a formal specification. You grab a sample response from the /api/products endpoint:

{
  "products": [
    {
      "id": "prod_abc123",
      "name": "Ergonomic Keyboard",
      "price": 129.99,
      "currency": "USD",
      "in_stock": true,
      "categories": ["electronics", "peripherals"],
      "dimensions": {
        "width_cm": 45.0,
        "height_cm": 3.5,
        "depth_cm": 15.0,
        "weight_kg": 0.85
      },
      "created_at": "2025-11-20T08:00:00Z"
    }
  ],
  "total": 142,
  "page": 1,
  "per_page": 20
}

Running this through a schema generator gives you a complete schema with nested definitions for the product, dimensions, and pagination wrapper. You then refine it for OpenAPI: add description fields, set minimum and maximum on numeric fields, add format: "date-time" to timestamps, and add enum values for currency if you know the allowed set.

The inferred schema gives you 80% of the work done in seconds. The remaining 20% is domain knowledge that only you have.

Example 2: Validating Webhook Payloads

A payment provider sends webhook events to your endpoint. Their documentation is incomplete, so you capture a few real payloads and infer schemas from them:

{
  "event_type": "payment.completed",
  "event_id": "evt_9f8e7d6c",
  "timestamp": "2026-04-30T16:45:00Z",
  "data": {
    "payment_id": "pay_1a2b3c",
    "amount": 9999,
    "currency": "usd",
    "status": "succeeded",
    "customer": {
      "id": "cus_xyz789",
      "email": "buyer@example.com"
    },
    "metadata": {
      "order_id": "ORD-2026-0042",
      "source": "web"
    }
  }
}

The inferred schema tells you the exact shape your webhook handler should expect. You can use it directly with a validation library to reject malformed payloads before processing. This is especially valuable for webhooks because the provider may change their payload format, and validation catches the change immediately instead of letting bad data corrupt your system.

After inference, add constraints: amount should have minimum: 0, currency should be an enum, and event_type should list all known event types.

Example 3: Schema-Driven Mock Data Generation

JSON Schema is not just for validation — it is also a specification that tools can use to generate test data. Once you have a schema for your API response, you can feed it to a mock data generator to create realistic test fixtures.

The workflow is: capture a real response, infer the schema, refine it with constraints and examples, then use a library like json-schema-faker (JavaScript) or hypothesis-jsonschema (Python) to generate hundreds of valid test payloads. Each generated payload will respect the types, required fields, and constraints in your schema.

// Using the inferred + refined schema for test data
import { JSONSchemaFaker } from "json-schema-faker";

const productSchema = {
  type: "object",
  properties: {
    id: { type: "string", pattern: "^prod_[a-z0-9]{6}$" },
    name: { type: "string", minLength: 1, maxLength: 200 },
    price: { type: "number", minimum: 0, maximum: 99999.99 },
    currency: { type: "string", enum: ["USD", "EUR", "GBP"] },
    in_stock: { type: "boolean" },
    categories: {
      type: "array",
      items: { type: "string" },
      minItems: 1,
      maxItems: 5,
    },
  },
  required: ["id", "name", "price", "currency", "in_stock"],
};

// Generate 50 test products
const testProducts = Array.from({ length: 50 }, () =>
  JSONSchemaFaker.generate(productSchema)
);

This approach is far more reliable than hand-writing test fixtures. The schema ensures every generated object is valid, and the constraints produce realistic data rather than random noise.

Try it yourself

Generate Pydantic models from the same JSON for Python validation. Open the JSON to Pydantic tool →

Refining Inferred Schemas

Inference gets the structure right but cannot infer intent. Here are the most important refinements to make:

Add descriptions. Every property should have a human-readable description that explains what it represents. This matters for OpenAPI docs, form generation, and anyone reading the schema.

Add constraints. Inference knows that price is a number, but not that it must be non-negative. Add minimum, maximum, minLength, maxLength, pattern, and format as appropriate.

Add enums. If a string field only takes certain values (like status codes or country codes), add an enum array. The inference tool sees one sample value and cannot know the full set.

Add examples. The examples keyword provides sample values that documentation tools and mock generators can use.

Use $defs for reuse. If the same sub-schema appears in multiple places (like an address object used in both billing and shipping), extract it into $defs and reference it with $ref.

Review required fields. Inference marks a field as required if it appears in the sample. But some fields might be present in this sample and absent in others. Cross-reference with documentation or additional samples.

JSON Schema in the Broader Ecosystem

JSON Schema is not an isolated tool. It connects to a wide ecosystem:

OpenAPI. Every OpenAPI 3.x specification uses JSON Schema to define request and response bodies. If you generate a schema from your API responses, you are most of the way to an OpenAPI spec.

Validation libraries. Libraries like ajv (JavaScript, the fastest JSON Schema validator), jsonschema (Python), and everit-json-schema (Java) validate data against a schema at runtime.

Form generation. Libraries like JSON Forms and react-jsonschema-form generate UI forms directly from a JSON Schema, including labels, input types, and validation messages.

Code generation. Tools can generate TypeScript interfaces, Pydantic models, Go structs, and other typed representations from a JSON Schema, keeping your types synchronized across languages and services.

LLM structured output. JSON Schema is the standard format for defining the expected output shape when calling LLMs with structured output mode. The schema constrains the model to produce valid JSON matching your specification.

Try it yourself

Generate Zod schemas from JSON for TypeScript runtime validation. Open the JSON to Zod tool →

When to Write Schemas by Hand

Inference is the right starting point when you have data and need a schema. But there are cases where hand-writing makes more sense:

Designing new APIs. When you are defining what an endpoint should accept, the schema comes first and the implementation follows. There is no data to infer from yet.

Complex validation logic. JSON Schema supports if/then/else, allOf, oneOf, and dependentRequired for conditional validation. These cannot be inferred from samples.

Polymorphic data. If an array can contain multiple object types distinguished by a discriminator field, inference will struggle to produce the right oneOf structure.

For everything else — documenting existing APIs, validating incoming data, generating test fixtures, feeding schemas to form builders — starting with inference and refining is the pragmatic approach.

Inferring JSON Schema from Real Data: When and Why

What JSON Schema Does

Manual vs Inferred Schemas

How Inference Works

Example 1: Generating a Schema for OpenAPI Documentation

Example 2: Validating Webhook Payloads

Example 3: Schema-Driven Mock Data Generation

Refining Inferred Schemas

JSON Schema in the Broader Ecosystem

When to Write Schemas by Hand

Further Reading