What is an JSON Schema?

JSON Schema is a declarative language that allows you to validate, document, and define the structure of JSON data, specifying required fields, data types, and constraints to ensure the data conforms to a specific format.

Basic Concepts

Fields

Fields are the basic building blocks of your schema. Each field represents a piece of information you want to extract.

Field Types

Each field must have a type. The available types are:

  • string: For text values (names, descriptions, addresses)
  • number: For numerical values (amounts, counts, measurements)
  • boolean: For yes/no or true/false values
  • enum: For values from a predefined list of options
  • object: For grouping related fields together
  • array: For lists of items

Return As List

return_as_list is a feature that allows you to return the data as a list of objects or a list of strings/numbers. This is useful for extracting data from tables or lists.

Extract Per Page

extract_per_page is a feature that allows you to extract the defined schema per page. This is useful for extracting data from multi-page documents that have the same structure on each page.

Examples

Basic Schema

Extract a string value.

{
  "type": "object",
  "properties": {
    "name": {
      "type": "string"
    }
  }
}

Enum Schema

Extract status based on a list of options.

{
  "type": "object",
  "properties": {
    "status": {
      "type": "string",
      "enum": ["pending", "approved", "rejected"]
    }
  }
}

Object Schema

Extract an address object.

{
  "type": "object",
  "properties": {
    "address": {
      "type": "object",
      "properties": {
        "street": {
          "type": "string"
        },
        "city": {
          "type": "string"
        },
        "state": {
          "type": "string"
        },
        "zip": {
          "type": "string"
        }
      }
    }
  }
}

List Schema

Extract a list of company names.

{
  "type": "object",
  "properties": {
    "company_names": {
      "type": "array",
      "items": {
        "type": "string",
        "description": "The name of the company"
      }
    }
  }
}

Table Schema

Extract a transaction table from account statements.

{
  "type": "object",
  "properties": {
    "transactions": {
      "type": "array",
      "items": {
        "type": "object",
        "description": "",
        "properties": {
          "date": {
            "type": "string",
            "description": "Date of the transaction"
          },
          "type": {
            "type": "string",
            "description": "Type of transaction (deposit, withdrawal, etc.)"
          },
          "amount": {
            "type": "number",
            "description": "Transaction amount"
          },
          "description": {
            "type": "string",
            "description": "Transaction description or merchant name"
          },
          "running_balance": {
            "type": "number",
            "description": "Balance after transaction"
          }
        }
      }
    }
  }
}