API Schemas: Exploring Validation and Code Generation

10 min readJan 5, 2025

Recently, I traveled to API Days in Paris where I attended the JSON Schema Conference and helped staff the JSON Schema, AsyncAPI, OpenAPI, GraphQL booth. Overall, the vibe on JSON Schema was very positive, but there were some grumblings about problems with code generation. Specifically, people complained about certain schema constructs not being supported by tooling for the target language. There are also ongoing debates about how the different schema languages compare to each other and which one “is better”. This thread, for instance, debates XML Schema vs. JSON Schema.

There are also alternative schema languages being proposed. JSON Type Definition is a minimal subset of JSON Schema and is designed to “Enable code generation from JTD schemas”. JSON Compact Schema uses some of JSON Schema and “and aims for compatibility where possible”, but differs vastly in some areas. The Concise Data Definition Language is yet another alternative and looks more like a grammar.

In this article, we first look at the basics. We evolve a basic API client with the help of OpenAPI and JSON Schema and highlight the problems this solves. The second part identifies common problem areas where implementations tend to behave incorrectly and inconsistently. Finally, we outline some general guidelines for tools to avoid these problems.

OpenAPI Basics

Let’s look at some basics first. With the rise of AJAX and Single Page (JavaScript) Applications, JSON has replaced XML as the goto serialization format. OpenAPI has also emerged as the de-facto standard for describing APIs. Therefore, we focus on calling an OpenAPI JSON service using Java. We will start with a vanilla client, and step by step, introduce validation and code generation to make our lives easier.

Note that the concepts also apply for both clients using other languages as well as code that implements the server side.

cURL

We will use the Petstore’s getUserByName service. We can test the service (using the username: user1) using the Swagger API tooling. The system presents us with a cURL command line to perform the call from a shell. We encode the username in the URL, set an HTTP header, and run a GET request:

curl -X 'GET' \
  'https://petstore.swagger.io/v2/user/user1' \
  -H 'accept: application/json'

Vanilla Java Client

That’s a good start. We can easily translate this to Java:

URLConnection con = new URL("https://petstore.swagger.io/v2/user/user1")
  .openConnection();
con.setRequestProperty("Accept", "application/json");
try (InputStream in = con.getInputStream()) {
  String raw = IOUtils.toString(in, Charset.defaultCharset());
  System.out.println(raw);
}

OpenAPI Description

The Swagger UI is is driven by the OpenAPI description of the service:

{
  "host": "petstore.swagger.io",
  "basePath": "/v2",
  "schemes": [ "https" ],
  "paths": {
    "/user/{username}": {
      "get": {
        "operationId": "getUserByName",
        "produces": [ "application/json" ],
        "parameters": [
          {
            "name": "username",
            "in": "path",
            "description": "The name that needs to be fetched. Use user1 for testing. ",
            "required": true,
            "type": "string"
          }
        ],
        "responses": {
          "200": {
            "description": "successful operation",
            "schema": {
              "$ref": "#/definitions/User"
            }
...

This descriptions contains a lot of information we find the our Java Client:

Scheme, host, and basePath make up the service URL
The parameter and the path template describe how to attach the username to the URL
We learn that the service answers in JSON and that the response conforms to a schema called #/definitions/User (more on that later)

Our First Stub

A stub hides the details of the underlying network communication and allows accessing the service as if it were local. Using the information of the OpenAPI description, a stub could look like this:

public class Store {
  String scheme;
  String host;
  String basePath;

  public Store(String scheme, String host, String basePath) {
    this.scheme = scheme;
    this.host = host;
    this.basePath = basePath;
  }

  public String getUserByName(String username) throws IOException {
    URLConnection con = new URL(scheme + "://" + host + basePath + "/user/"
        + URLEncoder.encode(username, Charset.defaultCharset())).openConnection();
    con.setRequestProperty("Accept", "application/json");
    try (InputStream in = con.getInputStream()) {
      return IOUtils.toString(in, Charset.defaultCharset());
    }
  }
}

We find service name, operation id, parameter name and type, scheme, host, and basePath in the code. Currently, the method getUserByName simply returns a string. We will fix that in the next section.

Using this stub generated from the OpenAPI description, our client becomes a lot simpler:

new Store("https", "petstore.swagger.io", "/v2").getUserByName("user1")

Interpreting the Result

Let’s assume we need to access the user’s ID. Since we’re not using JavaScript, we need to parse JSON and access it via some API. In the Java world, this is typically done using the Jackson ObjectMapper. The server response look like this:

{
  "id":2020,
  "username":"user1",
  "firstName":"Rita",
  ...
}

The following code parses the result, makes sure it is an object, gets the id field, if it exists, makes sure it is an integer, and finally prints the value. In other words, it validates the result in order to avoid null pointer or number parsing exceptions:

String res = new Store("https", "petstore.swagger.io", "/v2")
  .getUserByName("user1");
ObjectMapper om = new ObjectMapper();
JsonNode tree = om.readTree(res);
if (!(tree instanceof ObjectNode))
  throw new RuntimeException("expected a JSON object");
JsonNode id = tree.get("id");
if (id == null)
  System.out.println("no id present");
if (id instanceof LongNode)
  System.out.println(id.asLong());
else
  throw new RuntimeException("illegal id");

Making Use of JSON Schema

In the service definition above, we saw that the response references a JSON Schema via the pointer #/definitions/User. This schema is defined in the bottom part of the OpenAPI description:

  "definitions": {
    "User": {
      "type": "object",
      "properties": {
        "id": {
          "type": "integer",
          "format": "int64"
        },
        ...
      }
    }
  }

Using this schema, we can enhance our stub by having getUserByName return a Java Object instead of a string. The User object is defined according to the JSON Schema. It contains an id getter that returns a long 64 bit integer or null if no id is present:

public class User {
  JsonNode tree;

  public User(String json) throws IOException {
    ObjectMapper om = new ObjectMapper();
    this.tree = om.readTree(json);
    if (!(tree instanceof ObjectNode))
      throw new RuntimeException("expected a JSON object");
  }

  public Long getId() {
    JsonNode id = tree.get("id");
    if (id == null)
      return null;
    if (id instanceof LongNode)
      return id.asLong();
    else
      throw new RuntimeException("illegal id");
  }
}

The only changes to the stub are the User return type and returning new User(json) rather than json directly:

public User getUserByName(String username) throws IOException {
  URLConnection con = new URL(scheme + "://" + host + basePath + "/user/"
      + URLEncoder.encode(username, Charset.defaultCharset())).openConnection();
  con.setRequestProperty("Accept", "application/json");
  try (InputStream in = con.getInputStream()) {
    String json = IOUtils.toString(in, Charset.defaultCharset());
    return new User(json);
  }
}

Since the checks are now performed by the stub, our client can be simplified to:

new Store("https", "petstore.swagger.io", "/v2")
  .getUserByName("user1")
  .getId()

What Did We Achieve So Far?

So far this looks pretty good! Our client looks very handy. This can be achieved by two core concepts:

Code Generation

In the section above, we specifically wrote a simplified stub by hand based on the information we found in the OpenAPI description. In practise, there are a number of tools available for this job.

Code generation dramatically reduces the amount of code required to write the client. We can call services without having to worry about the details of HTTP methods and headers and we can access data, without having to worry about internal JSON structures.

Validation

Apart from the HTTP paths and parameters, the schema information allows us to validate the response. It’s good to trust the server to deliver the data in the structure that was promised, but it’s even better to double check. Therefore, the manual versions of the client contain checks about the response being an object and the id (if present) being an integer. The generated code not only provides a nice way of accessing the data, it also validates the schema so the client does not have to.

Great! So What’s the Problem?

So far things look great. Why are people complaining then? Obviously the example so far is very simplistic. Let’s look at some common issues:

Mapping Primitive Types

Every runtime environment has built-in primitive types that must be mapped to JSON types in order for the stub API to feel natural. Schemas often use the format keyword to provide hints to implementations, especially when dealing with date/time values. Runtimes also have certain rules which types can be cast to others. A SQL database, for example, may treat the numbers 0 and 1 as false and true.

This is a common source for inconsistencies. In our example, id has the format int64 and therefore is mapped to a 64 bit long value. If the code generation library chose a 32 bit int, the stub API might not be able to process a valid response.

Required Values

Programming languages have different ways of expressing that a value is required. In Java, primitive values can be represented as an int or Integer. The difference is, that Integer may be null. int on the other hand, are always present and default to 0. In out example, id is not a required field in the user object. Therefore, the return type of getId was chosen to be an Integer.

Open vs. Closed

By default, a JSON Schema object is still valid if an unknown key is added by the server. This is a common situation when a new version of an API adds a new field to a response. XML Schema defaults to invalidating an element with unknown child elements and requires to explicitly override this behaviour with xsd:any.

In runtime environments, this behaviour varies between implementations. Consider our Jackson JSON parser. We wrote the User object by adding a getter around the raw JSON node object. Alternatively, Jackson offers a mapping to plain old Java objects. We can define the User objects as follows:

public class User {
  public Integer id;
  public String username;
  ...
}

A JSON stream can then be parsed to the User object as follows:

User user = om.readValue(json, User.class);

This looks convenient, however, the semantics differ from the JSON Schema because additional fields will cause errors. To fix this, we have to add the following class annotation:

@JsonIgnoreProperties(ignoreUnknown = true)

Compositional Types

Consider a scenario where a email might be a single string or an array of strings. In JSON Schema, this can be defined as:

"email": {
  "anyOf": {
    {"type": "string"},
    {"type": "array", "items": {"type": "string"}}
  }
}

Mapping this structure to code is not that straightforward. A common pattern is for generators to expose methods to detect the content (also see this example of the Corvus .NET toolkit dealing with a anyOf type for otherNames):

if (object.email.isString()) {
  String s = object.email.asString();
}
if (object.email.isStringArray()) {
  String[] s = object.email.asStringArray();
}

Inheritance

Class hierarchies are common in object oriented software. When serializing these objects to JSON, a discriminator field is typically used to identify the type of the instance. Let’s assume we have a base class Animal and a subclass Cat. The server can use the field petType to signal which type of animal we are dealing with:

{
  "id": 12345,
  "petType": "Cat"
}

This allows code generators to create the type hierarchy accordingly:

Animal animal = res.getAnimal();
if (animal instanceof Cat) {
  Cat cat = (Cat)animal;
  ...
}

Note that the discriminator is currently defined as an OpenAPI specific extension to JSON Schema.

Non-Structural Constraints

Schema languages allow defining constraints on legal value ranges. We can limit the pattern and length of strings, restrict the size of array, and much more. We call these non-structural constrains, because they focus on the value ranges rather than the structure of objects.

An approach take up by some of the tools, is to map these constraints to a constraint system in the respective runtime. In Java, the bean validation framework is a candidate.

Let’s consider a user property aboutMe which should be a string between 10 and 200 characters:

"aboutMe": {
  "type": "string",
  "description": "About Me must be between 10 and 200 characters",
  "minLength": 10,
  "maxLength": 200
}

This translates to the following annotation:

public class User {
    @Size(min = 10, max = 200, message 
      = "About Me must be between 10 and 200 characters")
    private String aboutMe;
}

Communalities

All of these examples show, that is it all but trivial to map a schema language designed to work with any runtime to the types and frameworks of a specific runtime. In this context, mapping means two things:

First, the validation behaviour must be the same.

Second, the generated code must provide programmatic access to all parts of a valid message. For example, our User plain old Java object looses any additional properties sent by the server. This would be an example where parts of a valid message cannot be read by the client anymore.

To a tool implementer, this may not seem like a big deal and it is certainly tempting (and generally a good idea) to use existing, well established frameworks like the Jackson JSON library in our examples. However, one has to carefully consider the long term effects these design decisions have under certain circumstances, for example, an older version of a client communicating with a newer server sending along additional data.

Guidelines for Avoiding These Issues

It seems like trying to catch two birds with one stone — i.e. trying to solve JSON data access and validation in one step — is the root cause of many problems. Why not seperate the two? There are plenty of great JSON Schema validators out there that have 100% compatibility with the specification.

Once a message passes validation, we can worry about presenting the data to the application. A rule of thumb should be to never use a type that is more restrictive than its schema counterpart. It probably is also a good idea to retain the raw data and expose it as a last resort via the API rather than forcing it into a not quite fitting shape and loose information in the process.

For high-volume application, there might be a performance penalty unless a high performance validator is available. However, for other scenarios the impact should be minimal and the benefits far outweigh this issue.

Finally — truth be told — the rapid pace of development on JSON Schema certainly does not help with the stability and conformity in the JSON Schema ecosystem. This actually is another reason to use a dedicated validation library when writing code generators. This allows you to focus on good type mappings.

What are your thoughts on this issue? We are planning to look deeper into good API patterns that can be generated from JSON Schema.