Case Study: data serialization in a DropWizard application using Apache Avro
By David Carr

When adopting a service-oriented architecture, one of the things we needed to do was define the data format each service endpoint takes as input and produces as output. In some cases, the input is simple and easily represented as URL segments or query parameters. In most cases, however, either the request body or response body needs to deal with a richer data object. For requests, this data needs to be parsed, validated, and coerced into a format that the application logic can use. For responses, we need to be able to generate the desired data format from the objects generated by the application logic.

There are many approaches to this sort of problem. A common approach is to use either JSON or XML as the serialization format, and a library to bind said serialization format to Java objects; GSON, Jackson, XStream, and JAXB are all examples of libraries that support this sort of approach. If your service is being used directly by web browser clients, or is publicly accessible, you may have to use either JSON or XML, in which case the remainder of this post may not be applicable. For communications between internal services, however, you may have more flexibility in choosing a serialization format.

There are a number of libraries available that use a binary serialization format. Examples of this approach include Protocol Buffers, Apache Thrift, Apache Avro, and Kryo. While the approaches used in these libraries vary quite a bit, in general they choose to use a binary serialization format for better efficiency in serialization, network transfer, and deserialization. Depending on the library, they may also provide tools to generate classes based on some form of schema definition, capabilities for schema evolution and dealing with schema mis-matches, cross-platform capabilities, or support for generic data.

The remainder of this post will focus on Apache Avro in particular, as that is the library I chose to evaluate in more detail. It seemed like a reasonably modern, active library in the space, with a rich feature set and decent design approach. The main differentiating features that interested me included:

  • Support for both code generation (called “specific data” in Avro parlance) and dynamic typing (called “generic data” in Avro parlance)
  • The schema definition is always available, allowing more intelligent handling of mismatches and eliminating the need to manage manually- assigned field IDs

In our case, we piloted Avro usage in a DropWizard-based service written mostly in Groovy with a Gradle build. Initially, the service had used the built-in Jackson serialization to/from JSON. When porting the resources to using Avro, we:

  • created Avro schema files for the different data types that were used
  • added the gradle-avro-plugin to our build process, to generate Java classes from the schema files
  • configured MessageBodyReaders and MessageBodyWriters for the Avro-generated classes
  • updated our resources to use the Avro-generated classes as request method parameters and response entities

This went pretty smoothly. There are currently three rough edges on the service side:

  1. While all of the logic for interacting with Avro for different “specific records” is generic (and, indeed, we implemented it all in generic abstract MessageBodyReader/Writer classes), associating a specific class with a specific media type appears to require a class definition with the appropriate generic definition and annotations. Ideally, I’d like to be able to register these without needing a class definition for each media type. This appears to be a JAX-RS/Jersey limitation.
  2. We needed some glue code to convert messages from the generated classes to our corresponding domain classes and back again. I’ve come to accept that there isn’t a magic bullet to completely eliminate the need for some form of glue when it comes to data serialization. One approach is to use convention within your domain classes to specify a mapping to the serialization format, and then likely annotations or some external form of configuration to deal with cases where the convention is violated. Another approach is to completely configure a mapping, whether by configuring a custom serializer instance or using a configuration file. For now, the approach we’re using is writing methods to convert between different objects.
  3. Since the entities used are Avro-generated (meaning we can’t easily add annotations to them), we can’t use DropWizard’s built-in validation system. This hasn’t been much of an issue for us yet. If it becomes an issue, we could consider switching to Avro’s support for reflection-based mapping instead of using generated classes.

When we worked on adding Avro support to the client application, we found that we wanted to use Avro-generated classes as well. Instead of generating them in the client application’s build, we refactored the service’s build to package the generated classes as a JAR and published it in our internal Maven repository. Whenever the client application needed an updated copy of the JAR, we simply updated the version of the service’s “DTO” component in the client’s build file, and it would then pull down the generated classes for the updated Avro schemas.

In the client application, we’re using Netflix Feign with the feign-avro extension as the service client. That’s working out really well. Occasionally, we have a client-side need for a method or two in the generated classes to manipulate data in a specific way. We expect that Groovy extension modules will work well for that.

Generally, I like how the solution is working. It’s really nice to get IDE feedback for whether you’re using the right types and fields. It’s also nice to be able to specify defaults in the schema to control how missing fields should be handled, and detect at runtime if a mismatched schema isn’t compatible with your application’s expectations, without having to write lots of checks for missing fields. When you have schema mismatches, sometimes the errors that Avro gives can be a little tricky to figure out. At development time, it can be convenient to also have JSON available to be able to look at the data that’s being read/written.

Categories Software EngineeringTags , , ,

Leave a Reply

Your email address will not be published. Required fields are marked *