DDIA Chapter 4 - Encoding and Evolution

Published

February 16, 2024

This chapter introduces the concepts of encoding, how data structures are represented as bytes on a disk or over the network, and evolution, how we can change the structure of data (i.e. the schema) while maintaining backward and forward compatibility.

Backward and Forward Compatibility

In a modern distributed application an upgrade involving a schema change typically cannot be deployed to all components of a system simultaneously. In addition you might not be able to take down your database to upgrade all the old data to a new schema. So you may have components on new code reading data in the old schema and potentially components on old code reading data that uses the new schema. In these cases, both backward and forward compatibility is desirable.

Backward Compatibility: Newer code can read the data written by older code.
Forward Compatibility: Older code can read the data written by newer code.

Different ways of encoding data may allow for either backward or forward compatibility or both (or neither).

Encoding

Language Specific Formats

Examples java.io.serializable for Java or pickle for Python.
Usually very convenient but mean you are locked in to a specific programming language for both reading and writing.
Can introduce security risks as they need to allow instantiation of arbitrary object types.
Usually do not provide a means for backward or forward compatibility.
Can be slow

JSON, XML and Binary Variants

JSON, XML and CSV are textual formats that produce (somewhat) human-readable data. However they may have issues with encoding numbers (e.g. specifying integer versus float), the schema representation can be quite complicated and they typically result in a large-file size, especially when the data contains a large amount of numbers.

JSON and XML have binary variants that are somewhat more efficient in storing numbers - e.g. MessagePack and BSON but typically still store each field name for each record.

Thrift and Protocol Buffers

Apache Thrift was originally developed and Facebook, whereas Protocol Buffers originated at Google.

Thrift: The Missing Guide is a better into to Thrift than the official documentation

Both define an interface definition language (IDL) used to describe a data structure as well as compilers for different programming languages that take an IDL and generate code to read/write data in that format.

Avro

Apache Avro is a binary encoding format that grew out of the Hadoop ecosystem.

Parquet

Apache Parquet differs from Avro in that Parquet os column-oriented whereas Avro is row-oriented. This makes it an efficient way to store and retrieve large data sets and is especially useful for data analysis applications. Being column oriented it can use more efficient compression schemes for each column as the data type for contiguous records is homogenous.

Comparison of Avro and Parquet

Example: Storing a struct in different binary formats

Example record which we’ll encode in various binary formats

{
  "userName": "Martin",
  "favoriteNumber": 1337,
  "interests": ["daydreaming", "hacking"]
}

With whitespace removed, the textual JSON encoding takes 81 bytes.

MessagePack example

The MessagePack encoding is 66 bytes, only about 20% smaller than the textual representation. The field names are included for every record, bloating the size.

Thrift example

Thrift IDL:

struct Person {
  1: required string            userName,
  2: optional i64               favoriteNumber,
  3: optional list<string>      interests
}

The Thrift Compact Protocol encodes the struct in 34 bytes. It use a variable integer encoding, so 1337 is encoded as 2 bytes rather than 8 bytes. It also stores field tags instead of field names, as well as a code identifying the type of each field. (Protocol Buffers use a similar encoding)

Thrift Compact Protocol encoding example

Avro example

Avro IDL:

record Person {
    string               userName;
    union { null, long } favoriteNumber = null;
    array<string>        interests;
}

The Avro encoding for this struct is 32 bytes. Like Thrift integers are encoded with a variable-length encoding. Unlike Thrift and Protocol Buffers, there are no field tags identifying each field or type codes identifying the type of the field. Instead the code reading the data relies on the schema to interpret the bytes - so that the schema used to read the data must be the same as that used to write the data.

Schema Evolution

Field tags are critical to the encoding of data for both Thrift and Protocol Buffers. Field names can be changed without impact but field tags cannot be changed without invalidating any data already written. To maintain forward compatibility, a new scheme can add a field (using a new field tag) - old code will just skip the bytes following an unrecognized field tag. For backward compatibility, a new field can be added so long as it is optional - otherwise new code reading old data will not find the new required field. When removing a field, to maintain both forward and backward compatibility, the removed field must be optional (and you can’t use the same field tag again).

Avro handles schema evolution by storing the writer’s schema in the file. When code with a new schema reads the file, it can compare the new schema with the old schema in the file. To maintain compatibility, only fields with a default value can be added or removed.

Modes of Dataflow

Data can flow between processes in different ways. As these processes don’t share memory, the data must be encoded as a sequence of bytes to move from one process to the other.

Dataflow through Databases

The process writing to the database encodes the data and the process reading from the database decodes it.

Dataflow through Services: REST and RPC

The client encodes a request, the server decodes the request and encodes a response, and the client finally decodes the response.

Message-passing Dataflow

Asynchronous message passing (using message brokers or actors), where nodes communicate by sending each other messages that are encoded by the sender and decoded by the recipient.