Transpiling protobuf definitions to runtypes in TypeScript

Written by Hikmat Gurbanli and Henri Normak

In this blog post we are going to discuss an approach we have been actively using to generate runtypes out of protobuf definitions. As this approach is quite specific to our use-case, we have not published it as an open-source library, however, we might in the future. In this post, we will look at parts of the process agnostic of the underlying library they are a part of.

What is protobuf?

Protocol buffers provide a language-neutral, platform-neutral, extensible mechanism for serializing structured data in a forward-compatible and backward-compatible way. It’s like JSON, except it’s smaller and faster, and it generates native language bindings. See more

If you’ve been following our posts, we have made it quite clear that Kafka plays a central role in our distributed event-driven system.

For message payloads, we have chosen protobuf as our data format. This is due to its efficient serialization, reducing the disk usage of our Kafka messages, as well as its support for types and structured messages. The latter of which plays a key part in the topic we will discuss in this blog post.

What are runtypes?

We have previously discussed why we have chosen TypeScript. That said, one of the properties of TypeScript is type erasure — when the code runs it is still just plain old JavaScript with all of the issues that this brings about.

One of these issues is data integrity and validation: how do we ensure that the structure of payloads we receive via API requests/responses, or from Kafka messages, match the types we have defined in our TypeScript code. This is where runtime type validation libraries come in, providing a way to define types which are useable both during compilation, as well as for runtime validation.

While there are several of such data validation libraries out there, we have chosen Runtypes. Most of the other libraries work in a similar way, so for the rest of this post, you can assume that “runtypes” can refer to any such library of your choosing while rest of the approach remains the same.

Why transpile protobufs to runtypes?

Runtypes libraries work by offering an API to define a data structure, against which data is validated during runtime. This means in order to provide this for our protobuf messages, we must keep the two definitions (protobufs and runtypes) in sync. The easiest way to do that is to generate one from the other, i.e avoid human errors in converting protobuf messages to runtypes structures.

This transpilation could also go the other way around, from runtypes to protobufs. However, as we have previous experience with generating TypeScript code using the compiler API, it seemed logical for us to do the generation in this direction. It also aligns with some of our other internal tooling, for example for generating runtypes from OpenAPI definitions for API servers. We might talk more about that in the future.

Requirements for transpiling protobufs to runtypes

As our needs only cover parts of the protobuf syntax, the library we have built does not exhaustively work with all protobuf capabilities. In order to scope the tool, the features of protobufs we need are as follows:

- Package definitions — we use these as means of namespacing messages, for example by topic
- Message definitions — we use these for individual messages
- Enum definitions
- Scalar field types
- Map and repeated field types
- Using message/enum as field type
- Nested messages or enums
- Import statements
- Optional fields — we built this library prior to proto3 support for optional fields, thus we derived our own approach here

Here is the example set of protobufs we’ll use for the remainder of this post:

runtype_optional is a custom option, which we use to provide additional info for the generator. You can see this later be reflected in the optional key in the intermediate structure and as a Partial in the resulting runtypes.

The expectation is that the library will generate 2 TypeScript files from this input:

Transpilation process

The 3 stages of transpilation — parse, process, generate

This process involves 3 stages. We have to parse the protobufs into some version of an AST (Abstract Syntax Tree) which we can then process in order to rename and restructure the output. Finally, we can then generate TypeScript code from this intermediate representation.

This separation of stages allows us to, in principle, swap out any of the three without affecting the others. For example, if we decide to switch the runtypes library, only the last stage would need to be rewritten. Similarly, if we need to produce runtypes from some other source, we could provide a different parser, and the rest of the pipeline would still work.

Let’s now go through these one at a time.

Parsing

The first step is to load and parse the protobuf files. We rely on the parse function from protobufjs library for this. Calling the function with a loaded protobuf file returns a IParserResult interface. From this interface we can read the package (package name defined in proto file) and root (hierarchical data structure that represents file contents as json schema) keys.

For example, here is how the root key looks like for shipment.proto. Note that we have omitted some of the output for simplicity:

{
nested: {
shipment: {
nested: {
Destination: {
fields: {
name: {
type: "string",
id: 1,
},
address: {
type: "common.Address",
id: 2,
},
},
},
ShippingOption: {
values: {
GROUND: 0,
AIR: 1,
},
},
Shipment: {
fields: {
tracking_number: {
type: "string",
id: 1,
},
destination: {
type: "Destination",
id: 2,
},
...
},
nested: {
...
},
},
},
},
},
}

Processing

Before generating TypeScript code we process the parsed structure in order to flatten and simplify it. This flattening of the structure is required because runtypes does not support nested (reusable) type definitions and thus we have to create separate TypeScript statements for all nested definitions.

This processing also simplifies the generator code later by providing a dictionary of namespaces where the key is a concatenation of namespace (package) and the message hierarchy. For example, the dictionary for shipment package will have the following keys:

- shipment.Destination
- shipment.ShippingOption
- shipment.Shipment.Delivery.Type
- shipment.Shipment.Delivery
- shipment.Shipment

The value for each key is a reflection of the message or enum from protobuf with all of its fields as well as some additional properties, such as the kind, name, namespace, aliases, optional, private etc. These all serve the purpose of guiding the runtypes generation in latter steps, for example, kind helps to categorise the fields into 4 kinds, each of which generates the runtypes slightly differently.

The value for the kind field is detected with following logic:

- repeated and map are detected by checking the boolean properties with the same names in the schema
- scalar is detected by checking if the field type exists in a predefined scalar type list
- message is assigned to all other fields as the fallback (enums, nested messages etc.)

After the mapping values for shipment.Shipment.Delivery.Type and shipment.Shipment keys of the dictionary will look like this:

{
"shipment.Shipment.Delivery.Type": {
namespace: "shipment.Shipment.Delivery.Type",
name: "Type",
values: {
ENVELOPE: 0,
BOX: 1,
},
private: false,
},
"shipment.Shipment": {
namespace: "shipment.Shipment",
name: "Shipment",
fields: [
{
kind: "scalar",
name: "tracking_number",
type: "string",
optional: false,
},
{
kind: "message",
name: "destination",
type: "shipment.Destination",
optional: false,
},
{
kind: "repeated",
name: "deliveries",
type: "shipment.Shipment.Delivery",
optional: false,
},
{
kind: "message",
name: "shipping_option",
type: "shipment.ShippingOption",
optional: false,
},
],
private: false,
},
}

Notice how the type property of the fields has been normalised to be a key from the dictionary we extracted before, this helps map uses of any of the runtypes across all of the files involved. It also helps us with generating the import statements later.

Now we have a dictionary where each key/value pair corresponds to a future node in the TypeScript AST. However name and fields properties are not quite there yet.

First, let’s look into name. It will be used to name the variable in TypeScript code. The current value of the property is exactly the same as in protobuf definition and that can be problematic. For example, Delivery message is defined inside Shipment and its name in the dictionary is Delivery. However protobuf would allow us to create another message with the same name outside Shipment which would lead to the creation of two runtypes with the same variable name.

To distinguish types in such cases, name property should be modified to reflect the hierarchy. We do that by concatenating types in the path. So in our case Delivery will become ShipmentDelivery. Additionally, we have added a custom alias option to protobuf through which one can configure a custom name to be used.

In fields we need to map the current types to their equivalent in TypeScript/runtypes. It's rather straightforward for scalar types since there is a matching runtype type for all of them. There is a small exception for numbers since JavaScript/TypeScript, and therefore runtypes, doesn't support strict number types (int32, int64, float etc.) and that's why all of them are mapped to one runtype (Number). In essence scalar types mapped as below:

bool -> Boolean
string -> String
bytes -> Buffer
int32, int64, float etc. -> Number

When it comes to custom types (message/enum), we drop the package name from the type if it matches the package name currently being mapped. So for example, the variable name will be Shipment, not ShipmentShipment.

This means the type of the deliveries field (shipment.Shipment.Delivery) will drop the shipment part, but the address field will keep its type as common.Address. This will allow us to generate a simpler import statement later, importing the entire common package, instead of having to destructure all imports.

In addition, the same logic we used for name applies here so that the field refers to correct runtype types. In the end deliveries field will have ShipmentDelivery type.

After all of this processing, the shipment.Shipment entry from the dictionary is represented as such:

{
"shipment.Shipment": {
namespace: "shipment.Shipment",
name: "Shipment",
fields: [
{
kind: "scalar",
name: "tracking_number",
type: "String",
optional: false,
},
{
kind: "message",
name: "destination",
type: "Destination",
optional: false,
},
{
kind: "repeated",
name: "deliveries",
type: "ShipmentDelivery",
optional: false,
},
{
kind: "message",
name: "shipping_option",
type: "ShippingOption",
optional: false,
},
],
private: false,
},
}

With these entries, we can proceed to generating the TypeScript code.

Generating TypeScript code

TypeScript comes with a really powerful compiler API, which can be used to generate TypeScript code by describing a graph of nodes in a source file.

Although very verbose, usage of this API is quite straightforward, for example, here is a function that converts fields with kind scalar into appropriate TypeScript nodes:

By having many of these types of utilities, we can map over the structure we extracted above, resulting in an array of Nodes, which we can print into a string using tools from the compiler API:

This stringified source can then be written into a file, which can be prettified, linted, etc. to match the formatting practices of the repository where the tool is being used. In our case, we generate a file per protobuf package, reaching the structure we mentioned above of having shipment.ts and common.ts files.

Summary

In this post, we explored how we can take a structured definition written in protobuf and transpile it to a set of runtypes to be used in TypeScript. We discussed the internal steps of the process, how the definitions are parsed, processed, and subsequently converted into runtypes.

We also discussed a bit about the background of why this tooling is necessary. It extends the type guarantees we have in TypeScript to runtime validation, offering a second layer of safety in our codebases by validating data from external sources such as Kafka messages.

This is by no means an exhaustive example. We might at some point open-source the library we are using, however, as it is quite specific to our use-case, it might not be general enough to be widely useful. Let us know if you have a need for something like this, or are using some library to achieve a similar goal already. As always, we enjoy hearing any feedback from you!

--

--

Transporeon Visibility Hub Tech Blog

Transporeon Visibility Hub is Europe’s leading real-time transportation visibility platform, powering supply chain visibility for the world’s biggest companies