Sixfold’s tech stack is written in TypeScript, and run on Kubernetes as tens of decoupled microservices, with inter service communication handled by Kafka. Integrating python based ML model into this stack was not a trivial challenge.
Our goal was for the service to require as little manual work and maintenance as possible, including the process of model updates.
In the context of this blog post we are going to focus on a single microservice, which takes care of calculating an ETA for a vehicle driving towards a stop.
Our machine learning code is written in Python, purposefully, so that we could rely on the vast set of ML libraries out there. Some teams have opted for converting pretrained model (weights, trees etc) to run on a different language runtime like Node.js, Java etc. to match the production environment. We did not want to restrict us to only ML approaches that have implementations in multiple languages. Additionally, with Python implementation we can share 100% of all data preprocessing and prediction post-processing steps.
ETA service architecture
The pretrained model is hosted by a Python web service called model-worker that lives in the same codebase with the ETA service and is communicated with over REST API. Both of these services share the same database, but model-worker has read-only access.
ETA service sends some data over the REST API to the model-worker, and it queries the rest from the database. Machine learning models tend to be really data hungry, meaning a lot of data aggregations and processing needs to be done before the model can make predictions. Model is highly coupled to the way how features are processed. Therefore, we decided to let model-worker calculate the necessary features.
The model-worker is fairly independent of TypeScript part, except for database structure dependency, which it shares with ETA service.
Due to this, before the Python tests are run, a TypeScript pretest hook is used to spin up an empty database with the correct structure from the main service. We found this to be the least-confusing way of handling this coupling, instead of trying to keep the database structure in sync between Python and TS code.
There are likely alternative solutions to this, and therefore we can’t say we fully endorse such a setup, however, it fit the criteria and matched our needs in the time we had to build this service.
As for the model, we are currently using a Boosted Trees based model, which makes it very CPU intensive and relatively moderate in terms of memory consumption.
From our experience with Node.js, we wanted to make sure that the model-worker was also an asynchronous webserver — minimising CPU usage during idle periods between requests, or waiting on IO. For that reason we decided to go with AIOHTTP web server.
Below is a snippet of code showing how the model-worker web server is set up:
As this particular web server is not available for access outside of the Kubernetes cluster, it lacks any authentication logic.
Our services are automatically scaled by Kubernetes based on the current load. Usually model-worker runs about 7–10 instances, however we have load tested the setup at 10x without major issues.
Testing ML services
How to test something that gives slightly differing values over time? After all, COVID-19 was a good example on how ML ETA needed to drastically adjust to change in the environment, world, and consequently the underlying data — it would be tedious to fix tests after every model update.
For testing we employ multiple approaches to make sure every step of the system works as expected.
For feature calculation raw data is inserted into the database and tests assert that calculated features are correct — the same as every other test involving databases.
All code related to feature preprocessing and prediction post-processing (manual heuristics for edge cases) are implemented in scikit-learn transformers. Every transformer has its own set of unit tests making sure it works in isolation.
Additional set of unit tests make sure that all transformers and the current model work when put together and no transformer messes up the features or their order. This is enforced by handpicked test features that must result in a prediction between some range. Simplified example: vehicle 100 km away, that is currently driving and does not need to take breaks, must give a prediction between 1 hour and 3 hours. The range is relatively relaxed on purpose, the main point is to highlight obvious problems in code.
API tests for the entire service making sure the service gives intuitive predictions for reasonable requests. This is mainly to make sure that changes in API do not break the model. For example, the model does not start filling missing values when the key name for a value is changed.
Sixfold has mock test vehicles with test transports driving every second of the day and night. Using this method we are testing the entire system end-to-end and making sure that ML ETA is always in-between a reasonable range. This makes sure that changes in some other part of the system do not affect ML ETA.
In summary, we test ML code in different setups and very thoroughly — it rarely fails. Most mistakes end up with completely unreasonable predictions that need to be highlighted by tests or monitoring. We will go into more detail of how we do ML monitoring in a future blog post.
Packaging and loading the model
A trained model can be thought of as some code with state, this includes weights, split-points for trees, number of trees, min, max and average values for features etc. How do we release this code with its runtime state?
What if you used MinMaxScaler to preprocess all your numeric feature to be between 0 and 1 and LabelEncoder to assign numbers to categorical values? The state for these conversions needs to be stored. Additionally, we also do post-processing for raw predictions. For example, we make sure no predicted ETA is faster than the fastest driving time we have seen. This information also needs to be stored for the code to work.
The above example is quite minimal for what is needed, but pretty fast you end up with explosion of state files that all need to be loaded and used correctly for the entire prediction pipeline to work correctly.
We considered two approaches for model pipeline serialization, including preprocessing and post-processing: publish code and state as python library that loads state on first run or use python tools to serialize code with state to a binary file. Both have their own set of pros and cons.
Our first idea was to publish a python library from KubeFlow. This way we could write our own
load_pipeline() library method that takes care of loading everything correctly in production environment using asset files included within the library. As an added benefit models would be versioned in a standard way and installing a new model could be done with pip. The entire pipeline would be more compatible with different versions of python and libraries, but still not fully.
Our second idea was to use pickle to serialize the entire pipeline state to a binary file, upload this file to cloud storage and take care of versioning. Then write custom build task in
setup.py that automatically downloads correct model from cloud storage during the CI build. Finally, make sure that python version and libraries match between training and inference in production.
We opted for the second approach due to its utter simplicity. We might need to change this in the future, however, for now, this has worked despite its naivety.
As discussed in our previous blogpost about training we use Kubeflow to train and evaluate the model. If Kubeflow evaluation passes, then the entire update process goes into motion: model file is written to cloud storage, pull request with all necessary information is opened and team is notified via Slack.
Model update pull request can be approved and merged by anyone — it has all the necessary steps, information and human-readable metrics in the description.
Adding Machine Learning to our product took several months due to the sheer complexity and scope of the task. In addition, it’s not really a one-off process — you need continuous model updates to avoid drift, every model update needs to be evaluated, live monitoring is a must, and so on. Skipping any of these steps is a disaster waiting to happen.
Taking the plunge from cleverly selected rules and constants to learning the behaviour of drivers on Sixfold from data was definitely risky but yielded great benefits. It is one of the main reasons why we have the market leading ETA accuracy that our clients rely on every day.