Doubling our ETA accuracy with machine learning in production

8 min readApr 23, 2021

This is the first post in our series about machine learning

Written by Taivo Käsper, Anton Potapchuk, and Maksim Mišin

If asked to describe Sixfold in one sentence, a reasonable answer would be: “We estimate time of arrival for millions of road freight shipments.” The estimated time of arrival (ETA) is the core of our business. We compute millions of ETA predictions daily, and have done so without major incidents for almost four years.

Initially we were mostly relying on various heuristics, but in the last year, we have switched over to a machine learning model that has worked without major flaws and provided significantly better predictions for our customers.

This is the first post in the series of three blog posts about ETA. In this part we will introduce the problem, our dataset, as well as our deployment strategy. In the later blog posts we will describe how we run the model in production and monitor its performance.

What problem are we trying to solve?

At a first glance, ETA prediction seems relatively straightforward. After all, online map providers such as Google Maps or HERE Maps seem to predict estimated travel times for all sorts of traffic modes without an issue. Indeed, as long as one has access to speed limits for trucks and can get an up-to-date road network graph, the travel time calculation essentially reduces to an optimal path calculation.

The question becomes much less straightforward once we remember that truck drivers are people! For instance, Google Maps reports travel time between Berlin and Madrid to be approximately 23 hours.

However, according to our data, this journey would realistically take around two and a half days, assuming the driver is fully rested at the start. Until we have self-driving vehicles, truck drivers are not going to drive non-stop — they are going to take all sorts of necessary, as well as some unnecessary or unexpected, breaks along the route. The potential detours, as well as safe driving hours regulations, make the problem even less tractable.

At Sixfold, we have decided to adopt a machine learning (ML) approach to solving this challenge. Unlike classical algorithms, the ML models can learn from the data we have access to. Given rich enough input they can detect preferred drivers’ routes, routines, and potential delays that truck drivers are likely to encounter along the way.

The ML ETA model was our first, and arguably the most important, machine learning model we deployed to production. Before doing so we discussed numerous pipelines and approaches. In the end we decided to settle on the one that is described below.

This approach turned out to be pretty robust and has been running in the production environment for over a year without major changes. That’s quite remarkable considering that it’s the same year during which our data has increased tenfold.

Getting a bit more quantitative, we managed to almost halve the error for long-term estimates (four or more hours) compared to our previous model based on heuristics. The win is even larger when compared to naive driving time estimates that are based only on route between two points.

Of course, we are aware that the approach we have adopted is far from perfect — the error is not zero after all! If you have any ideas or suggestions feel free to leave a comment or get in touch with us!

Dataset

On a high level, Sixfold operates with two types of data.

The first is vehicle GPS positions or telemetry data. We get this through thousands of different external providers — fleet management systems, mobile phones, various kinds of specialised logistics packages, other aggregators and the list goes on and on. The data quality varies from extremely high fidelity to almost completely unreliable.

The second type of data are various transport plans of large shippers delivering goods to their customers all across Europe. Similarly to GPS data, we also get it from all sorts of integrations of various qualities. Due to the volume and velocity of our data, our data warehouse has been big from pretty much day one.

Initially we managed to get by with a managed PostgreSQL offering by Google Cloud — Cloud SQL, but soon enough we decided to migrate to a proper cloud data warehouse solution.

Between Snowflake and BigQuery we chose BigQuery, because of its extensive support of geographical functions and good integration with Google’s ecosystem. The only downside of BigQuery is that it isn’t suited for frequently mutating data. Tables containing such data were kept in our old PostgreSQL data warehouse. Since BigQuery can access data in Cloud SQL databases it hasn’t caused any issues for us.

Another piece of our data pipeline is Google’s managed Airflow — Cloud Composer. We introduced it once we realized our data is way too noisy and confusing for ML training — models usually perform better when they are trained on clean data.

To ensure data quality, we created an Apache Airflow workflow that detects complex data integrity issues such as address problems or suspicious GPS signals. All these tools come together to produce our ETA model. We train it on past transports stored in our data warehouse. Suspicious looking examples are filtered out with the help of the Airflow output. The training dataset consists of transport information, GPS positions of trucks, and a number of derived features, such as vehicle movement history, speed, moving averages, etc.

We decided not to cache these additional features (a.k.a. use feature store) as calculation of these properties can be done on the fly inside BigQuery — the query runs less than 10 minutes and processes close to a TB of data. The final output is stored in Parquet format in Google’s Cloud Storage.

Kubeflow

Our entire ML ETA training workflow is developed as a single Kubeflow pipeline, visualized on the figure below.

It re-trains a model weekly and runs various checks on the newly trained model. Once the model passes all the tests, the pipeline automatically opens a GitHub pull request, which has to be approved by engineering — this manual step serves as an extra security measure, allowing engineering to look over the numbers and ensure they seem reasonable. The whole training process was specifically designed in this manner so that anyone from engineering can confidently update the model by following step-by-step instructions attached to the pull request. In addition to evaluation steps and automated tests we also have a monitoring Grafana board, as well as end-to-end tests that continuously monitor model performance in both staging and production environments.

We always compare the current live model to the re-trained model to establish a baseline performance and to provide at least some level of protection against potential concept drift. Another level of protection against the unknown is added by re-training the model weekly. We will go further in-depth on monitoring the model in a future post.

Research and production stage models

Our machine learning model development started a lot earlier than actual production deployment. We had countless Jupyter notebook prototypes that helped us determine the variables that are important for ETA prediction.

All this research was very useful and sped up the implementation process significantly but taking the code from notebooks is not easy — in notebooks, data, data preparation and preprocessing are detached.

Additionally, some of the features we tested in notebooks were too costly to calculate in the production environment on the fly, so we had to discard them or find simpler alternatives.

In the end we learned that notebook code is not easily transferable and requires modifications or even a complete rewrite. A lot of care must be taken to make sure that the data and code used during the training matches the data that comes in from the production environment.

One interesting difference between production and the training environment is related to databases.

Our training data is put together in BigQuery, while production environments use PostgreSQL. The two have some notable implementation differences; meaning even if the code looks the same, results can be different.

Some examples include ST_Distance function that in PostgreSQL uses spheroid calculation but BigQuery currently approximates Earth as being flat. Same applies to weekday calculation, which in one database starts from 0, while from 1 in the other. Such mistakes can easily go unnoticed, especially when they affect only a few predictions.

Full pipeline

The process of taking data in any shape and form and converting it into a form digestible for machine learning models is called data preprocessing.

Most ML models expect the data to be in a numeric format, so booleans (None, False, True), categoricals (loading, unloading), dates, texts, images etc. all need to be converted into numbers in some way or form. Fortunately, our ML setup is relatively simple and uses tabular data where we only need to convert booleans, categoricals, and fill in some missing values.

There are many different ways in which one can encode (represent) non-numeric values. Ultimately, it’s up to the model developer to pick the encoding format that works the best in a given context. Once decided, the encoding strategy must be the same everywhere, or you would again risk obtaining different predictions in different environments.

We have experimented with a few different preprocessing strategies and in the end settled on scikit-learn pipelines.

These allow us to write custom encoders and save these alongside the model into a single file — we found such an approach to be very reliable and easily testable. In addition to preprocessing, most models also require a post-processing step, needed to impose additional sanity checks, handle outliers, and add additional hand coded heuristics rules. This post-processing step also works nicely with scikit-learn pipelines.

Before settling on the scikit-learn we tried keeping data pre-processing and the model separate. This meant that whenever one wanted to load the model and make predictions — during evaluation, production, or research phases — one first had to load the code and state of preprocessing logic, transform the data, and then feed transformed data into the model in the correct way. This approach ended up being very prone to bugs.

With scikit-learn pipelines, all of this logic is encoded right into the model, removing the errors that would otherwise surface from incorrect model usage.

Summary

It is not easy to move critical parts of business logic to a machine-learning model, which are often seen as black boxes that may or may not return reasonable results in all possible real-world scenarios.

It took us quite a bit of experimentation and research to come up with the approach described above, finding a middle-ground between trusting ML models and ensuring they are performing predictably in all situations. All of this research led us to an approach that has worked incredibly well, leading to significant improvements in ETA accuracy, without losing stability of the production system.

In upcoming posts we’ll go deeper behind this approach, looking into how the model works in the production environment and how we monitor its’ performance.