Handling GPS streams of hundreds of thousands of trucks from 700 different systems
Written by Roland Kender
At Sixfold we provide real-time visibility for supply chains (see sixfold.com for more).
To do that, we ingest quite a lot of data from different sources. Two biggest categories of the data that we are dealing with are transport data and telemetry data.
The challenges that we encountered for handling large amounts of transport data have already been covered in a previous blog post about how we handled 10x increase in incoming transports volume.
The transport data gives us an idea as to what the planned route is: where and when to pick up or deliver the goods. The telemetry data is what allows us to give a real-time estimate of the pick up and delivery times. This data comes from GPS units mounted on vehicles executing the plan.
The challenge of handling the real-time GPS location data for large volume of trucks originating from vast number of different systems is twofold:
- Large amount of integrated trucks are producing a lot of location updates — say 500 000 trucks generating one GPS update a minute equals to 8000+ location update operations per second (in reality it’s multiple of that of course, because a single GPS update can trigger a cascade of business logic operations).
- The way how GPS data is received is not standardized — Sixfold is integrated with over 700 different GPS data providers, each and every one with a different data exchange interface.
You can read about some challenges associated with processing large volumes of data and how we deal with it here at Sixfold in several previous blog posts — using kafka on steroids, optimizing database access via batch processing, even fixing node-postgres library for faster date parsing.
In this post, we’ll focus on the not so obvious problems one encounters when getting telemetry data (GPS, reefer temperatures, fuel consumption etc) from 700 different systems with different interfaces for thousands of asset owning companies.
Push, pull and 700 shades of communication protocols
In a perfect world, one would expose a sleek REST API endpoint for receiving truck telemetry updates, all truck owning companies would implement the API and GPS update data would be there. So it would mostly be a matter of scalability to ensure that service is capable of handling, say, a million requests per minute. Not a trivial task, but with modern tools and cloud platform services available, not impossible either.
Reality is way more complex, of course.
There are hundreds of truck owning companies with pre-existing systems and interfaces already in place, where implementing yet another integration interface for Sixfold is not practical or even doable. So instead of forcing our integration partners to use our API, we use whatever interfaces and APIs there are to get the GPS data in. Usually carriers (companies who own the trucks) use third-party fleet management system providers (FMS in short) to monitor their fleet. In majority of cases, Sixfold connects to the external interface of FMS using carrier specific authentication credentials to retrieve GPS data for trucks of this particular carrier company.
There is n-m relationship between FMS providers and carrier companies — several different carriers may use same fleet management system and single carrier can use more than one FMS providers for monitoring their fleet. Such single relation between carrier company and FMS provider is called internally in Sixfold simply an FMS integration. Telemetry data is always exchanged in context of particular FMS integration — you need to know how and where to fetch the data and who is the company owning the assets you get telemetry data for.
The interfaces used to get the GPS input data can be split into 2 broad categories:
- push interfaces, where data exchange is initiated by integration partner
- pull interfaces, where Sixfold initiates the data transfer
What does push or pull interface really mean?
Push interface is any TCP based service that is implemented on Sixfold side, where the integration partner initiates the connection for data transfer. Pull interface is anything TCP based that is implemented on the integration partner side, where Sixfold initiates the connection for data transfer.
In practice, the protocol on top of TCP layer can be anything invented in the history of computer science. Well, maybe not exactly like that, but you get the point. Nice and not so nice REST interfaces, SOAP interfaces, direct database connections, SFTP servers using files to exchange GPS pings, raw TCP connections with custom protocols — you name it, we’ve seen it. It would be nice to REST, but that’s not how world works — so we have adapted to reality and evolved our internal integration tooling to make handling the sharp corners of the world of integrated systems as smooth as possible.
Fortunately, the flexibility of the tooling we use (Kubernetes, Node.js, entire universe of npmjs libraries) supports our ambitious “integrate the world” goal pretty well. It’s relatively easy to expose a new interface using required protocol or start pulling data from new source via required protocol by combining the flexibility of Kubernetes and vast amount of existing npm libraries. Challenges are hidden into scale and devils are lurking in oh-so-many-details.
Load balancing data pull
For push based services where TCP based endpoint (HTTP server in the simplest case) must be exposed, Kubernetes and Google Cloud provide a good toolset to make your service available in a reliable and scalable fashion. Load balancer takes care of evenly distributing incoming requests and Kubernetes autoscaler takes care of keeping the proper number of service instances (Node.js backed pods in our case) running.
Things get more interesting when you have to deal with relatively large amount of data pull operations.
Above I explained the Sixfold’s internal concept of FMS integration — it’s a relation between fleet management system and carrier company that provides necessary context (an implementation of particular interface and authentication credentials) for GPS data fetching.
For each such FMS integration, a data fetching cycle runs periodically to get the telemetry data in. Single occurrence of such data exchange is called an FMS integration sync cycle — retrieval of GPS data at this point in time for all vehicles managed by this FMS system for this carrier. One FMS integration sync cycle usually fetches data for differing amount of vehicles — from few trucks up to few thousand trucks. Depending on specifics of FMS interface, single sync cycle may be as simple as conducting a single HTTP request to get telematics data for 30 trucks — or as complex as running several requests in parallel to paginate through GPS history of 1,500 trucks.
So there can be easily 10K+ individual data fetching operations running at every minute. Incoming data must be parsed, processed and persisted. Depending on the verbosity of particular interface, parsing can be computationally quite expensive. Given this scale, one cannot and shouldn’t try to fit all those concurrent operations into a single instance of service. The logical solution is to run several instances of the service and spread the data fetching operations evenly across all instances.
As we already use Kubernetes and run several instances (pods in k8s terms) of service nevertheless, the problem boils down to question “how to spread the data fetching operations evenly across all instances of service”.
The solution is actually really simple — most Sixfold services are subscribed to some Kafka topics and at given point in time, each service pod has some Kafka partitions assigned to it. The total number of partitions that particular Kafka topic has is also known. In addition, each FMS integration has a stable identifier (integration ID) assigned to it. By combining those three variables (total number of partitions, the partitions assigned to the service pod and ID of the FMS integration) using modulo operator, we can ensure that data fetching operations are distributed evenly across service pods and no operation is skipped.
Distribution of FMS integrations is not perfectly even — as mentioned above, some integrations fetch data for a few trucks, others for a 1,000 trucks inside single sync cycle. However, given the amount of integrations each pod needs to run, there is a pretty good chance that each service pod has more or less similar amount of heavy and lightweight integrations running in it, so this is really not a problem. As you can see from the chart below, CPU resources are quite evenly utilized by telemetry fetching service pods. Overall we are quite happy with the computation cost of single fms integration sync cycle and haven’t felt a need for finer granularity of telemetry sync workloads — current distribution of workloads across different pods is good enough.
Implementation wise, each integration sync cycle is scheduled and executed as a job.
Somebody ate my memory
One of the most tedious problem we have encountered is when single FMS sync cycle fetches more data than fits into memory, resulting in out of memory error and process crash. This has happened for example when FMS provider service was down for few hours, then became available and now integration attempts to retrieve the missing GPS points for last few hours. For 1000 trucks, this can be a lot of data. If FMS provider interface is implemented verbosely, for example using SOAP, this means even more data. Process crash is problematic, because one faulty integration affects also other telemetry fetching operations running in parallel — all those ongoing operations would crash.
Over time, we have tuned critical parameters like amount of GPS history fetched inside single sync cycle and amount of memory allocated to service pod to reduce the likelihood of too ambitious data fetch and eventual process crash. The likelihood of crash due to out-of-memory (OOM) cannot be ruled out entirely though given the large amount of different interfaces we are connecting to and sometimes things inevitably go wrong. When an OOM crash happens, there is only one question that must be answered to isolate the bad apple — which integration fetched too much data?
Maybe the most natural instinct for an engineer is to configure the Node.js application to dump the contents of V8 heap into storage when OOM happens and inspect the heap later to find the culprit. But inspecting the heap dump is not necessarily the fastest and simplest way to find the greedy integration — unlike, say, Java Virtual Machine heap dumps, JavaScript’s V8 heap dumps can be really tough to interpret.
Better and simpler way is to ensure that start and end of each telemetry sync cycle is logged down properly and then use logs to find the integration that never finishes. But it still involves some extra steps — downloading the logs and applying scripts for logs to find the bad apple.
Fortunately almost by accident we have even more convenient tool in use. As mentioned above, each integration sync cycle is scheduled and executed as a job. Our internal job framework keeps track of each job via Postgres table. Finding the greedy integration that is causing crashes is as simple as running the SQL query to find the integration sync job that hasn’t finished its sync cycle for a while. From that point on, faulty integration is disabled and engineer can start inspecting and fixing the isolated case that causes the OOM.
Where is my GPS update?
When your system is integrated with 700 other systems — somewhere something always breaks. Servers go down, interfaces are changed without notifying integration partners, some unhandled edge cases are surfacing, authentication credentials stop working — list goes on.
We don’t want to lose GPS data, because otherwise our core business suffers — no real time GPS, no real time visibility.
To ensure the quality of our service, we monitor constantly that all integrations are running as expected:
- authentication credentials function properly
- integration does not throw an error due to modified interface
- we actually get data from integration and don’t discard it due to parsing errors
At the early days of Sixfold, this approach worked well — there were few integrations running and all glitches in the matrix were easy to investigate. Nowadays, there are thousands of integrations running and naive monitoring that once worked well, generates so much alerts that it is impossible for an engineering team to investigate everything.
Solution lies in applying data analysis and business knowledge.
You want each alert to be meaningful. Sometimes it is ok, that integration does not return any data — there are no affected transports, so our business does not suffer. Another example is authentication failure — sometimes carriers temporarily turn off data transfer by disabling the authentication credentials, because at that point in time, there are no transports monitored by Sixfold. Third example is prioritisation — if there are two alerts up for two different integrations and one integration affects 2 transports but another affects 3,000 transports, then it is quite clear which of the two alerts should be dealt with first.
By combining those nuances of our daily business with “dumb” monitoring data, it is possible to reduce the number of alerts tremendously and focus only on fixing those problems that actually matter. We are still working hard on tooling to properly distill the signal from the noise, provide appropriate tools for the operations people and automate some actions to resolve the issues with as little engineering involvement as possible — perfecting the monitoring system is a never-ending task.
Devils and details
Over time, we have extracted common integration patterns and have good tooling on our belt to implement data fetching from any telemetry data provider quickly and efficiently.
Let’s look at authentication for example.
Basic authentication, JWT token based authentication, client certificate based authentication, OAuth based authentication — there are several ways how services can be protected, but list is actually relatively short. So you implement for example client certificate based authentication once and then can reuse it for other FMS integrations that are protected the same way.
Nevertheless, there are always cases that don’t fit into common patterns. OAuth protocol for example can be interpreted differently by different providers — it has been surprisingly hard to create a single OAuth client implementation to universally cover all FMS integrations that use OAuth-ish approach. Then there are really custom authentication protocols that cannot be reused for other integrations — engineers can be quite creative!
Another example is fetching through paginated resources, where response is split over multiples “pages”. In theory, fields of course differ for different FMS integration interfaces, but overall pattern is usually same — so one can create a common library for fetching paginated resources and reuse it across all integrations.
Third common problem is related to fetching GPS history. Usually, for each truck you can specify start and end time for which time period you want history. Also, you don’t want to fetch same telemetry data again — so you must mark down the latest time you already got the GPS history for to not re-fetch same data again. Again, details differ on different FMS provider interfaces, but overall pattern is usually same, so natural way to go from here is to create a common library that abstracts away re-occurring patterns.
Those were some examples of common patterns that can be abstracted away into standard library functions in this particular problem domain. There is of course much more in this, given the amount of different integrations and interfaces we need to deal with.
Summary
The task of fetching and transforming essentially same data over different interfaces may seem trivial and tedious at first. But if you apply the scale factor and limited amount of engineering resources to the game, things get far more interesting — you must start thinking hard how to manage the zoo.
- How to develop new integrations efficiently?
- How to maintain and monitor existing integrations?
- How to distinguish monitoring signal from noise?
- How to deal with sheer amount of data?
- How to monitor the quality of incoming telemetry data and filter outliers (GPS signals are sometimes boldly inaccurate)?
- How to diagnose and recover fast from crashes?
We still have a single engineering team taking care of all the telemetry related integration efforts (among many other duties) and so far have adapted to increasing number of integrated services and volumes in incoming data by doing what every proper engineering organisation does — automating things and improving the tooling. Of course, our system is far from being perfect and journey never ends — there is always something to fix or improve.