So this is what happens: you begin building machine learning systems to solve problems, to provide predictions and estimates of variables whose knowledge improves operations within a certain business context. They evolve and gain substance, become better, sharper, and more precise. Complexity arises and needs to be tamed. Your systems grow as your models need to be more flexible, scalable, agnostic, etc., and then there comes a time when, for your own sanity, a more structured approach becomes a necessity. Uber has Michaelangelo, Databricks has MLFlow, Airbnb has aerosolve, and on it goes; each searching for the optimal set of tools and frameworks to put their machine learning processes into production.
That, at least, has been our story at Rubikloud and we know we are not alone on that quest. The problem, however, is that most traditional software engineering approaches fall short when it comes to dealing with the dynamic data beast that lives at the core of a machine learning system. This is an issue that has been reported over and over again. We know what challenges are there but, so far, there is no consensus on how to deal with these challenges.
At Rubikloud we productize complex models as a way of automating and improving core operations of a growing group of enterprise retailers. We build software systems around them (let’s call them “machine learning systems” from now on) so that they can interact with different data sources, collect parameters from different stakeholders, and communicate their outputs properly to interactive user interfaces. Our business relies on us doing this quickly and efficiently. Those models and their effective integration into existing workflows are our product. They should have robust and satisfying lives.
As our services matured, these are some of the restrictions we have imposed on ourselves to take care of them:
- Our machine learning systems need to be easily transferable to a variety of cloud computing platforms.
- Our modelling workflows need to be unaware of what retailer they are serving to the greatest extent possible: data distributions may change, some features may be unavailable. Generalization with flexibility for customizations must be possible.
- Our modelling workflows need to be designed with scale in mind: usually, they are batch processes that should be able to handle and persist hundreds of gigabytes of information per run. These runs should have predictable (and short) durations.
- Onboarding new clients into our current product offerings must require minimal boilerplate code but it should also allow feature and model selection, parameter optimization, pre-deployment evaluation, post-deployment monitoring, methodic experimenting and iteration, and minor client specializations.
- Our data scientists must be able to take models from prototyping to production and actively collaborate with machine learning and production engineers during the process. Our code is their code.
Fulfilling these requirements involves not only an understanding of our models and their mathematical subtleties but also of the broad computing ecosystem where they live and play.
Our current skyline
After many prototypes and iterations followed by some careful extraction of patterns, during the last few years, we have started building generic systems for handling some of these requirements. At the base of most of our solutions, we rely on RkLuigi, an in-house (and always evolving) extension of Spotify’s Luigi suited to our particular needs, with abstractions for machine learning workflows and workflow agnostic tasks. RkLuigi allows us to turn the broad machine learning methodology into well structured modular processes organized like directed graphs. Here are some examples of projects we have been working on recently:
- A system for registering features per client and also generating feature tables for training and predicting different models.
- A framework for ensembling large numbers of predictive models in a hierarchical fashion.
- A system for checking the health of essential components of our modelling workflows after each production run.
- An abstract workflow for performing pre-deployment evaluations of predictive and generative models.
- A configuration pattern for detaching client specific components of generic modelling workflows.
- A system for persisting reusable trained models.
- A generic A/B testing framework adaptable to quite different products.
Each project was born organically from our need to become better at what we do, balancing our resources and time. Each journey has come with its own lessons about the many ways machine learning systems in production expand, misbehave and sometimes break. With this new series, we would like to share some of those learnings and eventually create a space to discuss the inherent challenges and difficulties with the Toronto machine learning community.
Stay tuned for our second installment!