Much of the buzz around machine learning lately has been around novel applications of deep learning models. They have captured our imagination by anthropomorphizing them, allowing them to dream, play games at superhuman levels , and read x-rays better than physicians. While these deep learning models are incredibly powerful with incredible ingenuity built into them, they are not humans, nor are they much more than “sufficiently large parametric models trained with gradient descent on sufficiently many examples.” In my experience, this is not the hard part about machine learning.
Beyond the flashy headlines, the high-level math, and the computation-heavy calculations, the whole point of machine learning — as has been with computing and software before it — has been its application to real-world outcomes. Invariably, this means dealing with the realities of messy data, generating robust predictions, and automating decisions. Most of the time, this does not involve things that are headline-worthy as Jeff Bezos, CEO of Amazon, notes:
“But much of what we do with machine learning happens beneath the surface. Machine learning drives our algorithms for demand forecasting, product search ranking, product and deals recommendations, merchandising placements, fraud detection, translations, and much more.
Though less visible, much of the impact of machine learning will be of this type – quietly but meaningfully improving core operations.“
— Jeff Bezos, 2016 Amazon shareholder letter
Just as much of the impact of machine learning is beneath the surface, the hard parts of machine learning are not usually sexy. I would argue that the hard parts about machine learning fall into two areas: generating robust predictions and building machine learning systems.
One of the biggest challenges in what ML practitioners deal with on a day-to-day basis is simply put: reality. Real-world data is messy — incredibly messy. Missing data, incorrect data, sparse data, misformatted data, unlabelled data, inconsistent data … I could go on. As the adage says: “Garbage in, garbage out.”
One of the very first things ML practitioners learn is that having a strong data engineering process is critical to any data-related endeavor. Cleaning, extracting, imputing, outlier removal, these are all necessary — but decidedly unsexy — things that must be done. The best machine learning companies take this aspect of the system very seriously. On top of that, it’s important to have domain expertise in your data. It’s not enough to know that you have 10 real-valued columns, you need to know what they mean in order to properly use them in a model. Ensuring you have the right mix of skills is essential to achieving this goal.
The next thing that most real-world ML practitioners learn very quickly is that “The map is not the territory” or stated more bluntly “All models are wrong but some are useful.” All models have limitations. The hard part about using them is not coding up the algorithm or tuning the hyperparameters, it’s figuring out how to robustly map its output to real-world outcomes. Here are some examples where naively using a model can go wrong:
|Linear model for demand forecasting
||Tells you to order 10B units of red lipstick because of an outlier data point that was not seen in training.
|Collaborative Filtering for Product Recommendations
||Recommends more microwaves to a customer that just bought one.
|Deep Neural Network for Image Classification
||Tagging people of a certain race as “Gorillas”
|Chatbots to interact with users
||Learns to send racist and sexually-charged messages from its interactions with other users
Rubikloud has been facing this challenge from the beginning, and we’ve come up with a lot of techniques to deal with these issues. We’ve tackled it from both the modelling side (e.g. Bayesian priors/regularization, ensembles, hierarchical models and pooling data etc.), and by adding an additional business logic layer on top of our raw model outputs that ensure that we always produce “reasonable” results. This added robustness is essential to our predictions because the users of our products depend on these predictions to drive decisions that affect millions of dollars every day. It’s not enough to build a model. To solve real-world problems, you have to understand the problem domain, the limitations of the model (including the biases that are latent in your data) and, frequently, build an additional layer to encapsulate things not captured by the model/data such as: business rules, safety limits, and common sense. That’s why it’s important to have a deep understanding of what the models do, their limitations and where they work best.
Building Machine Learning Systems
As with most things, the complexity of something is not in the individual parts but in the interactions when putting those parts together. This is true whether we’re building planes, computer chips and especially ML systems. The difference between the former two and the latter is that we have many decades of experience; for ML systems, we’ve had less than a decade of experience. Best practices, frameworks, and war-stories are just beginning to emerge on how to build large-scale machine learning systems.
When building production-grade ML systems, there are so many more considerations besides just the model. Along with traditional issues of building large-scale software systems, ML systems require an additional layer of considerations related to data and models. One of the most difficult aspects of building best practices in this area is that the domain is even more important. For example, how you build a system to handle clickstream data versus survey data, or transaction data versus IoT data, is going to be dramatically different. And it’s not just the data store, the data processing pipeline, the models, the technologies are all probably going to be different.
Here is a small subset of the challenges that one faces when building large-scale ML systems:
- SQL or the plethora of NoSQL options
- Cloud vs. on-premise
- Managed service or in-house
|Extract, Transform, Load (ETL)
- Build vs. buy?
- Data validation
- Data aggregation
- Complexity of adding new data sources
- Schema on Read vs. Schema on Write
- API service or built-in library?
- How often should you retrain?
- Monitoring quality-of-results (e.g. accuracy, ROC, etc.)?
- Versioning of models/results?
- Who builds your system: data scientist, developer or a mix of both?
- Programming language, libraries and framework used for model prototyping vs. production system (e.g. Tensorflow, Spark MLLib, Scikit-Learn, Stan, etc.)
- Home-grown ML algorithm vs. library?
- Scalability (single machine vs. cluster algorithm)
- Full data set vs. sampling
|ML System Architecture
- Consistency/reproducibility of models
- Idempotency of processing steps
- Granularity of ML pipeline tasks (e.g. feature engineering and model fitting as one step, or as several smaller steps)
- Scheduling jobs
- Error handling (e.g. missing data, cannot fit, etc.)
- How do you test model code?
- Unit tests? Integration tests? Benchmarks?
- Testing model code mechanics vs. quality?
While there are a handful of examples of very successful large-scale ML systems such as those deployed by Google or Facebook; the honest fact is, you are not Google. For example, Rubikloud ingests first-party transaction data from each one of our clients, each of which has a different data schema, legacy technology stack and data size. For us, figuring out how to be nimble in on-boarding, validating and mapping data is of critical importance. This is in contrast to say an E-commerce site with full control and access to their primary data source (the E-commerce site). Similar differences exist in other situations such as a skunkworks ML project started within a big bank versus a 10-person startup. There just hasn’t been enough collective experience on how to build these ML systems for us to draw conclusions on best practices for what works and what doesn’t.
At Rubikloud, we’ve been facing these types of challenges from the beginning. Fortunately, we’ve adapted to these challenges by building teams around the core functions of a data science product. For example, we have a data engineering team focusing on building and utilizing our retail data platform Rubicore, which handles all ETL of client data into our system ensuring a consistent, scalable, validated view of our the data. Our data science engineering team builds out the ML framework and pipelines that deal with the plethora of issues such as deploying models in production, feature and model versioning, and integration of the numerous ML libraries that we use. As with any large system, there is no one-size-fits all solution, and each team is constantly evolving our ML system to meet our ever-changing needs as a company.
Machine Learning: It Ain’t Easy
Building systems is hard; building machine learning systems that give robust predictions is especially hard. Rubikloud has been facing these challenges since its inception, and can’t be solved with a 5-minute ML tutorial or a 4-month boot camp. They’re solved by real-world deployments of battle-tested systems, gradual evolution and incremental iteration, and bright people who are constantly learning how to do things better. My hope is that in 5 years as an industry, we’ll be further along in maturity on how to effectively use machine learning to drive real-world business problems. But it ain’t going to be easy.
Rubikloud is a great place to work. We tackle interesting problems from building large-scale machine learning systems to making sure our predictions aren’t racist. If you get excited about applying machine learning to real-world problems, Rubikloud is always looking for ambitious and curious data scientists: rubikloud.com/careers.
1 – While it’s true that we’ve had machine learning, statistics and data mining/analysis software systems in the past, the modern incarnation poses different challenges compared to the proprietary environments and systems that were characteristic of the early days.