When deploying a model to production, there are two important questions to ask:
- Should the model return predictions in real time?
- Could the model be deployed to the cloud?
The first question forces us to choose between real-time vs. batch inference, and the second one — between cloud vs. edge computing.
Real-Time vs. Batch Inference
Real-time inference is a straightforward and intuitive way to work with a model: you give it an input, and it returns you a prediction. This approach is used when prediction is required immediately. For example, a bank might use real-time inference to verify whether a transaction is fraudulent before finalizing it.
Batch inference, on the other hand, is cheaper to run and easier to implement. Inputs that have been previously collected are processed all at once. Batch inference is used for evaluations (when running on static test datasets), ad-hoc campaigns (such as selecting customers for email marketing campaigns), or in situations where immediate predictions aren’t necessary. Batch inference can also be a cost or speed optimization of real-time inference: you precompute predictions in advance and return them when requested.
Running real-time inference is much more challenging and costly than batch inference. This is because the model must be always up and return predictions with low latency. It requires a clever infrastructure and monitoring setup that may be unique even for different projects within the same company. Therefore, if getting a prediction immediately is not critical for the business — stick to the batch inference and be happy.
However, for many companies, real-time inference does make a difference in terms of accuracy and revenue. This is true for search engines, recommendation systems, and ad click predictions, so investing in real-time inference infrastructure is more than justified.
For more details on real-time vs. batch inference, check out these posts:
– Deploy machine learning models in production environments by Microsoft
– Batch Inference vs Online Inference by Luigi Patruno
Cloud vs. Edge Computing
In cloud computing, data is usually transferred over the internet and processed on a centralized server. On the other hand, in edge computing data is processed on the device where it was generated, with each device handling its own data in a decentralized way. Examples of edge devices are phones, laptops, and cars.
Streaming services like Netflix and YouTube are typically running their recommender systems in the cloud. Their apps and websites send user data to data servers to get recommendations. Cloud computing is relatively easy to set up, and you can scale computing resources almost indefinitely (or at least until it’s economically sensible). However, cloud infrastructure heavily depends on a stable Internet connection, and sensitive user data should not be transferred over the Internet.
Edge computing is developed to overcome cloud limitations and is able to work where cloud computing cannot. The self-driving engine is running on the car, so it can still work fast without a stable internet connection. Smartphone authentication systems (like iPhone’s FaceID) run on smartphones because transferring sensitive user data over the internet is not a good idea, and users do need to unlock their phones without an internet connection. However, for edge computing to be viable, the edge device needs to be sufficiently powerful, or alternatively, the model must be lightweight and fast. This gave rise to the model compression methods, such as low-rank approximation, knowledge distillation, pruning, and quantization. If you want to learn more about model compression, here is a great place to start: Awesome ML Model Compression.
For a deeper dive into Edge and Cloud Computing, read these posts:
– What’s the Difference Between Edge Computing and Cloud Computing? by NVIDIA
– Edge Computing vs Cloud Computing: Major Differences by Mounika Narang
Easy Deployment & Demo
“Production is a spectrum. For some teams, production means generating nice plots from notebook results to show to the business team. For other teams, production means keeping your models up and running for millions of users per day.” Chip Huyen, Why data scientists shouldn’t need to know Kubernetes
Deploying models to serve millions of users is the task for a large team, so as a Data Scientist / ML Engineer, you won’t be left alone.
However, sometimes you do need to deploy alone. Maybe you are working on a pet or study project and would like to create a demo. Maybe you are the first Data Scientist / ML Engineer in the company and you need to bring some business value before the company decides to scale the Data Science team. Maybe all your colleagues are so busy with their tasks, so you are asking yourself whether it’s easier to deploy yourself and not wait for support. You are not the first and definitely not the last who faces these challenges, and there are solutions to help you.
To deploy a model, you need a server (instance) where the model will be running, an API to communicate with the model (send inputs, get predictions), and (optionally) a user interface to accept input from users and show them predictions.
Google Colab is Jupyter Notebook on steroids. It is a great tool to create demos that you can share. It does not require any specific installation from users, it offers free servers with GPU to run the code, and you can easily customize it to accept any inputs from users (text files, images, videos). It is very popular among students and ML researchers (here is how DeepMind researchers use it). If you are interested in learning more about Google Colab, start here.
FastAPI is a framework for building APIs in Python. You may have heard about Flask, FastAPI is similar, but simpler to code, more specialized towards APIs, and faster. For more details, check out the official documentation. For practical examples, read APIs for Model Serving by Goku Mohandas.
Streamlit is an easy tool to create web applications. It is easy, I really mean it. And applications turn out to be nice and interactive — with images, plots, input windows, buttons, sliders,… Streamlit offers Community Cloud where you can publish apps for free. To get started, refer to the official tutorial.
Cloud Platforms. Google and Amazon do a great job making the deployment process painless and accessible. They offer paid end-to-end solutions to train and deploy models (storage, compute instance, API, monitoring tool, workflows,…). Solutions are easy to start with and also have a wide functionality to support specific needs, so many companies build their production infrastructure with cloud providers.
If you would like to learn more, here are the resources to review:
– Deploy your side-projects at scale for basically nothing by Alex Olivier
– Deploy models for inference by Amazon
– Deploy a model to an endpoint by Google
Like all software systems in production, ML systems must be monitored. It helps quickly detect and localize bugs and prevent catastrophic system failures.
Technically, monitoring means collecting logs, calculating metrics from them, displaying these metrics on dashboards like Grafana, and setting up alerts for when metrics fall outside expected ranges.
What metrics should be monitored? Since an ML system is a subclass of a software system, start with operational metrics. Examples are CPU/GPU utilization of the machine, its memory, and disk space; number of requests sent to the application and response latency, error rate; network connectivity. For a deeper dive into monitoring of the operation metrics, check out the post An Introduction to Metrics, Monitoring, and Alerting by Justin Ellingwood.
While operational metrics are about machine, network, and application health, ML-related metrics check model accuracy and input consistency.
Accuracy is the most important thing we care about. This means the model might still return predictions, but those predictions could be entirely off-base, and you won’t realize it until the model is evaluated. If you’re fortunate to work in a domain where natural labels become available quickly (as in recommender systems), simply collect these labels as they come in, evaluate the model, and do so continuously. However, in many domains, labels might either take a long time to arrive or not come in at all. In such cases, it’s beneficial to monitor something that could indirectly indicate a potential drop in accuracy.
Why could model accuracy drop at all? The most widespread reason is that production data has drifted from training/test data. In the Computer Vision domain, you can visually see that data has drifted: images became darker, or lighter, or resolution changes, or now there are more indoor images than outdoor.
To automatically detect data drift (it is also called “data distribution shift”), continuously monitor model inputs and outputs. The inputs to the model should be consistent with those used during training; for tabular data, this means that column names as well as the mean and variance of the features must be the same. Monitoring the distribution of model predictions is also valuable. In classification tasks, for example, you can track the proportion of predictions for each class. If there’s a notable change — like if a model that previously categorized 5% of instances as Class A now categorizes 20% as such — it’s a sign that something definitely happened. To learn more about data drift, check out this great post by Chip Huyen: Data Distribution Shifts and Monitoring.
There is much more left to say about monitoring, but we must move on. You can check these posts if you feel like you need more information:
– Monitoring Machine Learning Systems by Goku Mohandas
– A Comprehensive Guide on How to Monitor Your Models in Production by Stephen Oladele
If you deploy the model to production and do nothing to it, its accuracy diminishes over time. In most cases, it is explained by data distribution shifts. The input data may change format. User behavior continuously changes without any valid reasons. Epidemics, crises, and wars may suddenly happen and break all the rules and assumptions that worked previously. “Change is the only constant.”- Heraclitus.
That is why production models must be regularly updated. There are two types of updates: model update and data update. During the model update an algorithm or training strategy is changed. The model update does not need to happen regularly, it’s usually done ad-hoc — when a business task is changed, a bug is found, or the team has time for the research. In contrast, a data update is when the same algorithm is trained on newer data. Regular data update is a must for any ML system.
A prerequisite for regular data updates is setting up an infrastructure that can support automatic dataflows, model training, evaluation, and deployment.
It’s crucial to highlight that data updates should occur with little to no manual intervention. Manual efforts should be primarily reserved for data annotation (while ensuring that data flow to and from annotation teams is fully automated), maybe making final deployment decisions, and addressing any bugs that may surface during the training and deployment phases.
Once the infrastructure is set up, the frequency of updates is merely a value you need to adjust in the config file. How often should the model be updated with the newer data? The answer is: as frequently as feasible and economically sensible. If increasing the frequency of updates brings more value than consumes costs — definitely go for the increase. However, in some scenarios, training every hour might not be feasible, even if it would be highly profitable. For instance, if a model depends on human annotations, this process can become a bottleneck.
Training from scratch or fine-tuning on new data only? It’s not a binary decision but rather a blend of both. Frequently fine-tuning the model is sensible since it’s more cost-effective and quicker than training from scratch. However, occasionally, training from scratch is also necessary. It’s crucial to understand that fine-tuning is primarily an optimization of cost and time. Typically, companies start with the straightforward approach of training from scratch initially, gradually incorporating fine-tuning as the project expands and evolves.
To find out more about model updates, check out this post:
To retrain, or not to retrain? Let’s get analytical about ML model updates by Emeli Dral et al.
Before the model is deployed to production, it must be thoroughly evaluated. We have already discussed the pre-production (offline) evaluation in the previous post (check section “Model Evaluation”). However, you never know how the model will perform in production until you deploy it. This gave rise to testing in production, which is also referred to as online evaluation.
Testing in production doesn’t mean recklessly swapping out your reliable old model for a newly trained one and then anxiously awaiting the first predictions, ready to roll back at the slightest hiccup. Never do that. There are smarter and safer strategies to test your model in production without risking losing money or customers.
A/B testing is the most popular approach in the industry. With this method, traffic is randomly divided between existing and new models in some proportion. Existing and new models make predictions for real users, the predictions are saved and later carefully inspected. It is useful to compare not only model accuracies but also some business-related metrics, like conversion or revenue, which sometimes may be negatively correlated with accuracy.
A/B testing highly relies on statistical hypothesis testing. If you want to learn more about it, here is the post for you: A/B Testing: A Complete Guide to Statistical Testing by Francesco Casalegno. For engineering implementation of the A/B tests, check out Online AB test pattern.
Shadow deployment is the safest way to test the model. The idea is to send all the traffic to the existing model and return its predictions to the end user in the usual way, and at the same time, also send all the traffic to a new (shadow) model. Shadow model predictions are not used anywhere, only stored for future analysis.
Canary release. You may think of it as “dynamic” A/B testing. A new model is deployed in parallel with the existing one. At the beginning only a small share of traffic is sent to a new model, for instance, 1%; the other 99% is still served by an existing model. If the new model performance is good enough its share of traffic is gradually increased and evaluated again, and increased again and evaluated, until all traffic is served by a new model. If at some stage, the new model does not perform well, it is removed from production and all traffic is directed back to the existing model.
Here is the post that explains it a bit more:
Shadow Deployment Vs. Canary Release of ML Models by Bartosz Mikulski.
In this chapter, we learned about a whole new set of challenges that arise, once the model is deployed to production. The operational and ML-related metrics of the model must be continuously monitored to quickly detect and fix bugs if they arise. The model must be regularly retrained on newer data because its accuracy diminishes over time primarily due to the data distribution shifts. We discussed high-level decisions to make before deploying the model — real-time vs. batch inference and cloud vs. edge computing, each of them has its own advantages and limitations. We covered tools for easy deployment and demo when in infrequent cases you must do it alone. We learned that the model must be evaluated in production in addition to offline evaluations on the static datasets. You never know how the model will work in production until you actually release it. This problem gave rise to “safe” and controlled production tests — A/B tests, shadow deployments, and canary releases.
This was also the final chapter of the “Building Better ML Systems” series. If you have stayed with me from the beginning, you know now that an ML system is much more than just a fancy algorithm. I really hope this series was helpful, expanded your horizons, and taught you how to build better ML systems.
Thank you for reading!