[ad_1]
Scaling the use of AI/ML by building Continuous Integration (CI) / Continuous Delivery (CD) / Continuous Training (CT) pipelines for ML based applications
Background
In my previous article:
MLOps in Practice — De-constructing an ML Solution Architecture into 10 components
I talked about the importance of building CI/CT/CD solutions to automate the ML pipelines. The aim of MLOps automation is to continuously test and integrate code changes, continuously train new models with new data, upgrade model performance when required, and continuously serve and deploy models to a production environment in a safe, agile and automated way.
In today’s article, we are going to dive into the topic of MLOps automation. Specifically we are going to cover the following components:
- Why is MLOps automation necessary?
- A high-level introduction to DevOps and CI/CD and its relevance to ML
- What’s special about MLOps compared to DevOps?
- A sample CI/CD architecture for ML based systems
Now let’s get started by understanding why MLOps automation is necessary.
Why is MLOps automation necessary?
Building ML systems is an extremely iterative process — These iterations can be reflected in the following 2 aspects:
- Experiment-driven: Firstly, training ML models is very experiment-driven. To get a satisfactory performance of a specific ML model, data scientists generally need to conduct hundreds (or even possibly thousands) of experiment runs around combinations of different feature engineering techniques, model architecture definitions, and hyperparameters. Manually experimenting with all these potential combinations and selecting the best-performing one, can be very tedious and time-consuming, particularly when one data science team needs to manage tens (or even hundreds) of ML applications simultaneously.
- Data-dependent: Secondly, different from traditional software, ML based applications are extremely data dependent. The ML based applications potentially require frequent updates depending on how quickly the underlying data changes. When frequent updates become necessary, manually re-training an ML model and deploying the newer version into the production environment can be very time-consuming. More importantly, manual updates can significantly slow down the application release cycle, which can lead to missed market opportunities or un-met regulatory requirements.
Without MLOps automation, it is very difficult for any organization to scale the use of AI/ML without constantly recruiting data science or ML resources for new ML use cases / applications.
Another point worth mentioning, before we get into the detail, is that, implementing ML in a production environment doesn’t only mean deploying your model as an API for prediction. Rather, it means deploying an ML pipeline that can automate the retraining and deployment of new models. Setting up a CI/CD system enables you to automatically test and deploy new pipeline implementations
What is DevOps and CI/CD ?
According to Wikipedia,
DevOps is a methodology in the software development and IT industry. Used as a set of practices and tools, DevOps integrates and automates the work of software development (Dev) and IT operations (Ops) as a means for improving and shortening the systems development life cycle.
The key essence of DevOps is “automation”, to speed up the release of the software development cycle. As was explained in the first part of this article, ML based applications need fast updates and releases, due to the fact that performance of ML models is dependent on constantly evolving data profiles.
Hence, although being quite different from traditional software, ML based applications are still a piece of software, fundamentally. Therefore, when data scientists and ML engineers deploy ML based applications, there are still huge benefits from adopting the DevOps principals and methodologies, which have been leveraged by traditional software development for years.
As many data scientists do not come from the software engineering and compute science background, the concept of DevOps and CI/CD may be unfamiliar to them. Therefore, before we get into the detail of designing CI/CD pipelines for ML based applications, let’s fist level set, to quickly go through what a CI/CD pipeline encompasses.
To the fellow data scientists, please do not be scared away by DevOps and CI/CD. It is not that difficult particularly when you do not aim to become an DevOps expert. I will try to explain them in relatively simple layman terms. In fact, understanding how it works on the high-level can bring many benefits. For example, you can have a feel for what your peers / counterparts like ML engineers and DevOps engineers are working on and this can make the communications with them much smoother. If you wish, you can also expand your skills from developing ML models to building CI/CD pipelines, to gradually becoming a full-stack data scientist.
You can think of a CI/CD pipeline as an automated workflow including a series of steps that need to be executed before the software is reliably and securely deployed to the production environment. CI/CD pipelines are the backbone of a DevOps methodology. A CI/CD pipeline is generally developed in a yaml file (you can refer to the example below). As suggested by the name, a CI/CD pipeline generally be comprised of 2 parts:
- Continuous Integration (CI) — The tasks in a CI pipeline are focused on testing and validating the source codes and building the codes into executable artifacts and usable applications. The testing in the CI pipeline is generally divided into two categories, one is unit testing and the other is integration testing. Unit testing makes sure each function (def function()) works as expected and integration testing focuses on verifying that different modules or services, used by your software, work very well together. Integration testing for ML based systems is around making sure every step (ingestion, splitting, transforming, training, evaluation and prediction) of the ML pipeline works well together.
- Continuous Delivery (CD) — The tasks in CD are generally triggered after a successful CI run, with CD focusing on automating the infrastructure provisioning and application release process. Nowadays Terraform is widely used for infrastructure provisioning and it is recommended to test the Terraform scripts as well, before deploying. Other than infrastructure provisioning, at the CD stages, the pre-built artifacts will be deployed into a staging / testing environment, which is like a cloned version of the production environment. In this staging environment, there will be rigorous testing scripts executed to identify bugs that slipped through the initial pre-build testing process.
All the above tasks included in a CI/CD pipeline can be continuously executed automatically. Therefore the true value of a CI/CD pipeline lies in automation. Finally, a CI/CD pipeline is a design decision. There is no right or wrong answer here. You can decide what tasks to be included, depending on the need of your own applications.
What’s special about MLOps compared to DevOps?
We will talk about how MLOps is different from DevOps repectively by CI and CD. Let’s start with CI.
CI for ML based systems
For the CI part, one of the key differences between MLOps and DevOps relates to the scope of testing. As was explained in the previous part, the testing involved with a CI pipeline of traditional software is mainly unit testing and integration testing, which mainly focuses on testing the quality of code. Due to the uniqueness in ML based systems, the testing should be expanded to include data testing and model testing. Data testing includes data quality checking, unit testing of the feature engineering logics, validity in train/test data split and verifying data distribution. Model testing includes model performance validation against the test dataset, model output schema check and testing the performance of model prediction services in terms of throughput and latency.
CD for ML based systems
The CD part of traditional software is the delivery of a software or a service, while the CD part of a ML based system is about deploying a multi-step pipeline to automatically retrain and deploy mode, which adds an extra layer of complexity. Therefore, there are two types of CD pipelines for ML based systems. One is continuous delivery for ML training pipelines, also called Continuous Training (CT) pipelines, and the other is continuous delivery for the deployment of model prediction services. These two types of CD pipelines also work closely with each other. Generally a new run of the ML training pipeline will trigger a new deployment of the ML model prediction service.
Continuous Training (CT) pipelines are generally triggered by two types of changes.
- The first type of change is code related. For example, data scientists make some changes to the model training source codes including feature engineering logics changes, model architectures changes, hyperparameters changes, or configuration/variable changes. Code changes will always trigger a new CI/CD pipeline run and a new execution of the ML training pipeline to generate a newer version of the model.
- The second type of change is data related, meaning model retraining is required due to underlying data drift. If there is no code change and only new data is provided, model retraining can be executed by triggering the ML training pipeline in the production environment to create a newer version of the model.
Regardless of the type of CT pipeline, a newer version of the ML model is generated. If the newer version model has a better performance, it will be registered in the model registry store, which will trigger the ML deployment service to generate a new model prediction service.
CD for the deployment of model prediction service is followed by a new run of ML training pipeline. In general, there are two types of ML deployment. One is online (real-time) model deployment where ML models are normally packaged either as REST API endpoints or self-contained Docker images with the REST API endpoints. The other is offline (batch) model deployment where ML models are used to directly score files. With offline model deployment, the trained models are called and fed with a batch of data at a certain interval, (such as once per day or once per week depending on how the models are used in certain business contexts), to periodically generate predictions for use. If you are keen on learning more about ML model deployment patterns, you can refer to my previous article:
MLOps in Practice — Machine Learning (ML) model deployment patterns
In today’s article, we are focusing on the CD pipeline design for the deployment of model prediction service. Regardless of the type of model deployment for your ML based applications, you should include the following tasks when you design and build the CD pipelines for your ML model deployment:
- Infrastructure provisioning and testing for scripts of infrastructure provisioning
- Input data validation to make sure the data sent to the model for prediction meets the required input schema
- Model performance test by segmentations to make sure the model still performs as expected
- Throughput and latency test of the model prediction service. This is particularly relevant for online model service
A sample CI/CD/CT architecture for ML based systems
Below is a sample CI/CD/CT reference architecture for ML based systems
This reference architecture includes the potential tasks respectively for CI, CT and CD.
- Continuous Integration — unit test, integration test, data test and model test
- Continuous Training — data extraction, data validation, feature engineering, model training, model evaluation and model validation.
- Continuous Delivery — infrastructure provisioning, pre-production deployment, model throughput testing and model latency testing.
There are a few popular CI/CD tools that can all be leveraged to implement the above reference architecture, such as Github actions, Azure DevOps pipelines and Jenkins pipeline. Please leave a comment to let me know if you would like to see an article on practical implementation of CI/CD/CT pipelines. I am more than happy to write one.
The end
Please feel free to let me know if you have any comments and questions on this topic or other MLOps related topics! I generally publish 1 article related to data and AI every week. Please feel free to follow me on Medium so that you can get notified when these articles are published.
If you want to see more guides, deep dives, and insights around modern and efficient data+AI stack, please subscribe to my free newsletter — Efficient Data+AI Stack, thanks!
Note: Just in case you haven’t become a Medium member yet, and you really should, as you’ll get unlimited access to Medium, you can sign up using my referral link!
Thanks so much for your support!
Source link