Published in · 8 min read · Mar 26
--
Machine Learning Operations (MLOps) is the practice of applying DevOps principles to machine learning workflows, with the goal of streamlining the deployment and management of machine learning models in production environments. MLOps involves a combination of automation, collaboration, and monitoring to ensure that machine learning models are scalable, reliable, and maintainable.
Azure Machine Learning service is a cloud-based platform that helps organizations implement MLOps best practices with engineering rigor. It brings together people, process, and platform to automate ML-infused software delivery & adds continuous value to users. It provides a range of tools and services for data scientists, machine learning engineers, and IT professionals to collaborate on building, training, deploying, and managing machine learning models at scale.
Azure Machine Learning service supports a variety of popular machine learning frameworks and programming languages, and provides tools for automating model training, hyperparameter tuning, and model evaluation. It also integrates with DevOps tools like Azure DevOps, GitHub, and Jenkins, enabling teams to implement continuous integration and continuous deployment (CI/CD) pipelines for machine learning models.
With Azure Machine Learning service, organizations can apply engineering rigor to their machine learning workflows, ensuring that models are built and deployed with the same level of quality and reliability as traditional software applications. By implementing MLOps best practices, organizations can accelerate the time-to-value of their machine learning investments and maximize the impact of their data science initiatives.
In this blog we will learn about Training ML Model keeping in mind the people ( data engineers, data scientists, machine learning engineers) and through a simple example explore the power of AML and learn about its capabilities in building and delivering scalable ML solutions.
“This blog focuses on ML Training.” GitHub Reference [https://github.com/keshavksingh/aml-training]
AML supports several kinds of attachable compute, VM instance compute, VM compute clusters, Attached Computes in form of Synapse Spark Pool and Kubernetes Clusters for training ML models.
We begin by provisioning 2 Azure Machine Learning Workspaces (amlws and amlws2) and an instance of AML Registry (amlregistry).
Azure Machine Learning Registry is a service (currently in Preview) provided by Microsoft Azure that allows users to create, manage, and deploy machine learning models at scale. It is a centralized repository for storing and versioning machine learning models, pipelines, and other artifacts related to the machine learning workflow.
The Azure Machine Learning Registry enables data scientists and machine learning engineers to collaborate and share models with their team members securely. It also provides features such as version control, auditing, and access control to manage the lifecycle of the models. With the registry, users can easily discover and reuse pre-built models, which can help accelerate the development and deployment of machine learning applications.
Furthermore, the Azure Machine Learning Registry integrates with other Azure services such as Azure Kubernetes Service (AKS) and Azure Container Instances (ACI), making it easy to deploy and manage models in production environments.
amlws “acts as our pre-production workspace” which hosts all 4 types of computes, registry, tracking, workflow (pipelines), a unique MLFLow URI (azureml://eastus.api.azureml.ms/mlflow/v1.0/subscriptions/<subscriptionId>/resourceGroups/mlops/providers/Microsoft.MachineLearningServices/workspaces/amlws) and model artifacts. The operations team maintains this workspace and provisions required compute on this instance for enabling DS experimentation. This Workspace (AMLWS) is used by the DE community for building data preparation pipelines & the Data Scientist community uses it for model training on this workspace. The Machine Learning Engineering team finally leverages the outcome from this workspace to promote to the AML registry & then eventually host the artifacts for serving. To re-iterate, this instance also hosts the central instance of MLFLow, a model tracking server for hosting all model artifacts. All local or federated trainings are captured on this MLFlow instance for central calibration, evaluation and considerations for promoting to production.
amlws2 “Is an example of an instance of federated training” it is merely a cloud replacement of local training space used by the Data Scientists to train ML solutions while pointing to amlws for central tracking.
Illustration 1 : A DS developing training script on amlws2 yet pointing to amlws for compute all model tracking and training.
The training is conducted as a Pipeline (Step) which submits Train.py a script to train IRIS Model using sparkcompute (Synapse Spark Pool Compute) attached to AMLWS AML Workspace. The tracking MLFlow Server is also that of AMLWS.
For Validation: We confirm that there is no Pipeline Created on AMLWS2
AMLWS Workspace — Demonstrates a Pipeline Run for Experiment AMLWS2-IRIS-V1
From the run artifacts we are now positioned to register the model on the AMLWS workspace.
In this illustration we have observed how we have designated an instance of AML as HUB and used a spoke AML instance to train and track centrally.
Illustration 2 : Training on Local leveraging Local Compute and central Tracking
In this example we have a Data Scientist training a model on local machine (on local compute) while leveraging the central AMLWS MLFLow to record the experimentation details.
Below we observe the trained experiment has been recorded on the workspace AMLWS.
The workspace has recorded the Model generated out of the training.
We illustrated a local training and central tracking scenario.
Illustration 3: Azure Kubernetes Service as Compute for Training and Tracking on MLFLow on the Workspace.
This is an interesting one, we leverage the attached AKS Cluster to train our model on AKS. AML does an excellent job at OOB containerization and deploying all code with needed dependencies for training on the attached AKS. Refer [Click on the hyperlink : To learn about steps to attach AKS as Compute to AML Workspace Instance]
We have trained the model on Azure Kubernetes Service, which provides a scalable service for training demanding models.
Illustration 4: Training on VM Cluster Compute and Track with MLFLow
All Models Trained So Far:
AML Registry: Lets register some models to AML Registry
In this section we evaluate the training metrics for all the experiments and based on our evaluation register the qualified model into the AML Registry for inferencing.
Lets suppose we are satisfied by the AKS and VM Cluster training outcomes, we now register them into AML registry.
Pipeline Scheduling As Workflow Management
AML also enables efficient workflow management. For an identified Pipeline requiring recurrent trainings, the pipelines can be scheduled for re-training and re-run very efficiently both through Python SDK V2, UI & Azure CLI. The pipelines
AML pipeline is a sequence of connected steps or stages that together perform a machine learning workflow. A pipeline can consist of multiple steps such as data preparation, model training, evaluation, and deployment. Each step in a pipeline can be a job or a set of jobs. Pipelines can be created and managed through the Azure Machine Learning SDK or the Azure portal.
An AML job is a unit of work that runs a specified script or command on a specified compute target. Jobs can be used for various tasks such as training a machine learning model, running batch inferencing on a large dataset, or tuning hyperparameters of a model. Jobs can be submitted through the Azure Machine Learning SDK or through the Azure portal.
Complex Pipeline Demonstration:
Lastly we will create a complex pipeline with a slightly optimized workflow where we will use 3 computes for respective 3 steps and run [Step 1, Step 2 & Step 3] all in parallel to demonstrate such on AML. AML Pipeline provides us with a robust capability to create desired workflows.
Similarly we could also setup dependency on the pipeline steps to run sequentially.
Overall in this blog we have learnt about the Azure Machine Learning and its capability in offering MLOPs. This blog simply unpacked several facets of the service from the training lens. It opens the door for you to explore much more such as feature store, environments and DevOps for CI/CD releases and more. Hope you have drawn value from this article and it gets you one step closer to provisioning your first AML workspace for your organization’s ML-infused software development & solutions!
As a pre-read to the deployment and inferencing blog, I recommend one of the best articulated vlogs on this topic at https://youtu.be/WZ7vS10KPAw from our very own Nishant Thacker. He narrates deeply knowledgeable summary of AML deployment and inferencing, hope you enjoy it!