- You could use automated ML to find the best model
When a workspace is created, a few supporting resources are created with it in the same resource group.
- Storage account:
- files and notebooks
- metadata of jobs and models
- Key Vault:
- store credentials used by the workspace
- Application Insights:
- monitor predictive services
- Container Registry:
- created when needed
- to store images for ML environments
- Workspace: stores all logs, metrics, outputs, models and snapshots of your code
- Compute
- Four types:
- Compute instance: ideal as a dev environment to run Jupyter notebooks
- Compute cluster: CPU or GPU compute nodes, ideal for production workloads, automatically scale
- Serverless: // TODO
- Inference cluster: AKS cluster, ideal to deploy models in production
- Attached compute: attach other Azure compute resources, like Azure Databricks or Synapse Spark pools
- Most cost-intensive, should only allow admins to create and manage, not data scientists
- Four types:
- Datastores
- References to other data services, like Azure Storage Account and Azure Data Lake Storage (Gen2)
- Connection info is stored in the Azure Key Vault
- Two datastores created along with the workspace:
workspacefilestore
, a file share in the storage account, to store files like Jupyter notebooks and Python scriptsworkspaceblobstore
, blob storage, to store metrics and output when tracking model training
-
Models
- Can be trained with Scikit-learn or PyTorch
- Common storage format is Python pickle file (
.pkl
) or MLModel file (.mlmodel
) - Models and corresponding metadata are saved in binary files
- To persist those files, you can create or register a model in the workspace
-
Environment
- Stored as an image in the Azure Container Registry created with the workspace
- Specifies software packages, environment variables, and software settings
- When you run a script, you specify the environment to use by the compute target
-
Data
- Refer to a specific file or folder in a datastore
-
Components
- Reusable code snippets, often represents a step in a pipline
- A component consists of name, version, code and environment needed to run the code
- Can be shared across workspaces
AzureML has specific built-in roles:
- AzureML Data Scientist
- all actions within the workspace except creating/deleting compute resources, or editing workspace settings
- AzureML Compute Operator
- create/change/manage compute resources
- You often need an ETL process to load the data to Azure Blob, Data Lake or SQL
- Usually done with Azure Synapse Analytics or Azure Databricks, which allow transformation to be distributed across compute nodes
- You could also Azure ML pipelines, but may not as performant
Common model types:
- Classification: Predict a categorical value.
- Regression: Predict a numerical value.
- Time-series forecasting: Predict future numerical values based on time-series data.
- Computer vision: Classify images or detect objects in images.
- Natural language processing (NLP): Extract insights from text.
Several options to train a model:
- Use the visual designer (building a pipeline using built-in and custom components)
- Automated ML
- Run a Jupyter notebook
- Run a script as a job
- Model is stored in the job output
Different types of jobs:
- Command: a single script
- Sweep: hyperparameter tuning when executing a single script
- Pipeline
- AutoML experiments
Models can be created from
- a local file
- local directory
- datastore
- job outputs
# create a MLFlow model from job output
az ml model create --name my-model --version 1 \
--path runs:/<job-run-id>/model/ \
--type mlflow_model \
--resource-group rg-ml-test \
--workspace-name mlw-ml-test
CPU | GPU | |
---|---|---|
Cost | low | high |
Tabular data | small | large |
Unstructured data | no | yes |
GPU is more powerful, but cost more
- General purpose: for testing and development with smaller datasets
- Memory optimized: higher memory-to-CPU ratio, ideal for larger datasets, or working in notebooks
- A Spark cluster consists of a driver node and worker nodes. Driver nodes distribute the work, and summarize the result.
- You code needs to be written in a Spark-friendly language, such as Scala, SQL, RSpark, or PySpark
Two types:
- Real-time: the compute needs to be running all the time
- Batch: the compute could be started/shutdown as needed
- OpenAI: //TODO
- Serverless: //TODO
Compute:
- ACI
- Suitable for testing
- AKS
- For production
- Must create an inference cluster compute target
# create an online endpoint
az ml online-endpoint create --name my-endpoint \
--resource-group rg-ml-test \
--workspace-name mlw-ml-test
Organized as:
Experiment -> Jobs -> Child jobs
A AutoML job could have multiple child jobs,
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
code: src # code directory
command: >-
python main.py
--diabetes-csv ${{inputs.diabetes}}
inputs:
diabetes:
path: azureml:diabetes-data:1
mode: ro_mount
environment: azureml:basic-env-scikit@latest
compute: azureml:aml-instance
experiment_name: diabetes-data-example
description: Train a classification model on diabetes data using a registered dataset as input.
azureml:diabetes-data:1
azureml:
is the prefix for an existing registered data assetdiabetes-data
data asset name1
data asset version
code: src
- Local files in
src
folder will be uploaded
- Local files in
A sample pipeline in YAML:
$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline
display_name: nyc-taxi-pipeline-example
experiment_name: nyc-taxi-pipeline-example
jobs:
transform-job:
type: command
inputs:
raw_data:
type: uri_folder
path: ./data
outputs:
transformed_data:
mode: rw_mount
code: src/transform
environment: azureml:AzureML-sklearn-0.24-ubuntu18.04-py37-cpu@latest
compute: azureml:cpu-cluster
command: >-
python transform.py
--raw_data ${{inputs.raw_data}}
--transformed_data ${{outputs.transformed_data}}
train-job:
type: command
inputs:
training_data: ${{parent.jobs.transform-job.outputs.transformed_data}}
outputs:
model_output:
mode: rw_mount
test_data:
mode: rw_mount
code: src/train
environment: azureml:AzureML-sklearn-0.24-ubuntu18.04-py37-cpu@latest
compute: azureml:cpu-cluster
command: >-
python train.py
--training_data ${{inputs.training_data}}
--test_data ${{outputs.test_data}}
--model_output ${{outputs.model_output}}
A script step specifies:
- The code, inputs, outputs
- Which compute target to use
You could trigger a job with
az ml job create --file job.yml
It uploads your local code
-
List workspaces
# install the extension az extension add -n ml -y az ml workspace list -otable
-
Create dataset
# upload local data in a folder, create a dataset in the default datastore az ml data create \ --name dataset-test \ --version 1 \ --path ./data \ -w mlw-test-001 \ -g rg-ml-test # list dataset az ml data list \ -w mlw-test-001 \ -g rg-ml-test -otable
-
Create compute
-
List available compute sizes
az ml compute list-sizes \ --type ComputeInstance \ -w mlw-test-001 \ -g rg-ml-test \ -otable
-
Create compute instance
az ml compute create --name ds1_instance \ --size Standard_DS1_v2 \ --type ComputeInstance \ -w mlw-test-001 \ -g rg-ml-test
-
Create compute cluster with a YAML file
$schema: https://azuremlschemas.azureedge.net/latest/amlCompute.schema.json name: aml-cluster type: amlcompute size: STANDARD_DS3_v2 min_instances: 0 max_instances: 5
Then create the compute target:
az ml compute create --file compute.yml \ --resource-group my-resource-group \ --workspace-name my-workspace
-