Skip to content

Latest commit

 

History

History
363 lines (276 loc) · 9.35 KB

azure-machine-learning.markdown

File metadata and controls

363 lines (276 loc) · 9.35 KB

Azure Machine Learning

Overview

  • You could use automated ML to find the best model

Workspace

Resources overview

Azure resources

When a workspace is created, a few supporting resources are created with it in the same resource group.

  • Storage account:
    • files and notebooks
    • metadata of jobs and models
  • Key Vault:
    • store credentials used by the workspace
  • Application Insights:
    • monitor predictive services
  • Container Registry:
    • created when needed
    • to store images for ML environments

Workspace resources

  • Workspace: stores all logs, metrics, outputs, models and snapshots of your code
  • Compute
    • Four types:
      • Compute instance: ideal as a dev environment to run Jupyter notebooks
      • Compute cluster: CPU or GPU compute nodes, ideal for production workloads, automatically scale
      • Serverless: // TODO
      • Inference cluster: AKS cluster, ideal to deploy models in production
      • Attached compute: attach other Azure compute resources, like Azure Databricks or Synapse Spark pools
    • Most cost-intensive, should only allow admins to create and manage, not data scientists
  • Datastores
    • References to other data services, like Azure Storage Account and Azure Data Lake Storage (Gen2)
    • Connection info is stored in the Azure Key Vault
    • Two datastores created along with the workspace:
      • workspacefilestore, a file share in the storage account, to store files like Jupyter notebooks and Python scripts
      • workspaceblobstore, blob storage, to store metrics and output when tracking model training

Assets

  • Models

    • Can be trained with Scikit-learn or PyTorch
    • Common storage format is Python pickle file (.pkl) or MLModel file (.mlmodel)
    • Models and corresponding metadata are saved in binary files
    • To persist those files, you can create or register a model in the workspace
  • Environment

    • Stored as an image in the Azure Container Registry created with the workspace
    • Specifies software packages, environment variables, and software settings
    • When you run a script, you specify the environment to use by the compute target
  • Data

    • Refer to a specific file or folder in a datastore
  • Components

    • Reusable code snippets, often represents a step in a pipline
    • A component consists of name, version, code and environment needed to run the code
    • Can be shared across workspaces

Roles

AzureML has specific built-in roles:

  • AzureML Data Scientist
    • all actions within the workspace except creating/deleting compute resources, or editing workspace settings
  • AzureML Compute Operator
    • create/change/manage compute resources

Data ingestion

Data ETL

  • You often need an ETL process to load the data to Azure Blob, Data Lake or SQL
  • Usually done with Azure Synapse Analytics or Azure Databricks, which allow transformation to be distributed across compute nodes
  • You could also Azure ML pipelines, but may not as performant

Models

Common model types:

  • Classification: Predict a categorical value.
  • Regression: Predict a numerical value.
  • Time-series forecasting: Predict future numerical values based on time-series data.
  • Computer vision: Classify images or detect objects in images.
  • Natural language processing (NLP): Extract insights from text.

How to train a model

Several options to train a model:

  • Use the visual designer (building a pipeline using built-in and custom components)
  • Automated ML
  • Run a Jupyter notebook
  • Run a script as a job
    • Model is stored in the job output

Different types of jobs:

  • Command: a single script
  • Sweep: hyperparameter tuning when executing a single script
  • Pipeline
  • AutoML experiments

Create a model

Models can be created from

  • a local file
  • local directory
  • datastore
  • job outputs
# create a MLFlow model from job output
az ml model create --name my-model --version 1 \
  --path runs:/<job-run-id>/model/ \
  --type mlflow_model \
  --resource-group rg-ml-test \
  --workspace-name mlw-ml-test

Compute

CPU vs. GPU

CPU GPU
Cost low high
Tabular data small large
Unstructured data no yes

GPU is more powerful, but cost more

General purpose vs. memory optimized

  • General purpose: for testing and development with smaller datasets
  • Memory optimized: higher memory-to-CPU ratio, ideal for larger datasets, or working in notebooks

Spark

  • A Spark cluster consists of a driver node and worker nodes. Driver nodes distribute the work, and summarize the result.
  • You code needs to be written in a Spark-friendly language, such as Scala, SQL, RSpark, or PySpark

Model deployment

Two types:

  • Real-time: the compute needs to be running all the time
  • Batch: the compute could be started/shutdown as needed
  • OpenAI: //TODO
  • Serverless: //TODO

Compute:

  • ACI
    • Suitable for testing
  • AKS
    • For production
    • Must create an inference cluster compute target
# create an online endpoint
az ml online-endpoint create --name my-endpoint \
  --resource-group rg-ml-test \
  --workspace-name mlw-ml-test

Jobs

Organized as:

Experiment -> Jobs -> Child jobs

A AutoML job could have multiple child jobs,

Command job

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
code: src # code directory
command: >-
  python main.py
  --diabetes-csv ${{inputs.diabetes}}
inputs:
  diabetes:
    path: azureml:diabetes-data:1
    mode: ro_mount
environment: azureml:basic-env-scikit@latest
compute: azureml:aml-instance
experiment_name: diabetes-data-example
description: Train a classification model on diabetes data using a registered dataset as input.
  • azureml:diabetes-data:1
    • azureml: is the prefix for an existing registered data asset
    • diabetes-data data asset name
    • 1 data asset version
  • code: src
    • Local files in src folder will be uploaded

Pipeline job

A sample pipeline in YAML:

$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline
display_name: nyc-taxi-pipeline-example
experiment_name: nyc-taxi-pipeline-example

jobs:
  transform-job:
    type: command
    inputs:
      raw_data:
          type: uri_folder
          path: ./data
    outputs:
      transformed_data:
        mode: rw_mount
    code: src/transform
    environment: azureml:AzureML-sklearn-0.24-ubuntu18.04-py37-cpu@latest
    compute: azureml:cpu-cluster
    command: >-
      python transform.py
      --raw_data ${{inputs.raw_data}}
      --transformed_data ${{outputs.transformed_data}}

  train-job:
    type: command
    inputs:
      training_data: ${{parent.jobs.transform-job.outputs.transformed_data}}
    outputs:
      model_output:
        mode: rw_mount
      test_data:
        mode: rw_mount
    code: src/train
    environment: azureml:AzureML-sklearn-0.24-ubuntu18.04-py37-cpu@latest
    compute: azureml:cpu-cluster
    command: >-
      python train.py
      --training_data ${{inputs.training_data}}
      --test_data ${{outputs.test_data}}
      --model_output ${{outputs.model_output}}

A script step specifies:

  • The code, inputs, outputs
  • Which compute target to use

Trigger a job via CLI

You could trigger a job with

az ml job create --file job.yml

It uploads your local code

Azure ML CLI

  • List workspaces

    # install the extension
    az extension add -n ml -y
    
    az ml workspace list -otable
  • Create dataset

    # upload local data in a folder, create a dataset in the default datastore
    az ml data create \
      --name dataset-test \
      --version 1 \
      --path ./data \
      -w mlw-test-001 \
      -g rg-ml-test
    
    # list dataset
    az ml data list \
      -w mlw-test-001 \
      -g rg-ml-test
      -otable
  • Create compute

    • List available compute sizes

      az ml compute list-sizes \
        --type ComputeInstance \
        -w mlw-test-001 \
        -g rg-ml-test \
        -otable
    • Create compute instance

      az ml compute create --name ds1_instance \
        --size Standard_DS1_v2 \
        --type ComputeInstance \
        -w mlw-test-001 \
        -g rg-ml-test
    • Create compute cluster with a YAML file

      $schema: https://azuremlschemas.azureedge.net/latest/amlCompute.schema.json
      name: aml-cluster
      type: amlcompute
      size: STANDARD_DS3_v2
      min_instances: 0
      max_instances: 5

      Then create the compute target:

      az ml compute create --file compute.yml \
          --resource-group my-resource-group \
          --workspace-name my-workspace