rapidsai · jameslamb · Jan 7, 2025 · Jan 6, 2025 · Jan 6, 2025 · Jan 7, 2025
@@ -37,5 +37,12 @@ repos:
         language: system
         pass_filenames: false
         always_run: true
+  - repo: https://github.com/codespell-project/codespell
+    rev: v2.3.0
+    hooks:
+      - id: codespell
+        additional_dependencies: [tomli]
+        args: ["--toml", "pyproject.toml"]
+
 default_language_version:
   python: python3
@@ -1,3 +1,10 @@
+[tool.codespell]
+# note: pre-commit passes explicit lists of files here, which this skip file list doesn't override -
+skip = "./.git,./pyproject.toml,./.ruff_cache"
+ignore-regex = "\\b(.{1,4}|[A-Z]\\w*T)\\b"
+builtin = "clear"
+quiet-level = 3
+
 [tool.ruff]
 fix = true
 line-length = 120

@@ -78,7 +78,7 @@ Once inside JupyterLab you should be able to upload the [Running RAPIDS hyperpar
 
 RAPIDS can also be used in [SageMaker Estimators](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html).
 Estimators allow you to launch training jobs on ephemeral VMs which SageMaker manages for you.
-With this option, your Notebook Isntance doesn't need to have a GPU... you are only charged for GPU instances for the time that your training job is running.
+With this option, your Notebook Instance doesn't need to have a GPU... you are only charged for GPU instances for the time that your training job is running.
 
 All you’ll need to do is bring in your RAPIDS training script and libraries as a Docker container image and ask Amazon SageMaker to run copies of it in parallel on a specified number of GPU instances.
 

@@ -14,7 +14,7 @@ $ pip install dask-cloudprovider[azure]
 
 ### 2. Configure your Azure Resources
 
-Set up your [Azure Resouce Group](https://cloudprovider.dask.org/en/latest/azure.html#resource-groups), [Virtual Network](https://cloudprovider.dask.org/en/latest/azure.html#virtual-networks), and [Security Group](https://cloudprovider.dask.org/en/latest/azure.html#security-groups) according to [Dask Cloud Provider instructions](https://cloudprovider.dask.org/en/latest/azure.html#authentication).
+Set up your [Azure Resource Group](https://cloudprovider.dask.org/en/latest/azure.html#resource-groups), [Virtual Network](https://cloudprovider.dask.org/en/latest/azure.html#virtual-networks), and [Security Group](https://cloudprovider.dask.org/en/latest/azure.html#security-groups) according to [Dask Cloud Provider instructions](https://cloudprovider.dask.org/en/latest/azure.html#authentication).
 
 ### 3. Create a Cluster
 

@@ -134,7 +134,7 @@ ml_client.begin_create_or_update(gpu_compute).result()
 A [datastore URI](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-access-data-interactive?tabs=adls&view=azureml-api-2#access-data-from-a-datastore-uri-like-a-filesystem-preview) is a reference to a blob storage location (path) on your Azure account. You can copy-and-paste the datastore URI from the AzureML Studio UI:
 
 1. Select **Data** from the left-hand menu > **Datastores** > choose your datastore name > **Browse**
-2. Find the file/folder containing your dataset and click the elipsis (...) next to it.
+2. Find the file/folder containing your dataset and click the ellipsis (...) next to it.
 3. From the menu, choose **Copy URI** and select **Datastore URI** format to copy into your notebook.
 
 ![Screenshot of access datastore uri screen](../../images/azureml-access-datastore-uri.png)

@@ -16,7 +16,7 @@ RAPIDS can be deployed on [Vertex AI Workbench](https://cloud.google.com/vertex-
 6. After customizing any other aspects of the machine you wish, click **CREATE**.
 
 ```{tip}
-If you want to select a different GPU or select other hardware options you can select "Advanced Options" at the bottom and then make changes in the "Machine type" seciton.
+If you want to select a different GPU or select other hardware options you can select "Advanced Options" at the bottom and then make changes in the "Machine type" section.
 ```
 
 ## Install RAPIDS

@@ -163,15 +163,15 @@
     "        {\n",
     "            # Choose a random station from the lookup table for each row in our output\n",
     "            \"station\": cp.random.randint(0, len(lookup_df) - 1, int(chunksize)),\n",
-    "            # Generate a normal distibution around zero for each row in our output\n",
+    "            # Generate a normal distribution around zero for each row in our output\n",
     "            # Because the std is the same for every station we can adjust the mean for each row afterwards\n",
     "            \"measure\": cp.random.normal(0, std, int(chunksize)),\n",
     "        }\n",
     "    )\n",
     "\n",
     "    # Offset each measurement by the station's mean value\n",
     "    df.measure += df.station.map(lookup_df.mean_temp)\n",
-    "    # Round the temprature to one decimal place\n",
+    "    # Round the temperature to one decimal place\n",
     "    df.measure = df.measure.round(decimals=1)\n",
     "    # Convert the station index to the station name\n",
     "    df.station = df.station.map(lookup_df.station)\n",

@@ -2179,7 +2179,7 @@
    "source": [
     "## Closing thoughts\n",
     "\n",
-    "By sharing a Kubernetes cluster between many users who are all launching many ephemeral Dask Clusters to perform their work we are able to balance cost vs user time. Peaks in individual user demands get smoothed out over time in a multi-tenant model, and the overall peaks and troughs of the day are accomodated by the Kubernetes cluster autoscaler.\n",
+    "By sharing a Kubernetes cluster between many users who are all launching many ephemeral Dask Clusters to perform their work we are able to balance cost vs user time. Peaks in individual user demands get smoothed out over time in a multi-tenant model, and the overall peaks and troughs of the day are accommodated by the Kubernetes cluster autoscaler.\n",
     "\n",
     "We managed to create a responsive experience for our users where they generally got Dask clusters in a few seconds. We also managed to hit 64% utilization of the GPUs in our cluster, a very respectable number for an interactive cluster. \n",
     "\n",

@@ -78,7 +78,7 @@ def __init__(
 
     def load_hyperparams(self, model_name="XGBoost"):
         """
-        Selecting model paramters based on the model we select for execution.
+        Selecting model parameters based on the model we select for execution.
         Checks if there is a config file present in the path self.CSP_paths['hyperparams'] with
         the parameters for the experiment. If not present, it returns the default parameters.
 
@@ -304,7 +304,7 @@ def train_model(self, X_train, y_train, model_params):
         Parameters
         ----------
         X_train : dataframe
-                  The data for traning
+                  The data for training
         y_train : dataframe
                   The label to be used for training.
         model_params : dict

@@ -261,7 +261,7 @@
     "\n",
     "We'll add new features to the dataframe:\n",
     "\n",
-    "1. We can split the datetime column to retrive year, month, day, hour, day_of_week columns. Find the difference between pickup time and drop off time. \n",
+    "1. We can split the datetime column to retrieve year, month, day, hour, day_of_week columns. Find the difference between pickup time and drop off time. \n",
     "2. Haversine Distance between the pick-up and drop-off coordinates."
    ]
   },

@@ -64,7 +64,7 @@ def validate_dockerfile(rapids_base_container, dockerfile_name="Dockerfile"):
     with open(dockerfile_name) as dockerfile_handle:
         if rapids_base_container not in dockerfile_handle.read():
             raise Exception(
-                "Dockerfile base layer [i.e. FROM statment] does"
+                "Dockerfile base layer [i.e. FROM statement] does"
                 " not match the variable rapids_base_container"
             )
 
@@ -166,7 +166,7 @@ def new_job_name_from_config(
 ):
     """
     Build a jobname string that captures the HPO configuration options.
-    This is helpful for intepreting logs and for general book-keeping
+    This is helpful for interpreting logs and for general book-keeping
     """
     job_name = None
     try:

@@ -435,7 +435,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "We also need to decide how may total experiments to run, and how many should run in parallel. Below we have a very conservative number of maximum jobs to run so that you don't accidently spawn large computations when starting out, however for meaningful HPO searches this number should be much higher (e.g., in our experiments we often run 100 max_jobs). Note that you may need to request a [quota limit increase](https://docs.aws.amazon.com/general/latest/gr/sagemaker.html) for additional  `max_parallel_jobs` parallel workers. "
+    "We also need to decide how may total experiments to run, and how many should run in parallel. Below we have a very conservative number of maximum jobs to run so that you don't accidentally spawn large computations when starting out, however for meaningful HPO searches this number should be much higher (e.g., in our experiments we often run 100 max_jobs). Note that you may need to request a [quota limit increase](https://docs.aws.amazon.com/general/latest/gr/sagemaker.html) for additional  `max_parallel_jobs` parallel workers. "
    ]
   },
   {
@@ -635,7 +635,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "To build a RAPIDS enabled SageMaker HPO we first need to build a [SageMaker Estimator](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html). An Estimator is a container image that captures all the software needed to run an HPO experiment. The container is augmented with entrypoint code that will be trggered at runtime by each worker. The entrypoint code enables us to write custom models and hook them up to data. "
+    "To build a RAPIDS enabled SageMaker HPO we first need to build a [SageMaker Estimator](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html). An Estimator is a container image that captures all the software needed to run an HPO experiment. The container is augmented with entrypoint code that will be triggered at runtime by each worker. The entrypoint code enables us to write custom models and hook them up to data. "
    ]
   },
   {
@@ -1491,7 +1491,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "We only focus on a single metric, which we call 'final-score', that captures the accuracy of our model on the test data unseen during training. You are of course welcome to add aditional metrics, see [AWS SageMaker documentation on Metrics](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-define-metrics.html). When defining a metric we provide a regular expression (i.e., string parsing rule) to extract the key metric from the output of each Estimator/worker."
+    "We only focus on a single metric, which we call 'final-score', that captures the accuracy of our model on the test data unseen during training. You are of course welcome to add additional metrics, see [AWS SageMaker documentation on Metrics](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-define-metrics.html). When defining a metric we provide a regular expression (i.e., string parsing rule) to extract the key metric from the output of each Estimator/worker."
    ]
   },
   {
@@ -2179,7 +2179,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "As mentioned in the introduction we find a <span style=\"color:#8735fb; font-size:14pt\"> **12X** </span> speedup in wall clock time and a <span style=\"color:#8735fb; font-size:14pt\"> **4.5x** </span> reduction in cost when comparing between GPU and CPU instances on 100 HPO trials using 10 parallel workers on 10 years of the Airline Dataset (~63M flights). In these experiments we used the XGBoost algorithm with the multi-GPU vs multi-CPU Dask cluster and 10 cross validaiton folds. Below we offer a table with additional details."
+    "As mentioned in the introduction we find a <span style=\"color:#8735fb; font-size:14pt\"> **12X** </span> speedup in wall clock time and a <span style=\"color:#8735fb; font-size:14pt\"> **4.5x** </span> reduction in cost when comparing between GPU and CPU instances on 100 HPO trials using 10 parallel workers on 10 years of the Airline Dataset (~63M flights). In these experiments we used the XGBoost algorithm with the multi-GPU vs multi-CPU Dask cluster and 10 cross validation folds. Below we offer a table with additional details."
    ]
   },
   {

@@ -109,7 +109,7 @@ def load_trained_model():
     def predict():
         """
         Run CPU or GPU inference on input data,
-        called everytime an incoming request arrives
+        called every time an incoming request arrives
         """
         # parse user input
         try:

@@ -92,7 +92,7 @@
     "1. Option 1: Specify an additional scope when provisioning the GKE cluster.\n",
     "\n",
     "   When you are provisioning a new GKE cluster, add the `storage-rw` scope.\n",
-    "   This option is only available if you are creating a new cluster from scratch. If you are using an exising GKE cluster, see Option 2.\n",
+    "   This option is only available if you are creating a new cluster from scratch. If you are using an existing GKE cluster, see Option 2.\n",
     "\n",
     "   Example:\n",
     "```\n",
@@ -103,7 +103,7 @@
     "\n",
     "2. Option 2: Grant bucket access to the associated service account.\n",
     "\n",
-    "   Find out which service account is associated with your GKE cluster. You can grant the bucket access to the service account as follows: Nagivate to the Cloud Storage console, open the Bucket Details page for the bucket, open the Permissions tab, and click on Grant Access.\n",
+    "   Find out which service account is associated with your GKE cluster. You can grant the bucket access to the service account as follows: Navigate to the Cloud Storage console, open the Bucket Details page for the bucket, open the Permissions tab, and click on Grant Access.\n",
     "   \n",
     "Enter the name of the bucket that your cluster has read-write access to:"
    ]
@@ -3148,7 +3148,7 @@
    "source": [
     "### Assign weights for product items\n",
     "\n",
-    "When we assess the accuracy of our machine learning model, we should assign a weight for each product item, to indicate the relative importance of the item. For the M5 competition, the weights are computed from the total sales amount (in US dollars) in the lastest 28 days."
+    "When we assess the accuracy of our machine learning model, we should assign a weight for each product item, to indicate the relative importance of the item. For the M5 competition, the weights are computed from the total sales amount (in US dollars) in the last 28 days."
    ]
   },
   {
@@ -6520,7 +6520,7 @@
     "### Define cross-validation folds\n",
     "**[Cross-validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics))** is a statistical method for estimating how well a machine learning model generalizes to an independent data set. The method is also useful for evaluating the choice of a given combination of model hyperparameters.\n",
     "\n",
-    "To estimate the capacity to generalize, we define multiple cross-validation **folds** consisting of mulitple pairs of `(training set, validation set)`. For each fold, we fit a model using the training set and evaluate its accuracy on the validation set. The \"goodness\" score for a given hyperparameter combination is the accuracy of the model on each validation set, averaged over all cross-validation folds.\n",
+    "To estimate the capacity to generalize, we define multiple cross-validation **folds** consisting of multiple pairs of `(training set, validation set)`. For each fold, we fit a model using the training set and evaluate its accuracy on the validation set. The \"goodness\" score for a given hyperparameter combination is the accuracy of the model on each validation set, averaged over all cross-validation folds.\n",
     "\n",
     "Great care must be taken when defining cross-validation folds for time-series data. We are not allowed to use the future to predict the past, so the training set must precede (in time) the validation set. Consequently, we partition the data set in the time dimension and assign the training and validation sets using time ranges:"
    ]

@@ -23,7 +23,7 @@
    "source": [
     "# Multi-Node Multi-GPU XGBoost Example on Azure using dask-cloudprovider\n",
     "\n",
-    "[Dask Cloud Provider](https://cloudprovider.dask.org/en/latest/) is a native cloud intergration library for Dask. It helps manage Dask clusters on different cloud platforms. In this notebook, we will look at how we can use this package to set-up an Azure cluster and run a multi-node multi-GPU (MNMG) example with [RAPIDS](https://rapids.ai/). RAPIDS provides a suite of libraries to accelerate data science pipelines on the GPU entirely. This can be scaled to multiple nodes using Dask as we will see in this notebook. \n",
+    "[Dask Cloud Provider](https://cloudprovider.dask.org/en/latest/) is a native cloud integration library for Dask. It helps manage Dask clusters on different cloud platforms. In this notebook, we will look at how we can use this package to set-up an Azure cluster and run a multi-node multi-GPU (MNMG) example with [RAPIDS](https://rapids.ai/). RAPIDS provides a suite of libraries to accelerate data science pipelines on the GPU entirely. This can be scaled to multiple nodes using Dask as we will see in this notebook. \n",
     "\n",
     "For the purposes of this demo, we will use a part of the NYC Taxi Dataset (only the files of 2014 calendar year will be used here). The goal is to predict the fare amount for a given trip given the times and coordinates of the taxi trip. We will download the data from [Azure Open Datasets](https://docs.microsoft.com/en-us/azure/open-datasets/overview-what-are-open-datasets), where the dataset is publicly hosted by Microsoft.\n",
     "\n",

@@ -1,6 +1,6 @@
 # How to Setup InfiniBand on Azure
 
-[Azure GPU optmized virtual machines](https://learn.microsoft.com/en-us/azure/virtual-machines/sizes-gpu) provide
+[Azure GPU optimized virtual machines](https://learn.microsoft.com/en-us/azure/virtual-machines/sizes-gpu) provide
 a low latency and high bandwidth InfiniBand network. This guide walks through the steps to enable InfiniBand to
 optimize network performance.
 

@@ -54,7 +54,7 @@ After your drivers are installed, you are ready to test your cluster.
 
 ### Installing Dask operator with Helm
 
-The operator has a Helm chart which can be used to manage the installation of the operator. Follow the instructions provided in the [Dask documention](https://kubernetes.dask.org/en/latest/installing.html#installing-with-helm), or alternatively can be installed via:
+The operator has a Helm chart which can be used to manage the installation of the operator. Follow the instructions provided in the [Dask documentation](https://kubernetes.dask.org/en/latest/installing.html#installing-with-helm), or alternatively can be installed via:
 
 ```console
 $ helm install --create-namespace -n dask-operator --generate-name --repo https://helm.dask.org dask-kubernetes-operator
@@ -233,7 +233,7 @@ client = Client("localhost:8786")
 
 ## Example using `KubeCluster`
 
-In additon to creating clusters via `kubectl` you can also do so from Python with {class}`dask_kubernetes.operator.KubeCluster`. This class implements the Dask Cluster Manager interface and under the hood creates and manages the `DaskCluster` resource for you. You can also generate a spec with make_cluster_spec() which KubeCluster uses internally and then modify it with your custom options. We will use this to add node affinity to the scheduler.
+In addition to creating clusters via `kubectl` you can also do so from Python with {class}`dask_kubernetes.operator.KubeCluster`. This class implements the Dask Cluster Manager interface and under the hood creates and manages the `DaskCluster` resource for you. You can also generate a spec with make_cluster_spec() which KubeCluster uses internally and then modify it with your custom options. We will use this to add node affinity to the scheduler.
 In the following example, the same cluster configuration as the `kubectl` example is used.
 
 ```python