Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add codespell to pre-commit configuration #489

Merged
merged 3 commits into from
Jan 7, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -37,5 +37,12 @@ repos:
language: system
pass_filenames: false
always_run: true
- repo: https://github.com/codespell-project/codespell
rev: v2.3.0
hooks:
- id: codespell
additional_dependencies: [tomli]
args: ["--toml", "pyproject.toml"]

default_language_version:
python: python3
7 changes: 7 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,3 +1,10 @@
[tool.codespell]
# note: pre-commit passes explicit lists of files here, which this skip file list doesn't override -
skip = "./.git,./pyproject.toml,./.ruff_cache"
ignore-regex = "\\b(.{1,4}|[A-Z]\\w*T)\\b"
builtin = "clear"
quiet-level = 3

[tool.ruff]
fix = true
line-length = 120
Expand Down
2 changes: 1 addition & 1 deletion source/cloud/aws/sagemaker.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,7 @@ Once inside JupyterLab you should be able to upload the [Running RAPIDS hyperpar

RAPIDS can also be used in [SageMaker Estimators](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html).
Estimators allow you to launch training jobs on ephemeral VMs which SageMaker manages for you.
With this option, your Notebook Isntance doesn't need to have a GPU... you are only charged for GPU instances for the time that your training job is running.
With this option, your Notebook Instance doesn't need to have a GPU... you are only charged for GPU instances for the time that your training job is running.

All you’ll need to do is bring in your RAPIDS training script and libraries as a Docker container image and ask Amazon SageMaker to run copies of it in parallel on a specified number of GPU instances.

Expand Down
2 changes: 1 addition & 1 deletion source/cloud/azure/azure-vm-multi.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ $ pip install dask-cloudprovider[azure]

### 2. Configure your Azure Resources

Set up your [Azure Resouce Group](https://cloudprovider.dask.org/en/latest/azure.html#resource-groups), [Virtual Network](https://cloudprovider.dask.org/en/latest/azure.html#virtual-networks), and [Security Group](https://cloudprovider.dask.org/en/latest/azure.html#security-groups) according to [Dask Cloud Provider instructions](https://cloudprovider.dask.org/en/latest/azure.html#authentication).
Set up your [Azure Resource Group](https://cloudprovider.dask.org/en/latest/azure.html#resource-groups), [Virtual Network](https://cloudprovider.dask.org/en/latest/azure.html#virtual-networks), and [Security Group](https://cloudprovider.dask.org/en/latest/azure.html#security-groups) according to [Dask Cloud Provider instructions](https://cloudprovider.dask.org/en/latest/azure.html#authentication).

### 3. Create a Cluster

Expand Down
2 changes: 1 addition & 1 deletion source/cloud/azure/azureml.md
Original file line number Diff line number Diff line change
Expand Up @@ -134,7 +134,7 @@ ml_client.begin_create_or_update(gpu_compute).result()
A [datastore URI](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-access-data-interactive?tabs=adls&view=azureml-api-2#access-data-from-a-datastore-uri-like-a-filesystem-preview) is a reference to a blob storage location (path) on your Azure account. You can copy-and-paste the datastore URI from the AzureML Studio UI:

1. Select **Data** from the left-hand menu > **Datastores** > choose your datastore name > **Browse**
2. Find the file/folder containing your dataset and click the elipsis (...) next to it.
2. Find the file/folder containing your dataset and click the ellipsis (...) next to it.
3. From the menu, choose **Copy URI** and select **Datastore URI** format to copy into your notebook.

![Screenshot of access datastore uri screen](../../images/azureml-access-datastore-uri.png)
Expand Down
2 changes: 1 addition & 1 deletion source/cloud/gcp/vertex-ai.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ RAPIDS can be deployed on [Vertex AI Workbench](https://cloud.google.com/vertex-
6. After customizing any other aspects of the machine you wish, click **CREATE**.

```{tip}
If you want to select a different GPU or select other hardware options you can select "Advanced Options" at the bottom and then make changes in the "Machine type" seciton.
If you want to select a different GPU or select other hardware options you can select "Advanced Options" at the bottom and then make changes in the "Machine type" section.
```

## Install RAPIDS
Expand Down
4 changes: 2 additions & 2 deletions source/examples/rapids-1brc-single-node/notebook.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -163,15 +163,15 @@
" {\n",
" # Choose a random station from the lookup table for each row in our output\n",
" \"station\": cp.random.randint(0, len(lookup_df) - 1, int(chunksize)),\n",
" # Generate a normal distibution around zero for each row in our output\n",
" # Generate a normal distribution around zero for each row in our output\n",
" # Because the std is the same for every station we can adjust the mean for each row afterwards\n",
" \"measure\": cp.random.normal(0, std, int(chunksize)),\n",
" }\n",
" )\n",
"\n",
" # Offset each measurement by the station's mean value\n",
" df.measure += df.station.map(lookup_df.mean_temp)\n",
" # Round the temprature to one decimal place\n",
" # Round the temperature to one decimal place\n",
" df.measure = df.measure.round(decimals=1)\n",
" # Convert the station index to the station name\n",
" df.station = df.station.map(lookup_df.station)\n",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2179,7 +2179,7 @@
"source": [
"## Closing thoughts\n",
"\n",
"By sharing a Kubernetes cluster between many users who are all launching many ephemeral Dask Clusters to perform their work we are able to balance cost vs user time. Peaks in individual user demands get smoothed out over time in a multi-tenant model, and the overall peaks and troughs of the day are accomodated by the Kubernetes cluster autoscaler.\n",
"By sharing a Kubernetes cluster between many users who are all launching many ephemeral Dask Clusters to perform their work we are able to balance cost vs user time. Peaks in individual user demands get smoothed out over time in a multi-tenant model, and the overall peaks and troughs of the day are accommodated by the Kubernetes cluster autoscaler.\n",
"\n",
"We managed to create a responsive experience for our users where they generally got Dask clusters in a few seconds. We also managed to hit 64% utilization of the GPUs in our cluster, a very respectable number for an interactive cluster. \n",
"\n",
Expand Down
4 changes: 2 additions & 2 deletions source/examples/rapids-azureml-hpo/rapids_csp_azure.py
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,7 @@ def __init__(

def load_hyperparams(self, model_name="XGBoost"):
"""
Selecting model paramters based on the model we select for execution.
Selecting model parameters based on the model we select for execution.
Checks if there is a config file present in the path self.CSP_paths['hyperparams'] with
the parameters for the experiment. If not present, it returns the default parameters.

Expand Down Expand Up @@ -304,7 +304,7 @@ def train_model(self, X_train, y_train, model_params):
Parameters
----------
X_train : dataframe
The data for traning
The data for training
y_train : dataframe
The label to be used for training.
model_params : dict
Expand Down
2 changes: 1 addition & 1 deletion source/examples/rapids-ec2-mnmg/notebook.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -261,7 +261,7 @@
"\n",
"We'll add new features to the dataframe:\n",
"\n",
"1. We can split the datetime column to retrive year, month, day, hour, day_of_week columns. Find the difference between pickup time and drop off time. \n",
"1. We can split the datetime column to retrieve year, month, day, hour, day_of_week columns. Find the difference between pickup time and drop off time. \n",
"2. Haversine Distance between the pick-up and drop-off coordinates."
]
},
Expand Down
4 changes: 2 additions & 2 deletions source/examples/rapids-sagemaker-hpo/helper_functions.py
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@ def validate_dockerfile(rapids_base_container, dockerfile_name="Dockerfile"):
with open(dockerfile_name) as dockerfile_handle:
if rapids_base_container not in dockerfile_handle.read():
raise Exception(
"Dockerfile base layer [i.e. FROM statment] does"
"Dockerfile base layer [i.e. FROM statement] does"
" not match the variable rapids_base_container"
)

Expand Down Expand Up @@ -166,7 +166,7 @@ def new_job_name_from_config(
):
"""
Build a jobname string that captures the HPO configuration options.
This is helpful for intepreting logs and for general book-keeping
This is helpful for interpreting logs and for general book-keeping
"""
job_name = None
try:
Expand Down
8 changes: 4 additions & 4 deletions source/examples/rapids-sagemaker-hpo/notebook.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -435,7 +435,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We also need to decide how may total experiments to run, and how many should run in parallel. Below we have a very conservative number of maximum jobs to run so that you don't accidently spawn large computations when starting out, however for meaningful HPO searches this number should be much higher (e.g., in our experiments we often run 100 max_jobs). Note that you may need to request a [quota limit increase](https://docs.aws.amazon.com/general/latest/gr/sagemaker.html) for additional `max_parallel_jobs` parallel workers. "
"We also need to decide how may total experiments to run, and how many should run in parallel. Below we have a very conservative number of maximum jobs to run so that you don't accidentally spawn large computations when starting out, however for meaningful HPO searches this number should be much higher (e.g., in our experiments we often run 100 max_jobs). Note that you may need to request a [quota limit increase](https://docs.aws.amazon.com/general/latest/gr/sagemaker.html) for additional `max_parallel_jobs` parallel workers. "
]
},
{
Expand Down Expand Up @@ -635,7 +635,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"To build a RAPIDS enabled SageMaker HPO we first need to build a [SageMaker Estimator](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html). An Estimator is a container image that captures all the software needed to run an HPO experiment. The container is augmented with entrypoint code that will be trggered at runtime by each worker. The entrypoint code enables us to write custom models and hook them up to data. "
"To build a RAPIDS enabled SageMaker HPO we first need to build a [SageMaker Estimator](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html). An Estimator is a container image that captures all the software needed to run an HPO experiment. The container is augmented with entrypoint code that will be triggered at runtime by each worker. The entrypoint code enables us to write custom models and hook them up to data. "
]
},
{
Expand Down Expand Up @@ -1491,7 +1491,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We only focus on a single metric, which we call 'final-score', that captures the accuracy of our model on the test data unseen during training. You are of course welcome to add aditional metrics, see [AWS SageMaker documentation on Metrics](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-define-metrics.html). When defining a metric we provide a regular expression (i.e., string parsing rule) to extract the key metric from the output of each Estimator/worker."
"We only focus on a single metric, which we call 'final-score', that captures the accuracy of our model on the test data unseen during training. You are of course welcome to add additional metrics, see [AWS SageMaker documentation on Metrics](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-define-metrics.html). When defining a metric we provide a regular expression (i.e., string parsing rule) to extract the key metric from the output of each Estimator/worker."
]
},
{
Expand Down Expand Up @@ -2179,7 +2179,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"As mentioned in the introduction we find a <span style=\"color:#8735fb; font-size:14pt\"> **12X** </span> speedup in wall clock time and a <span style=\"color:#8735fb; font-size:14pt\"> **4.5x** </span> reduction in cost when comparing between GPU and CPU instances on 100 HPO trials using 10 parallel workers on 10 years of the Airline Dataset (~63M flights). In these experiments we used the XGBoost algorithm with the multi-GPU vs multi-CPU Dask cluster and 10 cross validaiton folds. Below we offer a table with additional details."
"As mentioned in the introduction we find a <span style=\"color:#8735fb; font-size:14pt\"> **12X** </span> speedup in wall clock time and a <span style=\"color:#8735fb; font-size:14pt\"> **4.5x** </span> reduction in cost when comparing between GPU and CPU instances on 100 HPO trials using 10 parallel workers on 10 years of the Airline Dataset (~63M flights). In these experiments we used the XGBoost algorithm with the multi-GPU vs multi-CPU Dask cluster and 10 cross validation folds. Below we offer a table with additional details."
]
},
{
Expand Down
2 changes: 1 addition & 1 deletion source/examples/rapids-sagemaker-hpo/serve.py
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,7 @@ def load_trained_model():
def predict():
"""
Run CPU or GPU inference on input data,
called everytime an incoming request arrives
called every time an incoming request arrives
"""
# parse user input
try:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -92,7 +92,7 @@
"1. Option 1: Specify an additional scope when provisioning the GKE cluster.\n",
"\n",
" When you are provisioning a new GKE cluster, add the `storage-rw` scope.\n",
" This option is only available if you are creating a new cluster from scratch. If you are using an exising GKE cluster, see Option 2.\n",
" This option is only available if you are creating a new cluster from scratch. If you are using an existing GKE cluster, see Option 2.\n",
"\n",
" Example:\n",
"```\n",
Expand All @@ -103,7 +103,7 @@
"\n",
"2. Option 2: Grant bucket access to the associated service account.\n",
"\n",
" Find out which service account is associated with your GKE cluster. You can grant the bucket access to the service account as follows: Nagivate to the Cloud Storage console, open the Bucket Details page for the bucket, open the Permissions tab, and click on Grant Access.\n",
" Find out which service account is associated with your GKE cluster. You can grant the bucket access to the service account as follows: Navigate to the Cloud Storage console, open the Bucket Details page for the bucket, open the Permissions tab, and click on Grant Access.\n",
" \n",
"Enter the name of the bucket that your cluster has read-write access to:"
]
Expand Down Expand Up @@ -3148,7 +3148,7 @@
"source": [
"### Assign weights for product items\n",
"\n",
"When we assess the accuracy of our machine learning model, we should assign a weight for each product item, to indicate the relative importance of the item. For the M5 competition, the weights are computed from the total sales amount (in US dollars) in the lastest 28 days."
"When we assess the accuracy of our machine learning model, we should assign a weight for each product item, to indicate the relative importance of the item. For the M5 competition, the weights are computed from the total sales amount (in US dollars) in the last 28 days."
]
},
{
Expand Down Expand Up @@ -6520,7 +6520,7 @@
"### Define cross-validation folds\n",
"**[Cross-validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics))** is a statistical method for estimating how well a machine learning model generalizes to an independent data set. The method is also useful for evaluating the choice of a given combination of model hyperparameters.\n",
"\n",
"To estimate the capacity to generalize, we define multiple cross-validation **folds** consisting of mulitple pairs of `(training set, validation set)`. For each fold, we fit a model using the training set and evaluate its accuracy on the validation set. The \"goodness\" score for a given hyperparameter combination is the accuracy of the model on each validation set, averaged over all cross-validation folds.\n",
"To estimate the capacity to generalize, we define multiple cross-validation **folds** consisting of multiple pairs of `(training set, validation set)`. For each fold, we fit a model using the training set and evaluate its accuracy on the validation set. The \"goodness\" score for a given hyperparameter combination is the accuracy of the model on each validation set, averaged over all cross-validation folds.\n",
"\n",
"Great care must be taken when defining cross-validation folds for time-series data. We are not allowed to use the future to predict the past, so the training set must precede (in time) the validation set. Consequently, we partition the data set in the time dimension and assign the training and validation sets using time ranges:"
]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@
"source": [
"# Multi-Node Multi-GPU XGBoost Example on Azure using dask-cloudprovider\n",
"\n",
"[Dask Cloud Provider](https://cloudprovider.dask.org/en/latest/) is a native cloud intergration library for Dask. It helps manage Dask clusters on different cloud platforms. In this notebook, we will look at how we can use this package to set-up an Azure cluster and run a multi-node multi-GPU (MNMG) example with [RAPIDS](https://rapids.ai/). RAPIDS provides a suite of libraries to accelerate data science pipelines on the GPU entirely. This can be scaled to multiple nodes using Dask as we will see in this notebook. \n",
"[Dask Cloud Provider](https://cloudprovider.dask.org/en/latest/) is a native cloud integration library for Dask. It helps manage Dask clusters on different cloud platforms. In this notebook, we will look at how we can use this package to set-up an Azure cluster and run a multi-node multi-GPU (MNMG) example with [RAPIDS](https://rapids.ai/). RAPIDS provides a suite of libraries to accelerate data science pipelines on the GPU entirely. This can be scaled to multiple nodes using Dask as we will see in this notebook. \n",
"\n",
"For the purposes of this demo, we will use a part of the NYC Taxi Dataset (only the files of 2014 calendar year will be used here). The goal is to predict the fare amount for a given trip given the times and coordinates of the taxi trip. We will download the data from [Azure Open Datasets](https://docs.microsoft.com/en-us/azure/open-datasets/overview-what-are-open-datasets), where the dataset is publicly hosted by Microsoft.\n",
"\n",
Expand Down
2 changes: 1 addition & 1 deletion source/guides/azure/infiniband.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# How to Setup InfiniBand on Azure

[Azure GPU optmized virtual machines](https://learn.microsoft.com/en-us/azure/virtual-machines/sizes-gpu) provide
[Azure GPU optimized virtual machines](https://learn.microsoft.com/en-us/azure/virtual-machines/sizes-gpu) provide
a low latency and high bandwidth InfiniBand network. This guide walks through the steps to enable InfiniBand to
optimize network performance.

Expand Down
4 changes: 2 additions & 2 deletions source/guides/colocate-workers.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ After your drivers are installed, you are ready to test your cluster.

### Installing Dask operator with Helm

The operator has a Helm chart which can be used to manage the installation of the operator. Follow the instructions provided in the [Dask documention](https://kubernetes.dask.org/en/latest/installing.html#installing-with-helm), or alternatively can be installed via:
The operator has a Helm chart which can be used to manage the installation of the operator. Follow the instructions provided in the [Dask documentation](https://kubernetes.dask.org/en/latest/installing.html#installing-with-helm), or alternatively can be installed via:

```console
$ helm install --create-namespace -n dask-operator --generate-name --repo https://helm.dask.org dask-kubernetes-operator
Expand Down Expand Up @@ -233,7 +233,7 @@ client = Client("localhost:8786")

## Example using `KubeCluster`

In additon to creating clusters via `kubectl` you can also do so from Python with {class}`dask_kubernetes.operator.KubeCluster`. This class implements the Dask Cluster Manager interface and under the hood creates and manages the `DaskCluster` resource for you. You can also generate a spec with make_cluster_spec() which KubeCluster uses internally and then modify it with your custom options. We will use this to add node affinity to the scheduler.
In addition to creating clusters via `kubectl` you can also do so from Python with {class}`dask_kubernetes.operator.KubeCluster`. This class implements the Dask Cluster Manager interface and under the hood creates and manages the `DaskCluster` resource for you. You can also generate a spec with make_cluster_spec() which KubeCluster uses internally and then modify it with your custom options. We will use this to add node affinity to the scheduler.
In the following example, the same cluster configuration as the `kubectl` example is used.

```python
Expand Down
Loading
Loading