Skip to content

Commit 88577c7

Browse files
Updated e2e tests to support S3 compatible storage bucket from whicyh to download MNISt datasets for disconnected automation
1 parent 9c1e65d commit 88577c7

11 files changed

+331
-19
lines changed

Diff for: .pre-commit-config.yaml

+1
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ repos:
77
- id: trailing-whitespace
88
- id: end-of-file-fixer
99
- id: check-yaml
10+
args: [--allow-multiple-documents]
1011
- id: check-added-large-files
1112
- repo: https://github.com/psf/black
1213
rev: 23.3.0

Diff for: docs/e2e.md

+22-5
Original file line numberDiff line numberDiff line change
@@ -108,8 +108,25 @@ Currently the SDK doesn't support tolerations, so e2e tests can't be executed on
108108
```
109109
poetry run pytest -v -s ./tests/e2e -m openshift --timeout=1200
110110
```
111-
- If the cluster doesn't have NVidia GPU support or GPU nodes have taint then we need to disable NVidia GPU tests by providing proper marker:
112-
```
113-
poetry install --with test,docs
114-
poetry run pytest -v -s ./tests/e2e/mnist_raycluster_sdk_kind_test.py -m 'not nvidia_gpu'
115-
```
111+
112+
## On OpenShift Disconnected clusters
113+
114+
- In addition to setup phase mentioned above in case of Openshift cluster, Disconnected environment requires following pre-requisites :
115+
- Mirror Image registry :
116+
- Image mirror registry is used to host set of container images required locally for the applications and services. This ensures to pull images without needing an external network connection. It also ensures continuous operation and deployment capabilities in a network-isolated environment.
117+
- PYPI Mirror Index :
118+
- When trying to install Python packages in a disconnected environment, the pip command might fail because the connection cannot install packages from external URLs. This issue can be resolved by setting up PIP Mirror Index on separate endpoint in same environment.
119+
- S3 compatible storage :
120+
- Some of our distributed training examples require an external storage solution so that all nodes can access the same data in disconnected environment (For example: common-datasets and model files).
121+
- Minio S3 compatible storage type instance can be deployed in disconnected environment using `/tests/e2e/minio_deployment.yaml` or using support methods in e2e test suite.
122+
- The following are environment variables for configuring PIP index URl for accessing the common-python packages required and the S3 or Minio storage for your Ray Train script or interactive session.
123+
```
124+
export RAY_IMAGE=quay.io/project-codeflare/ray@sha256:<image-digest> (prefer image digest over image tag in disocnnected environment)
125+
PIP_INDEX_URL=https://<bastion-node-endpoint-url>/root/pypi/+simple/ \
126+
PIP_TRUSTED_HOST=<bastion-node-endpoint-url> \
127+
AWS_DEFAULT_ENDPOINT=<s3-compatible-storage-endpoint-url> \
128+
AWS_ACCESS_KEY_ID=<s3-compatible-storage-access-key> \
129+
AWS_SECRET_ACCESS_KEY=<s3-compatible-storage-secret-key> \
130+
AWS_STORAGE_BUCKET=<storage-bucket-name>
131+
AWS_STORAGE_BUCKET_MNIST_DIR=<storage-bucket-MNIST-datasets-directory>
132+
```

Diff for: tests/e2e/local_interactive_sdk_oauth_test.py

+3
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,8 @@ def test_local_interactives(self):
2828
self.run_local_interactives()
2929

3030
def run_local_interactives(self):
31+
ray_image = get_ray_image()
32+
3133
auth = TokenAuthentication(
3234
token=run_oc_command(["whoami", "--show-token=true"]),
3335
server=run_oc_command(["whoami", "--show-server=true"]),
@@ -46,6 +48,7 @@ def run_local_interactives(self):
4648
worker_cpu_limits=1,
4749
worker_memory_requests=1,
4850
worker_memory_limits=4,
51+
image=ray_image,
4952
verify_tls=False,
5053
)
5154
)

Diff for: tests/e2e/minio_deployment.yaml

+163
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,163 @@
1+
---
2+
kind: PersistentVolumeClaim
3+
apiVersion: v1
4+
metadata:
5+
name: minio-pvc
6+
spec:
7+
accessModes:
8+
- ReadWriteOnce
9+
resources:
10+
requests:
11+
storage: 20Gi
12+
volumeMode: Filesystem
13+
---
14+
kind: Secret
15+
apiVersion: v1
16+
metadata:
17+
name: minio-secret
18+
stringData:
19+
# change the username and password to your own values.
20+
# ensure that the user is at least 3 characters long and the password at least 8
21+
minio_root_user: minio
22+
minio_root_password: minio123
23+
---
24+
kind: Deployment
25+
apiVersion: apps/v1
26+
metadata:
27+
name: minio
28+
spec:
29+
replicas: 1
30+
selector:
31+
matchLabels:
32+
app: minio
33+
template:
34+
metadata:
35+
creationTimestamp: null
36+
labels:
37+
app: minio
38+
spec:
39+
volumes:
40+
- name: data
41+
persistentVolumeClaim:
42+
claimName: minio-pvc
43+
containers:
44+
- resources:
45+
limits:
46+
cpu: 250m
47+
memory: 1Gi
48+
requests:
49+
cpu: 20m
50+
memory: 100Mi
51+
readinessProbe:
52+
tcpSocket:
53+
port: 9000
54+
initialDelaySeconds: 5
55+
timeoutSeconds: 1
56+
periodSeconds: 5
57+
successThreshold: 1
58+
failureThreshold: 3
59+
terminationMessagePath: /dev/termination-log
60+
name: minio
61+
livenessProbe:
62+
tcpSocket:
63+
port: 9000
64+
initialDelaySeconds: 30
65+
timeoutSeconds: 1
66+
periodSeconds: 5
67+
successThreshold: 1
68+
failureThreshold: 3
69+
env:
70+
- name: MINIO_ROOT_USER
71+
valueFrom:
72+
secretKeyRef:
73+
name: minio-secret
74+
key: minio_root_user
75+
- name: MINIO_ROOT_PASSWORD
76+
valueFrom:
77+
secretKeyRef:
78+
name: minio-secret
79+
key: minio_root_password
80+
ports:
81+
- containerPort: 9000
82+
protocol: TCP
83+
- containerPort: 9090
84+
protocol: TCP
85+
imagePullPolicy: IfNotPresent
86+
volumeMounts:
87+
- name: data
88+
mountPath: /data
89+
subPath: minio
90+
terminationMessagePolicy: File
91+
image: >-
92+
quay.io/minio/minio:RELEASE.2024-06-22T05-26-45Z
93+
# In case of disconnected environment, use image digest instead of tag
94+
# For example : <mirror_registry_endpoint>/minio/minio@sha256:6b3abf2f59286b985bfde2b23e37230b466081eda5dccbf971524d54c8e406b5
95+
args:
96+
- server
97+
- /data
98+
- --console-address
99+
- :9090
100+
restartPolicy: Always
101+
terminationGracePeriodSeconds: 30
102+
dnsPolicy: ClusterFirst
103+
securityContext: {}
104+
schedulerName: default-scheduler
105+
strategy:
106+
type: Recreate
107+
revisionHistoryLimit: 10
108+
progressDeadlineSeconds: 600
109+
---
110+
kind: Service
111+
apiVersion: v1
112+
metadata:
113+
name: minio-service
114+
spec:
115+
ipFamilies:
116+
- IPv4
117+
ports:
118+
- name: api
119+
protocol: TCP
120+
port: 9000
121+
targetPort: 9000
122+
- name: ui
123+
protocol: TCP
124+
port: 9090
125+
targetPort: 9090
126+
internalTrafficPolicy: Cluster
127+
type: ClusterIP
128+
ipFamilyPolicy: SingleStack
129+
sessionAffinity: None
130+
selector:
131+
app: minio
132+
---
133+
kind: Route
134+
apiVersion: route.openshift.io/v1
135+
metadata:
136+
name: minio-api
137+
spec:
138+
to:
139+
kind: Service
140+
name: minio-service
141+
weight: 100
142+
port:
143+
targetPort: api
144+
wildcardPolicy: None
145+
tls:
146+
termination: edge
147+
insecureEdgeTerminationPolicy: Redirect
148+
---
149+
kind: Route
150+
apiVersion: route.openshift.io/v1
151+
metadata:
152+
name: minio-ui
153+
spec:
154+
to:
155+
kind: Service
156+
name: minio-service
157+
weight: 100
158+
port:
159+
targetPort: ui
160+
wildcardPolicy: None
161+
tls:
162+
termination: edge
163+
insecureEdgeTerminationPolicy: Redirect

Diff for: tests/e2e/mnist.py

+86-5
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@
1515
import os
1616

1717
import torch
18+
import requests
1819
from pytorch_lightning import LightningModule, Trainer
1920
from pytorch_lightning.callbacks.progress import TQDMProgressBar
2021
from torch import nn
@@ -23,9 +24,15 @@
2324
from torchmetrics import Accuracy
2425
from torchvision import transforms
2526
from torchvision.datasets import MNIST
27+
import gzip
28+
import shutil
29+
from minio import Minio
30+
2631

2732
PATH_DATASETS = os.environ.get("PATH_DATASETS", ".")
2833
BATCH_SIZE = 256 if torch.cuda.is_available() else 64
34+
35+
local_mnist_path = os.path.dirname(os.path.abspath(__file__))
2936
# %%
3037

3138
print("prior to running the trainer")
@@ -35,6 +42,25 @@
3542
print("ACCELERATOR: is ", os.getenv("ACCELERATOR"))
3643
ACCELERATOR = os.getenv("ACCELERATOR")
3744

45+
STORAGE_BUCKET_EXISTS = "AWS_DEFAULT_ENDPOINT" in os.environ
46+
print("STORAGE_BUCKET_EXISTS: ", STORAGE_BUCKET_EXISTS)
47+
48+
print(
49+
f'Storage_Bucket_Default_Endpoint : is {os.environ.get("AWS_DEFAULT_ENDPOINT")}'
50+
if "AWS_DEFAULT_ENDPOINT" in os.environ
51+
else ""
52+
)
53+
print(
54+
f'Storage_Bucket_Name : is {os.environ.get("AWS_STORAGE_BUCKET")}'
55+
if "AWS_STORAGE_BUCKET" in os.environ
56+
else ""
57+
)
58+
print(
59+
f'Storage_Bucket_Mnist_Directory : is {os.environ.get("AWS_STORAGE_BUCKET_MNIST_DIR")}'
60+
if "AWS_STORAGE_BUCKET_MNIST_DIR" in os.environ
61+
else ""
62+
)
63+
3864

3965
class LitMNIST(LightningModule):
4066
def __init__(self, data_dir=PATH_DATASETS, hidden_size=64, learning_rate=2e-4):
@@ -114,19 +140,74 @@ def configure_optimizers(self):
114140
def prepare_data(self):
115141
# download
116142
print("Downloading MNIST dataset...")
117-
MNIST(self.data_dir, train=True, download=True)
118-
MNIST(self.data_dir, train=False, download=True)
143+
144+
if (
145+
STORAGE_BUCKET_EXISTS
146+
and os.environ.get("AWS_DEFAULT_ENDPOINT") != ""
147+
and os.environ.get("AWS_DEFAULT_ENDPOINT") != None
148+
):
149+
print("Using storage bucket to download datasets...")
150+
151+
dataset_dir = os.path.join(self.data_dir, "MNIST/raw")
152+
endpoint = os.environ.get("AWS_DEFAULT_ENDPOINT")
153+
access_key = os.environ.get("AWS_ACCESS_KEY_ID")
154+
secret_key = os.environ.get("AWS_SECRET_ACCESS_KEY")
155+
bucket_name = os.environ.get("AWS_STORAGE_BUCKET")
156+
157+
client = Minio(
158+
endpoint,
159+
access_key=access_key,
160+
secret_key=secret_key,
161+
cert_check=False,
162+
)
163+
164+
if not os.path.exists(dataset_dir):
165+
os.makedirs(dataset_dir)
166+
else:
167+
print(f"Directory '{dataset_dir}' already exists")
168+
169+
# To download datasets from storage bucket's specific directory, use prefix to provide directory name
170+
prefix = os.environ.get("AWS_STORAGE_BUCKET_MNIST_DIR")
171+
# download all files from prefix folder of storage bucket recursively
172+
for item in client.list_objects(bucket_name, prefix=prefix, recursive=True):
173+
file_name = item.object_name[len(prefix) + 1 :]
174+
dataset_file_path = os.path.join(dataset_dir, file_name)
175+
if not os.path.exists(dataset_file_path):
176+
client.fget_object(bucket_name, item.object_name, dataset_file_path)
177+
else:
178+
print(f"File-path '{dataset_file_path}' already exists")
179+
# Unzip files
180+
with gzip.open(dataset_file_path, "rb") as f_in:
181+
with open(dataset_file_path.split(".")[:-1][0], "wb") as f_out:
182+
shutil.copyfileobj(f_in, f_out)
183+
# delete zip file
184+
os.remove(dataset_file_path)
185+
unzipped_filepath = dataset_file_path.split(".")[0]
186+
if os.path.exists(unzipped_filepath):
187+
print(
188+
f"Unzipped and saved dataset file to path - {unzipped_filepath}"
189+
)
190+
download_datasets = False
191+
192+
else:
193+
print("Using default MNIST mirror reference to download datasets...")
194+
download_datasets = True
195+
196+
MNIST(self.data_dir, train=True, download=download_datasets)
197+
MNIST(self.data_dir, train=False, download=download_datasets)
119198

120199
def setup(self, stage=None):
121200
# Assign train/val datasets for use in dataloaders
122201
if stage == "fit" or stage is None:
123-
mnist_full = MNIST(self.data_dir, train=True, transform=self.transform)
202+
mnist_full = MNIST(
203+
self.data_dir, train=True, transform=self.transform, download=False
204+
)
124205
self.mnist_train, self.mnist_val = random_split(mnist_full, [55000, 5000])
125206

126207
# Assign test dataset for use in dataloader(s)
127208
if stage == "test" or stage is None:
128209
self.mnist_test = MNIST(
129-
self.data_dir, train=False, transform=self.transform
210+
self.data_dir, train=False, transform=self.transform, download=False
130211
)
131212

132213
def train_dataloader(self):
@@ -145,7 +226,7 @@ def test_dataloader(self):
145226

146227
# Init DataLoader from MNIST Dataset
147228

148-
model = LitMNIST()
229+
model = LitMNIST(data_dir=local_mnist_path)
149230

150231
print("GROUP: ", int(os.environ.get("GROUP_WORLD_SIZE", 1)))
151232
print("LOCAL: ", int(os.environ.get("LOCAL_WORLD_SIZE", 1)))

Diff for: tests/e2e/mnist_pip_requirements.txt

+1
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
11
pytorch_lightning==1.9.5
22
torchmetrics==0.9.1
33
torchvision==0.12.0
4+
minio

Diff for: tests/e2e/mnist_raycluster_sdk_aw_kind_test.py

+2-1
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@ def setup_method(self):
1919

2020
def teardown_method(self):
2121
delete_namespace(self)
22+
delete_kueue_resources(self)
2223

2324
def test_mnist_ray_cluster_sdk_kind(self):
2425
self.setup_method()
@@ -77,7 +78,7 @@ def assert_jobsubmit_withoutlogin_kind(self, cluster, accelerator, number_of_gpu
7778
runtime_env={
7879
"working_dir": "./tests/e2e/",
7980
"pip": "./tests/e2e/mnist_pip_requirements.txt",
80-
"env_vars": {"ACCELERATOR": accelerator},
81+
"env_vars": get_setup_env_variables(ACCELERATOR=accelerator),
8182
},
8283
entrypoint_num_gpus=number_of_gpus,
8384
)

Diff for: tests/e2e/mnist_raycluster_sdk_kind_test.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -77,7 +77,7 @@ def assert_jobsubmit_withoutlogin_kind(self, cluster, accelerator, number_of_gpu
7777
runtime_env={
7878
"working_dir": "./tests/e2e/",
7979
"pip": "./tests/e2e/mnist_pip_requirements.txt",
80-
"env_vars": {"ACCELERATOR": accelerator},
80+
"env_vars": get_setup_env_variables(ACCELERATOR=accelerator),
8181
},
8282
entrypoint_num_gpus=number_of_gpus,
8383
)

0 commit comments

Comments
 (0)