Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] Unpin Dask, adopt CUDA 12.8 and RAPIDS 24.12 #11194

Merged
merged 12 commits into from
Feb 8, 2025
26 changes: 17 additions & 9 deletions doc/contrib/ci.rst
Original file line number Diff line number Diff line change
Expand Up @@ -43,18 +43,26 @@ To make changes to the CI container, carry out the following steps:
Consult :ref:`build_run_docker_locally` for this step.
4. Submit a pull request to `dmlc/xgboost-devops <https://github.com/dmlc/xgboost-devops>`_ with
the proposed changes to the Dockerfile. Make note of the pull request number. Example: ``#204``
5. Clone `dmlc/xgboost <https://github.com/dmlc/xgboost>`_ and update all references to the
old container to point to the new container. More specifically, all container image URIs of form
``492475357299.dkr.ecr.us-west-2.amazonaws.com/[image_repo]:main`` should have its image tag
(last component) replaced with ``PR-#``, where ``#`` is the pull request number.
For the example above,
we'd replace ``492475357299.dkr.ecr.us-west-2.amazonaws.com/xgb-ci.gpu:main`` with
``492475357299.dkr.ecr.us-west-2.amazonaws.com/xgb-ci.gpu:PR-204``.
5. Clone `dmlc/xgboost <https://github.com/dmlc/xgboost>`_. Locate the file
``ops/pipeline/get-image-tag.sh``, which should have a single line

.. code-block:: bash

IMAGE_TAG=main

To use the new container, revise the file as follows:

.. code-block:: bash

IMAGE_TAG=PR-XX

where ``XX`` is the pull request number.

6. Now submit a pull request to `dmlc/xgboost <https://github.com/dmlc/xgboost>`_. The CI will
run tests using the new container. Verify that all tests pass.
7. Merge the pull request in ``dmlc/xgboost-devops``. Wait until the CI completes on the ``main`` branch.
8. Go back to the the pull request for ``dmlc/xgboost`` and revise all the container references to use
the old tag ``:main``.
8. Go back to the the pull request for ``dmlc/xgboost`` and change ``ops/pipeline/get-image-tag.sh``
back to ``IMAGE_TAG=main``.
9. Merge the pull request in ``dmlc/xgboost``.

.. _build_run_docker_locally:
Expand Down
3 changes: 2 additions & 1 deletion ops/pipeline/build-cpu-arm64.sh
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,10 @@ fi

source ops/pipeline/classify-git-branch.sh
source ops/pipeline/get-docker-registry-details.sh
source ops/pipeline/get-image-tag.sh

WHEEL_TAG=manylinux_2_28_aarch64
IMAGE_URI=${DOCKER_REGISTRY_URL}/xgb-ci.aarch64:main
IMAGE_URI=${DOCKER_REGISTRY_URL}/xgb-ci.aarch64:${IMAGE_TAG}

echo "--- Build CPU code targeting ARM64"
set -x
Expand Down
3 changes: 2 additions & 1 deletion ops/pipeline/build-cpu.sh
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,9 @@ set -euo pipefail

source ops/pipeline/classify-git-branch.sh
source ops/pipeline/get-docker-registry-details.sh
source ops/pipeline/get-image-tag.sh

IMAGE_URI=${DOCKER_REGISTRY_URL}/xgb-ci.cpu:main
IMAGE_URI=${DOCKER_REGISTRY_URL}/xgb-ci.cpu:${IMAGE_TAG}

echo "--- Build CPU code"
set -x
Expand Down
2 changes: 1 addition & 1 deletion ops/pipeline/build-cuda-impl.sh
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ then
cmake_prefix_path='/opt/grpc;/opt/rmm;/opt/rmm/lib64/rapids/cmake'
cmake_args="${cmake_args} -DPLUGIN_RMM=ON"
else
cmake_prefix_path='/opt/grpc;/workspace/cccl'
cmake_prefix_path='/opt/grpc'
fi

# Disable CMAKE_COMPILE_WARNING_AS_ERROR option temporarily until
Expand Down
12 changes: 3 additions & 9 deletions ops/pipeline/build-cuda.sh
Original file line number Diff line number Diff line change
Expand Up @@ -33,10 +33,11 @@ esac

source ops/pipeline/classify-git-branch.sh
source ops/pipeline/get-docker-registry-details.sh
source ops/pipeline/get-image-tag.sh

WHEEL_TAG=manylinux_2_28_x86_64
BUILD_IMAGE_URI="${DOCKER_REGISTRY_URL}/${image_repo}:main"
MANYLINUX_IMAGE_URI="${DOCKER_REGISTRY_URL}/xgb-ci.${WHEEL_TAG}:main"
BUILD_IMAGE_URI="${DOCKER_REGISTRY_URL}/${image_repo}:${IMAGE_TAG}"
MANYLINUX_IMAGE_URI="${DOCKER_REGISTRY_URL}/xgb-ci.${WHEEL_TAG}:${IMAGE_TAG}"

echo "--- Build with CUDA"

Expand All @@ -47,13 +48,6 @@ else
export BUILD_ONLY_SM75=0
fi

if [[ ${USE_RMM} == 0 ]]
then
# Work around https://github.com/NVIDIA/cccl/issues/1956
# TODO(hcho3): Remove this once new CUDA version ships with CCCL 2.6.0+
git clone https://github.com/NVIDIA/cccl.git -b v2.6.1 --quiet
fi

set -x

python3 ops/docker_run.py \
Expand Down
3 changes: 2 additions & 1 deletion ops/pipeline/build-gpu-rpkg.sh
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,9 @@ fi

source ops/pipeline/classify-git-branch.sh
source ops/pipeline/get-docker-registry-details.sh
source ops/pipeline/get-image-tag.sh

IMAGE_URI=${DOCKER_REGISTRY_URL}/xgb-ci.gpu_build_r_rockylinux8:main
IMAGE_URI=${DOCKER_REGISTRY_URL}/xgb-ci.gpu_build_r_rockylinux8:${IMAGE_TAG}

echo "--- Build XGBoost R package with CUDA"
set -x
Expand Down
3 changes: 2 additions & 1 deletion ops/pipeline/build-jvm-doc.sh
Original file line number Diff line number Diff line change
Expand Up @@ -18,8 +18,9 @@ then
fi

source ops/pipeline/get-docker-registry-details.sh
source ops/pipeline/get-image-tag.sh

IMAGE_URI=${DOCKER_REGISTRY_URL}/xgb-ci.jvm_gpu_build:main
IMAGE_URI=${DOCKER_REGISTRY_URL}/xgb-ci.jvm_gpu_build:${IMAGE_TAG}

echo "--- Build JVM packages doc"
set -x
Expand Down
8 changes: 3 additions & 5 deletions ops/pipeline/build-jvm-gpu.sh
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,9 @@ set -euo pipefail

source ops/pipeline/classify-git-branch.sh
source ops/pipeline/get-docker-registry-details.sh
source ops/pipeline/get-image-tag.sh

IMAGE_URI=${DOCKER_REGISTRY_URL}/xgb-ci.jvm_gpu_build:main
IMAGE_URI=${DOCKER_REGISTRY_URL}/xgb-ci.jvm_gpu_build:${IMAGE_TAG}

echo "--- Build libxgboost4j.so with CUDA"

Expand All @@ -20,17 +21,14 @@ fi
COMMAND=$(
cat <<-EOF
cd build-gpu/ && \
cmake .. -DCMAKE_PREFIX_PATH=/workspace/cccl -GNinja -DUSE_CUDA=ON -DUSE_NCCL=ON \
cmake .. -GNinja -DUSE_CUDA=ON -DUSE_NCCL=ON \
-DJVM_BINDINGS=ON -DCMAKE_EXPORT_COMPILE_COMMANDS=ON ${arch_flag} && \
ninja
EOF
)

set -x
mkdir -p build-gpu/
# Work around https://github.com/NVIDIA/cccl/issues/1956
# TODO(hcho3): Remove this once new CUDA version ships with CCCL 2.6.0+
git clone https://github.com/NVIDIA/cccl.git -b v2.6.1 --quiet --depth 1
python3 ops/docker_run.py \
--image-uri ${IMAGE_URI} \
-- bash -c "${COMMAND}"
3 changes: 2 additions & 1 deletion ops/pipeline/build-jvm-manylinux2014.sh
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,9 @@ image_repo="xgb-ci.manylinux2014_${arch}"

source ops/pipeline/classify-git-branch.sh
source ops/pipeline/get-docker-registry-details.sh
source ops/pipeline/get-image-tag.sh

IMAGE_URI="${DOCKER_REGISTRY_URL}/${image_repo}:main"
IMAGE_URI="${DOCKER_REGISTRY_URL}/${image_repo}:${IMAGE_TAG}"

# Build XGBoost4J binary
echo "--- Build libxgboost4j.so (targeting glibc 2.17)"
Expand Down
3 changes: 2 additions & 1 deletion ops/pipeline/build-manylinux2014.sh
Original file line number Diff line number Diff line change
Expand Up @@ -18,10 +18,11 @@ arch="$1"

source ops/pipeline/classify-git-branch.sh
source ops/pipeline/get-docker-registry-details.sh
source ops/pipeline/get-image-tag.sh

WHEEL_TAG="manylinux2014_${arch}"
IMAGE_REPO="xgb-ci.${WHEEL_TAG}"
IMAGE_URI="${DOCKER_REGISTRY_URL}/${IMAGE_REPO}:main"
IMAGE_URI="${DOCKER_REGISTRY_URL}/${IMAGE_REPO}:${IMAGE_TAG}"
PYTHON_BIN="/opt/python/cp310-cp310/bin/python"

echo "--- Build binary wheel for ${WHEEL_TAG}"
Expand Down
3 changes: 2 additions & 1 deletion ops/pipeline/build-test-jvm-packages.sh
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ EOF
set -euo pipefail

source ops/pipeline/get-docker-registry-details.sh
source ops/pipeline/get-image-tag.sh

for arg in "SCALA_VERSION"
do
Expand All @@ -23,7 +24,7 @@ do
fi
done

IMAGE_URI=${DOCKER_REGISTRY_URL}/xgb-ci.jvm:main
IMAGE_URI=${DOCKER_REGISTRY_URL}/xgb-ci.jvm:${IMAGE_TAG}

set -x

Expand Down
3 changes: 2 additions & 1 deletion ops/pipeline/deploy-jvm-packages.sh
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ set -euo pipefail

source ops/pipeline/enforce-ci.sh
source ops/pipeline/get-docker-registry-details.sh
source ops/pipeline/get-image-tag.sh

if [[ "$#" -lt 3 ]]
then
Expand All @@ -16,7 +17,7 @@ variant="$1"
image_repo="$2"
scala_version="$3"

IMAGE_URI="${DOCKER_REGISTRY_URL}/${image_repo}:main"
IMAGE_URI="${DOCKER_REGISTRY_URL}/${image_repo}:${IMAGE_TAG}"

set -x

Expand Down
4 changes: 4 additions & 0 deletions ops/pipeline/get-image-tag.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
## Update the following line to test changes to CI images
## See https://xgboost.readthedocs.io/en/latest/contrib/ci.html#making-changes-to-ci-containers

IMAGE_TAG=main
3 changes: 2 additions & 1 deletion ops/pipeline/run-clang-tidy.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,9 @@
set -euo pipefail

source ops/pipeline/get-docker-registry-details.sh
source ops/pipeline/get-image-tag.sh

IMAGE_URI=${DOCKER_REGISTRY_URL}/xgb-ci.clang_tidy:main
IMAGE_URI=${DOCKER_REGISTRY_URL}/xgb-ci.clang_tidy:${IMAGE_TAG}

echo "--- Run clang-tidy"
set -x
Expand Down
7 changes: 5 additions & 2 deletions ops/pipeline/test-cpp-gpu.sh
Original file line number Diff line number Diff line change
Expand Up @@ -10,26 +10,29 @@ fi
suite=$1

source ops/pipeline/get-docker-registry-details.sh
source ops/pipeline/get-image-tag.sh

IMAGE_URI=${DOCKER_REGISTRY_URL}/xgb-ci.gpu:main
IMAGE_URI=${DOCKER_REGISTRY_URL}/xgb-ci.gpu:${IMAGE_TAG}

case "${suite}" in
gpu)
echo "--- Run Google Tests, using a single GPU"
python3 ops/docker_run.py --image-uri ${IMAGE_URI} --use-gpus \
--run-args='--privileged' \
-- build/testxgboost
;;

gpu-rmm)
echo "--- Run Google Tests, using a single GPU, RMM enabled"
python3 ops/docker_run.py --image-uri ${IMAGE_URI} --use-gpus \
--run-args='--privileged' \
-- build/testxgboost --use-rmm-pool
;;

mgpu)
echo "--- Run Google Tests, using multiple GPUs"
python3 ops/docker_run.py --image-uri ${IMAGE_URI} --use-gpus \
--run-args='--shm-size=4g' \
--run-args='--shm-size=4g --privileged' \
-- build/testxgboost --gtest_filter=*MGPU*
;;

Expand Down
3 changes: 2 additions & 1 deletion ops/pipeline/test-cpp-i386.sh
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,9 @@
set -euo pipefail

source ops/pipeline/get-docker-registry-details.sh
source ops/pipeline/get-image-tag.sh

IMAGE_URI="${DOCKER_REGISTRY_URL}/xgb-ci.i386:main"
IMAGE_URI="${DOCKER_REGISTRY_URL}/xgb-ci.i386:${IMAGE_TAG}"

set -x
python3 ops/docker_run.py \
Expand Down
3 changes: 2 additions & 1 deletion ops/pipeline/test-jvm-gpu.sh
Original file line number Diff line number Diff line change
Expand Up @@ -24,8 +24,9 @@ do
done

source ops/pipeline/get-docker-registry-details.sh
source ops/pipeline/get-image-tag.sh

IMAGE_URI=${DOCKER_REGISTRY_URL}/xgb-ci.jvm_gpu_build:main
IMAGE_URI=${DOCKER_REGISTRY_URL}/xgb-ci.jvm_gpu_build:${IMAGE_TAG}

set -x

Expand Down
4 changes: 3 additions & 1 deletion ops/pipeline/test-python-wheel.sh
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,9 @@ else
fi

source ops/pipeline/get-docker-registry-details.sh
IMAGE_URI="${DOCKER_REGISTRY_URL}/${image_repo}:main"
source ops/pipeline/get-image-tag.sh

IMAGE_URI="${DOCKER_REGISTRY_URL}/${image_repo}:${IMAGE_TAG}"

set -x
python3 ops/docker_run.py --image-uri "${IMAGE_URI}" ${gpu_option} \
Expand Down
12 changes: 11 additions & 1 deletion tests/test_distributed/test_gpu_with_dask/test_gpu_with_dask.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
import pytest
from hypothesis import given, note, settings, strategies
from hypothesis._settings import duration
from packaging.version import parse as parse_version
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not in the stdlib, do we need to specify it in our testing environments?

Copy link
Collaborator Author

@hcho3 hcho3 Feb 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, our CI already has packaging installed (due to it being a dependency of other packages we use). However, if we were to add a [test] suite for optional dependencies in pyproject.toml, we should have packaging in that suite.


import xgboost as xgb
from xgboost import testing as tm
Expand Down Expand Up @@ -44,14 +45,20 @@
try:
import cudf
import dask.dataframe as dd
from dask import __version__ as dask_version
from dask import array as da
from dask.distributed import Client
from dask_cuda import LocalCUDACluster

from xgboost import dask as dxgb
from xgboost.testing.dask import check_init_estimation, check_uneven_nan
except ImportError:
pass
dask_version = None


dask_version_ge110 = dask_version and parse_version(dask_version) >= parse_version(
"2024.11.0"
)


def run_with_dask_dataframe(DMatrixT: Type, client: Client) -> None:
Expand Down Expand Up @@ -375,6 +382,9 @@ def test_early_stopping(self, local_cuda_client: Client) -> None:
dump = booster.get_dump(dump_format="json")
assert len(dump) - booster.best_iteration == early_stopping_rounds + 1

@pytest.mark.xfail(
dask_version_ge110, reason="Test cannot pass with Dask 2024.11.0+"
)
@pytest.mark.skipif(**tm.no_cudf())
@pytest.mark.parametrize("model", ["boosting"])
def test_dask_classifier(self, model: str, local_cuda_client: Client) -> None:
Expand Down
Loading