Skip to content

Commit e856af5

Browse files
authored
Merge branch 'master' into patch-1
2 parents 4f9c2a7 + a83ab17 commit e856af5

File tree

171 files changed

+2486
-711
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

171 files changed

+2486
-711
lines changed

.github/ISSUE_TEMPLATE/deepspeed_chat_bug_report.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ If applicable, add screenshots to help explain your problem.
3232
**System info (please complete the following information):**
3333
- OS: [e.g. Ubuntu 18.04]
3434
- GPU count and types [e.g. two machines with x8 A100s each]
35-
- (if applicable) what [DeepSpeed-MII](https://github.com/microsoft/deepspeed-mii) version are you using
35+
- (if applicable) what [DeepSpeed-MII](https://github.com/deepspeedai/deepspeed-mii) version are you using
3636
- (if applicable) Hugging Face Transformers/Accelerate/etc. versions
3737
- Python version
3838
- Any other relevant info about your setup

.github/ISSUE_TEMPLATE/inference_bug_report.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ If applicable, add screenshots to help explain your problem.
2929
**System info (please complete the following information):**
3030
- OS: [e.g. Ubuntu 18.04]
3131
- GPU count and types [e.g. two machines with x8 A100s each]
32-
- (if applicable) what [DeepSpeed-MII](https://github.com/microsoft/deepspeed-mii) version are you using
32+
- (if applicable) what [DeepSpeed-MII](https://github.com/deepspeedai/deepspeed-mii) version are you using
3333
- (if applicable) Hugging Face Transformers/Accelerate/etc. versions
3434
- Python version
3535
- Any other relevant info about your setup

.github/workflows/cpu-torch-latest.yml

+2-2
Original file line numberDiff line numberDiff line change
@@ -59,5 +59,5 @@ jobs:
5959
run: |
6060
unset TORCH_CUDA_ARCH_LIST # only jit compile for current arch
6161
cd tests
62-
HF_HOME=/tmp/hf_home/ pytest $PYTEST_OPTS -n 4 unit/ --torch_ver="2.5"
63-
HF_HOME=/tmp/hf_home/ pytest $PYTEST_OPTS -m 'sequential' unit/ --torch_ver="2.5"
62+
HF_HOME=/tmp/hf_home/ pytest $PYTEST_OPTS -n 4 unit/ --torch_ver="2.6"
63+
HF_HOME=/tmp/hf_home/ pytest $PYTEST_OPTS -m 'sequential' unit/ --torch_ver="2.6"

.github/workflows/nv-a6000.yml

+6-5
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ jobs:
2323
unit-tests:
2424
runs-on: [self-hosted, nvidia, a6000]
2525
container:
26-
image: nvcr.io/nvidia/pytorch:24.03-py3
26+
image: nvcr.io/nvidia/pytorch:24.09-py3
2727
ports:
2828
- 80
2929
options: --gpus all --shm-size "8G"
@@ -42,7 +42,8 @@ jobs:
4242
run: |
4343
git clone https://github.com/huggingface/transformers
4444
cd transformers
45-
git checkout v4.47.1
45+
# if you need to use an older transformers version temporarily in case of breakage
46+
# git checkout v4.47.1
4647
git rev-parse --short HEAD
4748
python -m pip install .
4849
- name: Install deepspeed
@@ -57,16 +58,16 @@ jobs:
5758
run: |
5859
unset TORCH_CUDA_ARCH_LIST # only jit compile for current arch
5960
cd tests
60-
python -m pytest --color=yes --durations=0 --verbose -rF -m 'inference_v2' unit/ --torch_ver="2.3" --cuda_ver="12"
61-
python -m pytest --color=yes --durations=0 --verbose -rF -m 'inference_v2_ops' unit/ --torch_ver="2.3" --cuda_ver="12"
61+
python -m pytest --color=yes --durations=0 --verbose -rF -m 'inference_v2' unit/ --torch_ver="2.5" --cuda_ver="12"
62+
python -m pytest --color=yes --durations=0 --verbose -rF -m 'inference_v2_ops' unit/ --torch_ver="2.5" --cuda_ver="12"
6263
- name: MII unit tests
6364
run: |
6465
BRANCH="main"
6566
if [[ ! -z "${{ github.event.inputs.mii_branch }}" ]]; then
6667
BRANCH="${{ github.event.inputs.mii_branch }}"
6768
fi
6869
echo "Cloning DeepSpeed-MII branch: $BRANCH"
69-
git clone -b $BRANCH --depth=1 https://github.com/microsoft/DeepSpeed-MII.git
70+
git clone -b $BRANCH --depth=1 https://github.com/deepspeedai/DeepSpeed-MII.git
7071
cd DeepSpeed-MII
7172
pip install .[dev]
7273
cd tests

.github/workflows/nv-ds-chat.yml

+3-2
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ jobs:
3737

3838
- name: Install pytorch
3939
run: |
40-
pip3 install -U --cache-dir $TORCH_CACHE torch --index-url https://download.pytorch.org/whl/cu121
40+
pip install -U --cache-dir $TORCH_CACHE torch torchvision --index-url https://download.pytorch.org/whl/cu121
4141
python -c "import torch; print('torch:', torch.__version__, torch)"
4242
python -c "import torch; print('CUDA available:', torch.cuda.is_available())"
4343
@@ -54,7 +54,7 @@ jobs:
5454
BRANCH="${{ github.event.inputs.dse_branch }}"
5555
fi
5656
echo "DeepSpeedExamples Branch: $BRANCH"
57-
git clone -b $BRANCH https://github.com/microsoft/DeepSpeedExamples.git
57+
git clone -b $BRANCH https://github.com/deepspeedai/DeepSpeedExamples.git
5858
cd DeepSpeedExamples/applications/DeepSpeed-Chat
5959
pip install -r requirements.txt
6060
pip install -e .
@@ -67,6 +67,7 @@ jobs:
6767
run: |
6868
cd DeepSpeedExamples/applications/DeepSpeed-Chat
6969
unset TORCH_CUDA_ARCH_LIST # only jit compile for current arch
70+
unset NCCL_DEBUG
7071
cd tests
7172
pytest $PYTEST_OPTS ./
7273

.github/workflows/nv-flash-attn.yml

+2-2
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ jobs:
1818
unit-tests:
1919
runs-on: [self-hosted, nvidia, a6000]
2020
container:
21-
image: nvcr.io/nvidia/pytorch:24.03-py3
21+
image: nvcr.io/nvidia/pytorch:24.09-py3
2222
ports:
2323
- 80
2424
options: --gpus all --shm-size "8G"
@@ -53,7 +53,7 @@ jobs:
5353
run: |
5454
unset TORCH_CUDA_ARCH_LIST # only jit compile for current arch
5555
cd tests
56-
python -m pytest --color=yes --durations=0 --verbose -rF unit/sequence_parallelism/test_ulysses.py --torch_ver="2.3" --cuda_ver="12"
56+
python -m pytest --color=yes --durations=0 --verbose -rF unit/sequence_parallelism/test_ulysses.py --torch_ver="2.5" --cuda_ver="12"
5757
- name: Open GitHub issue if nightly CI fails
5858
if: ${{ failure() && (github.event_name == 'schedule') }}
5959
uses: JasonEtco/create-an-issue@v2

.github/workflows/nv-human-eval.yml

+2-2
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ jobs:
1111
unit-tests:
1212
runs-on: [self-hosted, nvidia, a6000]
1313
container:
14-
image: nvcr.io/nvidia/pytorch:24.03-py3
14+
image: nvcr.io/nvidia/pytorch:24.09-py3
1515
ports:
1616
- 80
1717
options: --gpus all --shm-size "8G"
@@ -50,4 +50,4 @@ jobs:
5050
run: |
5151
unset TORCH_CUDA_ARCH_LIST # only jit compile for current arch
5252
cd tests
53-
python -m pytest --color=yes --durations=0 --verbose -rF -m 'evaluation' -k "test_human_eval" unit/ --torch_ver="2.3" --cuda_ver="12"
53+
python -m pytest --color=yes --durations=0 --verbose -rF -m 'evaluation' -k "test_human_eval" unit/ --torch_ver="2.5" --cuda_ver="12"

.github/workflows/nv-mii.yml

+1-1
Original file line numberDiff line numberDiff line change
@@ -66,7 +66,7 @@ jobs:
6666
BRANCH="${{ github.event.inputs.mii_branch }}"
6767
fi
6868
echo "Cloning DeepSpeed-MII branch: $BRANCH"
69-
git clone -b $BRANCH --depth=1 https://github.com/microsoft/DeepSpeed-MII.git
69+
git clone -b $BRANCH --depth=1 https://github.com/deepspeedai/DeepSpeed-MII.git
7070
cd DeepSpeed-MII
7171
pip install .[dev]
7272
unset TORCH_CUDA_ARCH_LIST # only jit compile for current arch

.github/workflows/xpu-max1100.yml

+13-6
Original file line numberDiff line numberDiff line change
@@ -36,26 +36,31 @@ jobs:
3636
unit-tests:
3737
runs-on: [self-hosted, intel, xpu]
3838
container:
39-
image: intel/oneapi-basekit:2024.2.1-0-devel-ubuntu22.04
39+
image: intel/oneapi-basekit:2025.0.1-0-devel-ubuntu24.04
4040
ports:
4141
- 80
4242
options: --privileged -it --rm --device /dev/dri:/dev/dri -v /dev/dri/by-path:/dev/dri/by-path --ipc=host --cap-add=ALL
4343

4444
steps:
4545
- uses: actions/checkout@v4
4646
- name: Install prerequisite
47+
shell: bash
4748
run: |
4849
apt-get update
49-
apt-get install clinfo libaio-dev python3-pip -y
50-
pip install torch==2.3.1 -f https://pytorch-extension.intel.com/release-whl/stable/xpu/cn/torch/
51-
pip install intel-extension-for-pytorch==2.3.110+xpu -f https://pytorch-extension.intel.com/release-whl/stable/xpu/cn/intel-extension-for-pytorch/
52-
pip install oneccl_bind_pt==2.3.100+xpu -f https://pytorch-extension.intel.com/release-whl/stable/xpu/cn/oneccl-bind-pt/
53-
pip install torchvision==0.18.1 -f https://pytorch-extension.intel.com/release-whl/stable/xpu/cn/torchvision/
50+
apt-get install clinfo libaio-dev python3-pip python3.12-venv -y
51+
python3 -m venv ~/ds_env
52+
source ~/ds_env/bin/activate
53+
pip install torch==2.5.1 -f https://pytorch-extension.intel.com/release-whl/stable/xpu/cn/torch/
54+
pip install intel-extension-for-pytorch==2.5.10+xpu -f https://pytorch-extension.intel.com/release-whl/stable/xpu/cn/intel-extension-for-pytorch/
55+
pip install oneccl_bind_pt==2.5.0+xpu -f https://pytorch-extension.intel.com/release-whl/stable/xpu/cn/oneccl-bind-pt/
56+
pip install torchvision==0.20.1 -f https://pytorch-extension.intel.com/release-whl/stable/xpu/cn/torchvision/
5457
pip install py-cpuinfo numpy
5558
pip install .[dev,autotuning]
5659
5760
- name: Check container state
61+
shell: bash
5862
run: |
63+
source ~/ds_env/bin/activate
5964
ldd --version
6065
ds_report
6166
python3 -c "import torch; print('torch:', torch.__version__, torch)"
@@ -64,7 +69,9 @@ jobs:
6469
pip list
6570
6671
- name: Unit tests
72+
shell: bash
6773
run: |
74+
source ~/ds_env/bin/activate
6875
cd tests/unit
6976
pytest --verbose accelerator/*
7077
pytest --verbose autotuning/*

CONTRIBUTING.md

+4-4
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ and then repeat the previous `git commit` command.
2323
## Testing
2424
DeepSpeed tracks two types of tests: unit tests and more costly model convergence tests.
2525
The model convergence tests train
26-
[DeepSpeedExamples](https://github.com/microsoft/DeepSpeedExamples/) and measure
26+
[DeepSpeedExamples](https://github.com/deepspeedai/DeepSpeedExamples/) and measure
2727
end-to-end convergence and related metrics. Unit tests are found in `tests/unit/` and
2828
the model convergence tests are found in `tests/model/`.
2929

@@ -40,7 +40,7 @@ tests. Note that [pytest-forked](https://github.com/pytest-dev/pytest-forked) an
4040

4141
### Model Tests
4242
To execute model tests, first [install DeepSpeed](#installation). The
43-
[DeepSpeedExamples](https://github.com/microsoft/DeepSpeedExamples/) repository is cloned
43+
[DeepSpeedExamples](https://github.com/deepspeedai/DeepSpeedExamples/) repository is cloned
4444
as part of this process. Next, execute the model test driver:
4545
```bash
4646
cd tests/model/
@@ -85,8 +85,8 @@ Based on the issue we shall discuss the merit of the new feature and decide whet
8585
### Step 2: implementation and verification
8686
Contributor will go ahead and implement the feature, and the DeepSpeed team will provide guidance/helps as needed. The required deliverables include:
8787

88-
* A PR to [microsoft/DeepSpeed](https://github.com/microsoft/DeepSpeed) including (1) the feature implementation (2) unit tests (3) documentation (4) tutorial
89-
* A PR to [microsoft/DeepSpeedExamples](https://github.com/microsoft/DeepSpeedExamples) or [microsoft/Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed) including the examples of how to use the feature (this is related to the planned testing experiments in proposal)
88+
* A PR to [deepspeedai/DeepSpeed](https://github.com/deepspeedai/DeepSpeed) including (1) the feature implementation (2) unit tests (3) documentation (4) tutorial
89+
* A PR to [deepspeedai/DeepSpeedExamples](https://github.com/deepspeedai/DeepSpeedExamples) or [deepspeedai/Megatron-DeepSpeed](https://github.com/deepspeedai/Megatron-DeepSpeed) including the examples of how to use the feature (this is related to the planned testing experiments in proposal)
9090
* In the implementation (code, documentation, tutorial), we require the feature author to record their GitHub username as a contact method for future questions/maintenance.
9191

9292
After receiving the PRs, we will review them and merge them after necessary tests/fixes.

0 commit comments

Comments
 (0)