enable tp on CPU #36299

jiqing-feng · 2025-02-20T09:14:44Z

CPU device cannot use index.

If we pass index for cpu device, the check will never be passed

Rocketknight1 · 2025-02-20T13:55:49Z

Is there a reason we want to support TP on CPU? I assumed it would mainly be useful for multi-GPU nodes.

Signed-off-by: jiqing-feng <[email protected]>

jiqing-feng · 2025-02-21T03:07:15Z

Is there a reason we want to support TP on CPU? I assumed it would mainly be useful for multi-GPU nodes.

Intel Xeon CPU has multi numa nodes which means we can implement TP model and each part on a NUMA node. Currently we can enable the function and select gloo backend to run TP model on CPU.

Besides, we should always make sure that CPU device cannot assign index.

Rocketknight1 · 2025-02-21T16:27:07Z

In that case this change makes sense to me, but maybe we should just raise an error saying that TP on CPU is not supported yet, rather than setting index to None? cc @ArthurZucker @Cyrilvallez

jiqing-feng · 2025-02-24T01:46:29Z

In that case this change makes sense to me, but maybe we should just raise an error saying that TP on CPU is not supported yet, rather than setting index to None? cc @ArthurZucker @Cyrilvallez

Actually the TP functionality is ready on CPU, just run with the following codes:

CMD: OMP_NUM_THREADS=56 numactl -C 0-55 -m 0 torchrun --nnodes=2 --node_rank=0 --master_addr="127.0.0.1" --master_port=29500 --nproc-per-node 1 tp_hf.py & OMP_NUM_THREADS=56 numactl -C 56-111 -m 1 torchrun --nnodes=2 --node_rank=1 --master_addr="127.0.0.1" --master_port=29500 --nproc-per-node 1 tp_hf.py & wait

import os
import torch.distributed as dist
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoModel

import time
import torch

import torch
import os

model_id = "meta-llama/Llama-3.1-8B-Instruct"

def main(is_tp, rank, world_size) -> None:
    backend = "ccl"
    print(is_tp)
    if is_tp:
        dist.init_process_group(backend)

    model_kwargs = dict(torch_dtype=torch.bfloat16)
    if is_tp:
        model_kwargs["tp_plan"] = "auto"
    else:
        model_kwargs["device_map"] = "cpu"

    # Retrieve tensor parallel model
    model = AutoModel.from_pretrained(model_id, **model_kwargs)
    print(model.dtype)

    # Prepare input tokens
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    prompt = "Can I help" * 200
    inputs = tokenizer(prompt, return_tensors="pt", max_length=512).input_ids.to(model.device)
    print(f"inpu shape is {inputs.shape}")

    # model = torch.compile(model)
    # warm-up
    dist.barrier()
    for i in range(5):
        outputs = model(inputs)

    dist.barrier()
    for i in range(5):
        with torch.no_grad():
            start = time.time()
            outputs = model(inputs)
            end = time.time()
            print(f"time cost {(end-start)*1000} ms")

    print(outputs)


if __name__ == "__main__":
    rank = int(os.environ["RANK"]) if "RANK" in os.environ else 0
    world_size = int(os.environ["WORLD_SIZE"]) if "WORLD_SIZE" in os.environ else 1
    is_tp = "RANK" in os.environ
    main(is_tp, rank, world_size)

Cyrilvallez · 2025-02-24T13:45:48Z

Hey @jiqing-feng! The TP code is going to change quite a bit in the near future as we work to improve loading efficiency, so it would be best to put this issue on hold for now and revisit afterwards 🤗
As a side note, did you experiment with it on your setup? Is it truly worth it/faster compared to having the model on cpu as usual? 🤔

jiqing-feng · 2025-02-25T01:23:57Z

Hey @jiqing-feng! The TP code is going to change quite a bit in the near future as we work to improve loading efficiency, so it would be best to put this issue on hold for now and revisit afterwards 🤗 As a side note, did you experiment with it on your setup? Is it truly worth it/faster compared to having the model on cpu as usual? 🤔

OK, but I suppose this change is really tiny to not impact the refactor. It's okay to wait for your refactor.
For now, the performance is not as good as non-TP, but the functionality is ready, we'd like to enable the functionality first and then resolve the performance issue. Thanks.

jiqing-feng · 2025-02-26T06:04:52Z

Hi @SunMarc @Rocketknight1 @Cyrilvallez . As this change is really tiny, and the logic that cannot assign index in cpu device is reasonable, could we merge this PR? We will optimize the TP performance on CPU in our next step.

SunMarc

Yeah, I think can we can merge this without impacting the refactor cc @Cyrilvallez

HuggingFaceDocBuilderDev · 2025-02-26T14:20:13Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker

Super nice! it's missing:

some doc about the feature (I did not know we could do this this easily on CPU!)
a simple fast test to make sure this is no broken in the futur
rebasing as Update form pretrained to make TP a first class citizen #36335 was merged!

Otherwise much welcome 🤗

jiqing-feng · 2025-02-28T03:21:11Z

Super nice! it's missing:

some doc about the feature (I did not know we could do this this easily on CPU!)

a simple fast test to make sure this is no broken in the futur

rebasing as Update form pretrained to make TP a first class citizen #36335 was merged!

Otherwise much welcome 🤗

Hi @ArthurZucker ,

I will enable the CPU TP doc after we fix the performance issue
I'd like to enable the cpu tests here, but the test hanged when I ran in cuda. My command is: PYTHONPATH="src" python -m torch.distributed.run --nproc_per_node 2 ./tests/tp/test_tp.py . Do you have any more detailed guides for running the test?
Done.

jiqing-feng · 2025-03-06T06:58:39Z

Convert to draft because of the new regression:

[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/jiqingfe/tp_hf.py", line 100, in <module>
[rank0]:     main(is_tp, rank, world_size)
[rank0]:   File "/home/jiqingfe/tp_hf.py", line 56, in main
[rank0]:     outputs = model(inputs)
[rank0]:   File "/root/miniforge3/envs/py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/root/miniforge3/envs/py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/home/jiqingfe/transformers/src/transformers/models/llama/modeling_llama.py", line 571, in forward
[rank0]:     position_embeddings = self.rotary_emb(hidden_states, position_ids)
[rank0]:   File "/root/miniforge3/envs/py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/root/miniforge3/envs/py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)                                                                                      [rank0]:   File "/root/miniforge3/envs/py310/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/home/jiqingfe/transformers/src/transformers/models/llama/modeling_llama.py", line 131, in forward
[rank0]:     with torch.autocast(device_type=device_type, enabled=False):
[rank0]:   File "/root/miniforge3/envs/py310/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 230, in __init__
[rank0]:     dtype = torch.get_autocast_dtype(device_type)
[rank0]: RuntimeError: unsupported scalarType

ArthurZucker · 2025-03-06T11:05:53Z

Sounds great! 🤗

Signed-off-by: jiqing-feng <[email protected]>

jiqing-feng · 2025-03-11T05:42:38Z

Sounds great! 🤗

Hi @ArthurZucker . Could you give me more details about the tests? I see tests/tp/test_tp.py is highly bounded to cuda, could I added a new file named test_tp_cpu.py to test cpu tp and rename the test_tp.py as test_tp_cuda.py? Please let me know if you have better ideas. Thanks!

ArthurZucker · 2025-03-11T09:30:03Z

Hey! As you see fit, I'd rather have both in the same file, but make sure you rebase I rename it to test_tensor_parallel

jiqing-feng · 2025-03-19T02:29:48Z

Hi @ArthurZucker . I tried to enable CPU tests but it seems that the test script is out of date because I got error AttributeError: 'LlamaModel' object has no attribute 'tensor_parallel' even when running on CUDA. Would you please clean the cuda tests so I can add CPU tests? Thanks!

Signed-off-by: jiqing-feng <[email protected]>

ArthurZucker · 2025-03-24T12:03:14Z

Ah right we did change the API, feel free to ignore this one! You can even remove it, we need a better on that uses from p;retrained!

ArthurZucker

Should be good to go!

ArthurZucker · 2025-03-24T12:03:55Z

src/transformers/modeling_utils.py

@@ -796,14 +797,15 @@ def _load_state_dict_into_meta_model(
        )

        if device_mesh is not None:  # In this case, the param is already on the correct device!
+            rank = tensor_device if isinstance(tensor_device, int) else torch.distributed.get_rank()


jiqing-feng · 2025-03-26T08:05:17Z

Hi @ArthurZucker . Do I need any change fore merging?

ArthurZucker

Thanks! just missing a small test TBH! 🤗

ArthurZucker · 2025-03-26T15:22:47Z

A test that uses from_pretrained!

jiqing-feng · 2025-03-27T02:26:05Z

Hi @ArthurZucker . I have added the TP tests by replacing the old test script. Please review it. Thanks!

Signed-off-by: jiqing-feng <[email protected]>

ArthurZucker · 2025-03-27T13:40:45Z

Nice checked that

"tests/tensor_parallel/test_tensor_parallel.py::TestTensorParallel::test_model_forward": {
        "tests_non_model": "passed"
    },

was run!

ArthurZucker · 2025-03-27T13:41:56Z

Oh the only thing missing is documentation!!!!! Adding this:

OMP_NUM_THREADS=56 numactl -C 0-55 -m 0 torchrun --nnodes=2 --node_rank=0 --master_addr="127.0.0.1" --master_port=29500 --nproc-per-node 1 tp_hf.py & OMP_NUM_THREADS=56 numactl -C 56-111 -m 1 torchrun --nnodes=2 --node_rank=1 --master_addr="127.0.0.1" --master_port=29500 --nproc-per-node 1 tp_hf.py & wait

in for example!

jiqing-feng · 2025-03-31T02:20:51Z

Hi @ArthurZucker . Currently, we haven't verified the performance on Intel CPU, so the performance table will not be released in this PR. I added the CPU instruction on perf_infer_gpu_multi.md. We will have an independent doc once we figure out the best performance on CPU.

ArthurZucker · 2025-03-31T08:55:42Z

Awesome!!! Thanks for the great contribution! 🤗

Signed-off-by: jiqing-feng <[email protected]>

* enable tp on CPU Signed-off-by: jiqing-feng <[email protected]> * get rank from cpu Signed-off-by: jiqing-feng <[email protected]> * update Signed-off-by: jiqing-feng <[email protected]> * enable TP tests Signed-off-by: jiqing-feng <[email protected]> * fix comment Signed-off-by: jiqing-feng <[email protected]> * em print Signed-off-by: jiqing-feng <[email protected]> * fix model id Signed-off-by: jiqing-feng <[email protected]> * fix conflict Signed-off-by: jiqing-feng <[email protected]> * fix index and add doc Signed-off-by: jiqing-feng <[email protected]> --------- Signed-off-by: jiqing-feng <[email protected]>

enable tp on CPU

0e9c447

Signed-off-by: jiqing-feng <[email protected]>

jiqing-feng mentioned this pull request Feb 21, 2025

The performance of model with TP worse than without TP in CPU pytorch/pytorch#147596

Open

Merge branch 'main' into tp

a20c1f8

Merge branch 'main' into tp

195c655

SunMarc approved these changes Feb 26, 2025

View reviewed changes

SunMarc requested a review from Cyrilvallez February 26, 2025 13:53

shethaadit approved these changes Feb 26, 2025

View reviewed changes

Merge branch 'main' into tp

32bd205

ArthurZucker reviewed Feb 27, 2025

View reviewed changes

Merge branch 'main' into tp

def1a19

jiqing-feng added 2 commits February 28, 2025 11:21

Merge branch 'main' into tp

974ba3f

Merge branch 'huggingface:main' into tp

14df1e3

jiqing-feng marked this pull request as draft March 6, 2025 06:58

Merge branch 'main' into tp

3b0ef40

jiqing-feng marked this pull request as ready for review March 10, 2025 09:16

get rank from cpu

d49abbd

Signed-off-by: jiqing-feng <[email protected]>

Merge branch 'main' into tp

e41b1b9

Merge branch 'main' into tp

3bee22e

update

e8590ef

Signed-off-by: jiqing-feng <[email protected]>

ArthurZucker approved these changes Mar 24, 2025

View reviewed changes

Merge branch 'main' into tp

caf4bd3

ArthurZucker approved these changes Mar 26, 2025

View reviewed changes

Merge branch 'main' into tp

4529866

jiqing-feng force-pushed the tp branch from 0f2dad1 to 6202a2a Compare March 27, 2025 02:24

jiqing-feng added 4 commits March 27, 2025 10:08

enable TP tests

6202a2a

Signed-off-by: jiqing-feng <[email protected]>

fix comment

56c3c20

Signed-off-by: jiqing-feng <[email protected]>

em print

322c2fe

Signed-off-by: jiqing-feng <[email protected]>

fix model id

da22593

Signed-off-by: jiqing-feng <[email protected]>

Merge branch 'main' into tp

91afba0

ArthurZucker merged commit 286393f into huggingface:main Mar 31, 2025
16 of 18 checks passed

ArthurZucker added the Tensor Parallel label Mar 31, 2025

jiqing-feng added 2 commits March 31, 2025 09:43

fix conflict

61982a9

Signed-off-by: jiqing-feng <[email protected]>

fix index and add doc

25f0fef

Signed-off-by: jiqing-feng <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enable tp on CPU #36299

enable tp on CPU #36299

jiqing-feng commented Feb 20, 2025 •

edited

Loading

Rocketknight1 commented Feb 20, 2025

jiqing-feng commented Feb 21, 2025 •

edited

Loading

Rocketknight1 commented Feb 21, 2025

jiqing-feng commented Feb 24, 2025

Cyrilvallez commented Feb 24, 2025

jiqing-feng commented Feb 25, 2025 •

edited

Loading

jiqing-feng commented Feb 26, 2025 •

edited

Loading

SunMarc left a comment

HuggingFaceDocBuilderDev commented Feb 26, 2025

ArthurZucker left a comment

jiqing-feng commented Feb 28, 2025

jiqing-feng commented Mar 6, 2025

ArthurZucker commented Mar 6, 2025

jiqing-feng commented Mar 11, 2025

ArthurZucker commented Mar 11, 2025

jiqing-feng commented Mar 19, 2025

ArthurZucker commented Mar 24, 2025

ArthurZucker left a comment

ArthurZucker Mar 24, 2025

jiqing-feng commented Mar 26, 2025

ArthurZucker left a comment

ArthurZucker commented Mar 26, 2025

jiqing-feng commented Mar 27, 2025

ArthurZucker commented Mar 27, 2025

ArthurZucker commented Mar 27, 2025

jiqing-feng commented Mar 31, 2025

ArthurZucker commented Mar 31, 2025

enable tp on CPU #36299

enable tp on CPU #36299

Conversation

jiqing-feng commented Feb 20, 2025 • edited Loading

Rocketknight1 commented Feb 20, 2025

jiqing-feng commented Feb 21, 2025 • edited Loading

Rocketknight1 commented Feb 21, 2025

jiqing-feng commented Feb 24, 2025

Cyrilvallez commented Feb 24, 2025

jiqing-feng commented Feb 25, 2025 • edited Loading

jiqing-feng commented Feb 26, 2025 • edited Loading

SunMarc left a comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Feb 26, 2025

ArthurZucker left a comment

Choose a reason for hiding this comment

jiqing-feng commented Feb 28, 2025

jiqing-feng commented Mar 6, 2025

ArthurZucker commented Mar 6, 2025

jiqing-feng commented Mar 11, 2025

ArthurZucker commented Mar 11, 2025

jiqing-feng commented Mar 19, 2025

ArthurZucker commented Mar 24, 2025

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker Mar 24, 2025

Choose a reason for hiding this comment

jiqing-feng commented Mar 26, 2025

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker commented Mar 26, 2025

jiqing-feng commented Mar 27, 2025

ArthurZucker commented Mar 27, 2025

ArthurZucker commented Mar 27, 2025

jiqing-feng commented Mar 31, 2025

ArthurZucker commented Mar 31, 2025

jiqing-feng commented Feb 20, 2025 •

edited

Loading

jiqing-feng commented Feb 21, 2025 •

edited

Loading

jiqing-feng commented Feb 25, 2025 •

edited

Loading

jiqing-feng commented Feb 26, 2025 •

edited

Loading