-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
enable tp on CPU #36299
enable tp on CPU #36299
Conversation
Is there a reason we want to support TP on CPU? I assumed it would mainly be useful for multi-GPU nodes. |
Signed-off-by: jiqing-feng <[email protected]>
Intel Xeon CPU has multi numa nodes which means we can implement TP model and each part on a NUMA node. Currently we can enable the function and select Besides, we should always make sure that CPU device cannot assign index. |
In that case this change makes sense to me, but maybe we should just raise an error saying that TP on CPU is not supported yet, rather than setting index to |
Actually the TP functionality is ready on CPU, just run with the following codes: CMD: import os
import torch.distributed as dist
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoModel
import time
import torch
import torch
import os
model_id = "meta-llama/Llama-3.1-8B-Instruct"
def main(is_tp, rank, world_size) -> None:
backend = "ccl"
print(is_tp)
if is_tp:
dist.init_process_group(backend)
model_kwargs = dict(torch_dtype=torch.bfloat16)
if is_tp:
model_kwargs["tp_plan"] = "auto"
else:
model_kwargs["device_map"] = "cpu"
# Retrieve tensor parallel model
model = AutoModel.from_pretrained(model_id, **model_kwargs)
print(model.dtype)
# Prepare input tokens
tokenizer = AutoTokenizer.from_pretrained(model_id)
prompt = "Can I help" * 200
inputs = tokenizer(prompt, return_tensors="pt", max_length=512).input_ids.to(model.device)
print(f"inpu shape is {inputs.shape}")
# model = torch.compile(model)
# warm-up
dist.barrier()
for i in range(5):
outputs = model(inputs)
dist.barrier()
for i in range(5):
with torch.no_grad():
start = time.time()
outputs = model(inputs)
end = time.time()
print(f"time cost {(end-start)*1000} ms")
print(outputs)
if __name__ == "__main__":
rank = int(os.environ["RANK"]) if "RANK" in os.environ else 0
world_size = int(os.environ["WORLD_SIZE"]) if "WORLD_SIZE" in os.environ else 1
is_tp = "RANK" in os.environ
main(is_tp, rank, world_size) |
Hey @jiqing-feng! The TP code is going to change quite a bit in the near future as we work to improve loading efficiency, so it would be best to put this issue on hold for now and revisit afterwards 🤗 |
OK, but I suppose this change is really tiny to not impact the refactor. It's okay to wait for your refactor. |
Hi @SunMarc @Rocketknight1 @Cyrilvallez . As this change is really tiny, and the logic that cannot assign index in cpu device is reasonable, could we merge this PR? We will optimize the TP performance on CPU in our next step. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think can we can merge this without impacting the refactor cc @Cyrilvallez
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super nice! it's missing:
- some doc about the feature (I did not know we could do this this easily on CPU!)
- a simple fast test to make sure this is no broken in the futur
- rebasing as Update form pretrained to make TP a first class citizen #36335 was merged!
Otherwise much welcome 🤗
Hi @ArthurZucker ,
|
Convert to draft because of the new regression:
|
Sounds great! 🤗 |
Signed-off-by: jiqing-feng <[email protected]>
Hi @ArthurZucker . Could you give me more details about the tests? I see |
Hey! As you see fit, I'd rather have both in the same file, but make sure you rebase I rename it to |
Hi @ArthurZucker . I tried to enable CPU tests but it seems that the test script is out of date because I got error |
Signed-off-by: jiqing-feng <[email protected]>
Ah right we did change the API, feel free to ignore this one! You can even remove it, we need a better on that uses from p;retrained! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be good to go!
src/transformers/modeling_utils.py
Outdated
@@ -796,14 +797,15 @@ def _load_state_dict_into_meta_model( | |||
) | |||
|
|||
if device_mesh is not None: # In this case, the param is already on the correct device! | |||
rank = tensor_device if isinstance(tensor_device, int) else torch.distributed.get_rank() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indeed
Hi @ArthurZucker . Do I need any change fore merging? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! just missing a small test TBH! 🤗
A test that uses |
Hi @ArthurZucker . I have added the TP tests by replacing the old test script. Please review it. Thanks! |
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Nice checked that
was run! |
Oh the only thing missing is documentation!!!!! Adding this: OMP_NUM_THREADS=56 numactl -C 0-55 -m 0 torchrun --nnodes=2 --node_rank=0 --master_addr="127.0.0.1" --master_port=29500 --nproc-per-node 1 tp_hf.py & OMP_NUM_THREADS=56 numactl -C 56-111 -m 1 torchrun --nnodes=2 --node_rank=1 --master_addr="127.0.0.1" --master_port=29500 --nproc-per-node 1 tp_hf.py & wait in for example! |
Hi @ArthurZucker . Currently, we haven't verified the performance on Intel CPU, so the performance table will not be released in this PR. I added the CPU instruction on |
Awesome!!! Thanks for the great contribution! 🤗 |
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
* enable tp on CPU Signed-off-by: jiqing-feng <[email protected]> * get rank from cpu Signed-off-by: jiqing-feng <[email protected]> * update Signed-off-by: jiqing-feng <[email protected]> * enable TP tests Signed-off-by: jiqing-feng <[email protected]> * fix comment Signed-off-by: jiqing-feng <[email protected]> * em print Signed-off-by: jiqing-feng <[email protected]> * fix model id Signed-off-by: jiqing-feng <[email protected]> * fix conflict Signed-off-by: jiqing-feng <[email protected]> * fix index and add doc Signed-off-by: jiqing-feng <[email protected]> --------- Signed-off-by: jiqing-feng <[email protected]>
* enable tp on CPU Signed-off-by: jiqing-feng <[email protected]> * get rank from cpu Signed-off-by: jiqing-feng <[email protected]> * update Signed-off-by: jiqing-feng <[email protected]> * enable TP tests Signed-off-by: jiqing-feng <[email protected]> * fix comment Signed-off-by: jiqing-feng <[email protected]> * em print Signed-off-by: jiqing-feng <[email protected]> * fix model id Signed-off-by: jiqing-feng <[email protected]> * fix conflict Signed-off-by: jiqing-feng <[email protected]> * fix index and add doc Signed-off-by: jiqing-feng <[email protected]> --------- Signed-off-by: jiqing-feng <[email protected]>
CPU device cannot use index.
If we pass index for cpu device, the check will never be passed