-
-
Notifications
You must be signed in to change notification settings - Fork 6.9k
[V1] AsyncLLM data parallel #13923
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[V1] AsyncLLM data parallel #13923
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
how to test,I mean how to run the server,I think we need two command right? |
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the DP-related part looks good to me.
cc @robertgshaw2-redhat I'm not familiar with the frontend processing part, maybe Robert can take a look?
@v-lmn no for single node you can run a single command, with |
…gine # Conflicts: # vllm/v1/core/scheduler.py # vllm/v1/engine/core.py # vllm/v1/engine/core_client.py
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Nick Hill <[email protected]>
tests/v1/test_async_llm_dp.py
Outdated
if not current_platform.is_cuda(): | ||
pytest.skip(reason="V1 currently only supported on CUDA.", | ||
allow_module_level=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if DP works on TPU or AMD GPUs, but modify this reason string since V1 works there at least experimentally?
Lines 1669 to 1675 in d0cfec7
# No support for device type other than CUDA, AMD (experiemntal) or | |
# TPU (experimental) so far. | |
if not (current_platform.is_cuda_alike() or current_platform.is_tpu()): | |
_raise_or_fallback( | |
feature_name=f"device type={current_platform.device_type}", | |
recommend_to_remove=False) | |
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could actually use supports_v1
now that this PR has landed (probably only want to turn tests on for CUDA and RoCM though)
tests/v1/test_async_llm_dp.py
Outdated
@pytest.mark.asyncio | ||
async def test_load(monkeypatch, output_kind: RequestOutputKind): | ||
with monkeypatch.context() as m, ExitStack() as after: | ||
m.setenv("VLLM_USE_V1", "1") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this now that V1 is on by default?
outputs=outputs, | ||
scheduler_stats=self.make_stats(), | ||
) | ||
if self.include_finished_set: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few minor comments. The PR still looks very good and would be great to get it landed soon.
Signed-off-by: Nick Hill <[email protected]> # Conflicts: # examples/offline_inference/data_parallel.py
Signed-off-by: Nick Hill <[email protected]>
Thanks @tlrmchlsmth! Have addressed those comments. Also had to make some additional adjustments to ensure compatibility with @youkaichao's offline multi-node scenario added in #15484. |
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Nick Hill <[email protected]>
# Conflicts: # vllm/v1/core/sched/scheduler.py
local_dp_rank = vllm_config.parallel_config.data_parallel_rank_local | ||
|
||
assert dp_size > 1 | ||
assert 0 <= local_dp_rank <= dp_rank < dp_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need this check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not strictly needed, I just thought it might be good here to verify that the config is in a coherent state.
from vllm.platforms import current_platform | ||
if current_platform.is_cuda_alike(): | ||
from vllm.platforms.cuda import device_id_to_physical_device_id | ||
tp_size = vllm_config.parallel_config.tensor_parallel_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can use world_size
to be general, not just tp_size
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Nick Hill <[email protected]> Signed-off-by: xinyuxiao <[email protected]>
Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Louis Ulmer <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
The engine core client starts an engine core proc per dp rank and load balances requests between them. A dummy request is sent to idle ranks when the global req count goes from 0->1, and when each engine finishes all requests it will continue in an idle forward loop.
Working for single node:
I aimed to keep the data parallel logic isolated as much as possible (in subclasses of the core engine and client) to avoid adding complexity/overhead to the more common default dp=1 case.
Follow-on after this PR: