Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Synchronize vLLM flags to support cross-node inference #897

Open
wants to merge 3 commits into
base: habana_main
Choose a base branch
from

Conversation

IT-Forrest
Copy link

@IT-Forrest IT-Forrest commented Mar 7, 2025

synchronize 12 vLLM flags to non-driver workers in Ray executor

FIX "not warmed-up" bucket issue in cross-node vLLM inference.

Root cause: the issue is caused by not synchronizing the 12 vLLM flags to all the non-driver workers within the Ray cluster

image

synchronize 12 vLLM flags to non-driver workers in Ray executor
synchronize 12 vLLM flags to non-driver workers in Ray executor
@IT-Forrest IT-Forrest force-pushed the IT-Forrest-patch-1 branch from 8ec323f to 24480c3 Compare March 8, 2025 07:13
@@ -198,11 +198,34 @@ def _init_workers_ray(self, placement_group: "PlacementGroup",
)(RayWorkerWrapper).remote(vllm_config=self.vllm_config,
rpc_rank=rank)
else:

def retain_envs(var_name):
retain_var_list = [

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add hpu specific to existing platform specific list in line 344

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @michalkuligowski, thanks for the comment. Let me know if I understand you correctly, do you want me (1) to add platform-specific comments in the commit, such as if current_platform.is_hpu() or (2) to put the fix from line 201 to line 344?

  • As for (1), this change is not HPU platform-specific, but needed for the else branch of if current_platform.ray_device_key == "GPU": at line 191. Therefore, no comment is necessary

  • As for (2), this fix should be done while the driver worker flow is creating the non-driver workers. I have tried to put the change to line 344, however, those 12 flags are not propagated to the cross node. That is the reason to keep the commit at line 191

Let me know if I get your point or provide me further feedback.

Copy link

@michalkuligowski michalkuligowski Mar 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am basing my previous comment on "Set environment variables for the driver and workers." in line 335, and then structure that contains env vars is used to call method "update_environment_variables" in line 357. Does that call has no effect? Maybe need to overraide base class method -

def update_environment_variables(self, envs_list: List[Dict[str,
?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants