Add sleep mode feature for Ascend NPU #416
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What this PR does / why we need it?
This PR adds sleep mode feature for vllm-ascend, when sleeps, we do mainly two things:
RLHF tools(such as https://github.com/volcengine/verl and https://github.com/OpenRLHF/OpenRLHF) have a strong need of sleep mode to accelerate the training process.
This PR may solve #375 and #320 .
Does this PR introduce any user-facing change?
No existing user interfaces changed.
Users will have two new methods(
sleep()
andwake_up()
) to use.How was this patch tested?
This PR is tested with Qwen/Qwen2.5-0.5B-Instruct.
At first, we have free NPU memory M1.
After
llm = LLM("Qwen/Qwen2.5-0.5B-Instruct", enable_sleep_mode=True)
executed, we have free NPU memory M2. M2 < M1.Then we call
llm.sleep(level=1)
, we have free NPU memory M3.We have M3 > M2, M3 is very close to M1.
Plus, we have the same output tokens before sleep and after wake up, with the config of
SamplingParams(temperature=0, max_tokens=10)
and with the same input tokens of course.This PR is utilizing the CMake procedure of #371 , thanks a lot.