Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmarks for Metis #10

Closed
YuMJie opened this issue Nov 9, 2024 · 11 comments
Closed

Benchmarks for Metis #10

YuMJie opened this issue Nov 9, 2024 · 11 comments
Assignees

Comments

@YuMJie
Copy link

YuMJie commented Nov 9, 2024

It is the excellent work "Metis: Fast Automatic Distributed Training on Heterogeneous GPUs", however, I have a couple of questions about the code:

  1. Why are the configuration files execution_memory are same for different micro batch sizes?
  2. Can you provide the profile files of different devices (e.g., RTX3090, etc.).
  3. I see that the profile file configured in the README.md is about activation memory but this is not used in the code.

Could you provide the benchmarks for metis?
Thank you!

@mgong-kang mgong-kang self-assigned this Nov 12, 2024
@mgong-kang
Copy link
Collaborator

Thank you for your interest in this project.

  1. As you mentioned, memory usage varies depending on the micro batch size. the files in profile_data_samples are based on a micro batch size of 1. It appears there was an error in copying the sample data, resulting in incorrect values. I will update this with the correct values.

  2. While we do not have precise measurement data for the exact model provided in the sample, we will add some reference data that may be helpful. Additionally, you might consider regenerating and running the existing data to reflect approximate device performance as an alternative approach.

  3. To measure pipeline communication costs more accurately, it is recommended to profile and use the activation size in advance. In the Metis code, a GPT model is provided, and it includes code for calculating the activation size of the GPT model, which has been used for the calculations.

@YuMJie
Copy link
Author

YuMJie commented Nov 13, 2024

Thank you for your reply:
But I have some questions:

  1. I have seen the code to calculate the activation size, but I can not find any code using the item "activation_parameters_bytes" in the profile file.
  2. Could you provide the code for executing the config that Metis generates?
  3. the item "parameters" in profiler file means the size of model, but using mixed-precision training, it exists two model weights which have different sizes, and which wight size should be writen?

Thank you!

@mgong-kang
Copy link
Collaborator

  1. activation_parameter_bytes is not currently used in the code. It is a format prepared for models where calculating the activation size per model is challenging.
  2. Are you referring to the code for generating the profile? If so, there is a generation guide in the README.md, and I kindly ask for your understanding as we cannot provide the code.
  3. Since it is important for the communication cost to reflect the actual weight size of the model, it is deemed appropriate to use the FP32 weight size for measuring communication cost, even if some weights are converted to FP16 for computational efficiency in mixed-precision training.

Thank you!

@YuMJie
Copy link
Author

YuMJie commented Nov 13, 2024

I got it!
Thank you for your reply!
I will closed this issue

@YuMJie YuMJie closed this as completed Nov 13, 2024
@YuMJie
Copy link
Author

YuMJie commented Nov 15, 2024

Hi, @mgong-kang
I notice that you achieved the hete-aware data-parallelism load balanced, but I cannot see any code or output result about it. Could you provide some information?
what is more. Could you provide how to use the metis config to run alpa?
Thank you!

@mgong-kang
Copy link
Collaborator

The DataLoadBalancer is implemented at the follwing path:
https://github.com/SamsungLabs/Metis/blob/main/model/load_balancer.py#L147

@mgong-kang
Copy link
Collaborator

mgong-kang commented Nov 15, 2024

@goeunee326
I would appreciate it if you could provide guidance on how to execute the Metis results in Alpa.

@YuMJie
Copy link
Author

YuMJie commented Nov 15, 2024

The DataLoadBalancer is implemented at the follwing path: https://github.com/SamsungLabs/Metis/blob/main/model/load_balancer.py#L147

Thank you for your help, but it seems that the output of the Metis strategy does not reflect the batch size when data parallelism in different GPUs.

What is more, I found some same strategies have different costs.

image

@YuMJie YuMJie reopened this Nov 15, 2024
@mgong-kang
Copy link
Collaborator

mgong-kang commented Nov 15, 2024

@YuMJie

  1. Data parallelism occurs when heterogeneous GPUs are allocated within a stage.

  2. If you could send the profile data you've worked on, I'll take a look.

Thank you.

@goeunee326
Copy link
Collaborator

You can execute the process by modifying a specific part of the Alpa benchmark code.
https://github.com/alpa-projects/alpa/blob/main/benchmark/alpa/suite_auto_gpt.py#L31-L44

To run the results from Metis in Alpa, parameter mapping is required:

  • layer_partition -> forward_stage_layer_ids
  • device_group -> submesh_physical_shapes
  • strategies -> submesh_logical_shapes
  • node_sequence -> submesh_to_hosts (a concept not currently present in Alpa)

Since the concept of submesh_to_hosts does not exist in Alpa, you will need to add it by modifying Alpa's internal code. Ensure GPU placement aligns with Metis's node_sequence by referencing your Ray status. These adjustments will enable the system to function properly.

@mgong-kang
Copy link
Collaborator

If there are no further discussions or additional points to address, we will proceed to close this issue. Please feel free to reopen it at any time if further discussions or inquiries are needed. Do not hesitate to share any additional comments or questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants