Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Support New Arguments for Expert Routing Policies. #17

Open
jacklanda opened this issue May 31, 2024 · 9 comments
Open

[Feature] Support New Arguments for Expert Routing Policies. #17

jacklanda opened this issue May 31, 2024 · 9 comments
Labels
enhancement New feature or request

Comments

@jacklanda
Copy link
Contributor

Hi there, thanks mergoo, an amazing code base for MoE model construction.

A crucial feature that may need to be implemented is that mergoo should let the user select the basic routing policy when constructing the MoE layer.

Specifically, I think the forward method shown here should be concerned with refactoring to adapt the policy selection (an argument passed by the user). As far as I know, the current code will construct a fully-activated MoE model, not a real sparse MoE model.

I am delighted to share my code for this feature and file a PR for it 🤗.

Would you have any thoughts to share about it?

@gitsailor5
Copy link
Contributor

Hi @jacklanda,
Thank you for showing great interest in Mergoo.

  1. What is policy and policy selection w.r.t MOE? what the types of policies that you want to add ?
  2. The current code does create a sparse MOE, this line is responsible for the sparse selection. This is inspired from mixtral MOE architecture here.

@jacklanda
Copy link
Contributor Author

jacklanda commented May 31, 2024

Exactly, the code has done with the expert selection, but it seems to force every experts from the self.experts module lists forwarding.

In conclusion, forwarding every experts means it is a dense activation in fact.

@gitsailor5
Copy link
Contributor

Only the top K experts will undergo a forward pass, and this top K can vary for different MoE blocks. This iteration over self.experts is done to optimize the code efficiency. If there are any other tensor optimizations that are faster, we are happy to integrate them.

Here, we create indexes used for the expert forward pass. To summarize, we iterate and select an expert E, then select the token indexes that need to undergo the forward pass of expert E, and perform the forward pass.

  • Sparse: In a sparse configuration, K experts are selected from a pool of N experts, determined by the gating mechanism.
  • Dense: All the experts undergo a forward pass.

@jacklanda
Copy link
Contributor Author

jacklanda commented May 31, 2024

Only the top K experts will undergo a forward pass, and this top K can vary for different MoE blocks. This iteration over self.experts is done to optimize the code efficiency. If there are any other tensor optimizations that are faster, we are happy to integrate them.

Here, we create indexes used for the expert forward pass. To summarize, we iterate and select an expert E, then select the token indexes that need to undergo the forward pass of expert E, and perform the forward pass.

* **Sparse:** In a sparse configuration, K experts are selected from a pool of N experts, determined by the gating mechanism.

* **Dense:** All the experts undergo a forward pass.

Thanks for your reply!

Will this call cause extra useless computation?

I believe only the selected k experts should call the corresponding FFN module to compute the returned tensor of the input token.

图片

For comparable implementation, the mixtral modeling does the same thing as A and B.

I think it is just a tiny bug on development, not an error on design :)

@gitsailor5
Copy link
Contributor

If by "useless computation" you mean extra forward passes, then no, the forward pass will only be done for the indexes that require it. Line 78 is responsible for selecting the batch IDs and token IDs that need the forward pass of a specific expert, so the tensor inputs is passed after indexing, not directly.

If by "useless computation" you are referring to preparing the expert mask as shown here, it could be implemented. However, I believe that indexing is not an expensive operation.

@jacklanda
Copy link
Contributor Author

jacklanda commented May 31, 2024

If by "useless computation" you mean extra forward passes, then no, the forward pass will only be done for the indexes that require it. Line 78 is responsible for selecting the batch IDs and token IDs that need the forward pass of a specific expert, so the tensor inputs is passed after indexing, not directly.

If by "useless computation" you are referring to preparing the expert mask as shown here, it could be implemented. However, I believe that indexing is not an expensive operation.

Note that the expert(inputs[batch_idx, tok_idx]) will call the corresponding dense layer to compute, so this operation does not only just "index" but also does real computation.

Let's break it down:

  1. inputs[batch_idx, tok_idx] perform indexing;
  2. expert(...) will call the internal __call__ method and finally call the forward method to perform dense computing.

@jacklanda
Copy link
Contributor Author

If by "useless computation" you mean extra forward passes, then no, the forward pass will only be done for the indexes that require it. Line 78 is responsible for selecting the batch IDs and token IDs that need the forward pass of a specific expert, so the tensor inputs is passed after indexing, not directly.
If by "useless computation" you are referring to preparing the expert mask as shown here, it could be implemented. However, I believe that indexing is not an expensive operation.

Note that the expert(inputs[batch_idx, tok_idx]) will call the corresponding dense layer to compute, so this operation does not only just "index" but also does real computation.

Let's break it down:

1. `inputs[batch_idx, tok_idx]` perform indexing;

2. `expert(...)` will call the internal `__call__` method and finally call the `forward` method to perform dense computing.

Understood.

In the beginning, I am concerned that some tokens may not need any computation by experts. However, in this case, the input tensor should be empty and it does not cause any useless computation.

Thanks for all your help.

@jacklanda
Copy link
Contributor Author

jacklanda commented May 31, 2024

To respond to the title of this issue, I think it is also helpful to allow the users to select routing policies dynamically as they want.

On one hand, for the Top-k scenario, the users can pass an argument like num_experts_per_tok to control the model activation behavior, even if the MoE model has been merged by the previous setting arguments.

Users could not pass the num_experts_per_tok dynamically here to decide the experts activation number after the model merging procedure was done.

@jacklanda jacklanda reopened this May 31, 2024
@jacklanda
Copy link
Contributor Author

On the other hand, as I know, there exist many useful routing policies such as ``sequence-level routing''.

Hence, it will be great to support additional policies like that.

@arshadshk arshadshk added the enhancement New feature or request label Nov 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants