[Feature] Support New Arguments for Expert Routing Policies. #17

jacklanda · 2024-05-31T10:03:12Z

Hi there, thanks mergoo, an amazing code base for MoE model construction.

A crucial feature that may need to be implemented is that mergoo should let the user select the basic routing policy when constructing the MoE layer.

Specifically, I think the forward method shown here should be concerned with refactoring to adapt the policy selection (an argument passed by the user). As far as I know, the current code will construct a fully-activated MoE model, not a real sparse MoE model.

I am delighted to share my code for this feature and file a PR for it 🤗.

Would you have any thoughts to share about it?

The text was updated successfully, but these errors were encountered:

gitsailor5 · 2024-05-31T10:49:36Z

Hi @jacklanda,
Thank you for showing great interest in Mergoo.

What is policy and policy selection w.r.t MOE? what the types of policies that you want to add ?
The current code does create a sparse MOE, this line is responsible for the sparse selection. This is inspired from mixtral MOE architecture here.

jacklanda · 2024-05-31T11:54:11Z

Exactly, the code has done with the expert selection, but it seems to force every experts from the self.experts module lists forwarding.

In conclusion, forwarding every experts means it is a dense activation in fact.

gitsailor5 · 2024-05-31T12:07:17Z

Only the top K experts will undergo a forward pass, and this top K can vary for different MoE blocks. This iteration over self.experts is done to optimize the code efficiency. If there are any other tensor optimizations that are faster, we are happy to integrate them.

Here, we create indexes used for the expert forward pass. To summarize, we iterate and select an expert E, then select the token indexes that need to undergo the forward pass of expert E, and perform the forward pass.

Sparse: In a sparse configuration, K experts are selected from a pool of N experts, determined by the gating mechanism.
Dense: All the experts undergo a forward pass.

jacklanda · 2024-05-31T12:18:41Z

Only the top K experts will undergo a forward pass, and this top K can vary for different MoE blocks. This iteration over self.experts is done to optimize the code efficiency. If there are any other tensor optimizations that are faster, we are happy to integrate them.

Here, we create indexes used for the expert forward pass. To summarize, we iterate and select an expert E, then select the token indexes that need to undergo the forward pass of expert E, and perform the forward pass.
* **Sparse:** In a sparse configuration, K experts are selected from a pool of N experts, determined by the gating mechanism.

* **Dense:** All the experts undergo a forward pass.

Thanks for your reply!

Will this call cause extra useless computation?

I believe only the selected k experts should call the corresponding FFN module to compute the returned tensor of the input token.

For comparable implementation, the mixtral modeling does the same thing as A and B.

I think it is just a tiny bug on development, not an error on design :)

gitsailor5 · 2024-05-31T13:22:15Z

If by "useless computation" you mean extra forward passes, then no, the forward pass will only be done for the indexes that require it. Line 78 is responsible for selecting the batch IDs and token IDs that need the forward pass of a specific expert, so the tensor inputs is passed after indexing, not directly.

If by "useless computation" you are referring to preparing the expert mask as shown here, it could be implemented. However, I believe that indexing is not an expensive operation.

jacklanda · 2024-05-31T13:28:39Z

If by "useless computation" you mean extra forward passes, then no, the forward pass will only be done for the indexes that require it. Line 78 is responsible for selecting the batch IDs and token IDs that need the forward pass of a specific expert, so the tensor inputs is passed after indexing, not directly.

If by "useless computation" you are referring to preparing the expert mask as shown here, it could be implemented. However, I believe that indexing is not an expensive operation.

Note that the expert(inputs[batch_idx, tok_idx]) will call the corresponding dense layer to compute, so this operation does not only just "index" but also does real computation.

Let's break it down:

inputs[batch_idx, tok_idx] perform indexing;
expert(...) will call the internal __call__ method and finally call the forward method to perform dense computing.

jacklanda · 2024-05-31T13:55:00Z

If by "useless computation" you mean extra forward passes, then no, the forward pass will only be done for the indexes that require it. Line 78 is responsible for selecting the batch IDs and token IDs that need the forward pass of a specific expert, so the tensor inputs is passed after indexing, not directly.
If by "useless computation" you are referring to preparing the expert mask as shown here, it could be implemented. However, I believe that indexing is not an expensive operation.

Note that the expert(inputs[batch_idx, tok_idx]) will call the corresponding dense layer to compute, so this operation does not only just "index" but also does real computation.

Let's break it down:
1. `inputs[batch_idx, tok_idx]` perform indexing;

2. `expert(...)` will call the internal `__call__` method and finally call the `forward` method to perform dense computing.

Understood.

In the beginning, I am concerned that some tokens may not need any computation by experts. However, in this case, the input tensor should be empty and it does not cause any useless computation.

Thanks for all your help.

jacklanda · 2024-05-31T14:01:33Z

To respond to the title of this issue, I think it is also helpful to allow the users to select routing policies dynamically as they want.

On one hand, for the Top-k scenario, the users can pass an argument like num_experts_per_tok to control the model activation behavior, even if the MoE model has been merged by the previous setting arguments.

Users could not pass the num_experts_per_tok dynamically here to decide the experts activation number after the model merging procedure was done.

jacklanda · 2024-05-31T14:08:06Z

On the other hand, as I know, there exist many useful routing policies such as ``sequence-level routing''.

Hence, it will be great to support additional policies like that.

jacklanda closed this as completed May 31, 2024

jacklanda reopened this May 31, 2024

arshadshk added the enhancement New feature or request label Nov 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Support New Arguments for Expert Routing Policies. #17

[Feature] Support New Arguments for Expert Routing Policies. #17

jacklanda commented May 31, 2024

gitsailor5 commented May 31, 2024

jacklanda commented May 31, 2024 •

edited

Loading

gitsailor5 commented May 31, 2024

jacklanda commented May 31, 2024 •

edited

Loading

gitsailor5 commented May 31, 2024

jacklanda commented May 31, 2024 •

edited

Loading

jacklanda commented May 31, 2024

jacklanda commented May 31, 2024 •

edited

Loading

jacklanda commented May 31, 2024

[Feature] Support New Arguments for Expert Routing Policies. #17

[Feature] Support New Arguments for Expert Routing Policies. #17

Comments

jacklanda commented May 31, 2024

gitsailor5 commented May 31, 2024

jacklanda commented May 31, 2024 • edited Loading

gitsailor5 commented May 31, 2024

jacklanda commented May 31, 2024 • edited Loading

gitsailor5 commented May 31, 2024

jacklanda commented May 31, 2024 • edited Loading

jacklanda commented May 31, 2024

jacklanda commented May 31, 2024 • edited Loading

jacklanda commented May 31, 2024

jacklanda commented May 31, 2024 •

edited

Loading

jacklanda commented May 31, 2024 •

edited

Loading

jacklanda commented May 31, 2024 •

edited

Loading

jacklanda commented May 31, 2024 •

edited

Loading