Add Qwen Moe #2163

heyyanshuman · 2025-03-23T10:30:00Z

This PR adds Qwen Mixture-of-expert model to Keras Hub.

Huggingface Reference : link

divyashreepathihalli

just reviewed the MOE part of the code.

divyashreepathihalli · 2025-03-31T18:41:52Z

keras_hub/src/models/qwen/qwen_decoder.py

@@ -79,7 +79,7 @@ def build(self, decoder_sequence_shape):
        self.hidden_dim = decoder_sequence_shape[-1]

        # Self attention layer.
-        self._self_attention_layer = QwenAttention(
+        self._self_attention_layer = QwenMoeAttention(


I see you have forked this file in qwen_moe folder - why is this being edited?

looks like this slipped in in find & replace, fixed it.

divyashreepathihalli · 2025-03-31T19:06:25Z

keras_hub/src/models/qwen_moe/qwen_moe_decoder.py

+
+        for expert_idx in range(self.num_experts):
+            expert_layer = self.experts[expert_idx]
+            idx, top_x = ops.where(expert_mask[expert_idx])


will the ops.where return different output shape here on different forward passes? - if so that would not work with JAX XLA

I don't think that this should be the case. Why do you think so?

ops.where won't work on JAX at all, if you don't provide x and y.

ops.where calls jnp.where. Check the note here:

Because the size of the output of nonzero is data-dependent, the function is not compatible with JIT and other transformations. The JAX version adds the optional size argument which must be specified statically for jnp.nonzero to be used within JAX’s transformations.

You can, however, consider passing the size argument. That might make it work.

divyashreepathihalli · 2025-03-31T19:16:14Z

keras_hub/src/models/qwen_moe/qwen_moe_decoder.py

+        )
+        expert_mask = ops.transpose(expert_mask, axes=[2, 1, 0])
+
+        for expert_idx in range(self.num_experts):


The for loop here to go over the experts are inefficient for XLA compilation. This implementation would need to updated - I had tried out a dummy MOE implementation in JAX here - https://colab.sandbox.google.com/drive/1r0rscZK_2bNpDmFLC1POEEQoKcqWQYlQ
in order to bring this to KearsHub we are missing ragged_dot op.

I have prototyped the implementation - will add this op soon

When can we expect this to be available as a part of keras op?

heyyanshuman · 2025-04-02T11:10:26Z

@divyashreepathihalli How should we accomodate aux_loss for CausalLM task here model here?

We are specifying Sparse Categorical CrossEntropy Loss here:

keras-hub/keras_hub/src/models/causal_lm.py

Lines 109 to 119 in b997444

    
           if optimizer == "auto": 
        
               optimizer = keras.optimizers.Adam(2e-5) 
        
           if loss == "auto": 
        
               loss = keras.losses.SparseCategoricalCrossentropy(from_logits=True) 
        
           if weighted_metrics == "auto": 
        
               weighted_metrics = [keras.metrics.SparseCategoricalAccuracy()] 
        
           super().compile( 
        
               optimizer=optimizer, 
        
               loss=loss, 
        
               weighted_metrics=weighted_metrics, 
        
               **kwargs,

divyashreepathihalli

Thanks for the updates @heyyanshuman!
I left some comments on the PR regarding tf ops
please add tests for the layers, backbones and tasks
I am curious to know if model.fit works, do you have a demo colab for inference and FT? - looking for the aux loss implementation

divyashreepathihalli · 2025-04-10T01:07:18Z

keras_hub/src/models/qwen_moe/qwen_moe_attention.py

+        )
+        self._query_dense.build(inputs_shape)
+
+        self._key_dense = keras.layers.EinsumDense(


you might want to rename this to to match other KH models here - value_dense and query_dense
this will allow enabling LoRA on this Model -

keras-hub/keras_hub/src/models/backbone.py

Line 195 in 7c86942

return ["query_dense", "value_dense", "query", "value"]

I don't have access to this document :(

sorry! wrong copy pasta error! updated the link -

keras-hub/keras_hub/src/models/backbone.py

Line 195 in 7c86942

return ["query_dense", "value_dense", "query", "value"]

keras_hub/src/models/qwen_moe/qwen_moe_attention.py

keras_hub/src/models/qwen_moe/qwen_moe_causal_lm.py

keras_hub/src/models/qwen_moe/qwen_moe_attention.py

divyashreepathihalli · 2025-04-10T01:21:44Z

keras_hub/src/models/qwen_moe/qwen_moe_causal_lm.py

+@keras_hub_export(
+    "keras_hub.models.QwenMoeCausalLM",
+)
+class QwenMoeCausalLM(CausalLM):


add docstring and example to show model.fit and generate

divyashreepathihalli

Thanks @heyyanshuman - can you add a demo colab for inference and fit? and also aprovide colab/screenshot for numerics verification?

divyashreepathihalli · 2025-04-14T15:21:10Z

keras_hub/src/models/qwen_moe/qwen_moe_attention.py

+        )
+        self._query_dense.build(inputs_shape)
+
+        self._key_dense = keras.layers.EinsumDense(


sorry! wrong copy pasta error! updated the link -

keras-hub/keras_hub/src/models/backbone.py

Line 195 in 7c86942

return ["query_dense", "value_dense", "query", "value"]

divyashreepathihalli · 2025-04-14T15:22:54Z

keras_hub/src/models/qwen_moe/qwen_moe_attention.py

+            return False
+        if running_on_gpu():
+            # GPU never supports softcap in the fused op.
+            if self.logit_soft_cap is not None:


does Qwen MOE use logit_soft_cap?

heyyanshuman · 2025-04-14T15:49:46Z

Output matching ss:

divyashreepathihalli · 2025-04-15T11:03:25Z

keras_hub/src/models/qwen_moe/qwen_moe_decoder.py

+        gate_output = self._feedforward_gate_dense(x)
+
+        # Note that we run the activation function in full 32-bit
+        # precision since this is what `torch.nn.functional.silu`


is this torch specific? Or did you mean this is what the original implementation is doing?

divyashreepathihalli · 2025-04-15T11:04:34Z

keras_hub/src/models/qwen_moe/qwen_moe_decoder.py

+
+
+class QwenMoeExperts(keras.layers.Layer):
+    """Batched feed‑forward experts à‑la Llama‑4 (pure keras.ops)."""


NIT : a-la - lets keep this simple english - inspired by Llama 4 or whatever

divyashreepathihalli · 2025-04-15T11:06:17Z

keras_hub/src/models/qwen_moe/qwen_moe_decoder.py

+
+
+class QwenSparseMoeBlock(keras.layers.Layer):
+    """Qwen‑2 sparse block rewritten in Llama‑4 batched style."""


NIT: rewritten - inspired by Llama 4 implementation

divyashreepathihalli · 2025-04-15T11:06:38Z

keras_hub/src/models/qwen_moe/qwen_moe_decoder.py

+        shared_expert_intermediate_dim,
+        num_experts,
+        top_k,
+        norm_topk_prob,


NIT: topk->top_k

divyashreepathihalli · 2025-04-15T11:07:02Z

keras_hub/src/models/qwen_moe/qwen_moe_decoder.py

+        norm_topk_prob,
+        kernel_initializer="glorot_uniform",
+        layer_norm_epsilon=1e-5,
+        router_aux_loss_coef=0.01,


expand coef-> coefficient

divyashreepathihalli · 2025-04-15T11:07:16Z

keras_hub/src/models/qwen_moe/qwen_moe_decoder.py

+        self.intermediate_dim_shared = shared_expert_intermediate_dim
+        self.num_experts = num_experts
+        self.top_k = top_k
+        self.norm_topk_prob = norm_topk_prob


topk->top_k

divyashreepathihalli · 2025-04-15T11:08:30Z

keras_hub/src/models/qwen_moe/qwen_moe_decoder.py

+        shared_expert_intermediate_dim,
+        num_experts,
+        top_k,
+        norm_topk_prob,


topk->top_k - here and everywhere

divyashreepathihalli · 2025-04-15T11:09:28Z

keras_hub/src/models/qwen_moe/qwen_moe_decoder.py

+        self.intermediate_dim = intermediate_dim
+        self.num_query_heads = num_query_heads
+        self.num_key_value_heads = num_key_value_heads
+


remove unnecessary new lines

divyashreepathihalli · 2025-04-15T11:12:56Z

keras_hub/src/models/qwen_moe/qwen_moe_backbone_test.py

+
+
+class QwenMoeBackboneTest(TestCase):
+    def setUp(self):


add a flash attention mock test like -

keras-hub/keras_hub/src/models/gemma/gemma_causal_lm_test.py

Line 100 in a5337f5

def test_flash_attention_call(self):

divyashreepathihalli · 2025-04-15T11:15:49Z

keras_hub/src/models/qwen_moe/qwen_moe_decoder.py

+        return out, router_logits
+
+
+class QwenMoeTransformerDecoder(keras.layers.Layer):


add a self.run_layer_test for the decoder

divyashreepathihalli · 2025-04-15T11:19:50Z

keras_hub/src/models/qwen_moe/qwen_moe_decoder.py

+        return x
+
+
+class QwenMoeExperts(keras.layers.Layer):


also add a self.run_layer_test for this layer as well

divyashreepathihalli

Left a few more NIT comments!! Looking great overall!
Once we have the comments addressed and inference and fit demo - it is ready for merge.

heyyanshuman and others added 3 commits March 23, 2025 15:59

qwen moe init commit

1256614

wip

4e1d714

Merge branch 'keras-team:master' into qwen-moe

df0c409

heyyanshuman self-assigned this Mar 29, 2025

heyyanshuman requested review from mattdangerw, abheesht17 and divyashreepathihalli March 29, 2025 05:06

heyyanshuman marked this pull request as ready for review March 29, 2025 05:06

heyyanshuman added 3 commits March 29, 2025 13:34

wip

20be536

weight conversion wip

6986253

weight matching complete

d391cd2

heyyanshuman force-pushed the qwen-moe branch from c76184e to d391cd2 Compare March 29, 2025 08:04

update the docstrings + configs

1d1f18d

mattdangerw removed the request for review from divyashreepathihalli March 31, 2025 16:41

divyashreepathihalli added the kokoro:force-run Runs Tests on GPU label Mar 31, 2025

kokoro-team removed the kokoro:force-run Runs Tests on GPU label Mar 31, 2025

divyashreepathihalli reviewed Mar 31, 2025

View reviewed changes

remove incorrect import

a597b82

heyyanshuman and others added 6 commits April 8, 2025 17:16

wip

bd17346

updates

abc2ad2

updates

c423410

updates

8d3d89a

Merge branch 'keras-team:master' into qwen-moe

eea62f0

bug fix

e85c404

divyashreepathihalli reviewed Apr 10, 2025

View reviewed changes

heyyanshuman added 4 commits April 10, 2025 14:10

add aux loss

a3fc50d

address comments

5e175b2

causal lm test

d87601e

add tests

68396cf

heyyanshuman added 3 commits April 13, 2025 09:57

add docstrings

7da4e10

small bug fix

9424adc

bug fix in aux loss

68ebb71

heyyanshuman requested a review from divyashreepathihalli April 14, 2025 05:41

divyashreepathihalli added the kokoro:force-run Runs Tests on GPU label Apr 14, 2025

kokoro-team removed the kokoro:force-run Runs Tests on GPU label Apr 14, 2025

divyashreepathihalli reviewed Apr 14, 2025

View reviewed changes

divyashreepathihalli added the kokoro:force-run Runs Tests on GPU label Apr 15, 2025

kokoro-team removed the kokoro:force-run Runs Tests on GPU label Apr 15, 2025

divyashreepathihalli reviewed Apr 15, 2025

View reviewed changes

heyyanshuman and others added 2 commits April 16, 2025 09:57

Merge branch 'keras-team:master' into qwen-moe

0a11471

update init.py

e30cf6c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Qwen Moe #2163

Add Qwen Moe #2163

heyyanshuman commented Mar 23, 2025 •

edited

Loading

divyashreepathihalli left a comment

divyashreepathihalli Mar 31, 2025

heyyanshuman Apr 2, 2025

divyashreepathihalli Mar 31, 2025

heyyanshuman Apr 2, 2025

abheesht17 Apr 2, 2025 •

edited

Loading

divyashreepathihalli Mar 31, 2025 •

edited

Loading

divyashreepathihalli Mar 31, 2025

heyyanshuman Apr 2, 2025

heyyanshuman commented Apr 2, 2025

divyashreepathihalli left a comment

divyashreepathihalli Apr 10, 2025 •

edited

Loading

heyyanshuman Apr 10, 2025

divyashreepathihalli Apr 14, 2025

divyashreepathihalli Apr 10, 2025

divyashreepathihalli left a comment

divyashreepathihalli Apr 14, 2025

divyashreepathihalli Apr 14, 2025

heyyanshuman commented Apr 14, 2025

divyashreepathihalli Apr 15, 2025

divyashreepathihalli Apr 15, 2025 •

edited

Loading

divyashreepathihalli Apr 15, 2025

divyashreepathihalli Apr 15, 2025

divyashreepathihalli Apr 15, 2025

divyashreepathihalli Apr 15, 2025

divyashreepathihalli Apr 15, 2025

divyashreepathihalli Apr 15, 2025

divyashreepathihalli Apr 15, 2025

divyashreepathihalli Apr 15, 2025

divyashreepathihalli Apr 15, 2025

divyashreepathihalli left a comment



		class QwenMoeExperts(keras.layers.Layer):
		"""Batched feed‑forward experts à‑la Llama‑4 (pure keras.ops)."""



		class QwenSparseMoeBlock(keras.layers.Layer):
		"""Qwen‑2 sparse block rewritten in Llama‑4 batched style."""

		return out, router_logits


		class QwenMoeTransformerDecoder(keras.layers.Layer):

Add Qwen Moe #2163

Are you sure you want to change the base?

Add Qwen Moe #2163

Conversation

heyyanshuman commented Mar 23, 2025 • edited Loading

divyashreepathihalli left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abheesht17 Apr 2, 2025 • edited Loading

Choose a reason for hiding this comment

divyashreepathihalli Mar 31, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

heyyanshuman commented Apr 2, 2025

divyashreepathihalli left a comment

Choose a reason for hiding this comment

divyashreepathihalli Apr 10, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

divyashreepathihalli left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

heyyanshuman commented Apr 14, 2025

Choose a reason for hiding this comment

divyashreepathihalli Apr 15, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

divyashreepathihalli left a comment

Choose a reason for hiding this comment

heyyanshuman commented Mar 23, 2025 •

edited

Loading

abheesht17 Apr 2, 2025 •

edited

Loading

divyashreepathihalli Mar 31, 2025 •

edited

Loading

divyashreepathihalli Apr 10, 2025 •

edited

Loading

divyashreepathihalli Apr 15, 2025 •

edited

Loading