Add vision for Gemma3 #2170

abheesht17 · 2025-03-27T09:11:01Z

Bunch of notebooks to demonstrate how different components work:

CausalLMPreprocessor: https://colab.research.google.com/drive/1fAb3Rvrw2zRZd5gfVCmB5WW2qdWl_eMN?resourcekey=0-uJD6GoFVgREFVSeMVaXAGA&usp=sharing

.generate(): https://colab.research.google.com/drive/1l2atV5VNt9HYKk-BZ39YcQBcsNG4D94V?resourcekey=0-qX4Xka61fikIygXpFZ-eGA&usp=sharing

.fit() on a randomly initialised model (because even the 4B one cannot fit on A100): https://colab.research.google.com/drive/11Yi9oAtBs9VvrwJPmtY9bsr8ZajwRhl6?resourcekey=0-wKYPYAm9uw51k1T39DfOGg&usp=sharing

JAX

generate(): https://colab.research.google.com/drive/1AR7G3UFawqO9VsNDTtLUrJP-dRTvbkh1?resourcekey=0-4ofXHp_ko1LDzFIG9pYkdQ&usp=sharing

.fit(): https://colab.research.google.com/drive/15IF3wF55EIbuenRB99R_JgbYI8Z_eA9U?resourcekey=0-w0sOujM6cxO1qsvkTXYiDA&usp=sharing

abheesht17 · 2025-03-28T14:48:53Z

Okay! @mattdangerw / @divyashreepathihalli - this is ready for review, mostly. I'm filling up the doc-strings and the unit tests, but can review the rest.

Also, probably a good idea to refer to this, because a lot of the vision components were added in the previous PR: #2152.

divyashreepathihalli

LGTM, few NIT comments for clean up.

divyashreepathihalli · 2025-03-28T15:53:56Z

keras_hub/src/models/gemma3/gemma3_backbone.py

-            # `vision_indices_input` to infer it directly.
-            text_mask_input = keras.Input(
-                shape=(None,), dtype="int32", name="text_mask"
+            # Truth be told, this is redundant, and we can infer this from


maybe remove this comment?

divyashreepathihalli · 2025-03-28T16:02:44Z

keras_hub/src/models/gemma3/gemma3_causal_lm_preprocessor.py

+        # == Branch: vision model, with non-`None` value for `images` ==
+
+        # Check: token IDs should not have less than 1, or more than
+        # `max_images_per_prompt` start of image tokens.


remove commented code

divyashreepathihalli · 2025-03-28T16:05:16Z

keras_hub/src/models/gemma3/gemma3_causal_lm_preprocessor.py

+        # == Branch: vision model, with non-`None` value for `images` ==
+
+        # Check: token IDs should not have less than 0, or more than
+        # `max_images_per_prompt` start of image tokens.


remove commented code

abheesht17 · 2025-03-28T19:35:57Z

keras_hub/src/models/gemma3/gemma3_tokenizer.py

+START_OF_IMAGE_TOKEN = "<start_of_image>"
+IMAGE_PLACEHOLDER_TOKEN = "<img>"
+END_OF_IMAGE_TOKEN = "<end_of_image>"


Need to remove this

they are no longer used right?

Yeah, no longer used

mattdangerw

Thanks! Haven't done a perfect read through of the preprocessor yet, but left some comments.

mattdangerw · 2025-03-29T00:03:38Z

keras_hub/src/models/gemma3/gemma3_backbone.py

+
+        # Add these for `Gemma3VITAttention`.
+        if not self.text_only_model:
+            target_names += ["query_proj", "value_proj"]


Does it hurt to just always leave these as part of the targets? They just won't match right?

Hmmm, yeah. But let's keep the if condition for clarity

mattdangerw · 2025-03-29T00:10:46Z

keras_hub/src/models/gemma3/gemma3_tokenizer.py

+START_OF_IMAGE_TOKEN = "<start_of_image>"
+IMAGE_PLACEHOLDER_TOKEN = "<img>"
+END_OF_IMAGE_TOKEN = "<end_of_image>"


they are no longer used right?

mattdangerw · 2025-03-29T00:12:21Z

keras_hub/src/models/gemma3/gemma3_interleave_embeddings.py

-            text_mask: Boolean tensor of shape `(batch_size, seq_length)`.
+            image_embeddings: tensor. Image embeddings as returned by the
+                vision encoder (`Gemma3ViT`, usually). Shape:
+            `(batch_size * num_images_per_prompt, num_vision_tokens_per_image,`


keep this indented so the arg list reads right.

mattdangerw · 2025-03-29T00:13:51Z

keras_hub/src/models/gemma3/gemma3_image_converter.py

+    def __init__(self, **kwargs):
+        # Always do image preprocessing in float32
+        kwargs.pop("dtype", None)
+        dtype = "float32"


why is this btw? won't the images get converted to the compute dtype later?

I want to do standardisation, normalisation in float32 because these ops are sensitive to precision. What do you think - worth keeping?

mattdangerw · 2025-03-29T00:16:39Z

keras_hub/src/models/gemma3/gemma3_causal_lm_preprocessor.py

-START_OF_IMAGE_TOKEN = "<start_of_image>"
-IMAGE_PLACEHOLDER_TOKEN = "<img>"
-END_OF_IMAGE_TOKEN = "<end_of_image>"
-

 @keras_hub_export("keras_hub.models.Gemma3CausalLMPreprocessor")


Not for this PR, but looks like our prompts, responses setup will not work for multi-turn conversations. We should consider how we want that to work in the future.

mattdangerw · 2025-03-29T00:23:47Z

keras_hub/src/models/gemma3/gemma3_causal_lm_preprocessor.py

-            else original_image_shape[-2]
-        )
+
+        if keras.config.backend() == "torch" and not isinstance(


This seems unsafe, given that in some modes of operations images is user input right? We could also have a np array with a .cpu() function for example. Maybe do images = images.convert_to_numpy(images) which should handle cpu conversion and other cases.

mattdangerw · 2025-03-29T00:27:02Z

keras_hub/src/models/gemma3/gemma3_causal_lm_preprocessor.py

+            if responses is not None:
+                responses = tf.expand_dims(responses, axis=0)
+
+        # There are 8 cases, based on values of


Consider if there's some common utilities to refactor and share between call/generate. I see a lot of duplicated code.

abheesht17 added 4 commits March 26, 2025 22:05

Edit causal LM preprocessor to handle images

e87f9be

Fix interleaving layer

c2a0571

Fix interleaving UTs + verify generate output

231f4f3

Workaround for scatter_update bug on Torch

2aa7c2f

github-actions bot added the Gemma Gemma model specific issues label Mar 27, 2025

abheesht17 added 19 commits March 27, 2025 16:37

Allow lists and ragged tensors as images

6661a51

Allow unbatched inputs for preprocessor

f993170

Make text only case work with tf.data

b473678

Preprocessor working with tf.data now

9916a00

Add ViT layer LoRA layers

c77dea5

Handle unbatched inputs in generate()

438a441

Fix for jax Array, torch tensor automatic typecasting

3c47d1d

Move tensor to CPU for Torch

e74c0ef

Same as prev commit

f8f2beb

Always do img preprocessing in float32

14d025a

Aaargh

819edf5

Torch fixes

d4291da

Allow None dtypes

78054f5

Workaround for tokenizer issue

1b0e202

Small fix

2aeddc4

Small fix

bcac761

Typo

f18ab4b

Override normalize_generate_inputs to handle unbatched images

cb1d58b

Change upranking logic for generation

263160f

abheesht17 requested review from mattdangerw and divyashreepathihalli March 28, 2025 14:48

divyashreepathihalli approved these changes Mar 28, 2025

View reviewed changes

divyashreepathihalli added the kokoro:force-run Runs Tests on GPU label Mar 28, 2025

kokoro-team removed the kokoro:force-run Runs Tests on GPU label Mar 28, 2025

abheesht17 commented Mar 28, 2025

View reviewed changes

mattdangerw reviewed Mar 29, 2025

View reviewed changes

abheesht17 added 12 commits March 29, 2025 09:05

Some doc-string brushing up

89243ad

Add a cache update mask for proper batched input generation

1718c91

Force ViT to use dtype bfloat16

047bcd9

Small edit

30c3ad6

Brush up doc-strings, address a few comments

63d597f

Fix tests

c6fc29e

Make existing tests pass

164f1db

Add backbone, causal LM preprocessor tests

f527095

Remove print statements

e4c61e3

Add causal LM tests

04b2fe1

Reduce duplicate code in preprocessor

b2b7038

Add getter/setter for max_images_per_prompt

61a382d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add vision for Gemma3 #2170

Add vision for Gemma3 #2170

abheesht17 commented Mar 27, 2025 •

edited

Loading

abheesht17 commented Mar 28, 2025 •

edited

Loading

divyashreepathihalli left a comment

divyashreepathihalli Mar 28, 2025

divyashreepathihalli Mar 28, 2025

divyashreepathihalli Mar 28, 2025

abheesht17 Mar 28, 2025

mattdangerw Mar 29, 2025

abheesht17 Mar 30, 2025

mattdangerw left a comment

mattdangerw Mar 29, 2025

abheesht17 Mar 30, 2025

mattdangerw Mar 29, 2025

mattdangerw Mar 29, 2025

mattdangerw Mar 29, 2025

abheesht17 Mar 30, 2025 •

edited

Loading

mattdangerw Mar 29, 2025

mattdangerw Mar 29, 2025

mattdangerw Mar 29, 2025

Add vision for Gemma3 #2170

Are you sure you want to change the base?

Add vision for Gemma3 #2170

Conversation

abheesht17 commented Mar 27, 2025 • edited Loading

abheesht17 commented Mar 28, 2025 • edited Loading

divyashreepathihalli left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattdangerw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abheesht17 Mar 30, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abheesht17 commented Mar 27, 2025 •

edited

Loading

abheesht17 commented Mar 28, 2025 •

edited

Loading

abheesht17 Mar 30, 2025 •

edited

Loading