Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implemented Coca architecture #2371

Open
wants to merge 13 commits into
base: master
Choose a base branch
from
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Lowercased coca model directory and added to kokoro build
VarunS1997 committed Mar 5, 2024
commit f15408f2e7713dc39a6bc6cb9dcad49ee5869edb
2 changes: 2 additions & 0 deletions .kokoro/github/ubuntu/gpu/build.sh
Original file line number Diff line number Diff line change
@@ -69,6 +69,7 @@ then
keras_cv/models/object_detection/retinanet \
keras_cv/models/object_detection/yolo_v8 \
keras_cv/models/object_detection_3d \
keras_cv/models/feature_extractor/coca \
keras_cv/models/segmentation \
keras_cv/models/stable_diffusion
else
@@ -83,6 +84,7 @@ else
keras_cv/models/object_detection/retinanet \
keras_cv/models/object_detection/yolo_v8 \
keras_cv/models/object_detection_3d \
keras_cv/models/feature_extractor/coca \
keras_cv/models/segmentation \
keras_cv/models/stable_diffusion
fi
2 changes: 1 addition & 1 deletion keras_cv/layers/attention_pooling.py
Original file line number Diff line number Diff line change
@@ -2,7 +2,7 @@


class AttentionPooling(layers.Layer):
"""Implements the Pooled Attention Layer used in "CoCa": Contrastive Captioners are Image-Text Foundation Models"
"""Implements the Pooled Attention Layer used in "coca": Contrastive Captioners are Image-Text Foundation Models"
(https://arxiv.org/pdf/2205.01917.pdf), consisting of a Multiheaded Attention followed by Layer Normalization.
Args:
Original file line number Diff line number Diff line change
@@ -21,7 +21,7 @@
from keras_cv.layers.vit_layers import PatchingAndEmbedding


@keras_cv_export(["keras_cv.models.CoCa"])
@keras_cv_export(["keras_cv.models.coca"])
class CoCa(Task):
def __init__(self,
img_patch_size=18,
@@ -91,7 +91,7 @@ def __init__(self,
""" Contrastive Captioner foundational model implementation.
This model implements the "Contrastive Captioners are image-Text Foundational Models" by Yu, et al.
(https://arxiv.org/pdf/2205.01917.pdf). In short, the CoCa model combines the ideas of Contrastive techniques
(https://arxiv.org/pdf/2205.01917.pdf). In short, the coca model combines the ideas of Contrastive techniques
such as CLIP, with Generative Captioning approaches such as SimVLM.
The architecture of clip can be described as an Image Visual Transformer Encoder in parallel to self-attention-only
@@ -105,7 +105,7 @@ def __init__(self,
images = ... # [batch_size, height, width, channel]
text = ... # [batch_size, text_dim, sequence_length]
coca = CoCa()
coca = coca()
# [batch_size, sequence_length, captioning_query_length]
output = coca(images, text)
@@ -118,7 +118,7 @@ def __init__(self,
encoder_depth: number of image encoder blocks
encoder_heads: number of attention heads used in each image encoder block
encoder_intermediate_dim: dimensionality of the encoder blocks' intermediate representation (MLP dimensionality)
encoder_width: dimensionality of the encoder's projection, consistent with wording used in CoCa paper.
encoder_width: dimensionality of the encoder's projection, consistent with wording used in coca paper.
unimodal_decoder_depth: number of decoder blocks used for text self-attention/embedding
multimodal_decoder_depth: number of decoder blocks used for image-text cross-attention and captioning
decoder_intermediate_dim: dimensionality of the decoder blocks' MLPs
@@ -137,7 +137,7 @@ def build(self, input_shape):

# Validate Input Shape
if len(input_shape) < 2:
raise ValueError("Build arguments to CoCa expected to contain shapes of both text and image data; "
raise ValueError("Build arguments to coca expected to contain shapes of both text and image data; "
f"got {len(input_shape)} shapes.")

images_shape = input_shape[0]