Multimodal embeddings #260

fzliu · 2025-01-22T22:24:49Z

Is there an interface planned for multimodal embeddings? We'd love to contribute one that accepts interleaved text and images, similar to how Anthropic does content blocks.

tylerhutcherson · 2025-01-24T17:55:57Z

We don't have anything planned today, yet! So given the content blocks example, is the idea that you would accept an array of text or images, interleaved, and then embeddings would be generated based on the content? I assume this means the model would be a multimodal embedding model, like CLIP, for example?

fzliu · 2025-01-24T19:48:11Z

Yup that's exactly what I'm thinking. The ability to accept content blocks would help a lot with RAG applications as well, as the full retrieved documents could be sent directly the the LLM.

tylerhutcherson · 2025-02-19T16:20:03Z

Definitely open to exploring this. If you have a proposal for an interface for multimodal embeddings, definitely curious. I also realize the current structure of text-only vectorizers is a bit rigid. Might be a better solution to packaging support for text, image, and multimodal all in a single streamlined interface. Open to suggestions!

fzowl · 2025-03-06T15:44:14Z

@tylerhutcherson I created a proposal on how multimodal embeddings could work and added a reference implementation with VoyageAI. I created a Draft PR: #294

tylerhutcherson self-assigned this Jan 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multimodal embeddings #260

Multimodal embeddings #260

fzliu commented Jan 22, 2025

tylerhutcherson commented Jan 24, 2025

fzliu commented Jan 24, 2025

tylerhutcherson commented Feb 19, 2025

fzowl commented Mar 6, 2025

Multimodal embeddings #260

Multimodal embeddings #260

Comments

fzliu commented Jan 22, 2025

tylerhutcherson commented Jan 24, 2025

fzliu commented Jan 24, 2025

tylerhutcherson commented Feb 19, 2025

fzowl commented Mar 6, 2025