Merlin can be run by instantiating the model in PyTorch. Merlin weights are also publicly available on HuggingFace.
- Image/Text contrastive embeddings
- Image-only embeddings (provide similar functionality to Google CT Foundation)
For a better understanding of the phenotypes and their associated PheWAS attributes, please refer to the phenotypes file.
Please see the demo for programmatic examples.
To get the image/text constrastive embeddings for inference, the breakdown is as follows:
import torch
from merlin import Merlin
model = Merlin()
model.eval()
model.cuda()
for batch in dataloader:
outputs = model(
batch["image"].to(device),
batch["text"]
)
where outputs
is a tuple:
outputs[0]
: returns the constrative image embeddings (shape: [1, 512])outputs[1]
: returns the phenotype prediction (shape: [1, 1692])outputs[2]
: returns the constrative text embeddings (shape: [1, 512])
import torch
from merlin import Merlin
model = Merlin(ImageEmbedding=True)
model.eval()
model.cuda()
for batch in dataloader:
outputs = model(
batch["image"].to(device),
)
where outputs
is a tuple:
outputs[0]
: returns the image embeddings (shape: [1, 2048])
Since both Merlin’s model architecture and pretrained weights are provided, Merlin allows for straightforward finetuning in PyTorch VLM and vision-only pipelines. Additionally, Merlin was trained on a single NVIDIA A6000 GPU (with a Vision-Language batch size of 18), meaning finetuning can be performed even in compute-constrained environments.
Merlin supports both Image/Text and Image-only finetuning. To perform finetuning, simply remove the following lines of code and train on your data:
model.eval()
model.cuda()
For compute-efficient finetuning, we recommend using mixed-precision training and gradient accumulation.