maestro is a tool designed to streamline and accelerate the fine-tuning process for multimodal models. It provides ready-to-use recipes for fine-tuning popular vision-language models (VLMs) such as Florence-2, PaliGemma 2, and Qwen2.5-VL on downstream vision-language tasks.
To get started with maestro, you’ll need to install the dependencies specific to the model you wish to fine-tune.
pip install maestro[qwen_2_5_vl]
Note: Some models may have clashing dependencies. We recommend creating a separate python environment for each model to avoid version conflicts.
maestro qwen_2_5_vl train \
--dataset "dataset/location" \
--epochs 10 \
--batch-size 4 \
--optimization_strategy "qlora" \
--metrics "edit_distance"
from maestro.trainer.models.qwen_2_5_vl.core import train
config = {
"dataset": "dataset/location",
"epochs": 10,
"batch_size": 4,
"optimization_strategy": "qlora",
"metrics": ["edit_distance"]
}
train(config)
We would love your help in making this repository even better! We are especially looking for contributors with experience in fine-tuning vision-language models (VLMs). If you notice any bugs or have suggestions for improvement, feel free to open an issue or submit a pull request.