Skip to content
/ LCL Public

[NeurIPS 2024] Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning

License

Notifications You must be signed in to change notification settings

OpenGVLab/LCL

Repository files navigation

Latent Compression Learning (LCL)

Static Badge Static Badge

[NeurIPS 2024] Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning

We introduce the Latent Compression Learning (LCL) to pre-train vision models from scratch with interleaved image-text data. Compared to existing methods (e.g., CLIP, auto-regressive text generation), our proposed LCL is the first to achieve both

  • Learning vision models from scratch
  • Training on interleaved image-text data

overview

📈 Results

Pre-training on MMC4 Dataset

result_interleaved

Our LCL pre-training significantly outperforms all other methods in the caption tasks and is on par with the best paired pre-training methods on classification and retrieval tasks.

Comparison with OpenCLIP

result_main_transfer

result_main_multimodal

When both using LAION-400M data, our LCL pre-training achieves similar performance to OpenCLIP. When combined with MMC4 data, our LCL pre-training outperforms OpenCLIP, especially in caption and multi-modal dialogue tasks. For a fair comparison, the total number of images seen during pre-training is 13B.

📦 Pre-trained Checkpoints

model data # samples download
ViT-B/16 LAION-400M 13B config / ckpt

🛠️ Usage

Install

This code is built upon OpenCLIP, you can refer to their repository for setup.

Load Pre-trained Checkpoints

Here is an example code to load pre-trained checkpoints:

import open_clip

model_name = "LCL_ViT-B-16_laion"
pretrained = "path to the `.pt` file"

model = open_clip.create_model(model_name, pretrained=pretrained)

Train LCL

The example training scripts are provided in ./scripts. You can refer to OpenCLIP for more ways to launch training.

Training on LAION-400M. Here is an example training script: ./scripts/lcl_vit_b_32_laion.sh. The corresponding model config is here.

Training on MMC4. We provide a simple dataloader that supports the original MMC4 dataset. Organize the data folder as follows:

  /path/to/mmc4/
      ├── images/
      │   └── ...
      └── data/ 
          ├── docs_shard_0_v2.jsonl.zip
          ├── docs_shard_1_v2.jsonl.zip
          └── ...

Here is an example training script: ./scripts/lcl_vit_b_32_mmc4.sh. The corresponding model config is here.

More training scripts can be found under ./scripts.

NOTE: We conduct large-scale pre-training with internal efficient code, which will not be released due to intellectual property reasons. This released version has been verified and can reproduce the results of ViT-B/16 on LAION-400M dataset.

📅 Schedule

  • basic code of LCL
  • checkpoints of more models and datasets
  • transfer evaluation code

🖊️ Citation

If you find this work helpful in your research, please consider citing:

@article{yang2024vision,
  title={Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning},
  author={Yang, Chenyu and Zhu, Xizhou and Zhu, Jinguo and Su, Weijie and Wang, Junjie and Dong, Xuan and Wang, Wenhai and Li, Bin and Zhou, Jie and Qiao, Yu and Dai, Jifeng},
  journal={arXiv preprint arXiv:2406.07543},
  year={2024}
}

📃 License

This project is released under the MIT license. Parts of this project contain code and models from other sources, which are subject to their respective licenses.

🙏 Acknowledgements

Our code is built with reference to the code of the following projects: OpenCLIP.

About

[NeurIPS 2024] Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published