Vary: Scaling up the Vision Vocabulary for

Large Vision-Language Models

Paper Demo Code

Abstract

This paper presents Vary, an efficient and effective method to scale up the Vision vocabulary of LVLMs. The procedures of Vary is naturally divided into two-folds: the generation and integration of a new vision vocabulary. In the first phase, we devise a "vocabulary network" along with a tiny decoder-only transformer to produce the desired vocabulary via autoregression. In the next, we scale up the vanilla vision vocabulary by merging the new one with the original one (CLIP), enabling the LVLMs can quickly garner new features. Compared to the popular BLIP-2, MiniGPT4, and LLaVA, Vary can maintain their universal capabilities while significantly improving the fine-grained perception and understanding ability.


Multimodal dialogue examples


这幅图奇怪在哪里?
这张图片的奇怪之处在于一个人站在一辆停在街上的黄色货车的后部,用熨斗熨烫一件蓝色的衣物。这是不寻常的,因为熨烫通常是在室内进行,比如在熨衣板上或在家庭或专业熨衣店。此外,熨烫衣物时,一个人站在一辆行驶中的车辆上是不安全的,因为这可能会导致事故或受伤。
Describe this image in about 100 words.
The image features a bowl of soup placed on a dining table. The soup is a green and white mixture, which resembles a yin and yang symbol. The bowl is filled with soup, showcasing a creative and artistic presentation. A spoon can be seen on the table, likely for enjoying the soup. The soup's vibrant colors and unique design make it an appealing and visually pleasing meal.



Citation

@article{wei2023vary,
  author = {Haoran, Wei and Lingyu, Kong and Jinyue, Chen and Liang, Zhao and Zheng, Ge and Jinrong, Yang and Jianjian, Sun and Chunrui, Han and Xiangyu, Zhang},
  title = {Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models},
  journal = {arXiv preprint arXiv:2312.06109},
  year = {2023},
}