Vary: Scaling up the Vision Vocabulary for

Large Vision-Language Models

Paper Demo Code


This paper presents Vary, an efficient and effective method to scale up the Vision vocabulary of LVLMs. The procedures of Vary is naturally divided into two-folds: the generation and integration of a new vision vocabulary. In the first phase, we devise a "vocabulary network" along with a tiny decoder-only transformer to produce the desired vocabulary via autoregression. In the next, we scale up the vanilla vision vocabulary by merging the new one with the original one (CLIP), enabling the LVLMs can quickly garner new features. Compared to the popular BLIP-2, MiniGPT4, and LLaVA, Vary can maintain their universal capabilities while significantly improving the fine-grained perception and understanding ability.

Multimodal dialogue examples

Describe this image in about 100 words.
The image features a bowl of soup placed on a dining table. The soup is a green and white mixture, which resembles a yin and yang symbol. The bowl is filled with soup, showcasing a creative and artistic presentation. A spoon can be seen on the table, likely for enjoying the soup. The soup's vibrant colors and unique design make it an appealing and visually pleasing meal.


  author = {Haoran, Wei and Lingyu, Kong and Jinyue, Chen and Liang, Zhao and Zheng, Ge and Jinrong, Yang and Jianjian, Sun and Chunrui, Han and Xiangyu, Zhang},
  title = {Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models},
  journal = {arXiv preprint arXiv:2312.06109},
  year = {2023},