This paper presents Vary, an efficient and effective method to scale up the Vision vocabulary of LVLMs. The procedures of Vary is naturally divided into two-folds: the generation and integration of a new vision vocabulary. In the first phase, we devise a "vocabulary network" along with a tiny decoder-only transformer to produce the desired vocabulary via autoregression. In the next, we scale up the vanilla vision vocabulary by merging the new one with the original one (CLIP), enabling the LVLMs can quickly garner new features. Compared to the popular BLIP-2, MiniGPT4, and LLaVA, Vary can maintain their universal capabilities while significantly improving the fine-grained perception and understanding ability.
@article{wei2023vary,
author = {Haoran, Wei and Lingyu, Kong and Jinyue, Chen and Liang, Zhao and Zheng, Ge and Jinrong, Yang and Jianjian, Sun and Chunrui, Han and Xiangyu, Zhang},
title = {Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models},
journal = {arXiv preprint arXiv:2312.06109},
year = {2023},
}