Harnessing the strong language understanding and generation potential of Large Language Models (LLMs), Multimodal Large Language Models (MLLMs) have been developed in recent years for vision-and-language understanding tasks. MLLMs have shown promising results in understanding general images by aligning a pre-trained visual encoder (e.g., the Vision Transformers) and the LLM with a Vision-toText (V2T) module. However, these models still need to improve understanding and extracting text from images containing rich text information, like documents, webpages, tables, and charts. The main reason is that the visual encoder and V2T module are trained on general image-text pairs and must be specifically optimized to represent the textual and structural information in text-rich images.
To enhance visual document understanding with MLLMs, prior works like mPLUG-DocOwl, Docpedia, and Ureader attempted to design text-reading tasks to strengthen text recognition ability. Still, they must pay more attention to structure comprehension or cover limited domains of text-rich images, such as web pages or documents.
Researchers from Alibaba Group and the Renmin University of China have introduced DocOwl 1.5, a Unified Structure Learning, to boost the performance of MLLMs. Unified Structure Learning comprises structure-aware parsing tasks and multi-grained text localization tasks across five domains: document, webpage, table, chart, and natural image. To better encode structure information, they have designed a simple and effective vision-to-text module H-Reducer, which can not only maintain the layout information but also reduce the length of visual features by merging horizontal adjacent patches through convolution, enabling the LLM to understand high-resolution images more efficiently.
DocOwl 1.5 follows the typical architecture of MLLMs: a visual encoder, a vision-to-text module, and a LLM as the decoder.
High-resolution Image Encoding: High-resolution images ensure the decoder can use rich text information from document images. They utilize a parameter-free
Shape-adaptive Cropping Module to crop a shape-variable high-resolution image into multiple
fixed-size sub-images.
Spatial-aware Vision-to-Text Module: H-Reducer: They have designed a comparatively more appropriate vision-to-text module for Visual Document Understanding, namely H-Reducer, which reduces visual sequence length and keeps spatial information. The H-Reducer comprises a convolution layer to reduce sequence length and a fully connected layer to project visual features to language embedding space. Since most textual information in document images is arranged from left to right, the horizontal text information is usually semantically coherent. Thus, the kernel and stride sizes in the convolution layer are set as 1×4 to ensemble horizontal four visual features. The output channel is set equal to the input channel.
Multimodal Modeling with LLM as the decoder: They applied the Modeality-Adaptive Module (MAM) in LLM to better distinguish visual and textual inputs. During self-attention, MAM utilizes two sets of linear projection layers to perform the key/value projection for visual features and textual features separately.
Evaluated DocOwl 1.5 on the test across ten challenges, from documents and tables to charts and webpage screenshots. Compared to other models, even those with massive parameter counts, DocOwl 1.5 surpasses them all. DocOwl 1.5 also outperforms CogAgent on InfoVQA and ChartQA and achieves comparable performance on DocVQA. This suggests that unified structure learning with DocStruct4M is more efficient in learning printed text recognition and how to analyze documents.
Major components that helped researchers get state-of-the-art performance:
Effectiveness of H-Reducer: Shape-adaptive Cropping Module, the image resolution supported by the MLLM is the product of each crop’s cropping number and basic resolution. With the Abstractor as the vision-to-text module, reducing the cropping number causes an obvious performance decrease (r4 vs r3) on documents. However, with a smaller cropping number, the H-Reducer performs better than the Abstractor (r5 vs. r3), and the H-Reducer is stronger in maintaining rich text information during vision-and-language feature alignment.
Effectiveness of Unified Structure Learning. With only the structure-aware parsing tasks, there is significant improvement across different domains (r10 vs r5). This validates that finetuning the visual encoder and H-Reducer with structure-aware parsing tasks greatly helps MLLMs understand text-rich images.
Effectiveness of the Two-stage Training: For joint training, the model improves significantly on DocVQA as the samples of Unified Structure Learning increase when it is below 1M. However, as the Unified Structure Learning samples are further increased, the improvement of the model becomes subtle, and its performance could be better than that of the one using two-stage training. This shows that the two-stage training could enhance basic text recognition and structure parsing abilities and is more beneficial and efficient for downstream document understanding.
In conclusion, researchers from Alibaba Group and the Renmin University of China have proposed DocOwl 1.5, a Unified Structure Learning across five domains of text-rich images, including both structure-aware parsing tasks and multi-grained text localization tasks. To better maintain structure and spatial information during vision-and-language feature alignment, they designed a simple and effective vision-to-text module named H-Reducer. It mainly utilizes a convolution layer to aggregate horizontally neighboring visual features. DocOwl 1.5 achieves state-of-the-art OCR-free performance on ten visual document understanding benchmarks.
Check out the Paper and Github. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.
If you like our work, you will love our newsletter..
Don’t Forget to join our 39k+ ML SubReddit