In the fast-paced global economy, companies may use generative AI models to quickly interpret and summarize the charts found in market reports and financial documents. However, even cutting-edge vision-language models can struggle with this task, as it requires them to understand visual, numerical, and linguistic elements. Even with advanced models, companies may still face issues with inaccurate or incomplete data.
To address these challenges, researchers from MIT and the MIT-IBM Computing Research Lab have created a comprehensive resource for AI users aimed at improving vision-language models’ (VLMs) ability to interpret charts effectively. They developed a new data generation technique to create a cutting-edge dataset containing over a million diverse charts. This dataset includes numerous visual, linguistic, and numerical components, enabling models to accurately analyze chart information.
This dataset, named ChartNet, was used to train several open-source VLMs. Many of these smaller models outperformed much larger commercial models in tasks such as data extraction and chart summarization. ChartNet’s success with open-source models could help smaller companies with limited resources access AI technology more easily. The open-source dataset enhances AI models’ capabilities for tasks like analyzing business trends and interpreting scientific figures.
“ChartNet is designed as a comprehensive tool for chart understanding, meeting the needs of AI models and practitioners training those models. We aim to inspire researchers to achieve top performance with smaller models that don’t require vast computational resources,” said Jovana Kondic, an MIT graduate student and lead author of the ChartNet paper.
Kondic collaborated with co-authors from MIT, the MIT-IBM Computing Research Lab, and IBM Research, including Pengyuan Li, Dhiraj Joshi, Isaac Sanchez, Aude Oliva, and Rogerio Feris. The findings are set to be presented at the IEEE Computer Vision and Pattern Recognition Conference.
Despite progress in developing generative AI models for natural language processing and reasoning about images, interpreting complex multimodal data in charts has received less attention, Kondic noted. Yet, chart understanding is vital for businesses across various industries.
“The finance industry heavily relies on charts. If vision-language models can extract information from charts, such as trend descriptions, it streamlines many downstream workflows,” Joshi explained.
A significant obstacle in developing VLMs capable of accurately interpreting charts is the lack of high-quality training data. Existing datasets often pull limited chart images from the internet, lacking the necessary scale and additional information for effective model training.
Kondic emphasized that unlike humans, vision-language models require exposure to thousands of examples during training to consistently recognize elements like line charts. To overcome these limitations, the researchers generated synthetic data that mimics the statistical properties of real data.
The ChartNet dataset comprises over a million high-quality chart images, with the code used to generate each chart, textual descriptions, and numerical data tables. Each data point also includes question-and-answer pairs to teach models how to correctly respond to questions about chart images.
“These additional data modes help the model connect and align the different pieces of information encoded in the chart image,” Kondic said.
The creation of ChartNet involved a two-step synthetic data generation process. The researchers’ automated system first translated existing chart images into code, then iteratively modified the code to alter various chart aspects, such as type, data values, topic, and colors.
“Starting with a single chart as a seed, we created hundreds of variations, enabling us to build a dataset with over a million diverse images,” Kondic explained. An automated quality check ensured the synthetic data’s high quality by verifying executable code and clean, accurate images.
“We aim for meaningful presentation of information, not just diversity in samples,” Kondic added. ChartNet also features chart data points annotated by human experts, offering access to additional charts and data with validity guarantees.
Joshi mentioned that practitioners could use the annotated data to fine-tune existing VLMs, enhancing specific application performance. The researchers tested ChartNet by training IBM’s Granite Vision models and other open-source models, which improved accuracy in chart reconstruction, data extraction, summarization, and question answering.
With ChartNet, smaller open-source models consistently surpassed much larger commercial models. “Previous datasets focused on simple chart questions. We expanded ChartNet to support all aspects of comprehensive chart understanding,” Kondic stated.
Looking ahead, the researchers plan to expand ChartNet by incorporating data with additional complexity levels and seek feedback from the research community. The MIT-IBM Computing Research Lab partially funded this research.
Original Source: news.mit.edu
