Graph captions that explain complex trends and patterns are important to improve the reader’s ability to understand and retain the data presented. Also, caption information is often the only way to understand a graph for visually impaired people.
However, writing effective and detailed captions is a labor intensive process. Automatic captioning techniques can alleviate this burden, but they often struggle to account for cognitive functions that provide additional context.
To help people create high-quality chart captions, MIT researchers developed a dataset to improve the automatic captioning system. With this tool, researchers can teach machine learning models to change the level of complexity and type of content included in graph captions based on user needs.
MIT researchers found that a machine learning model trained to auto-caption using a dataset consistently produces captions that are accurate, semantically rich, and explain trends and complex patterns in the data. found to generate Quantitative and qualitative analysis revealed that their model captioned charts more effectively than other automated captioning systems.
The team’s goal is to provide a dataset called VisText as a tool that researchers can use to tackle the thorny problem of automatic graph captioning. These automated systems could help provide captions for online charts that don’t have captions, improving accessibility for people with visual impairments, says co-first author Angie Bogast. He is a graduate student in electrical engineering and computer science at the Massachusetts Institute of Technology and a member of the computer science and computer science visualization groups. Artificial Intelligence Laboratory (CSAIL).
“As we and other researchers built our automated chart captioning system, we tried to embed many human values into our datasets so that we wouldn’t end up with a model that people didn’t want or need. she said. she says.
Boggust is joined by co-first author and fellow graduate student Benny J. Tang and lead author Arvind Satyanarayan, associate professor of computer science at MIT who leads CSAIL’s visualization group. The research will be presented at the Annual Meeting of the Association for Computational Linguistics.
Human-centric analysis
The researchers were inspired to develop VisText from previous work in the visualization group that explored what good graph captions are. In that study, researchers found that sighted and visually impaired users had different preferences for the complexity of semantic content in captions.
This group wanted to introduce its human-centered analysis into automatic captioning research. To that end, they developed VisText, a dataset of graphs and associated captions that can be used to train machine learning models to generate accurate, semantically rich, and customizable captions. .
Developing an effective automated captioning system is no easy task. Existing machine learning methods often attempt to caption charts in the same way that images are captioned, but humans and models view natural images differently than we read charts. interpret. Other techniques skip the visual content entirely and use the underlying data table to caption the chart. However, such data tables often become unavailable after the chart is published.
Due to the drawbacks of using images and data tables, VisText also represents charts as scene graphs. Scene graphs, which can be extracted from chart images, include all chart data, but also include additional image context.
“Scene graphs are like the best of both worlds. Scene graphs contain almost all the information that exists in an image, while at the same time they are easier to extract from an image than a data table. , so we can take advantage of the latest advances in large-scale language models for captioning,” explains Tan.
They compiled a dataset containing over 12,000 graphs (each represented as a data table, image, and scene graph) and associated captions. Each chart has two separate captions. One is low-level captions that describe the structure of the chart (axis ranges, etc.) and the other is high-level captions that describe statistics, relationships within the data, and complex trends.
The researchers used an automated system to generate low-level captions and crowdsourced high-level captions from human workers.
“Our captions were developed based on two main pieces of previous research: existing guidelines for accessible descriptions of visual media and our guidelines for classifying semantic content. A conceptual model of groups, which allows our captions to retain human variability in terms of axes, scales, and how to write captions, while also providing units for visually impaired readers. ,” says Tan.
Chart translation
After collecting the chart images and captions, the researchers used VisText to train five machine learning models for automatic captioning. They wanted to see how each representation, such as images, data tables, scene graphs, and combinations of those representations, impacted caption quality.
“You can think of the graph captions model like a model for language translation. I’m telling you,” says Bogast.
The results show that models trained on scene graphs perform as well or better than models trained using data tables. The researchers argue that scene graphs may be a more useful representation because scene graphs are easier to extract from existing charts.
We also trained the model with low-level and high-level captions separately. This technique, known as semantic prefix tuning, allowed us to train the model to vary the complexity of caption content.
Additionally, they conducted a qualitative examination of the best generated captions and categorized six common errors. For example, a directional error occurs when the model says the trend is decreasing when in fact the trend is increasing.
This fine-grained and robust qualitative assessment was critical to understanding how the model was failing. For example, using quantitative methods, directional errors can be penalized in the same way as repetitive errors, where the model repeats the same word or phrase. However, directional errors can be more misleading to users than repetitive errors. Qualitative analysis helped us understand these kinds of subtleties, says Bogast.
This kind of error also exposes the limitations of current models and raises ethical considerations that researchers must take into account when working to develop automated captioning systems, she added.
Generative machine learning models like those powering ChatGPT have been found to be hallucinogenic and potentially misleading. While there are obvious advantages to using these models to auto-caption existing charts, misinformation can be spread if charts are captioned incorrectly.
“Perhaps this means that we don’t just caption everything visible with AI. Instead, perhaps we offer these automated captioning systems as author-writing tools for people to edit. “It’s important to think about these ethical implications not just at the research stage, but throughout the research process,” she says.
Boggust, Tang, and their colleagues would like to continue optimizing the model to reduce some common errors. We also want to extend the VisText dataset to include more charts and more complex charts such as stacked bars and charts with multiple lines. We also want to gain insight into what these automatic caption models are actually learning about the chart data.
This work was supported in part by a Google Research Scholar Award, the National Science Foundation, the MLA@CSAIL Initiative, and the United States Air Force Research Laboratory.
