Researchers teach AI to write better graph captions | Massachusetts Institute of Technology News

Graph captions that explain complex trends and patterns are important to improve the reader’s ability to understand and retain the data presented. Also, caption information is often the only way to understand a graph for visually impaired people.

However, writing effective and detailed captions is a labor intensive process. Automatic captioning techniques can alleviate this burden, but they often struggle to account for cognitive functions that provide additional context.

To help people create high-quality chart captions, MIT researchers developed a dataset to improve the automatic captioning system. With this tool, researchers can teach machine learning models to change the level of complexity and type of content included in graph captions based on user needs.

MIT researchers found that a machine learning model trained to auto-caption using a dataset consistently produces captions that are accurate, semantically rich, and explain trends and complex patterns in the data. found to generate Quantitative and qualitative analysis revealed that their model captioned charts more effectively than other automated captioning systems.

The team’s goal is to provide a dataset called VisText as a tool that researchers can use to tackle the thorny problem of automatic graph captioning. These automated systems could help provide captions for online charts that don’t have captions, improving accessibility for people with visual impairments, says co-first author Angie Bogast. He is a graduate student in electrical engineering and computer science at the Massachusetts Institute of Technology and a member of the computer science and computer science visualization groups. Artificial Intelligence Laboratory (CSAIL).

“As we and other researchers build an automated chart captioning system, we embed many human values into the dataset so that we end up with a model that is not what people want or need. I’ve been trying to do that,” she says. she says.

Boggust is joined by co-first author and fellow graduate student Benny J. Tang and lead author Arvind Satyanarayan, associate professor of computer science at MIT who leads CSAIL’s visualization group. The research will be presented at the Annual Meeting of the Association for Computational Linguistics.

Human-centric analysis

The researchers were inspired to develop VisText from previous work in the visualization group that explored what good graph captions are. In that study, researchers found that sighted and visually impaired users had different preferences for the complexity of semantic content in captions.

This group wanted to introduce its human-centered analysis into automatic captioning research. To that end, they developed VisText, a dataset of graphs and associated captions that can be used to train machine learning models to generate accurate, semantically rich, and customizable captions. .

Developing an effective automated captioning system is no easy task. Existing machine learning methods often attempt to caption charts in the same way that images are captioned, but humans and models view natural images differently than we read charts. interpret. Other techniques skip the visual content entirely and use the underlying data table to caption the chart. However, such data tables often become unavailable after the chart is published.

Due to the drawbacks of using images and data tables, VisText also represents charts as scene graphs. Scene graphs, which can be extracted from chart images, include all chart data, but also include additional image context.

“Scene graphs are kind of the best of both worlds: they contain almost all the information that exists in an image, while at the same time being easier to extract from an image than a data table. So we can take advantage of the advances in modern large-scale language models for captioning,” explains Tan.

They compiled a dataset containing over 12,000 graphs (each represented as a data table, image, and scene graph) and their associated captions. Each chart has two separate captions. One is low-level captions that describe the structure of the chart (axis ranges, etc.) and the other is high-level captions that describe statistics, relationships within the data, and complex trends.

The researchers used an automated system to generate low-level captions and crowdsourced high-level captions from human workers.

“Our captions were informed by two important prior studies: existing guidelines for accessible descriptions of visual media and our group’s conceptual model for classifying semantic content. This allowed us to feature important low-level graph elements such as axes, tick marks, and units in captions for visually impaired readers, while maintaining human variability in caption writing styles.” says Tang.

Chart translation

After collecting the chart images and captions, the researchers used VisText to train five machine learning models for automatic captioning. They wanted to see how each representation, such as images, data tables, scene graphs, and combinations of those representations, impacted caption quality.

“The chart captions model can be thought of in the same way as the model for language translation. “We are doing it,” says Bogast.

The results show that models trained on scene graphs perform as well or better than models trained using data tables. The researchers argue that scene graphs may be a more useful representation because scene graphs are easier to extract from existing charts.

We also trained the model with low-level and high-level captions separately. This technique, known as semantic prefix tuning, allowed us to train the model to vary the complexity of caption content.

Additionally, they conducted a qualitative examination of the best generated captions and categorized six common errors. For example, a directional error occurs when the model says the trend is decreasing when in fact the trend is increasing.

This fine-grained and robust qualitative assessment was critical to understanding how the model was failing. For example, using quantitative methods, directional errors can be penalized in the same way as repetitive errors, where the model repeats the same word or phrase. However, directional errors can be more misleading to users than repetitive errors. Qualitative analysis helped us understand these kinds of subtleties, says Bogast.

This kind of error also exposes the limitations of current models and raises ethical considerations that researchers must take into account when working to develop automated captioning systems, she added.

Generative machine learning models like those powering ChatGPT have been found to be hallucinogenic and potentially misleading. While there are obvious advantages to using these models to auto-caption existing charts, misinformation can be spread if charts are captioned incorrectly.

“Perhaps this means that we don’t just use AI to caption everything we see. Instead, perhaps we provide these automated captioning systems as author-authoring tools for people to edit. It’s important to think about these ethical implications throughout the research process, not just in the final stages of model deployment,” she says.

Boggust, Tang, and their colleagues would like to continue optimizing the model to reduce some common errors. We also want to extend the VisText dataset to include more charts and more complex charts such as stacked bars and charts with multiple lines. We also want to gain insight into what these automatic caption models are actually learning about the chart data.

This work was supported in part by a Google Research Scholar Award, the National Science Foundation, the MLA@CSAIL Initiative, and the United States Air Force Research Laboratory.

Source link