Meets a summary and action item extraction with Amazon Nova

Meetings play an important role in decision-making, project coordination and collaboration, and remote meetings are common in many organizations. However, capturing and building important points from these conversations is often inefficient and inconsistent. Manually summarizing meetings and pulling out action items requires considerable effort and tends to be omitted and misleading.

Large-scale Language Models (LLMS) provide a more robust solution by transforming unstructured conference transcripts into structured summaries and action items. This feature is particularly useful for project management, customer support and sales calls, legal and compliance, and enterprise knowledge management.

This post presents benchmarks of various understanding models of the Amazon Nova family available on Amazon Bedrock, providing insights on how to choose the best model for your meeting summary task.

LLM to generate meeting insights

Modern LLMs are extremely effective in extracting summaries and action items thanks to their ability to understand context, infer topic relationships, and generate structured output. In these use cases, rapid engineering offers a more efficient and scalable approach compared to tweaking or customizing traditional models. Rather than modifying the underlying model architecture or training on a large labeled dataset, prompt engineering uses carefully crafted input queries to guide the behavior of the model, directly affecting the output format and content. This method allows for rapid, domain-specific customization without the need for a resource-intensive retraining process. For tasks such as meeting summary or action item extraction, rapid engineering allows you to accurately control the generated output and ensure that it meets specific business requirements. Flexible tuning of prompts for evolving use cases makes it an ideal solution for dynamic environments where model behavior needs to be reoriented quickly without the overhead of fine-tuning the model.

Amazon Nova Models and Amazon Bedrock

Announcing AWS Re:Invent, the Amazon Nova model is built to deliver frontier intelligence with industry-leading price performance. They are one of the fastest and most cost-effective models in their respective intelligence tiers, and are optimized for enterprise-generated AI applications in a reliable, secure and cost-effective way.

The understanding model family comes in four models: Nova Micro (text only, super efficient for using edges), Nova Lite (multimodal, balance of versatility), Nova Pro (multimodal, balance of speed and intelligence, perfect for most company needs), and Nova Premier (multimodal, the most capable Nova model for model dissimilation). The Amazon Nova model can be used for a variety of tasks, from summaries to structured text generation. Distillation of Amazon Bedrock models also allows customers to make Nova Premier intelligence a faster and more cost-effective model, such as Nova Pro or Nova Lite in use cases or domains. This can be achieved through Amazon Bedrock console and APIs such as the Converse and Inved APIs.

Solution overview

This post shows how to use the Amazon Nova understanding model available from Amazon Bedrock for automated insight extraction using rapid engineering. It focuses on two important outputs.

Meeting summary – A high-level abstract summary distilling key discussion points, decisions, and important updates from meeting transcripts
Action Items – A structured list of practical tasks derived from meetings that apply to a team or project as a whole

The following diagram illustrates the solution workflow.

Prerequisites

To follow this post, you're expected to be familiar with using Amazon Bedrock to invoke LLMS. For detailed instructions on using Amazon Bedrock for text summary tasks, see Building an AI Text Summary App with Amazon Bedrock. For more information about calling LLMS, see Using the API Invoke and Converse API Reference Documentation.

Solution Components

We developed two core features of the solution: summaries and action items extraction) by using popular models available from Amazon Bedrock. In the next section, we will look at the prompts used for these important tasks.

For meeting summary tasks, use persona assignments to encourage LLM to generate an overview.

A one-shot approach by giving LLM one example to reduce redundant opening and closing sentences, and making sure LLM consistently follows the proper format for summary generation. As part of the system prompt, we provide clear and concise rules that emphasize the correct tone, style, length and fidelity for the provided transcript.

For the Action Item Extraction task, we provided specific instructions on the generation of action items for the prompt and used a chain designed to improve the quality of the generated action items. In the assistant message, prefixes Tags are provided as prills to fine-tune model generation in the right direction and avoid redundant opening and closing statements.

It is important that different model families respond to the same prompt differently and follow the prompt guide defined in a particular model. For more information about Amazon Nova prompt best practices, see Promoting Best Practices for the Amazon Nova Understanding Model.

Dataset

Samples were used in the public QMSUM dataset to evaluate the solution. The QMSUM dataset is a benchmark for fulfilling summary with manually annotated summary of English transcripts from academic, business, and governance discussions. Evaluating LLM by generating structured and consistent summaries from complex, multi-speaker conversations, making it a valuable resource for understanding abstract summaries and discourse. For testing, we used 30 randomly sampled meetings from the QMSUM dataset. Each meeting included transcripts for each of 2-5 topics, with an average of around 8,600 tokens for each transcript.

Evaluation Framework

Achieving high quality output from LLMS in meeting summary and action item extraction can be a challenging task. Traditional evaluation metrics such as Rouge, Bleu, and Meteor focus on surface-level similarity between generated text and reference summaries, but often fail to capture nuances such as fact correctness, consistency, and behaviorality. Human ratings are gold standard, but they are expensive, time-consuming, and not scalable. You can use LLM-as-a-judge to address these challenges. This Judge allows you to use another LLM to systematically evaluate the quality of the output generated based on well-defined criteria. This approach provides a scalable and cost-effective way to automate evaluations while maintaining high accuracy. In this example, Anthropic's Claude 3.5 Sonnet V1 was used as the judge model. This is because we found it to be most in line with human judgment. We used LLM judges to obtain responses generated with three main metrics: fidelity, summary, and question answers (QA).

Fidelity scores measure the fidelity of the generated summary by measuring with a supported summary regarding the total number of statements in a summary supported by a specific context (for example, a meeting transcript).

A summary score is a concise combination of the QA score with the same weight (0.5). The QA score measures the coverage of summaries generated from the conference transcript. First, we generate a list of question-answer pairs from the meeting transcript and measure the portion of questions that are correctly asked when the summary is used as a context instead of the meeting transcript. QA scores complement the faithful scores because they do not measure coverage of the generated summary. QA scores were used only to measure the quality of the generated summaries. Because action items are not supposed to cover all aspects of the meeting transcript. A brief score measures the ratio of the length of the generated summary divided by the length of the total meeting transcript.

We used a modified version of the faithful score and a summary score with much lower latency than the original implementation.

result

The Amazon Nova model evaluation and action item extraction task across the meeting revealed clear performance latency patterns. For a summary, Nova Premier achieved the highest loyal score (1.0) with a processing time of 5.34 seconds, while Nova Pro brought 0.94 loyalty in 2.9 seconds. The small Nova Lite and Nova Micro models provided faithful scores of 0.86 and 0.83, respectively, increasing processing times of 2.13 and 1.52 seconds. In the action item extraction, Nova Premier again led the fidelity (0.83) with 4.94S processing time, followed by Nova Pro (0.8 fidelity, 2.03 seconds). Interestingly, the Nova Micro (0.7 fidelity, 1.43 s) outperformed Nova Lite (0.63 fidelity, 1.53 s) in this particular task despite its small size. These measurements provide valuable insight into the performance speed characteristics of the entire Amazon Nova model family for text processing applications. The following graph shows these results: The following screenshot shows sample output for the summary task, including a summary of the meetings generated by LLM and a list of action items.

Meeting summary results

Faithful scores for action items summary

Conclusion

In this post, we demonstrated how to use the prompt to satisfy insights such as summaries and action items using the Amazon Nova model available from Amazon Bedrock. Optimizing delay, cost, and accuracy is essential for large AI-driven meeting summaries. The Amazon Nova Family Understanding Models (Nova Micro, Nova Lite, Nova Pro, Nova Premier) offer practical alternatives to high-end models, significantly improving inference speeds while reducing operational costs. These factors make Amazon Nova an attractive option for businesses that process large volumes of conference data.

For more information about Amazon Bedrock and the latest Amazon Nova models, see the Amazon Bedrock User Guide and Amazon Nova User Guide, respectively. The AWS Generic AI Innovation Center has a group of AWS Science and Strategy experts with comprehensive expertise across the generative AI journey, helping customers prioritize use cases, build roadmap and move solutions to production. For more information about our latest work and customer success stories, see Generated AI Innovation Center.

About the author

Baishali Chaudhury I am an applied scientist at AWS Generation AI Innovation Center and focuses on driving generation AI solutions for real-world applications. She has a strong background in computer vision, machine learning, and healthcare AI. Baishali holds a PhD in Computer Science from the University of South Florida and holds a postdoc at the Moffitt Cancer Center.

Sungmin Hong He is a senior applied scientist at Amazon Generic AI Innovation Center, helping AWS customers accelerate their various use cases. Before joining Amazon, Sungmin was a postdoctoral researcher at Harvard Medical School. He holds a PhD. Computer Science at New York University. Outside of work, he takes pride in keeping his indoor plants alive for more than three years.

Mengdie (Flora) Wang I'm a data scientist at AWS Generic AI Innovation Center, working with customers to create and implement scalable, generator AI solutions that address unique business challenges. She specializes in model customization techniques and agent-based AI systems, helping organizations to make the most of the possibilities of generative AI technology. Before AWS, Flora received her Masters in Computer Science from the University of Minnesota, where she developed her expertise in machine learning and artificial intelligence.

Anila Joshi We have over 10 years of experience building AI solutions. As AWS GEO Leader at AWS Generic AI Innovation Center, Anila Pioneers AI's innovative applications accelerate the adoption of AWS services by pushing the boundaries of possibilities and helping customers to eye-catch, identify and implement secure, generate AI solutions.