Anthropic Claude 3.5 Sonnet Ranks #1 in Business and Finance in Kensho's S&P AI Benchmark

Anthropic Claude 3.5 Sonnet currently ranks top on Kensho's S&P AI benchmark, which evaluates large-scale language models (LLMs) for finance and business. Kensho is the AI innovation hub for S&P Global. Using Amazon Bedrock, Kensho was able to quickly run Anthropic Claude 3.5 Sonnet on a set of challenging business and finance tasks. In this post, we discuss these tasks and the capabilities of Anthropic Claude 3.5 Sonnet.

Limitations of LLM Assessment

To evaluate LLMs, it is common to use standardized tests such as Massive Multitask Language Understanding (MMLU, a test consisting of multiple-choice questions covering 57 subjects including mathematics, philosophy, and medicine) and HumanEval (a test for code generation). While these evaluations are useful to give LLM users a sense of the relative performance of their LLMs, they have limitations. For example, questions and answers from the benchmark dataset may leak into the training data. Additionally, today's LLMs are well-suited for general tasks such as question answering tasks and code generation. However, these capabilities do not necessarily apply to domain-specific tasks. In the financial services industry, we hear customers asking which models to choose for generative artificial intelligence (AI) applications in the financial domain. These applications require that LLMs have the necessary domain knowledge and can reason over numerical data to calculate metrics and extract insights. We also hear from customers that highly ranked general benchmark LLMs do not necessarily provide the best performance for their specific financial and business applications.

Our clients often ask us if there are any LLM benchmarks specific to the finance industry that can help them choose the right LLM faster.

S&P AI Benchmark by Kensho

When Kensho's R&D Lab began researching and developing challenging datasets useful for finance and business, it quickly became apparent that the finance industry lacked such realistic assessments. To address this challenge, the Lab created the S&P AI Benchmark, which aims to become the industry standard for benchmarking models in finance and business.

“By providing a robust, independent benchmarking solution, we hope to help the financial services industry make smart decisions about which models to implement for which use cases.”

– Bhavesh Dayalji, Chief AI Officer at S&P Global and CEO of Kensho.

The S&P AI Benchmark focuses on measuring a model's ability to perform feature- and knowledge-centric tasks in three categories: domain knowledge, quantity extraction, and quantitative reasoning (for more information, see this white paper). This public resource includes a corresponding leaderboard, allowing anyone to see the performance of all state-of-the-art language models evaluated on these rigorous tasks. Anthropic Claude 3.5 Sonnet is currently ranked #1 (as of July 2024), demonstrating Anthropic's strength in the business and finance domain.

Kensho chose to test its benchmarks using Amazon Bedrock because of its ease of use and enterprise-ready security and privacy controls.

Evaluation Task

S&P AI Benchmark assesses LLMs using a wide range of finance and business questions. The assessment consists of 600 questions across three categories: domain knowledge, quantity extraction, and quantitative reasoning. Each question is validated by domain experts and finance professionals with over 5 years of experience.

Quantitative Reasoning

The task is to determine whether, given a question and a long document, the model can perform complex calculations, make correct inferences, and generate accurate answers. The questions are formulated by financial experts using real data and financial knowledge, so they are closer to the kinds of questions business and finance experts would ask in their generative AI applications. For example:

questionKT-Lew Corporation's common stock has a market price of $60 per share and each share entitles the holder to one warrant. Four warrants are needed to purchase additional common shares at $54 per share. If the common stock is currently being sold with warrants, what is the theoretical value of the warrants? Answer to the nearest cent.

To answer this question, the LLM must resolve complex quantity references and use implicit financial background knowledge. For example, the previous question's “Subscription Rights”, “Sales Rights”, and “Subscription Price” require financial background knowledge to understand the terms. To generate an answer, the LLM must have the financial knowledge to calculate the “Theoretical Value of the Rights”.

Quantity Extraction

Given a financial report, an LLM can extract relevant numerical information. Many business and financial workflows require quantity extraction with high accuracy. In the following example, for an LLM to answer the question correctly, it needs to understand that the rows of a table represent locations and the columns represent years, and extract the correct quantity (total amount) from the table based on the location and year asked.

question: What was the total amount for the Americas in 2019 (in thousands)?

Given Context: The Company’s top ten clients accounted for 42.2%, 44.2% and 46.9% of 
its consolidated revenues during the years ended December 31, 2019, 2018 and 2017, respectively.
The following table represents a disaggregation of revenue from contracts with customers by 
delivery location (in thousands):

Year ending December 31
	2019	2018	2017
Americas:	.	.	.
America	$614,493	$668,580	$644,870
Philippines	250,888	231,966	241,211
Costa Rica	127,078	127,963	132,542
Canada	99,037	102,353	112,367
El Salvador	81,195	81,156	75,800
other	123,969	118,620	118,853
Americas Total	1,296,660	1,330,638	1,325,643
EMEA:	.	.	.
Germany	94,166	91,703	81,634
other	223,847	203,251	178,649
EMEA Total	318,013	294,954	260,283
Other total	89	95	82
.	$1,614,762	$1,625,687	$1,586,008

Domain knowledge

Models must demonstrate an understanding of business and finance terminology, practices, and formulas. The assignment is to answer multiple-choice questions collected from CFA practice exams and the Business Ethics, Microeconomics, and Professional Accounting exams in the MMLU dataset. In the following example question, the LLM must understand what a fixed rate system is:

questionThe features of the fixed interest rate system are as follows:
A: A clear legislative commitment to maintain certain equality.
B: Monetary independence is conditional on maintaining a fixed exchange rate.
C: Target foreign exchange reserves that are directly related to the amount of domestic currency.

Amazon Bedrock Anthropic Claude 3.5 Sonnet

In addition to ranking first on the S&P AI Benchmarks, Anthropic Claude 3.5 Sonnet delivers state-of-the-art performance across a range of tasks, including undergraduate-level expertise (MMLU), graduate-level expert reasoning (GPQA), and code (HumanEval). As noted in Anthropic's Claude 3.5 Sonnet model, now available on Amazon Bedrock, Anthropic Claude 3.5 Sonnet, which is more intelligent than Claude 3 Opus and one-fifth the cost, delivers important improvements in visual processing and understanding, text composition and content generation, natural language processing, coding, and insight generation.

Get started with Anthropic Claude 3.5 Sonnet on Amazon Bedrock

Anthropic Claude 3.5 Sonnet is generally available on Amazon Bedrock as part of the Anthropic Claude AI model family. Amazon Bedrock is a fully managed service that provides quick access to foundational models from AI21 Labs, Anthropic, Cohere, Meta, Stability AI, Amazon's industry-leading LLM and other models. It also offers a wide range of capabilities for building generative AI applications, simplifying development while supporting privacy and security. Tens of thousands of customers have already chosen Amazon Bedrock as the foundation for their generative AI strategies. Financial industry customers including Nasdaq, NYSE, Broadridge, Jefferies, and NatWest are using Amazon Bedrock to build generative AI applications.

“The Kensho team uses Amazon Bedrock to quickly evaluate models from different providers. In fact, with access to Amazon Bedrock, the team was able to benchmark Anthropic Claude 3.5 Sonnet within 24 hours.”

– Diana Mingels, Head of Machine Learning at Kensho.

Conclusion

In this post, we provided details on the S&P AI benchmark tasks for business and finance. The benchmarks show that Anthropic Claude 3.5 Sonnet is a top performer in these tasks. To get started with this new model, see the Anthropic Claude model. Amazon Bedrock provides a fully managed service with access to leading AI models from companies such as AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad range of capabilities for building generative AI applications. Learn more and get started today with Amazon Bedrock.

About the Author

Lee Ching Wei He is a Machine Learning Specialist at Amazon Web Services. He completed his PhD in Operations Research after bankrupting his advisor's research grant account and missing out on a promised Nobel Prize. He currently helps clients in the financial services and insurance industries build machine learning solutions on AWS. In his spare time, he enjoys reading and teaching.

Joe Dunn He is an AWS Principal Solutions Architect in Financial Services with 20+ years of experience in infrastructure architecture and migrating business critical workloads to AWS, helping Financial Services customers innovate on the AWS Cloud by delivering solutions using AWS products and services.

Raghuvender Arni (Arni) is part of the AWS Generative AI GTM team and leads the cross-portfolio team, a multi-disciplinary group of AI specialists dedicated to accelerating and optimizing generative AI adoption across industries.

Simon Zamarin is an AI/ML Solutions Architect with a focus on helping clients derive value from their data assets. In his spare time, he enjoys spending time with his family, reading science fiction, and working on various DIY home projects.

Scott Mullins Scott is the Managing Director and General Manager of AWS' Worldwide Financial Services organization. In this role, Scott is responsible for AWS' relationships with systemically important financial institutions and leads the development and execution of AWS' strategic initiatives across banking, payments, capital markets, and insurance worldwide. Prior to joining AWS in 2014, Scott spent 28 years in financial services, holding positions at JPMorgan Chase, Nasdaq, Merrill Lynch, and Penson Worldwide. At Nasdaq, he was the product manager responsible for building FinQloud, the exchange's first cloud-based solution. Prior to NASDAQ, Scott worked in surveillance and trading compliance at one of the largest clearing broker-dealers in the United States, where he was responsible for regulatory response, emerging regulatory initiatives, and compliance issues related to the company's trading and execution services division. Prior to his regulatory compliance role, Scott spent 10 years as an equity trader. A graduate of Texas A&M University, Scott is a subject matter expert cited in industry media and a known speaker at industry events.