Real-world AI productivity benchmarks – Samsung Global Newsroom

Unique benchmarks support multilingual productivity scenarios that address gaps in existing AI benchmarks

Samsung Electronics today announced TrueBench (Reliable Real-World Usage Evaluation Benchmark), a proprietary benchmark developed by Samsung Research to assess AI productivity.

TrueBench provides a comprehensive set of metrics for measuring how large-scale language models (LLMs) work in real-world workplace productivity applications. A variety of dialogue scenarios and multilingual conditions are incorporated to ensure realistic evaluation.

Using Samsung's internal use of AI for productivity, TrueBench evaluates commonly used enterprise tasks, including content generation, data analysis, summaries, and translation, across 10 categories and 46 subcategories. This benchmark ensures reliable scoring with AI-driven automated evaluations based on criteria that are collaboratively designed and refined by both humans and AI.

“Samsung Research brings deep expertise and competitiveness through real-world AI experiences,” said Paul (Kyungwhoon) Cheun, CTO of Samsung Electronics' DX division and head of Samsung Research. “We look forward to TrueBench establishing evaluation criteria for productivity and solidifying Samsung's technical leadership.”

Recently, as businesses adopt AI for their tasks, there has been a growing demand for measuring LLM productivity. However, existing benchmarks measure primarily overall performance, mostly English-centric and are limited to single-turn question-answer structures. This limits the ability to reflect the actual working environment.

To address these limitations, TrueBench consists of a total of 2,485 test sets across 10 categories and 12 languages¹ – It also supports inter-language scenarios. In the test set, we look at what the AI model can actually solve, and Samsung Research applies a test set of 8 short characters to 20,000 or more characters, reflecting tasks from simple requests to long document summaries.

To assess the performance of an AI model, it is important to have clear criteria for determining whether an AI response is correct. In actual circumstances, not all user intent is expressly stated in the instructions. TrueBench is designed to allow for realistic assessments by taking into account not only the accuracy of the answers but also detailed conditions that meet the user's implicit needs.

Samsung Research examined the evaluation points through collaboration between humans and AI. First, the human annotator creates the evaluation criteria, and then the AI reviews it to check for errors, inconsistencies, or unnecessary constraints. The human annotator then revises the criteria and repeats this process to apply more and more accurate assessment criteria. Based on these cross-validation criteria, an automatic evaluation of the AI model is carried out to minimize subjective bias and ensure consistency. Additionally, for each test, all conditions must be met for the model to pass. This allows for more detailed and accurate scoring across tasks.

TrueBench data samples and leaderboards are available with face-hugging faces from the global open source platform. This allows users to compare up to five models, allowing them to compare comprehensive AI model performance at a glance. Additionally, data on the average length of response results is also published, allowing for simultaneous comparisons of both performance and efficiency. For more information, please visit the TrueBench Hugging Face page at https://huggingface.co/spaces/samsungresearch/truebench.

¹ Chinese, English, French, German, Italian, Japanese, Korean, Polish, Portuguese, Russian, Spanish, Vietnamese

Source link