Google Stax aims to give developers access to AI model evaluations

AI News


Google Stax is a framework designed to replace subjective evaluation of AI models with objective, data-driven, and reproducible processes for measuring model output quality. Google says this allows AI developers to tailor their evaluation process to a specific use case rather than relying on popular benchmarks.

According to Google, ratings are the key to choosing the right model for your particular solution by comparing quality, delays and costs. It is also essential to assess how effective the rapid engineering and fine-tuning efforts to actually improve results. Another area where iterative benchmarks are valuable is agent orchestration, which helps to ensure that agents and other components work together.

STAX provides data and tools to build benchmarks that combine human judgment with automated evaluators. Developers can import production-ready datasets or create their own datasets by uploading existing data or generating synthetic datasets using LLM. Similarly, STAX includes a default suite of evaluators for general metrics such as redundancy and summary, allowing custom evaluators to be created for more specific or fine-grained criteria.

Custom evaluators can be created in a few steps, starting with choosing a base LLM that will act as a judge. The judge will be provided with a prompt to tell you how to evaluate the output of the tested model. The prompt must contain a definition of the category that the judge will use for grading. Each is associated with a numerical score between 0.0 and 1.0. Additionally, the priority response format must include steps, and use variables {{output}}, {{input}}, {{history}}, {{expected_output}}and {{metadata.key}}. To ensure reputational evaluator reliability, classic, monitored learning approaches must be used to calibrate against reliable human assessments. The rater prompt can be fine-tuned through an iterative process to improve consistency between that evaluation and the credible rater evaluation.

Google Stax is not the only solution available for AI model evaluation. Its competitors vary greatly in approaches and capabilities, including Openai Evals, Deepeval, MLFlow LLM Evaluate and many others.

Currently, STAX supports benchmarking an increasing list of model providers, including Openai, Humanity, Mistral, Grok, Deepseek, and Google itself. Additionally, it can be used with custom model endpoints. It's free to use in the beta version, but Google says it may introduce a pricing model afterwards.

Final notes on data privacy: Google says it does not own user data such as prompts, custom datasets, or evaluators, or it does not use it to train a language model. However, it should be noted that when users use other providers, the data policies of these providers also apply.





Source link

Leave a Reply

Your email address will not be published. Required fields are marked *