This AI paper from Microsoft introduces RUBICON, a machine learning technique for evaluating domain-specific human-AI conversations.

Machine Learning


https://www.microsoft.com/en-us/research/publication/rubicon-rubric-based-evaluation-of-domain-specific-human-ai-conversations/

Evaluating conversational AI assistants like GitHub Copilot Chat is difficult because they rely on language models and chat-based interfaces. Existing metrics of conversation quality need to be revised for domain-specific interactions, making it difficult for software developers to evaluate the effectiveness of these tools. Techniques such as SPUR use large-scale language models to analyze user satisfaction but can miss domain-specific nuances. This research focuses on automatically generating high-quality, task-aware rubrics for evaluating task-oriented conversational AI assistants and highlights the importance of context and task progression to improve evaluation accuracy.

Microsoft researchers present RUBICON, a method for evaluating domain-specific human-AI conversations using large-scale language models. RUBICON generates candidate metrics for evaluating conversation quality and selects the best-performing one. It enhances SPUR by incorporating domain-specific signals and Gricean maxims to create a pool of metrics that are iteratively evaluated. RUBICON was tested on 100 conversations between a developer and a chat-based assistant for C# debugging, using GPT-4 for metric generation and evaluation. RUBICON outperformed alternative sets of metrics, achieved high accuracy in predicting conversation quality, and demonstrated the effectiveness of its components through ablation studies.

Natural language conversations are central to modern AI applications, but traditional NLP metrics such as BLEU and Perplexity are insufficient to evaluate long-form conversations, especially in LLM. User satisfaction is an important metric, but manual analysis is resource-intensive and privacy-invasive. Recent approaches use language models to evaluate conversation quality through natural language assertions, capturing engagement and user experience themes. Techniques such as SPUR generate open-domain conversation rubrics, but more domain-specific context is required. This work highlights a holistic approach that integrates user expectations and dialogue progress, and uses bandit methods to explore optimal prompt selection and improve evaluation accuracy.

RUBICON estimates the conversation quality of a domain-specific assistant by learning satisfaction (SAT) and dissatisfaction (DSAT) rubrics from labeled conversations. It involves three steps: generating diverse rubrics, selecting an optimized set of rubrics, and scoring the conversations. Rubrics are natural language assertions that capture attributes of the conversation. Conversations are rated using a 5-point Likert scale, with a 5-point Likert scale indicating whether the conversation is satisfactory or not. [0, 10] Scope. Rubric generation involves supervised extraction and summarization, with selection optimizing the precision and scope of the rubric. The optimal rubric subset is selected with precision and clarity losses to ensure effective and accurate conversation quality assessment.

The evaluation of RUBICON involves three key questions: effectiveness compared to other methods, the impact of Domain Sensitization (DS) and Conversation Design Principles (CDP), and the performance of the selection policy. Conversation data obtained from the C# Debugger Copilot Assistant was filtered and annotated by an experienced developer and split 50:50 between training and testing. Metrics such as accuracy, precision, recall, F1 score, ΔNetSAT score, and yield rate were evaluated. Results show that RUBICON outperforms the baseline in separating positive and negative conversations and classifying conversations with high accuracy, highlighting the importance of DS and CDP instructions.

Internal validity is threatened by the subjective nature of manually assigned ground truth labels, despite high agreement between annotators. External validity is limited by the lack of diversity in the dataset, its specificity to C# debugging tasks in a software company, which may affect generalizability to other domains. Construct validity issues include the reliance on an automated scoring system and the use of Likert-scale responses. [0, 10] Scale. In future work, we plan to address different ways of calculating NetSAT scores. RUBICON has been successful in improving rubric quality and differentiating conversation effectiveness, proving valuable in real-world deployments.


Please check paper and detail. All credit for this research goes to the researchers of this project. Also, don't forget to follow us. twitter.

participate Telegram Channel and LinkedIn GroupsUp.

If you like our work, you will love our Newsletter..

Please join us 46k+ ML Subreddit

Sana Hassan, a Consulting Intern at Marktechpost and a dual degree student at Indian Institute of Technology Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, she brings a fresh perspective to the intersection of AI and real-world solutions.

🐝 Join the fastest growing AI research newsletter, read by researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft & more…





Source link

Leave a Reply

Your email address will not be published. Required fields are marked *