Applied AI Summit
Free online conference | October 14-16, 2025A Framework for Comparative Evaluation of Large Language Models in Conversational Agents
This talk presents a robust framework for evaluating the performance of Large Language Model (LLM)-powered conversational agents by analyzing and replaying historical interaction data. The methodology involves capturing real-world user-agent conversation traces, transforming them into a standardized format, and systematically re-executing key LLM-driven decision steps—specifically topic selection and tool invocation—with different models. Performance is assessed through a suite of granular metrics, including Topic Agreement with the original interaction, Function Name Matching, and Function Schema Conformance. To evaluate the semantic quality of generated outputs, the framework employs an LLM-as-a-judge to score the contextual accuracy of function arguments and the relevance of free-text responses against the original outcomes. We provide a comprehensive comparative analysis of multiple leading LLMs (including models from OpenAI, Anthropic, and Google) across several specialized agent roles. The results demonstrate the framework’s effectiveness in providing quantitative benchmarks for regression testing, characterizing model-specific behaviors, and enabling data-driven decision-making for agent development and deployment.
About the speaker
Monojit Banerjee
Lead at Salesforce
Monojit Banerjee is a technology leader who builds AI platforms, including the first LLM Benchmark for CRM, setting new standards for trustworthy enterprise AI. He has authored a book chapter on scalable ChatGPT solutions, published research on distributed systems and AI DevOps, and built open-source tools like EinsteinPlayground. An IEEE Senior Member and active ADPList mentor, Monojit’s thought leadership has been featured in CNCF, InfoWorld, and other leading industry platforms.