Applied AI Summit

Free online conference | October 14-16, 2025

Principles of AI Agent evaluation

The rapid adoption of AI agents across diverse applications highlights the critical need for robust evaluation frameworks. Drawing on hands-on experience building over 30 agents, this talk explores the principles of AI agent evaluation that balance rigor, practicality, and adaptability. Traditional machine learning and recommendation system metrics—such as precision, recall, and ranking quality—remain essential benchmarks, offering a foundation for measuring agent performance. Yet, in agent-driven systems, content generation quality emerges as the top priority, requiring human-centered measures of coherence, creativity, and utility. A central challenge lies in generating and curating a golden dataset that is both representative and scalable, enabling reproducible and meaningful evaluation. Finally, we present strategies for building iterative evaluation loops, including the design of “evaluation agents” that autonomously test, critique, and refine AI outputs. Together, these principles offer a roadmap for practitioners to build trustworthy, high-performing AI agents that continuously improve in real-world settings.

Back to speakers

About the speaker

Harsh Nilesh Pathak

Tech Lead ML/AI at GoDaddy

Harsh is an Applied AI researcher, and tech lead at GoDaddy, driving enterprise-ready generative AI solutions. With over 10 years of ML/DL experience, I have developed several recommendation systems and AI workflows including both LLM and Agentic applications. Additionally, I’ve published 25+ papers and built projects that generate millions in enterprise value.