Beyond Vibes: Evaluation Strategies for Safe Multi-Turn AI Agents

Evaluating whether an AI agent completes a task is hard. Evaluating whether it does so safely is harder — and most teams have no systematic way to catch failures before users do. What breaks when you try to measure safe behavior across multi-step, non-deterministic agent workflows? Why aren’t classic eval pipelines built for this? And what practical evaluation strategies actually work? This talk tackles all three, offering concrete patterns — from trajectory-level assertions to adversarial scenario generation to safety-aware scoring rubrics — for teams shipping agents today. Drawing on experience building LLM evaluation frameworks and production safety systems at scale, we go beyond surface-level pass/fail metrics to show how you can build real, repeatable confidence that your agent is behaving as intended.

Back to speakers

About the speaker

Eti Rastogi

Sr. Applied Scientist at Amazon

Eti Rastogi is an Applied Scientist at Amazon Web Services, where she leads the science direction for Amazon Bedrock Guardrails — the AI safety system protecting tens of thousands of enterprises globally. Her work focuses on making AI trustworthy enough for real-world deployment at scale, spanning content safety, prompt attack detection, and multimodal safety systems. Before Amazon, Eti was the first AI engineer hired at DeepScribe, a healthcare AI startup. She built the entire AI function from the ground up — the team, the infrastructure, the training pipelines, and the core medical LLM that now powers documentation for over 3 million cancer care visits annually across thousands of oncologists in the United States. Her work on hallucination reduction in clinical notes addressed one of the most critical patient safety challenges in medical AI. At Scale AI, she built evaluation systems for frontier language models, working on the data quality and bias detection infrastructure behind models used by hundreds of millions of people. Eti has published peer-reviewed research at ACM WSDM and NAACL, with over 180 citations. Her HealAI model is ranked on the PubMedQA leaderboard, a widely used benchmark for biomedical question answering. She holds an M.S. in Electrical and Computer Engineering from Carnegie Mellon University.