Applied AI Summit

Free online conference | October 14-16, 2025

Evaluating AI Scribes: Frameworks for Safe and Reliable Summarization [Keynote]

Testing of Summarization performed by LLMs is a complex task. It is even more complex when the source data is temporally non-linear. This discussion will explore the challenges and opportunities moving forward in terms of testing design and innovation leading to greater success in Responsible AI. As a key example, Ambient AI scribes are one of the most complex Summarization challenges and also one of the highest risk, with lives potentially on the line.

Ambient AI scribes are emerging as a transformative application of large language models (LLMs), automatically transforming doctor–patient conversations into structured clinical documentation and prose summaries. While these systems promise to reduce administrative burden and enhance clinician focus on patient care, they also raise critical safety concerns. In healthcare, even subtle transcription or summarization errors—such as misstating a medication, adding a fabricated symptom, or omitting vital history—can have direct consequences for treatment decisions, billing compliance, and patient safety.

This lecture explores frameworks, methods, and challenges in testing LLM-based summarization systems with a specific emphasis on clinical accuracy. We will examine why conventional summarization metrics fall short, present approaches for clinician-in-the-loop and structured fact-based evaluations, and highlight an error taxonomy specific to clinical notes. Opportunities for improvement will be discussed, including hybrid evaluation pipelines, continuous monitoring, and alignment with regulatory expectations. A concrete case study—showing how a small model error could lead to a clinically dangerous misrepresentation—will illustrate the stakes. The session concludes by framing testing not only as a technical necessity, but as a patient safety imperative.

Many of these lessons from testing of Ambient AI scribes can be applied to any other form of LLM driven summarization.

About the speaker

David Rivkin

CTO at GovernanceLabs.ai and Sr. Director at the Responsible AI Center of Excellence at UnitedHealthGroup