Towards Reliable Clinical AI: Evaluating Factuality, Robustness, and Real-World Performance of LLMs
Large language models are increasingly deployed in clinical settings, but their reliability remains uncertain—they hallucinate facts, behave inconsistently across instruction phrasings, and struggle with evolving medical terminology. In my talk, I address methods to systematically evaluate clinical LLM reliability across four dimensions aligned with how healthcare professionals actually work: verifying concrete facts (FactEHR), ensuring stable guidance across instruction variations (instruction sensitivity studies show a large variation in model performances), assessing LLMs on real patient conversations. These contributions establish evaluation standards spanning factuality, robustness and patient-centered communication, charting a path toward clinical AI that is safer, more equitable, and more trustworthy.
About the speaker
Monica Munnangi
PhD Student at Northeastern University
Monica Munnangi is a Computer Science PhD candidate at Northeastern University, advised by Dr. Saiph Savage. Her doctoral work sits at the intersection of machine learning, human–AI interaction, and healthcare, with a focus on building and evaluating reliable AI systems for clinical use. Her research has appeared in leading AI and NLP venues, including NAACL, Machine Learning for Healthcare, and BioNLP at ACL. Monica has also brought her expertise to practice through research roles at the Allen Institute for AI and Stanford Medicine.