Applied AI Summit Healthcare

Free online conference | April 14-15, 2026

Extracting NSCLC Diagnoses, Clinical Attributes, and Pain Scores from Unstructured Clinical Notes Using Named Entity Recognition (NER) and Agentic AI

Abstract 1: Extracting NSCLC Diagnoses, Clinical Attributes, and Pain Scores from Unstructured Clinical Notes Using Named Entity Recognition and Agentic AI
Challenge Statement: Real-world data from structured sources such as claims and electronic health records (EHR) often lacks the clinical granularity required to support advanced oncology research. For instance, the absence of diagnostic codes to identify NSCLC patients within broader lung cancer populations significantly limits research opportunities. Moreover, structured datasets rarely capture essential clinical context such as laboratory values, signs and symptoms etc. – essential for assessing overall patient health. In contrast, clinical notes are rich in such information but remain largely underutilized due to their unstructured nature. This gap underscores the need for advanced approaches like Natural Language Processing (NLP) and Agentic AI to extract and leverage insights from unstructured clinical text.
Proposed Solution: In this proof-of-concept study, we aim to develop a Named Entity Recognition (NER) and Agentic AI models to identify NSCLC patients and extract key attributes such as histology, tumor location, stage, pain scores & biomarker results from clinical notes. Additionally, we seek to extract biomarker data using Agentic AI models to enable personalized treatment recommendations.
Methodology: The study utilized Optum’s de-identified clinical notes from 2016 to 2023. The notes were analyzed using an NER model built with Char CNNs – BiLSTM – CRF (Character-level Convolutional Neural Networks – Bidirectional Long Short-Term Memory networks – Conditional Random Field) architecture. Pain scores were extracted from a representative subset of the cohort using GPT-4 and their correlation across different stages was analyzed. Agentic AI using GPT-4 was then applied to extract the biomarker and recommend treatments based on NSCLC stage and biomarker results. The model performance was evaluated using precision, recall, and F1-score metrics.
Results: The NER model showcased strong performance identifying NSCLC-related entities (precision 89%, recall 95%, F1-score 92%). Higher pain scores were observed in advanced-stage patients (stage 3,4) and biomarkers results were extracted with 95% confidence score. The Agentic AI framework was also able to suggest treatment approaches based on the stage and biomarker results on a smaller subset of data.
Potential Impact: This study demonstrates the ability of NLP and Agentic AI to unlock clinically meaningful insights from unstructured clinical notes, enabling personalized oncology research. By bridging key gaps in real world data, this approach can support improved clinical decision making, more precise patient stratification, and accelerated generation of high quality real world evidence in oncology.

About the speaker

Vikash Verma

Sr Director, AI/ML Engineering at Optum

Vikash is an AI/ML and Real-World Evidence leader with 18+ years of experience enabling regulatory-grade evidence generation for pharmaceutical and healthcare stakeholders. He leads 120+ member multidisciplinary team of physicians, statisticians, and AI/NLP engineers focused on transforming unstructured clinical notes into high-fidelity, structured real-world datasets. His work centers on forecasting, commercial analytics and scalable NLP frameworks for extracting oncology and complex disease attributes from EHR data, improving completeness, traceability, and reproducibility of real-world evidence. Vikash has co-authored 150+ scientific presentations at global forums including PMSA, ISPOR, and AMCP, and actively contributes to advancing AI-driven data standards and evidence transparency. Executive education in Data Science from IIM Calcutta, XLRI & Carnegie Mellon University