Turning Health Metadata into Searchable Insight with Generative AI

Health metadata is foundational to discovery, analysis, interoperability, and decision-making across healthcare, biomedical research, and public health. In practice, however, it is often distributed across spreadsheets, semi-structured documents, and disconnected legacy systems, making it difficult to search, standardize, and operationalize. This talk highlights a generative AI project designed to improve how health metadata is extracted, analyzed, retrieved, and interpreted across fragmented data environments.

The project followed a two-stage implementation strategy. The first stage used a rapid no-code approach to support entity extraction, metadata classification, and natural-language interaction over unstructured spreadsheet-based content. In this phase, customized models extracted key entities with an F1 score of 0.86 and supported metadata classification with average confidence of 75%. Development time was reduced to approximately 10 hours per model, while prompt-based workflows could be configured in about 30 minutes. Manual preprocessing that previously required an estimated 40 to 80 hours per dataset was reduced to near zero, significantly accelerating access to usable metadata for analysis.

The second stage extended these capabilities into an enterprise-grade environment, where an LLM-based application enabled prompt-driven retrieval, multi-source metadata integration, customizable visual outputs, and automated generation of standardized ontology structures. This architecture supported sub-second query latency, improved consistency across siloed metadata sources, and enabled users to generate summaries and analytical outputs through plain-language prompts rather than manual scripting or ad hoc data pulls.

Technically, the talk focuses on how large language models were applied to support entity extraction, semantic retrieval, metadata interpretation, and user-facing query workflows while reducing manual effort and improving scalability. The project provides a practical framework for applying generative AI to health metadata in ways that are technically robust, operationally useful, and adaptable across clinical, biomedical, and public health contexts.

About the speaker

Anindita Nath

Health AI Researcher at Independent Researcher

Accomplished interdisciplinary Data Scientist with 10+ years of research and industry experience building AI/ML solutions that turn complex data into actionable insights. Their work spans generative AI, large language models (LLMs), NLP, deep learning, and scalable data platforms, with 2+ years focused on biomedical and public health informatics and work in the federal health space. They specialize in developing reliable, measurable systems that reduce manual effort, improve analysis, and support better decisions in high-stakes environments. A defining achievement in their recent work has been co-leading the first agentic deep-research evaluation initiative in the federal public health space. They engineered prompts and designed evaluation frameworks and metrics across multiple public health domains, helping establish an early foundation for responsible agentic AI assessment and adoption. This work earned service recognition and contributed to the development of an agentic AI adoption strategy across CDC, now publicly reflected on the CDC website. Since August 2024, they have led AI-driven data modernization for public health surveillance and built LLM-enabled capabilities for metadata retrieval, analysis, and visualization at scale. They developed an AI application that automated analysis of large, siloed metadata and reduced manual analysis time by more than 50%. They also designed an LLM-powered natural language interface using Palantir Foundry and co-developed an Azure OpenAI assistant in Databricks for semantic metadata search. Previously, they led GENEVIC, an Azure OpenAI-enabled copilot for genomic data exploration. They also advanced dbGaP data representations for downstream ML readiness, conducted proteomics research in UK Biobank Alzheimer’s cohorts, and developed foundation model approaches for single-cell RNA sequencing and multi-omics integration. Their work has been recognized with an Executive Director’s Citation from the American Public Health Association (APHA), and competitive honors including the Speech Prosody Student Travel Grant, the International Phonetics Association Student Award, and the Generation Google Scholar distinction. They publish and present at the intersection of health AI, bioinformatics, and speech and multimodal interaction, including first-author work for ACM venues, Bioinformatics, and AMIA. They are also committed to mentorship, STEM advocacy, and digital literacy through work with Microsoft TEALS, Microsoft TechSpark, Girls Who Code, and related initiatives that expand access to computing for women and underrepresented communities.