Hu, Xia (Ben)2025-01-172024-122024-12-06December 2https://hdl.handle.net/1911/118242Medical RAG systems and long-context models like Med-PaLM face distinct yet interconnected challenges in processing complex medical information. While RAG systems struggle with hallucinations due to noisy retrievals and incomplete fact verification, long-context models, despite their ability to process extended inputs, suffer from attention dilution and context retention issues. Current BioNLP RAG systems have particularly overlooked the critical balance between retrieved context and parametric knowledge, often leading to hallucinations from over-reliance on retrieved information. Our HALO-MMedRAG framework addresses these challenges through a comprehensive multi-agent architecture. The system’s effectiveness stems from four innovative components: query-intent parser agent, multi-query generation agent, coarse retrieval agent, fine-grained hallucination aware retrieval agent with a perplexity and NLL based hybrid scoring for the chunks, generation agent, a light-weight fact-verification agent and an orchestrator agent that manages a CoT reasoning debate among 3 agents to provide the final hallucination free response, grounded in factuality. The notion of Retrieval Augmented Generation in the context of Multimodal Medical LLMs has not been given due consideration from the lens of hallucination mitigation. Further, the existing approaches have been limited in their coverage of the medical domains, often limited to X-Ray. Medical Multimodal LLMs, when utilized for Multi-Modal Retrieval Augmented Generation, face critical challenges in maintaining factual accuracy while integrating complex visual and textual information. Our innovative approach addresses these challenges through a unified Triple Preference Optimization framework with three-stage preference dataset curation, focusing on cross-modal alignment, retrieval balance, and a dual staged visual feedback agent. Unlike existing solutions, our method employs a single-step optimization process that simultaneously handles multiple aspects of alignment while maintaining computational efficiency. Through careful curation of preference datasets that capture different levels of alignment quality, combined with a visual feedback agent for precise visual grounding to provide visual prompting for the Vision Language Model to improve its response, our approach significantly reduces hallucinations and improves medical response accuracy. Extensive evaluation across diverse medical domains, including radiology, ophthalmology, pathology, magnetic resonance imaging and CT scan demonstrates superior performance compared to the existing multimodal medical RAG methods, making our solution titled Align-MedRAG-VL, both practical and reliable for real-world medical applications where hallucination mitigation is paramount.application/pdfenMedical RAGHallucination MitigationMulti-Agent OrchestrationVisual FeedbackLLMsVLMsPreference AlignmentReliable Medical LLM and Vision-Language RAG through Multi-Agent Orchestration and Single-Step Preference AlignmentThesis2025-01-17