Skip to content

Add Zero-Shot LLM Evidence Retrieval Pipeline (Ahsan et al. 2024)#1135

Open
abhiseksinha-r1 wants to merge 1 commit intosunlabuiuc:masterfrom
abhiseksinha-r1:Add-EHR-LLM
Open

Add Zero-Shot LLM Evidence Retrieval Pipeline (Ahsan et al. 2024)#1135
abhiseksinha-r1 wants to merge 1 commit intosunlabuiuc:masterfrom
abhiseksinha-r1:Add-EHR-LLM

Conversation

@abhiseksinha-r1
Copy link
Copy Markdown

Add Zero-Shot LLM Evidence Retrieval Pipeline (Ahsan et al. 2024)

Contributor: Abhisek Sinha (abhisek5@illinois.edu)
Type of Contribution: Dataset + Task + Model
Paper Reference: https://arxiv.org/abs/2309.04550

Summary

This PR implements the zero-shot LLM pipeline for EHR evidence retrieval as proposed in Ahsan et al. (2024) "Retrieving Evidence from EHRs with LLMs: Possibilities and Challenges" (CHIL 2024, PMLR 248:489-505). The implementation follows PyHealth's modular architecture with new dataset, task, and model components.


Project Details

1. MIMIC3NoteDataset - New Dataset Class

A specialized MIMIC-III data loader optimized for NLP and evidence retrieval tasks:

  • Loads noteevents and diagnoses_icd tables by default
  • Dedicated YAML config (mimic3_note.yaml) exposing the iserror flag for filtering erroneous notes
  • Applies preprocessing to fill missing charttime values from chartdate

2. EHREvidenceRetrievalTask - New Task Definition

Binary classification task: does a patient's notes support a given query diagnosis?

  • Pairs concatenated clinical notes with a free-text query diagnosis
  • Uses ICD-9 codes as computable proxy labels for ground truth
  • Configurable note categories, max notes per sample, and custom separators

3. ZeroShotEvidenceLLM - New Model Implementation

Implements the two-step zero-shot prompting strategy from the paper:

  1. Classification Prompt → Determines yes/no with token-level confidence score
  2. Summarisation Prompt → Extracts supporting evidence (only when step 1 is "yes")

Key features:

  • Supports encoder-decoder (Flan-T5) and decoder-only (Mistral) architectures
  • Confidence scoring via normalized token-level probability (AUC > 0.9 for hallucination prediction)
  • Clinical-BERT dense-retrieval baseline (use_cbert_baseline=True)
  • Batch processing for efficient inference

Files Added/Modified

File Description
pyhealth/datasets/mimic3.py Added MIMIC3NoteDataset class
pyhealth/datasets/configs/mimic3_note.yaml New YAML config for note-focused loading
pyhealth/datasets/__init__.py Export new dataset class
pyhealth/tasks/ehr_evidence_retrieval.py New EHREvidenceRetrievalTask
pyhealth/tasks/__init__.py Export new task class
pyhealth/models/ehr_evidence_llm.py New ZeroShotEvidenceLLM model
pyhealth/models/__init__.py Export new model class
examples/clinical_tasks/mimic3_note_ehr_evidence_retrieval_llm.py Full working example with ablations
docs/api/datasets/pyhealth.datasets.MIMIC3NoteDataset.rst API documentation
docs/api/tasks/pyhealth.tasks.EHREvidenceRetrievalTask.rst API documentation
docs/api/models/pyhealth.models.ZeroShotEvidenceLLM.rst API documentation
tests/core/test_mimic3_note_dataset.py Unit tests for dataset
tests/core/test_ehr_evidence_llm.py Unit tests for model


Ablation Experiments

The example script includes four ablation experiments from the paper:

ID Experiment Description
A1 Prompt Format Two-step vs single-step vs chain-of-thought
A2 Confidence Threshold Precision/recall trade-off for abstention
A3 BM25 Pre-retrieval Reduce note length, measure recall vs faithfulness
A4 LLM Judge Mistral-7B vs GPT-3.5 auto-evaluator agreement

Usage

Quick Demo (No MIMIC Access Required)

python examples/clinical_tasks/mimic3_note_ehr_evidence_retrieval_llm.py --demo

Full Pipeline with MIMIC-III

from pyhealth.datasets import MIMIC3NoteDataset
from pyhealth.tasks import EHREvidenceRetrievalTask
from pyhealth.models import ZeroShotEvidenceLLM

# Load dataset
dataset = MIMIC3NoteDataset(root="/path/to/mimic-iii/1.4")

# Define task
task = EHREvidenceRetrievalTask(
    query_diagnosis="small vessel disease",
    condition_icd_codes=["437.3", "437.30", "437.31"],
)
samples = dataset.set_task(task)

# Run inference
model = ZeroShotEvidenceLLM(
    dataset=None,
    model_name="google/flan-t5-xxl",
)
result = model.inference(samples[0])
print(result)
# {'prediction': 1, 'confidence': 0.92, 'evidence': '...'}

Testing

# Run all new tests
pytest tests/core/test_mimic3_note_dataset.py -v
pytest tests/core/test_ehr_evidence_llm.py -v

Related Work

  • Paper: Ahsan et al. (2024) "Retrieving Evidence from EHRs with LLMs: Possibilities and Challenges" CHIL 2024
  • Dataset: MIMIC-III v1.4 (PhysioNet credentialed access required)
  • Models Tested: Flan-T5-XXL, Mistral-7B-Instruct, Clinical-BERT

Checklist

  • New dataset class (MIMIC3NoteDataset)
  • New task definition (EHREvidenceRetrievalTask)
  • New model implementation (ZeroShotEvidenceLLM)
  • Clinical-BERT baseline included
  • Comprehensive example script with ablations
  • Unit tests for dataset and model
  • API documentation (RST files)
  • Demo mode for testing without MIMIC access

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant