Add Zero-Shot LLM Evidence Retrieval Pipeline (Ahsan et al. 2024) by abhiseksinha-r1 · Pull Request #1135 · sunlabuiuc/PyHealth

abhiseksinha-r1 · 2026-04-23T06:11:52Z

Add Zero-Shot LLM Evidence Retrieval Pipeline (Ahsan et al. 2024)

Contributor: Abhisek Sinha (abhisek5@illinois.edu)
Type of Contribution: Dataset + Task + Model
Paper Reference: https://arxiv.org/abs/2309.04550

Summary

This PR implements the zero-shot LLM pipeline for EHR evidence retrieval as proposed in Ahsan et al. (2024) "Retrieving Evidence from EHRs with LLMs: Possibilities and Challenges" (CHIL 2024, PMLR 248:489-505). The implementation follows PyHealth's modular architecture with new dataset, task, and model components.

Project Details

1. `MIMIC3NoteDataset` - New Dataset Class

A specialized MIMIC-III data loader optimized for NLP and evidence retrieval tasks:

Loads noteevents and diagnoses_icd tables by default
Dedicated YAML config (mimic3_note.yaml) exposing the iserror flag for filtering erroneous notes
Applies preprocessing to fill missing charttime values from chartdate

2. `EHREvidenceRetrievalTask` - New Task Definition

Binary classification task: does a patient's notes support a given query diagnosis?

Pairs concatenated clinical notes with a free-text query diagnosis
Uses ICD-9 codes as computable proxy labels for ground truth
Configurable note categories, max notes per sample, and custom separators

3. `ZeroShotEvidenceLLM` - New Model Implementation

Implements the two-step zero-shot prompting strategy from the paper:

Classification Prompt → Determines yes/no with token-level confidence score
Summarisation Prompt → Extracts supporting evidence (only when step 1 is "yes")

Key features:

Supports encoder-decoder (Flan-T5) and decoder-only (Mistral) architectures
Confidence scoring via normalized token-level probability (AUC > 0.9 for hallucination prediction)
Clinical-BERT dense-retrieval baseline (use_cbert_baseline=True)
Batch processing for efficient inference

Files Added/Modified

File	Description
`pyhealth/datasets/mimic3.py`	Added `MIMIC3NoteDataset` class
`pyhealth/datasets/configs/mimic3_note.yaml`	New YAML config for note-focused loading
`pyhealth/datasets/__init__.py`	Export new dataset class
`pyhealth/tasks/ehr_evidence_retrieval.py`	New `EHREvidenceRetrievalTask`
`pyhealth/tasks/__init__.py`	Export new task class
`pyhealth/models/ehr_evidence_llm.py`	New `ZeroShotEvidenceLLM` model
`pyhealth/models/__init__.py`	Export new model class
`examples/clinical_tasks/mimic3_note_ehr_evidence_retrieval_llm.py`	Full working example with ablations
`docs/api/datasets/pyhealth.datasets.MIMIC3NoteDataset.rst`	API documentation
`docs/api/tasks/pyhealth.tasks.EHREvidenceRetrievalTask.rst`	API documentation
`docs/api/models/pyhealth.models.ZeroShotEvidenceLLM.rst`	API documentation
`tests/core/test_mimic3_note_dataset.py`	Unit tests for dataset
`tests/core/test_ehr_evidence_llm.py`	Unit tests for model

Ablation Experiments

The example script includes four ablation experiments from the paper:

ID	Experiment	Description
A1	Prompt Format	Two-step vs single-step vs chain-of-thought
A2	Confidence Threshold	Precision/recall trade-off for abstention
A3	BM25 Pre-retrieval	Reduce note length, measure recall vs faithfulness
A4	LLM Judge	Mistral-7B vs GPT-3.5 auto-evaluator agreement

Usage

Quick Demo (No MIMIC Access Required)

python examples/clinical_tasks/mimic3_note_ehr_evidence_retrieval_llm.py --demo

Full Pipeline with MIMIC-III

from pyhealth.datasets import MIMIC3NoteDataset
from pyhealth.tasks import EHREvidenceRetrievalTask
from pyhealth.models import ZeroShotEvidenceLLM

# Load dataset
dataset = MIMIC3NoteDataset(root="/path/to/mimic-iii/1.4")

# Define task
task = EHREvidenceRetrievalTask(
    query_diagnosis="small vessel disease",
    condition_icd_codes=["437.3", "437.30", "437.31"],
)
samples = dataset.set_task(task)

# Run inference
model = ZeroShotEvidenceLLM(
    dataset=None,
    model_name="google/flan-t5-xxl",
)
result = model.inference(samples[0])
print(result)
# {'prediction': 1, 'confidence': 0.92, 'evidence': '...'}

Testing

# Run all new tests
pytest tests/core/test_mimic3_note_dataset.py -v
pytest tests/core/test_ehr_evidence_llm.py -v

Related Work

Paper: Ahsan et al. (2024) "Retrieving Evidence from EHRs with LLMs: Possibilities and Challenges" CHIL 2024
Dataset: MIMIC-III v1.4 (PhysioNet credentialed access required)
Models Tested: Flan-T5-XXL, Mistral-7B-Instruct, Clinical-BERT

Checklist

New dataset class (MIMIC3NoteDataset)
New task definition (EHREvidenceRetrievalTask)
New model implementation (ZeroShotEvidenceLLM)
Clinical-BERT baseline included
Comprehensive example script with ablations
Unit tests for dataset and model
API documentation (RST files)
Demo mode for testing without MIMIC access

…ero-Shot Evidence LLM

DL4H - Add MIMIC-III Note Dataset, EHR Evidence Retrieval Task, and Z…

84f11e6

…ero-Shot Evidence LLM

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Zero-Shot LLM Evidence Retrieval Pipeline (Ahsan et al. 2024)#1135

Add Zero-Shot LLM Evidence Retrieval Pipeline (Ahsan et al. 2024)#1135
abhiseksinha-r1 wants to merge 1 commit intosunlabuiuc:masterfrom
abhiseksinha-r1:Add-EHR-LLM

abhiseksinha-r1 commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

abhiseksinha-r1 commented Apr 23, 2026

Add Zero-Shot LLM Evidence Retrieval Pipeline (Ahsan et al. 2024)

Summary

Project Details

1. MIMIC3NoteDataset - New Dataset Class

2. EHREvidenceRetrievalTask - New Task Definition

3. ZeroShotEvidenceLLM - New Model Implementation

Files Added/Modified

Ablation Experiments

Usage

Quick Demo (No MIMIC Access Required)

Full Pipeline with MIMIC-III

Testing

Related Work

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. `MIMIC3NoteDataset` - New Dataset Class

2. `EHREvidenceRetrievalTask` - New Task Definition

3. `ZeroShotEvidenceLLM` - New Model Implementation