LLMs can construct powerful representations and streamline
sample-efficient supervised learning

Ilker Demirel1 Larry Shi1 Zeshan Hussain1,2 David Sontag1
1 Massachusetts Institute of Technology 2 Brigham and Women's Hospital & Harvard Medical School
Paper arXiv Code Coming Soon
Average performance over 15 EHRSHOT tasks comparing rubric methods against baselines
Figure 1. Performance averaged over all 15 clinical prediction tasks. Rubric representations outperform naive text serialization (NaiveText), a clinical foundation model pretrained on 2.57M patients (CLMBR-T), and a count-feature gradient boosting machine (Count-GBM).

Method

Our agentic pipeline uses LLMs to construct rubrics — structured specifications that transform raw, heterogeneous data into powerful representations for downstream supervised learning.

Global Rubric Pipeline

A Diverse Cohort Selection

Label-stratified k-means clustering in embedding space selects 40 diverse, representative examples (20 per class) as medoids from the training set.

B Rubric Synthesis

An LLM agent analyzes the 40 medoid EHRs in-context and synthesizes a task-specific rubric — a structured template defining what evidence to extract.

C Task-Specific Rubric

The output is a systematic rubric with sections like Demographics, CV Risk Factors, Comorbidities, Temporal Trends, and Alert Flags.

Three ways to apply the rubric ↓
D LLM Application

An LLM fills in every rubric field for each patient using only data from their EHR. High fidelity, but requires one LLM call per example.

E Parser Script

An LLM generates a deterministic Python parser that applies the rubric via string/regex matching — no LLM calls needed at inference time.

F Tabularization

An LLM generates a script converting rubric outputs into numeric feature vectors, enabling standard ML models like XGBoost.

Two Types of Rubrics

Global Rubrics

A single shared rubric is synthesized from a small subset of examples and applied uniformly to all inputs.

  • Fixed schema — easy to audit and reproduce
  • Can be applied via deterministic parser scripts (zero LLM cost at inference)
  • Convertible to tabular features for standard ML

Local Rubrics

A task-conditioned summary is generated per example by the LLM, producing structured sections like Patient Snapshot, Risk Factors, and Protective Factors.

  • More flexible, potentially higher fidelity
  • Captures patient-specific nuances
  • Requires an LLM call per example

Representation Comparison

The rubric transforms a raw, noisy text serialization into a structured, evidence-organized representation:

Naive Text Serialization

## Patient Demographics

- Patient age: 78, FEMALE [...]

## Detailed Past Medical Visits

### Inpatient Visit (14 days to pred. time)

Conditions: Acute posthemorrhagic anemia, pH: 7.25, 7.31 [...]

Medications: furosemide 20 MG, pantoprazole 20 MG [...]

Procedures: Chest x-ray, Electrocardiogram [...]

### ER Visit (87 days before)

Conditions: Benign essential hypertension, Chest pain [...]

Medications: ondansetron, nitroglycerin [...]

Local Rubric

1. Patient Snapshot

27 yo hispanic male. Recurrent cardiology visits for congenital anomaly of coronary artery [...]

2. Main Risk Factors

- Congenital coronary artery anomaly (structural predisposition to ischemia)

- Tobacco exposure (smokeless) [...]

3. Protective Factors

- Young age (27), Normal BMI (21-22)

- No diabetes or renal impairment [...]

6. Overall Risk Impression

Elevated risk of acute MI despite favorable metabolic parameters [...]

Global Rubric

§3. Demographics

55 | FEMALE | [...]

§6. Recent Cardiac Symptoms (last 365d)

- Chest pain/angina: No

- Dyspnea: Yes [...]

§12. Other Relevant Labs

- Creatinine: 1.12 (2023-12-02)

- eGFR: No data [...]

§17. Known Risk Factors

- Diabetes: No, Hyperlipidemia: Yes

- Family hx of premature CAD: Unknown [...]

Results

We evaluate on the EHRSHOT benchmark: 15 clinical prediction tasks across 4 categories with 6,739 patients. Rubric methods are compared against count-feature models, naive text embeddings, zero-shot chain-of-thought prompting, and CLMBR-T, a clinical foundation model pretrained on 2.57M patients.

Overall Performance (15-task average)

Method AUROC (n=40) AUPRC (n=40) AUROC (n=All) AUPRC (n=All)
Local-Rubric 0.717 0.406 0.772 0.452
Global-Rubric 0.700 0.400 0.763 0.459
Global-Rubric-Auto 0.690 0.382 0.751 0.445
CLMBR-T (2.57M patients) 0.657 0.356 0.727 0.432
NaiveText 0.638 0.343 0.699 0.391
Count-GBM 0.608 0.311 0.679 0.387

Performance by Task Category

Performance breakdown by task category: operational outcomes, new diagnoses, lab results, and chest X-ray findings
Figure 2. Average performance by task group. Rubric methods show the largest gains on new diagnosis prediction and lab result anticipation tasks, while operational outcome tasks remain slightly stronger for CLMBR-T.

Per-Task Breakdown

AUROC for each of the 15 EHRSHOT tasks
Figure 3. AUROC per task (full training data).
AUPRC for each of the 15 EHRSHOT tasks
Figure 4. AUPRC per task (full training data).