LLMs can construct powerful representations
and streamline sample-efficient supervised learning

1 Massachusetts Institute of Technology 2 Brigham and Women's Hospital & Harvard Medical School

Our agentic pipeline uses LLMs to construct rubrics — structured specifications for automatically transforming raw, heterogeneous inputs into powerful representations for efficient downstream supervised learning.

Average performance over 15 EHRSHOT tasks comparing rubric methods against baselines
Figure 1. Performance averaged over all 15 clinical prediction tasks in the EHRSHOT benchmark. Rubric representations outperform naive text serialization (NaiveText), a clinical foundation model pretrained on 2.57M patients (CLMBR-T), and a count-feature gradient boosting machine (Count-GBM).

Global Rubric Pipeline

A Diverse Cohort Selection

Label-stratified k-means clustering in embedding space selects 40 diverse, representative examples (20 per class) as medoids from the training set.

B Rubric Synthesis

An LLM agent analyzes the 40 medoid EHRs in-context and synthesizes a task-specific rubric — a structured template defining what evidence to extract.

C Task-Specific Rubric

The output is a systematic rubric with sections like Demographics, CV Risk Factors, Comorbidities, Temporal Trends, and Alert Flags.

Three ways to apply the rubric ↓
D LLM Application

An LLM fills in every rubric field for each patient using only data from their EHR. High fidelity, but requires one LLM call per example.

E Parser Script

An LLM generates a deterministic Python parser that applies the rubric via string/regex matching — no LLM calls needed at inference time.

F Tabularization

An LLM generates a script converting rubric outputs into numeric feature vectors, enabling standard ML models like XGBoost.

Full prompt for rubric synthesis (Panel B)

Panel B Rubric Synthesis Prompt

You are a medical expert designing a structured rubric for a clinical prediction task.

## Task

- Name: {task_name}
- Query: {task_query}

## Context

You will be given {40} labeled patient EHR examples ({20} positive, {20} negative). Another model will later use your rubric to transform new patient EHRs into structured summaries, which will then serve as input to a supervised classifier.

## What You Must Do

Study the examples below. Combine what you observe in them with your medical knowledge to design a rubric template — a set of named fields that, when filled in for any patient, produce a structured summary optimized for this prediction task.

The rubric should:

  1. Be data-driven and discriminative. Identify which features, patterns, and interactions actually separate the positive and negative cases.
  2. Be structured and consistent. Every rubricified output must follow the same field names and order.
  3. Extract facts only. The evaluator filling in the rubric must NOT make predictions, assign risk levels, or draw conclusions.
  4. Be concise. Focus on extracting information relevant to the task, not reproducing the entire EHR.

## Examples

{Positive and negative EHR serializations}

## Output

Output ONLY the rubric template itself — the instructions another model will follow to transform a patient EHR. No preamble, no explanation. The template must be self-contained and directly usable.

Two Types of Rubrics

Global Rubrics

A single shared rubric is synthesized from a small subset of examples and applied uniformly to all inputs.

  • Fixed schema — easy to audit and reproduce
  • Can be applied via deterministic parser scripts (zero LLM cost at inference)
  • Convertible to tabular features for standard ML

Local Rubrics

A task-conditioned summary is generated per example by the LLM, producing structured sections like Patient Snapshot, Risk Factors, and Protective Factors.

  • More flexible, potentially higher fidelity
  • Captures patient-specific nuances
  • Requires an LLM call per example

Representation Comparison

The rubric transforms a raw, noisy text serialization into a structured, evidence-organized representation:

Naive Text Serialization

## Patient Demographics

- Patient age: 78, FEMALE [...]

### Inpatient Visit (14 days to pred. time)

Conditions: Acute posthemorrhagic anemia [...]

Medications: furosemide 20 MG [...]

### ER Visit (87 days before)

Conditions: Benign essential hypertension [...]

Medications: ondansetron, nitroglycerin [...]

Local Rubric Text Serialization

1. Patient Snapshot

27 yo hispanic male. Recurrent cardiology visits [...]

2. Main Risk Factors

- Congenital coronary artery anomaly [...]

- Tobacco exposure (smokeless) [...]

3. Protective Factors

- Young age (27), Normal BMI (21-22) [...]

6. Overall Risk Impression

Elevated risk of acute MI [...]

Global Rubric Text Serialization

§3. Demographics

55 | FEMALE | [...]

§6. Recent Cardiac Symptoms (last 365d)

- Chest pain/angina: No

- Dyspnea: Yes [...]

§12. Other Relevant Labs

- Creatinine: 1.12 (2023-12-02)

- eGFR: No data [...]

§17. Known Risk Factors

- Diabetes: No, Hyperlipidemia: Yes

- Family hx of premature CAD: Unknown [...]

Results

We evaluate on the EHRSHOT benchmark: 15 clinical prediction tasks across 4 categories with 6,739 patients. Rubric methods are compared against count-feature models, naive text embeddings, zero-shot chain-of-thought prompting, and CLMBR-T, a clinical foundation model pretrained on 2.57M patients.

Overall Performance (15-task average)

Method AUROC (n=40) AUPRC (n=40) AUROC (n=All) AUPRC (n=All)
Local-Rubric 0.717 0.406 0.772 0.452
Global-Rubric 0.700 0.400 0.763 0.459
Global-Rubric-Auto 0.690 0.382 0.751 0.445
CLMBR-T (pretrained on 2.57M patients) 0.657 0.356 0.727 0.432
NaiveText 0.638 0.343 0.699 0.391
Count-GBM 0.608 0.311 0.679 0.387

Performance by Task Category

Performance breakdown by task category: operational outcomes, new diagnoses, lab results, and chest X-ray findings
Figure 2. Average performance by task group. Rubric methods show the largest gains on new diagnosis prediction and lab result anticipation tasks, while operational outcome tasks remain slightly stronger for CLMBR-T.

Per-Task Breakdown

AUROC for each of the 15 EHRSHOT tasks
Figure 3. AUROC per task (full training data).
AUPRC for each of the 15 EHRSHOT tasks
Figure 4. AUPRC per task (full training data).

Citation

If you find this work useful, please cite our paper:

@article{demirel2026llms,
  title={LLMs can construct powerful representations and streamline
         sample-efficient supervised learning},
  author={Demirel, Ilker and Shi, Lawrence and Hussain, Zeshan and Sontag, David},
  journal={arXiv preprint arXiv:2603.11679},
  year={2026}
}