NLP & UX Research

Measuring Patient-Centered Communication with NLP

RoleData Scientist Co-op
TeamI/O Psychologists, Data Scientist, and AI/ML Engineer
TimelineCo-Op (6 months)
Measuring Patient-Centered Communication with NLP
Overview

Why Communication Measurement Matters — and Why It's Hard

Skill Constellation

Primary

NLP Pipeline EngineeringTransformer Fine-Tuning (BERT)Rubric Design

Supporting

Prompt EngineeringLLM-Assisted LabelingData Curation

Emerging

Research Ethics & AI GovernancePHI De-identification

The American Board of Internal Medicine (ABIM) exists to certify physicians who demonstrate knowledge, skills, and attitudes essential for excellent patient care. Communication behaviors — the "skills and attitudes in action" — are the hardest competency to quantify at scale.

ABMS standards require boards to assess Interpersonal & Communication Skills and Professionalism as core competencies. But even "gold standard" human rating systems derived from Calgary-Cambridge show measurable rater bias, drift, and order effects. Manual measurement simply doesn't scale — and it isn't purely objective.

Our mission: build a scalable, rubric-grounded NLP pipeline that measures patient-centered communication behaviors across real clinical transcripts — framed as formative feedback, never as deterministic scoring.

Approach

Rubric-grounded annotation → LLM-assisted labeling → BERT classifier training, aligned to Calgary-Cambridge and NURSE frameworks.

Frameworks

Calgary-Cambridge Guide (6-step consultation model) and NURSE (Naming, Understanding, Respecting, Supporting, Exploring) for empathic communication.

Manual annotation consistency

BaselineVariable ICC
TargetRubric-anchored

Labeling scalability

Baseline~50 convos/week
Target985 convos labeled

Cost per inference

BaselineLLM API costs
TargetBERT (near-zero)
The Craft

From Rubric to Scalable Classifier

We started with simulated and role-play transcript sources because they're accessible and commonly used in clinical NLP benchmarking. But we pivoted toward real clinical encounters to improve ground truth realism and reduce "too-clean" synthetic behavior patterns.

Skill Spotlight

Rubric Design

Translated two clinical communication frameworks into construct-level scoring rubrics with inclusion/exclusion criteria and borderline examples.

Evidence: 11 distinct construct rubrics reducing subjective interpretation.

The Pipeline

Each step protects a specific standard: the rubric protects construct validity, seed labels protect interpretability, LLM pass provides scale, and BERT provides reproducibility at near-zero cost.

1

Rubric Design

Construct-specific scoring criteria aligned to Calgary-Cambridge and NURSE frameworks.

2

Seed Labels

8–10 exemplar excerpts per construct, manually annotated in Label Studio.

3

Prompt Engineering

Construct-specific prompts with rubric definition, inclusion/exclusion criteria, and structured output.

4

LLM Labeling

Scale annotation across ~985 conversations, one construct at a time to reduce contamination.

5

BERT Training

Fine-tuned encoder for cheap, scalable inference with audit-friendly probability outputs.

Dataset Sources

We aggregated ~985 conversations from four complementary sources, balancing real-world realism with benchmark coverage:

SourceReal vs SimulatedContributionLabeling Implication
ACI-BENCHMixed simulation / role-playBenchmark-style transcripts with clinical documentation behaviorsCan introduce "dictation" language that confounds Calgary constructs
VHA 4C LineageReal recorded encountersReal-world primary care dynamics and patient context cluesStrongest realism anchor; supports patient-centered measurement framing
OSCE Simulated InterviewsSimulated (audio + transcripts)High-quality respiratory-focus transcripts, domain-labeledGood for NER and general NLP; limited vs real-world nuance
PriMock57Simulated mock primary careMulti-artifact dataset (audio, transcripts, notes, eval)Useful for benchmarking communication-to-note pipelines
Skill Spotlight

Data Curation & Source Evaluation

Evaluated 4 transcript corpora across realism, domain coverage, and labeling implications. Made strategic call to anchor on real-encounter data.

Evidence: Dataset sources table; simulated→real pivot decision documented.

Key Methodological Decisions

🎯

One Construct at a Time

We labeled each Calgary or NURSE construct independently across all encounters — simpler prompts, less cross-construct contamination.

📐

Prompts as Measurement Tools

Each prompt included: rubric definition, inclusion criteria, exclusion criteria, borderline examples, and required structured output (score + rationale).

🔒

PHI De-identification First

NER-style PHI masking applied before scaling to real transcripts — a prerequisite to reuse and ethical scaling.

🔄

Simulated → Real Pivot

Synthetic scripts contain explicit structure that real encounters lack. We incorporated VHA 4C real encounters as the realism anchor.

Skill Spotlight

Prompt Engineering for Measurement

Designed prompts that function as measurement instruments, not generation prompts. Each included rubric, boundary conditions, and required structured output.

Evidence: Scalable labeling of 985 conversations with construct-level precision.

Skill Spotlight

LLM-Assisted Labeling at Scale

Used LLMs as scalable annotators constrained by rubric-grounded prompts.

Evidence: 985 conversations × 11 constructs labeled, replacing months of manual work.

Skill Spotlight

Transformer Fine-Tuning (BERT)

Fine-tuned BERT encoder for per-construct classification with probability outputs.

Evidence: Near-zero inference cost replacing expensive LLM API calls.

The Evidence

What We Built and Learned

The project produced a reusable labeling pipeline, a construct-wise labeled dataset, and a trained classifier — while surfacing critical lessons about the limits of LLM-assisted annotation.

985Conversations Labeled

Construct-wise labels (Calgary 6-step + NURSE 5-step) across aggregated corpus.

11Communication Constructs

Each with explicit scoring criteria, rubric boundaries, and exemplar annotations.

PipelineReusable Framework

End-to-end rubric → seed → LLM → BERT pipeline transferable to other domains.

BERTTrained Classifier

Per-construct probabilities enabling audit-friendly formative feedback.

Successes

We operationalized two widely used communication frameworks into measurable constructs with explicit scoring criteria — reducing subjectivity and enabling scalable, reusable labeling.

Engineering + Measurement Achievement

Rubric Operationalization

Translated Calgary-Cambridge and NURSE into construct-level scoring rubrics with borderline examples — reducing subjective interpretation.

🔁

Scalable Pipeline

Created a labeling pipeline that can be reused by future co-ops and extended to other communication domains beyond internal medicine.

Known Gaps

🔍

Gold-Standard Validation

Stronger human-annotation validation sets and slice-based fairness checks are needed before any higher-stakes use.

📊

Cross-Source Generalization

Per-construct performance across simulated vs. real data sources needs explicit evaluation and reporting.

The Growth

What We Recommend Not Doing

ACGME explicitly warns that Milestones are an educational, formative assessment tool and were not designed for high-stakes external decisions (credentialing / licensing). Automation can introduce false precision, biased measurement, and misuse risks. These governance constraints shaped our entire framing.

Skill Spotlight

PHI De-identification

Implemented NER-style PHI masking before scaling to real clinical transcripts.

Evidence: Pipeline processes real clinical data while maintaining HIPAA compliance.

🚫

Don't Present Output as a "Milestone Score"

Never make automated output determinative for advancement or credentialing without full governance review and validation. ACGME Milestones were not designed for external high-stakes use.

⚠️

Don't Rely Only on Simulated Data

Simulated/roleplay corpora cannot fully represent clinical realism — the datasets themselves acknowledge this limitation. Real-encounter data is essential for validity.

🔬

Don't Skip Label Quality Monitoring

Training BERT on raw LLM labels can cause instability and training plateaus. Monitor training variance and consider entropy filtering, ensembling, or human fallback for ambiguous samples.

📉

Don't Report Only "Overall Accuracy"

Communication constructs are sparse and imbalanced. Report per-construct precision/recall/F1, calibration metrics (ECE), and run-to-run variance to detect instability.

Core Limitations

🎙️

Text-Only Signal

Transcripts omit nonverbal cues, physical examination actions, and many contextual behaviors critical to communication quality.

⚖️

Fairness Risks

Language style varies across cultures, dialects, and literacy levels. Automated scoring must be audited by demographic slices and clinical context.

🔄

Label Noise Propagation

LLM labels are not gold. Error propagation from LLM→BERT can cause instability; mitigation only partially closes the performance gap.

What I'd Do Differently

Build human validation sets earlier. The rubric IS the product, not the classifier.

Evolved thinking: "The rubric IS the product, not the classifier" — weeks of rubric iteration paid off more than any model tuning.

This case study documents NLP research conducted during an ABIM Co-Op. The project is framed as formative, developmental research — not as a validated clinical assessment tool. All governance recommendations follow ACGME Milestone usage guidelines.