Measuring Patient-Centered Communication with NLP

Why Communication Measurement Matters — and Why It's Hard
Skill Constellation
Primary
Supporting
Emerging
The American Board of Internal Medicine (ABIM) exists to certify physicians who demonstrate knowledge, skills, and attitudes essential for excellent patient care. Communication behaviors — the "skills and attitudes in action" — are the hardest competency to quantify at scale.
ABMS standards require boards to assess Interpersonal & Communication Skills and Professionalism as core competencies. But even "gold standard" human rating systems derived from Calgary-Cambridge show measurable rater bias, drift, and order effects. Manual measurement simply doesn't scale — and it isn't purely objective.
Our mission: build a scalable, rubric-grounded NLP pipeline that measures patient-centered communication behaviors across real clinical transcripts — framed as formative feedback, never as deterministic scoring.
Rubric-grounded annotation → LLM-assisted labeling → BERT classifier training, aligned to Calgary-Cambridge and NURSE frameworks.
Calgary-Cambridge Guide (6-step consultation model) and NURSE (Naming, Understanding, Respecting, Supporting, Exploring) for empathic communication.
Manual annotation consistency
Labeling scalability
Cost per inference
From Rubric to Scalable Classifier
We started with simulated and role-play transcript sources because they're accessible and commonly used in clinical NLP benchmarking. But we pivoted toward real clinical encounters to improve ground truth realism and reduce "too-clean" synthetic behavior patterns.
Rubric Design
Translated two clinical communication frameworks into construct-level scoring rubrics with inclusion/exclusion criteria and borderline examples.
Evidence: 11 distinct construct rubrics reducing subjective interpretation.
The Pipeline
Each step protects a specific standard: the rubric protects construct validity, seed labels protect interpretability, LLM pass provides scale, and BERT provides reproducibility at near-zero cost.
Rubric Design
Construct-specific scoring criteria aligned to Calgary-Cambridge and NURSE frameworks.
Seed Labels
8–10 exemplar excerpts per construct, manually annotated in Label Studio.
Prompt Engineering
Construct-specific prompts with rubric definition, inclusion/exclusion criteria, and structured output.
LLM Labeling
Scale annotation across ~985 conversations, one construct at a time to reduce contamination.
BERT Training
Fine-tuned encoder for cheap, scalable inference with audit-friendly probability outputs.
Dataset Sources
We aggregated ~985 conversations from four complementary sources, balancing real-world realism with benchmark coverage:
Data Curation & Source Evaluation
Evaluated 4 transcript corpora across realism, domain coverage, and labeling implications. Made strategic call to anchor on real-encounter data.
Evidence: Dataset sources table; simulated→real pivot decision documented.
Key Methodological Decisions
One Construct at a Time
We labeled each Calgary or NURSE construct independently across all encounters — simpler prompts, less cross-construct contamination.
Prompts as Measurement Tools
Each prompt included: rubric definition, inclusion criteria, exclusion criteria, borderline examples, and required structured output (score + rationale).
PHI De-identification First
NER-style PHI masking applied before scaling to real transcripts — a prerequisite to reuse and ethical scaling.
Simulated → Real Pivot
Synthetic scripts contain explicit structure that real encounters lack. We incorporated VHA 4C real encounters as the realism anchor.
Prompt Engineering for Measurement
Designed prompts that function as measurement instruments, not generation prompts. Each included rubric, boundary conditions, and required structured output.
Evidence: Scalable labeling of 985 conversations with construct-level precision.
LLM-Assisted Labeling at Scale
Used LLMs as scalable annotators constrained by rubric-grounded prompts.
Evidence: 985 conversations × 11 constructs labeled, replacing months of manual work.
Transformer Fine-Tuning (BERT)
Fine-tuned BERT encoder for per-construct classification with probability outputs.
Evidence: Near-zero inference cost replacing expensive LLM API calls.
What We Built and Learned
The project produced a reusable labeling pipeline, a construct-wise labeled dataset, and a trained classifier — while surfacing critical lessons about the limits of LLM-assisted annotation.
Construct-wise labels (Calgary 6-step + NURSE 5-step) across aggregated corpus.
Each with explicit scoring criteria, rubric boundaries, and exemplar annotations.
End-to-end rubric → seed → LLM → BERT pipeline transferable to other domains.
Per-construct probabilities enabling audit-friendly formative feedback.
Successes
We operationalized two widely used communication frameworks into measurable constructs with explicit scoring criteria — reducing subjectivity and enabling scalable, reusable labeling.
— Engineering + Measurement Achievement
Rubric Operationalization
Translated Calgary-Cambridge and NURSE into construct-level scoring rubrics with borderline examples — reducing subjective interpretation.
Scalable Pipeline
Created a labeling pipeline that can be reused by future co-ops and extended to other communication domains beyond internal medicine.
Known Gaps
Gold-Standard Validation
Stronger human-annotation validation sets and slice-based fairness checks are needed before any higher-stakes use.
Cross-Source Generalization
Per-construct performance across simulated vs. real data sources needs explicit evaluation and reporting.
What We Recommend Not Doing
ACGME explicitly warns that Milestones are an educational, formative assessment tool and were not designed for high-stakes external decisions (credentialing / licensing). Automation can introduce false precision, biased measurement, and misuse risks. These governance constraints shaped our entire framing.
PHI De-identification
Implemented NER-style PHI masking before scaling to real clinical transcripts.
Evidence: Pipeline processes real clinical data while maintaining HIPAA compliance.
Don't Present Output as a "Milestone Score"
Never make automated output determinative for advancement or credentialing without full governance review and validation. ACGME Milestones were not designed for external high-stakes use.
Don't Rely Only on Simulated Data
Simulated/roleplay corpora cannot fully represent clinical realism — the datasets themselves acknowledge this limitation. Real-encounter data is essential for validity.
Don't Skip Label Quality Monitoring
Training BERT on raw LLM labels can cause instability and training plateaus. Monitor training variance and consider entropy filtering, ensembling, or human fallback for ambiguous samples.
Don't Report Only "Overall Accuracy"
Communication constructs are sparse and imbalanced. Report per-construct precision/recall/F1, calibration metrics (ECE), and run-to-run variance to detect instability.
Core Limitations
Text-Only Signal
Transcripts omit nonverbal cues, physical examination actions, and many contextual behaviors critical to communication quality.
Fairness Risks
Language style varies across cultures, dialects, and literacy levels. Automated scoring must be audited by demographic slices and clinical context.
Label Noise Propagation
LLM labels are not gold. Error propagation from LLM→BERT can cause instability; mitigation only partially closes the performance gap.
What I'd Do Differently
Build human validation sets earlier. The rubric IS the product, not the classifier.
Evolved thinking: "The rubric IS the product, not the classifier" — weeks of rubric iteration paid off more than any model tuning.