Shweta Sharma | UX Researcher

Overview

Physicians are rigorously tested on medical knowledge, but the communication skills that shape patient outcomes? Almost never measured at scale. At ABIM, I spent six months closing that gap: translating two validated clinical communication frameworks into 11 operationalized rubrics, then building and validating a scalable NLP pipeline that applies them reliably across 985 physician conversations.

985

Clinical transcripts labeled

Behavioral constructs operationalized

AI-detectable communication skills

Label prefix families in the schema

Skill Constellation

Primary

NLP Pipeline EngineeringTransformer Fine-Tuning (BERT)Rubric Design

Supporting

Prompt EngineeringLLM-Assisted LabelingData Curation

Emerging

Research Ethics & AI GovernancePHI De-identification

Background

Medicine is changing fast. Ambient listening tools transcribe encounters in real time. AI scribes write clinical notes from spoken conversation. Telemedicine has moved care into video calls and async messaging. More physician-patient conversations are being captured than ever before, but the ability to measure what happens in them has not kept pace.

Of all the competencies that determine patient outcomes, communication has the broadest reach: whether a patient understands their diagnosis, follows through on treatment, or feels safe enough to return. It almost always comes down to how the physician communicated. ACGME names Interpersonal and Communication Skills as a core competency alongside Medical Knowledge. Yet medical knowledge gets board exams and communication gets assumptions.

The existing approach is such where trained human raters reviewing transcripts tops out at ~50 conversations per week, with documented reliability problems: rater bias, rater drift, order effects. As transcript volume grows, the gap between what exists and what can be evaluated widens.

ABIM's Innovation team brought the question: can AI detect communication behaviors at scale without losing the measurement quality that makes the output meaningful? We reframed it immediately. This wasn't a model problem: it was a measurement design problem. The rubric, not the classifier, was the answer.

1.1 The Verification Gap in Medicine

ABIM certifies physicians across six core competency domains. While medical knowledge is rigorously tested via formal board examinations, patient interaction is rarely measured at scale:

Competency Domain	Measurement Status
Medical Knowledge	Rigorously tested (board exams)
Patient Care & Procedural Skills	Partially measured
Practice-Based Learning & Improvement	Self-reported
Systems-Based Practice	Self-reported
Interpersonal & Communication Skills	Almost never measured at scale
Professionalism	Almost never measured at scale

The result is a massive verification gap: physicians are certified via examination but assumed capable of proper human interaction. The only existing measure, patient satisfaction surveys, is rare, self-selected, and indirect.

Key phrase to use: "We certify doctors via exam but assume they are capable of proper human interaction. This project tried to make that assumption testable."

1.2 Why Communication is Hard to Measure

Communication quality is difficult to quantify because:

Construct subjectivity: Concepts like empathy, rapport, and shared decision-making are expressed in many different linguistic forms and interpreted differently by different raters.
Context dependency: What counts as good communication in a routine encounter differs from what is required in serious illness discussions or cross-cultural consultations.
Invisibility of non-verbal behavior: Body language, eye contact, and tone are not present in transcripts, creating an inherent measurement gap.
Scale bottleneck: Manual expert review cannot scale beyond ~50 conversations/week without unacceptable quality degradation.

The existing approach (human raters using validated frameworks) has documented reliability problems: rater bias, rater drift, and order effects. These are psychometric problems, reliability and validity issues, not just a technology gap.

1.3 The Research Question

"Can AI-driven pattern recognition achieve the same level of nuance and accuracy as human expert evaluation of clinical communication?"

Three subsidiary questions shaped the study design:

Can rubric-grounded annotation reduce rater subjectivity to a reliable, replicable standard?
Can LLM-assisted labeling scale annotation without sacrificing construct validity?
Can a downstream classifier achieve sufficient per-construct precision/recall to support formative (not deterministic) feedback?

2.1 Study Type: Mixed Methods, Quantitative Primary

The study used a two-strand structure (an enabling qualitative strand feeding into a primary quantitative strand):

Strand	Methods	Purpose	Output
Qualitative (Enabling)	Expert framework operationalization, behavioral coding scheme design, thematic synthesis, exemplar annotation	Build a valid measurement instrument	11 rubrics with inclusion/exclusion criteria, 8–10 annotated exemplars per construct
Quantitative (Primary)	Content analysis at scale, IRR, classifier performance evaluation, calibration metrics	Validate and deploy the measurement instrument	LLM agreement rates, per-construct F1/Precision/Recall, behavioral profiles across 985 conversations

Why qual enables quant here: The rubric design phase is the equivalent of instrument design in survey research: it is qual work, but its purpose is to enable the quant phase.

Qualitative Methods Map

Method	Where It Was Used	Reference
Expert Framework Operationalization	Translating Calgary-Cambridge & NURSE into behavioral checklists	DeVellis (2016) (scale development in psychometrics)
Thematic Synthesis	Synthesizing 5 frameworks to 16 nuanced skills, 12 AI-detectable	Thomas & Harden (2008)
Behavioral Coding Scheme Design	Creating presence/absence checklists per construct	Bakeman & Gottman (1997)
Exemplar Annotation (Few-Shot Grounding)	Manually coding 8–10 representative conversations per construct as seed labels	Expert annotation protocol

Quantitative Methods Map

Method	Where It Was Used	Reference
Content Analysis at Scale	Labeling ~985 conversations using LLM annotators constrained by rubric prompts	Krippendorff (2004)
Inter-Rater Reliability (IRR)	Agreement between human graders and LLM labels (Cohen's κ)	Landis & Koch (1977)
Classifier Performance Evaluation	Precision, Recall, F1; Macro-F1 as headline metric	Standard imbalanced classification evaluation
Calibration Analysis	Expected Calibration Error (ECE) and Brier Score evaluation	Niculescu-Mizil & Caruana (2005)

3.1 Framework Selection

Five frameworks were evaluated. Two were selected as primary:

Tier 1: Frameworks Evaluated

Calgary–Cambridge

6 phases, 70+ behaviors

PRIMARY

NURSE

Empathy micro-skills

PRIMARY

Shared decision-making

SCOPED OUT

SPIKES

Bad news delivery

REFERENCE

McLaughlin 6-Step

Assessment framework

REFERENCE

Tier 2: 16 Nuanced Medical Skills, 5 Categories

The Emotional Core

✓Empathy & Compassion

✗Emotional Regulation

Relational Mechanics

✓Relationship & Trust Building

✓Bridging Relational Gaps

✗Bedside Manner

✗Behavioral Adaptability

Information Gathering

✓Active Listening

✓Open-Ended Questioning

✓Patient-Centered Interviewing

✗Nonverbal Facilitation

Cognitive Integration

✓Complex Clinical Reasoning

✓Eliciting & Navigating Context

The Collaborative Plan

✓Shared Decision Making

✓Collaborative Phrasing

✓Strategic Agenda Setting

✓Indirect Emotional Inquiry

12 AI-Detectable Skills

Fully covered by Calgary-Cambridge + NURSE together: explicit verbal behaviors, detectable in text transcripts.

4 Excluded (Not Text-Detectable)

Nonverbal Facilitation, Bedside Manner, Emotional Regulation, and Behavioral Adaptability. These require audio, video, or observation of internal state.

3.2 From 16 Skills to 12 AI-Detectable Constructs

Framework synthesis identified 16 nuanced medical communication skills. 4 were excluded as non-text-detectable:

Why CCG + NURSE specifically: Together they cover all 12 AI-detectable skills with no gaps, have strong structural integration (NURSE empathy moves sit naturally inside CCG Phase IV), and are both validated, peer-reviewed, and widely used in clinical communication training.

4.1 The Five-Phase Pipeline

Each phase acts as a quality gate, preventing error propagation and construct drift:

Synthetic Bootstrap

ChatNote-style data + Label Studio. Useful for rubric testing; invalid as primary corpus.

GATE: rubric operationalized

Real Transcript Pivot

VHA 4C anchor + ACI-BENCH, OSCE, PriMock57. ~985 conversations.

GATE: construct validity

LLM Labeling at Scale

Rubric-grounded prompts; one construct at a time; Qwen vs. Llama selected per construct.

GATE: no cross-contamination

PHI De-Identification

NER-based masking to i2b2/UTHealth 2014 standard. HIPAA prerequisite for real data.

GATE: ethical + legal

SLM Training

Multi-label + ordinal 1–5 scoring. Probability outputs with audit rationale text.

GATE: formative use only

4.2 The Pivot to Real Clinical Transcripts

Initial work used synthetic data. We made a deliberate pivot toward real clinical encounters as the primary corpus anchor:

Source	Type	Contribution
VHA 4C Lineage	Real recorded encounters	Primary realism anchor: real-world primary care dynamics
ACI-BENCH	Mixed simulation / role-play	Benchmark-style transcripts
OSCE Simulated Interviews	Simulated	High-quality respiratory-focus transcripts; domain labels
PriMock57	Simulated mock primary care	Multi-artifact benchmarking

Why this was a construct validity decision: Synthetic conversations contain overly explicit behavioral markers that real encounters lack. A model trained on synthetic data learns a distribution that does not generalize to actual clinical settings: a measurement validity problem, not a model problem.

4.3 LLM-Assisted Labeling Strategy

8–10 exemplar conversations manually curated per construct for few-shot grounding.
Rubric-grounded prompts encoded the full rubric: inclusion criteria, exclusion criteria, structured reasoning output.
One construct at a time: prevents cross-construct contamination.
Per-construct model selection: Qwen vs. Llama evaluated separately per construct.

5.1 Inter-Rater Reliability (IRR)

Cohen's κ computed per construct against Landis & Koch (1977) benchmark bands:

>= 0.61 substantial (reliable)

0.41–0.60 moderate (formative use)

< 0.41 human review required

Construct

0.20 slight.40 fair.60 moderate.80 substantial1.0

Kappa

Build Rapport

.74

Agenda Setting

.62

Patient Narrative

.55

Signposting

.68

Summary / Check-Back

.71

Closing / Next Steps

.76

NURSE: Name

.78

NURSE: Understand

.58

NURSE: Respect

.61

NURSE: Support

.66

NURSE: Explore

.47

The pattern is itself a finding: Constructs with explicit linguistic markers (such as NURSE: Naming: directly labeling a patient's emotion) showed higher agreement. Constructs requiring inference from implicit behavior (such as NURSE: Explore: detecting open-ended emotional inquiry) showed lower agreement. That tells us which behaviors are text-detectable and which require additional signal.

5.2 Classifier Performance (BERT: Per Construct)

Macro-F1 is the headline metric: it averages F1 across all constructs without weighting by class frequency, the right choice for an imbalanced label space.

F1 Score

Precision

Recall

Build Rapport

cc_opening

.86

Agenda Setting

cc_agenda_set

.74

Patient Narrative

cc_patient_narrative_supported

.66

Signposting

cc_structure_signposting

.79

Summary / Check-Back

cc_summary_checkback

.81

Closing / Next Steps

cc_closing_next_steps

.85

NURSE: Name

nurse_name

.87

NURSE: Understand

nurse_understand

.69

NURSE: Respect

nurse_respect

.72

NURSE: Support

nurse_support

.78

NURSE: Explore

nurse_explore

.61

Macro-F1 (unweighted mean)

.76

5.3 Calibration Metrics

A model that says "75% confident" should be right 75% of the time.

ECE (Expected Calibration Error)

0.08 target < 0.10

Average gap between predicted confidence and actual accuracy. Model confidence strongly matches actual accuracy.

Brier Score

0.12 target < 0.15

Mean squared error between predicted probabilities and actual labels.

Why It Matters (Tiered Routing)

HIGH CONF:Auto-accept (low-risk constructs only)

MID CONF:Flag with rationale for reviewer

LOW CONF:Route to expert human review

Why calibration enables formative feedback: Calibration scores power a tiered routing system: high confidence labels are auto-accepted, mid confidence are flagged for human review, low confidence are routed to expert review. The model knows when it is uncertain. That is the human-in-the-loop design.

A qualitative and quantitative breakdown of Type I (False Positive) and Type II (False Negative) errors across the communication constructs.

Representative confusion matrix demonstrating the average classification distribution across sparse communication constructs (such as nurse_explore):

Predicted Present

Predicted Absent

Actual Present

72%True Positive

12%False Negative

Actual Absent

11%False Positive

88%True Negative

Named Entity Recognition detects Protected Health Information spans, classifies them by type, and masks them before any transcript enters the labeling pipeline.

Standard: i2b2/UTHealth 2014 de-identification corpus (HIPAA-compliant)

Original transcript containing Protected Health Information (PHI).

DR: Good morning, Mr. Alvarez. I see you visited Riverside Clinic on March 3rd for the chest pain.
PT: Yes, Dr. Chen referred me. I turned 58 last month. You can reach me at 610-555-0142.

De-identification was built into the pipeline architecture from the start, not added as an afterthought. That reflects a governance-first design philosophy that shaped everything from output format to what claims the team would and would not make.

7.1 Types of Validity Addressed

Validity Type	How Established	Evidence
Content Validity	Using peer-reviewed clinical communication frameworks (CCG, NURSE) validated in healthcare studies.	Published framework citations, ABMS/ACGME literature adoption.
Construct Validity	11 operationalized rubrics with explicit inclusion/exclusion checklists before any labels were run.	Checklists, borderline examples, prompt-grounding documentation.
Face Validity	I/O Psychologists and clinical educators reviewed checklist items against communication standards.	Adjudication and feedback records.
Criterion Validity	Comparing automated scores against OSCE panel grades and patient satisfaction metrics.	In progress: criterion-matching dataset roadmap.

7.2 Known Threats to Validity

Threat	Description	Mitigation
Label Noise Propagation	LLM labels are not gold standard; errors propagate to SLM training.	Seed label anchoring, entropy filtering, human review tier.
Text-Only Signal Limit	Transcripts omit nonverbal, paraverbal, and physical exam behaviors.	Explicitly excluded 4 non-text-detectable skills from scope.
Small Gold Set	8–10 exemplars per construct (sufficient for grounding, not formal ICC).	Expanding gold set is primary recommendation for next phase.

7.3 Governance: Built In From the Start

ACGME explicitly states that Milestones are formative educational tools, not designed for high-stakes external decisions. Every output format, framing choice, and constraint was shaped by this:

🚫

Don't Present Output as a Milestone Score

Never make automated output determinative for advancement or credentialing without full human review. ACGME Milestones were not designed for external high-stakes use.

⚠️

Don't Rely Only on Simulated Data

Simulated roleplay transcripts contain explicit structure that real encounters lack. Real clinical encounter anchoring is essential for valid measurement.

🔬

Don't Skip Label Quality Monitoring

Training classifiers on noisy silver labels causes training plateaus. Monitor label entropy and implement a human validation fallback tier.

📉

Don't Report Only Overall Accuracy

Communication behaviors are sparse. Report per-construct F1, calibration metrics (ECE), and slice-based performance to detect bias.

What I Owned

As Innovation Analyst (Data & Research Focused), I owned the data pipeline and labeling architecture:

Dataset architecture: Evaluating 4 transcript sources, structuring the corpus around real vs. simulated data, making the case for VHA 4C as the realism anchor.
Label schema design: The 9-prefix labeling ontology with full ACGME competency alignment.
LLM labeling strategy: The one-construct-at-a-time protocol, per-construct model selection (Qwen vs. Llama), exemplar conversation selection criteria.
CSV schema architecture: 5-section schema covering keys, outcomes, features, competency alignment, and weak-label covariates.
Simulated to real pivot documentation: Capturing the rationale with specific failure modes identified (distribution shift, model collapse).

The Decision That Shows Research Judgment

The simulated to real pivot:

"The original plan was to use synthetic ChatNote-style transcripts. I pushed for pivoting toward real clinical encounters (VHA 4C) as the primary data anchor, because if the model learns to detect communication behaviors from synthetic conversations that are cleaner and more explicit than real encounters, the scores it produces won't generalize to actual clinical settings. That is not a model problem. It is a measurement validity problem. The pivot was a research quality decision."

Acknowledged Gaps

Gap	Description	Impact
No Gold-Standard Validation	Pipeline not compared against a calibrated human rating panel or OSCE scores.	Cannot claim criterion validity: only content and construct validity.
SHARE Framework Scoped Out	Time constraints prevented implementing shared decision-making labeling.	4 constructs unmeasured.
Small Exemplar Set	8–10 conversations per construct is enough for grounding, not formal ICC.	Agreement statistics are preliminary, not definitive.

Recommended Next Steps

Expand gold standard dataset (100+ double-coded conversations for formal ICC).
Reintegrate SHARE framework for shared decision-making constructs.
Conduct the feedback UX study: how physicians interpret probability-based formative feedback.
Run bias audit across demographics, cultural backgrounds, and data sources.

From Frameworks to F1 Scores: Designing a Scalable Communication Assessment System

Overview

Skill Constellation

Primary

Supporting

Emerging

Background

1.1 The Verification Gap in Medicine

1.2 Why Communication is Hard to Measure

1.3 The Research Question

2.1 Study Type: Mixed Methods, Quantitative Primary

Qualitative Methods Map

Quantitative Methods Map

3.1 Framework Selection

Tier 1: Frameworks Evaluated

Calgary–Cambridge

NURSE

SHARE

SPIKES

McLaughlin 6-Step

Tier 2: 16 Nuanced Medical Skills, 5 Categories

The Emotional Core

Relational Mechanics

Information Gathering

Cognitive Integration

The Collaborative Plan

12 AI-Detectable Skills

4 Excluded (Not Text-Detectable)

3.2 From 16 Skills to 12 AI-Detectable Constructs

4.1 The Five-Phase Pipeline

Synthetic Bootstrap

Real Transcript Pivot

LLM Labeling at Scale

PHI De-Identification

SLM Training

4.2 The Pivot to Real Clinical Transcripts

4.3 LLM-Assisted Labeling Strategy

5.1 Inter-Rater Reliability (IRR)

Build Rapport

Agenda Setting

Patient Narrative

Signposting

Summary / Check-Back

Closing / Next Steps

NURSE: Name

NURSE: Understand

NURSE: Respect

NURSE: Support

NURSE: Explore

5.2 Classifier Performance (BERT: Per Construct)

Build Rapport

Agenda Setting

Patient Narrative

Signposting

Summary / Check-Back

Closing / Next Steps

NURSE: Name

NURSE: Understand

NURSE: Respect

NURSE: Support

NURSE: Explore

Macro-F1 (unweighted mean)

5.3 Calibration Metrics

0.08 target < 0.10

0.12 target < 0.15

7.1 Types of Validity Addressed

7.2 Known Threats to Validity

7.3 Governance: Built In From the Start

Don't Present Output as a Milestone Score

Don't Rely Only on Simulated Data

Don't Skip Label Quality Monitoring

Don't Report Only Overall Accuracy

What I Owned

The Decision That Shows Research Judgment

Acknowledged Gaps

Recommended Next Steps