Shweta Sharma | UX Researcher

Overview

Building Machine Learning Pipelines to Surface What Clinicians Can't See at Scale

Skill Constellation

Primary

NLP Pipeline DesignGold-Standard AnnotationLLM Evaluation

Supporting

Transformer Fine-Tuning (ClinicalBERT, RoBERTa)Synthetic Data Generation

Emerging

Ethics in AIHIPAA ComplianceBias Taxonomy Design

The American Board of Internal Medicine certifies over 300,000 physicians. Part of their mission involves evaluating how physicians communicate, reason, and make decisions. However, humans are susceptible to unconscious biases—structural biases, clinical stigma, and assumptions baked into medical training.

The challenge is that these biases are nearly invisible at an individual level. It's only by analyzing thousands of records across hundreds of clinicians that patterns emerge. I built a system to detect these patterns reliably, automatically, and at scale.

Live Demo →

Methods

NLP Pipeline Engineering, Transformer Fine-Tuning, Gold-Standard Annotation, LLM Evaluation, Synthetic Data Generation

Tools

Python, PyTorch, Hugging Face, Label Studio, ClinicalBERT, RoBERTa, GPT-family LLMs

The Tension

Why This Project Exists

Assessment of clinical communication quality previously relied on manual review. This process was subjective, slow, and inconsistent. Without a shared classification system, tracking patterns or measuring the prevalence of bias was nearly impossible.

The question wasn't whether bias exists in healthcare communication—it was: can we build a system that detects it reliably, automatically, and at the scale ABIM needs to act on it?

The Craft

What I Built and Why

Experiment 1: NLP Bias Detection

We defined a 4-label classification framework developed iteratively with domain experts and validated against clinical literature.

Skill Spotlight

Bias Taxonomy Design

Created domain-specific 4-label framework operationalizing abstract bias concepts into measurable text signals.

Evidence: Categories not found in existing NLP bias benchmarks.

Label	What It Captures	Example Signal
Structural Bias	Systemic patterns reflecting institutional or socioeconomic inequities	Assumptions about treatment adherence based on patient demographics
Clinical Stigma	Language or framing that reflects prejudice toward specific diagnoses or patient populations	Dismissive tone toward patients with substance use disorders or mental health conditions
Diagnostic Framing Bias	Asymmetric language when describing similar clinical presentations across different patient groups	Different urgency or thoroughness in workup descriptions based on patient characteristics
No Bias Detected	Communication that meets equitable standards	Consistent, patient-centred language regardless of demographics

Model Selection: ClinicalBERT vs. RoBERTa

I fine-tuned and compared ClinicalBERT and RoBERTa. While both performed well on binary tasks, the differentiation appeared in granular classification. ClinicalBERT excelled with dense medical terminology, while RoBERTa provided consistent robustness.

Skill Spotlight

Gold-Standard Annotation Protocol

Built annotation guidelines with inclusion/exclusion criteria, borderline examples, and inter-annotator agreement measures.

Evidence: Protocol designed for reproducibility by future researchers.

Experiment 2: Communication Behaviour Detection

Beyond bias, we needed to detect evidence-based behaviors like the Calgary-Cambridge Framework, NURSE Protocol, and SHARE Approach. I built supervised models to identify these structural patterns in doctor-patient dialogues.

Experiment 3: LLM Evaluation Framework

Before trusting LLMs with internal tasks like SQL generation, we needed to quantify their accuracy. I built a pipeline measuring hallucination rates, logic errors, and schema fidelity across varying query complexities.

Skill Spotlight

LLM Evaluation

Systematically evaluated LLM outputs against human gold-standard labels, measuring category-specific failure modes.

Evidence: Quantified accuracy revealing where automated labeling fails.

Experiment 4: Synthetic Data Engineering

To solve healthcare data privacy (HIPAA), I designed an automated synthetic data loop creating high-quality dialogues for training without touching real PHI.

Skill Spotlight

Synthetic Data Generation

Engineered synthetic clinical transcripts to augment scarce training data.

Evidence: Addressed class imbalance without touching real PHI.

Technical Architecture

The Interconnected Infrastructure

Skill Spotlight

Ensemble Model Architecture

Combined domain-specific ClinicalBERT with general-purpose RoBERTa.

Evidence: 4-experiment design isolating detection, behavior, LLM quality, data.

Bias Detection

•ClinicalBERT + RoBERTa ensemble

•4-class bias classification

•Confidence scores

Behaviour Classifier

•Supervised NLP model

•Turn-level detection

Evaluation Engine

•SQL generation benchmarks

•Hallucination metrics

•Deployment confidence

Synthetic Loop

•LLM generation + validation

•PHI-free training data

•Reproducibility datasets

Ethics & Privacy

Trust Through Transparency

Every design decision adhered to HIPAA Compliance, Common Rule/IRB standards, and Belmont Principles. The system is designed to recommend and surface patterns for human review, not to act as a black-box decision maker.

The Evidence

Evidence-Based Impact

ScalableThroughput

Thousands of records processed automatically versus ~50 manually.

DeterministicConsistency

Reproducible classifications across all cohorts.

Privacy-FirstData Pipeline

Unlimited iteration via synthetic data engineering.

QuantifiedAccuracy

Documented error and hallucination rates for LLMs.

The Growth

Key Learnings

Taxonomy design is the hardest part. Weeks of iteration with domain experts on the framework paid off more than model tuning did.

Healthcare AI must earn trust. Confidence scores and human-in-the-loop review are not optional features; they are foundational requirements for clinical adoption.

This project documents research conducted at ABIM. All publicly shared artifacts use synthetic data.

Detecting Healthcare Bias with NLP

Building Machine Learning Pipelines to Surface What Clinicians Can't See at Scale

Skill Constellation

Primary

Supporting

Emerging

Why This Project Exists

What I Built and Why

Experiment 1: NLP Bias Detection

Bias Taxonomy Design

Model Selection: ClinicalBERT vs. RoBERTa

Gold-Standard Annotation Protocol

Experiment 2: Communication Behaviour Detection

Experiment 3: LLM Evaluation Framework

LLM Evaluation

Experiment 4: Synthetic Data Engineering

Synthetic Data Generation

The Interconnected Infrastructure

Ensemble Model Architecture

Bias Detection

Behaviour Classifier

Evaluation Engine

Synthetic Loop

Trust Through Transparency

Evidence-Based Impact

Key Learnings