Detecting Healthcare Bias with NLP

Building Machine Learning Pipelines to Surface What Clinicians Can't See at Scale
Skill Constellation
Primary
Supporting
Emerging
The American Board of Internal Medicine certifies over 300,000 physicians. Part of their mission involves evaluating how physicians communicate, reason, and make decisions. However, humans are susceptible to unconscious biases—structural biases, clinical stigma, and assumptions baked into medical training.
The challenge is that these biases are nearly invisible at an individual level. It's only by analyzing thousands of records across hundreds of clinicians that patterns emerge. I built a system to detect these patterns reliably, automatically, and at scale.
NLP Pipeline Engineering, Transformer Fine-Tuning, Gold-Standard Annotation, LLM Evaluation, Synthetic Data Generation
Python, PyTorch, Hugging Face, Label Studio, ClinicalBERT, RoBERTa, GPT-family LLMs
Why This Project Exists
Assessment of clinical communication quality previously relied on manual review. This process was subjective, slow, and inconsistent. Without a shared classification system, tracking patterns or measuring the prevalence of bias was nearly impossible.
The question wasn't whether bias exists in healthcare communication—it was: can we build a system that detects it reliably, automatically, and at the scale ABIM needs to act on it?
What I Built and Why
Experiment 1: NLP Bias Detection
We defined a 4-label classification framework developed iteratively with domain experts and validated against clinical literature.
Bias Taxonomy Design
Created domain-specific 4-label framework operationalizing abstract bias concepts into measurable text signals.
Evidence: Categories not found in existing NLP bias benchmarks.
| Label | What It Captures | Example Signal |
|---|---|---|
| Structural Bias | Systemic patterns reflecting institutional or socioeconomic inequities | Assumptions about treatment adherence based on patient demographics |
| Clinical Stigma | Language or framing that reflects prejudice toward specific diagnoses or patient populations | Dismissive tone toward patients with substance use disorders or mental health conditions |
| Diagnostic Framing Bias | Asymmetric language when describing similar clinical presentations across different patient groups | Different urgency or thoroughness in workup descriptions based on patient characteristics |
| No Bias Detected | Communication that meets equitable standards | Consistent, patient-centred language regardless of demographics |
Model Selection: ClinicalBERT vs. RoBERTa
I fine-tuned and compared ClinicalBERT and RoBERTa. While both performed well on binary tasks, the differentiation appeared in granular classification. ClinicalBERT excelled with dense medical terminology, while RoBERTa provided consistent robustness.
Gold-Standard Annotation Protocol
Built annotation guidelines with inclusion/exclusion criteria, borderline examples, and inter-annotator agreement measures.
Evidence: Protocol designed for reproducibility by future researchers.
Experiment 2: Communication Behaviour Detection
Beyond bias, we needed to detect evidence-based behaviors like the Calgary-Cambridge Framework, NURSE Protocol, and SHARE Approach. I built supervised models to identify these structural patterns in doctor-patient dialogues.
Experiment 3: LLM Evaluation Framework
Before trusting LLMs with internal tasks like SQL generation, we needed to quantify their accuracy. I built a pipeline measuring hallucination rates, logic errors, and schema fidelity across varying query complexities.
LLM Evaluation
Systematically evaluated LLM outputs against human gold-standard labels, measuring category-specific failure modes.
Evidence: Quantified accuracy revealing where automated labeling fails.
Experiment 4: Synthetic Data Engineering
To solve healthcare data privacy (HIPAA), I designed an automated synthetic data loop creating high-quality dialogues for training without touching real PHI.
Synthetic Data Generation
Engineered synthetic clinical transcripts to augment scarce training data.
Evidence: Addressed class imbalance without touching real PHI.
The Interconnected Infrastructure
Ensemble Model Architecture
Combined domain-specific ClinicalBERT with general-purpose RoBERTa.
Evidence: 4-experiment design isolating detection, behavior, LLM quality, data.
Bias Detection
Behaviour Classifier
Evaluation Engine
Synthetic Loop
Trust Through Transparency
Every design decision adhered to HIPAA Compliance, Common Rule/IRB standards, and Belmont Principles. The system is designed to recommend and surface patterns for human review, not to act as a black-box decision maker.
Evidence-Based Impact
Thousands of records processed automatically versus ~50 manually.
Reproducible classifications across all cohorts.
Unlimited iteration via synthetic data engineering.
Documented error and hallucination rates for LLMs.
Key Learnings
Taxonomy design is the hardest part. Weeks of iteration with domain experts on the framework paid off more than model tuning did.
Healthcare AI must earn trust. Confidence scores and human-in-the-loop review are not optional features; they are foundational requirements for clinical adoption.
This project documents research conducted at ABIM. All publicly shared artifacts use synthetic data.