Data Science & ML Engineering

Detecting Healthcare Bias with NLP

RoleData Scientist Co-op
Team1 Data Scientist Co-op, 1 Advisor
TimelineSept 2025 – Present
Detecting Healthcare Bias with NLP
Overview

Building Machine Learning Pipelines to Surface What Clinicians Can't See at Scale

Skill Constellation

Primary

NLP Pipeline DesignGold-Standard AnnotationLLM Evaluation

Supporting

Transformer Fine-Tuning (ClinicalBERT, RoBERTa)Synthetic Data Generation

Emerging

Ethics in AIHIPAA ComplianceBias Taxonomy Design

The American Board of Internal Medicine certifies over 300,000 physicians. Part of their mission involves evaluating how physicians communicate, reason, and make decisions. However, humans are susceptible to unconscious biases—structural biases, clinical stigma, and assumptions baked into medical training.

The challenge is that these biases are nearly invisible at an individual level. It's only by analyzing thousands of records across hundreds of clinicians that patterns emerge. I built a system to detect these patterns reliably, automatically, and at scale.

Methods

NLP Pipeline Engineering, Transformer Fine-Tuning, Gold-Standard Annotation, LLM Evaluation, Synthetic Data Generation

Tools

Python, PyTorch, Hugging Face, Label Studio, ClinicalBERT, RoBERTa, GPT-family LLMs

The Tension

Why This Project Exists

Assessment of clinical communication quality previously relied on manual review. This process was subjective, slow, and inconsistent. Without a shared classification system, tracking patterns or measuring the prevalence of bias was nearly impossible.

The question wasn't whether bias exists in healthcare communication—it was: can we build a system that detects it reliably, automatically, and at the scale ABIM needs to act on it?

The Craft

What I Built and Why

Experiment 1: NLP Bias Detection

We defined a 4-label classification framework developed iteratively with domain experts and validated against clinical literature.

Skill Spotlight

Bias Taxonomy Design

Created domain-specific 4-label framework operationalizing abstract bias concepts into measurable text signals.

Evidence: Categories not found in existing NLP bias benchmarks.

LabelWhat It CapturesExample Signal
Structural BiasSystemic patterns reflecting institutional or socioeconomic inequitiesAssumptions about treatment adherence based on patient demographics
Clinical StigmaLanguage or framing that reflects prejudice toward specific diagnoses or patient populationsDismissive tone toward patients with substance use disorders or mental health conditions
Diagnostic Framing BiasAsymmetric language when describing similar clinical presentations across different patient groupsDifferent urgency or thoroughness in workup descriptions based on patient characteristics
No Bias DetectedCommunication that meets equitable standardsConsistent, patient-centred language regardless of demographics

Model Selection: ClinicalBERT vs. RoBERTa

I fine-tuned and compared ClinicalBERT and RoBERTa. While both performed well on binary tasks, the differentiation appeared in granular classification. ClinicalBERT excelled with dense medical terminology, while RoBERTa provided consistent robustness.

Skill Spotlight

Gold-Standard Annotation Protocol

Built annotation guidelines with inclusion/exclusion criteria, borderline examples, and inter-annotator agreement measures.

Evidence: Protocol designed for reproducibility by future researchers.

Experiment 2: Communication Behaviour Detection

Beyond bias, we needed to detect evidence-based behaviors like the Calgary-Cambridge Framework, NURSE Protocol, and SHARE Approach. I built supervised models to identify these structural patterns in doctor-patient dialogues.

Experiment 3: LLM Evaluation Framework

Before trusting LLMs with internal tasks like SQL generation, we needed to quantify their accuracy. I built a pipeline measuring hallucination rates, logic errors, and schema fidelity across varying query complexities.

Skill Spotlight

LLM Evaluation

Systematically evaluated LLM outputs against human gold-standard labels, measuring category-specific failure modes.

Evidence: Quantified accuracy revealing where automated labeling fails.

Experiment 4: Synthetic Data Engineering

To solve healthcare data privacy (HIPAA), I designed an automated synthetic data loop creating high-quality dialogues for training without touching real PHI.

Skill Spotlight

Synthetic Data Generation

Engineered synthetic clinical transcripts to augment scarce training data.

Evidence: Addressed class imbalance without touching real PHI.

Technical Architecture

The Interconnected Infrastructure

Skill Spotlight

Ensemble Model Architecture

Combined domain-specific ClinicalBERT with general-purpose RoBERTa.

Evidence: 4-experiment design isolating detection, behavior, LLM quality, data.

Bias Detection

ClinicalBERT + RoBERTa ensemble
4-class bias classification
Confidence scores

Behaviour Classifier

Supervised NLP model
Turn-level detection

Evaluation Engine

SQL generation benchmarks
Hallucination metrics
Deployment confidence

Synthetic Loop

LLM generation + validation
PHI-free training data
Reproducibility datasets
Ethics & Privacy

Trust Through Transparency

Every design decision adhered to HIPAA Compliance, Common Rule/IRB standards, and Belmont Principles. The system is designed to recommend and surface patterns for human review, not to act as a black-box decision maker.

The Evidence

Evidence-Based Impact

ScalableThroughput

Thousands of records processed automatically versus ~50 manually.

DeterministicConsistency

Reproducible classifications across all cohorts.

Privacy-FirstData Pipeline

Unlimited iteration via synthetic data engineering.

QuantifiedAccuracy

Documented error and hallucination rates for LLMs.

The Growth

Key Learnings

Taxonomy design is the hardest part. Weeks of iteration with domain experts on the framework paid off more than model tuning did.

Healthcare AI must earn trust. Confidence scores and human-in-the-loop review are not optional features; they are foundational requirements for clinical adoption.

This project documents research conducted at ABIM. All publicly shared artifacts use synthetic data.