Status
Conditions
Treatments
About
The goal of this randomized controlled trial is to evaluate whether behavioral nudges can reduce automation bias, the uncritical acceptance of automated output, in physicians using large language models (LLM) like ChatGPT-5.1 for clinical decision-making.
The main question it aims to answer is: Does a dual-mechanism behavioral nudge intervention (baseline accuracy anchoring plus case-specific color-coded confidence signals) reduce physicians' uncritical acceptance of incorrect LLM recommendations?
Researchers will compare physicians who receive LLM recommendations along with a behavioral nudge to those who receive LLM recommendations without the nudge to assess if the nudge reduces automation bias.
Participants will:
Full description
Automation bias represents a critical challenge in modern clinical practice, particularly as artificial intelligence (AI) tools become increasingly embedded in healthcare workflows. This cognitive phenomenon describes the tendency of clinicians to favor suggestions from automated decision-making systems, even when those suggestions are incorrect. As Large Language Models (LLM) such as ChatGPT-5.1 gain traction in medical settings, their potential to reduce errors and improve efficiency must be weighed against a significant concern: these models lack rigorous medical validation and may amplify existing cognitive biases through incorrect or misleading recommendations.
The emergence of automation bias in medical contexts reflects a complex interplay of environmental and psychological factors. Time constraints in high-volume clinical settings create pressure to accept AI-generated recommendations without adequate scrutiny. Financial incentives that prioritize efficiency over thoroughness may further discourage critical evaluation necessary for sound clinical judgment. Cognitive fatigue during extended shifts diminishes physicians' capacity for sustained analytical thinking. These pressures interact with psychological mechanisms including diffusion of responsibility, overconfidence in technological solutions, and cognitive offloading, collectively creating conditions where uncritical acceptance of AI-generated recommendations becomes more likely.
This randomized controlled trial evaluates the effectiveness of a behavioral nudge intervention designed to mitigate automation bias among medical doctors utilizing LLM-generated diagnostic recommendations. The primary objective is to determine whether this intervention improves diagnostic reasoning performance scores when evaluating clinical vignettes that include deliberately flawed LLM recommendations. Secondary objectives include assessing whether physician experience level, gender, and prior LLM experience moderate the intervention's effectiveness, determining differential effectiveness for vignettes across different confidence signals.
This study employs a single-blind, randomized controlled trial with two parallel arms. Participants will be randomly assigned 1:1 to either the intervention or control arm. To eliminate variability from differences in prompting skills, participants will not interact directly with a live LLM interface. Instead, all participants will use a custom-built web platform displaying clinical vignettes with pre-generated LLM recommendations, ensuring identical LLM-generated content for each vignette.
All participants will evaluate six clinical vignettes during a single, proctored session lasting approximately 75 minutes. Three vignettes will contain deliberately introduced clinical reasoning flaws in the LLM recommendations, while three will contain correct recommendations. Vignettes will be presented in randomized order to prevent pattern detection.
Control arm participants will evaluate clinical vignettes with LLM diagnostic recommendations generated by ChatGPT presented in standard, neutral text format without additional contextual information. Intervention arm participants will evaluate the same vignettes alongside a behavioral nudge. This intervention consists of two synchronized cognitive cues: (1) an anchoring cue displaying ChatGPT's baseline diagnostic accuracy on standard medical datasets at the top of the interface panel, explicitly anchoring expectations to the model's fallibility, and (2) a selective attention cue displaying the LLM recommendation alongside a color-coded confidence signal generated through an ensemble assessment: three independent state-of-the-art LLMs (Claude Sonnet 4.5, Gemini 2.5 Pro Thinking, and GPT-5.1) each provide confidence ratings for the recommendation, and the mean confidence determines the signal color to mitigate single-model miscalibration.
The color-coded confidence signals are categorized into three distinct levels based on the ensemble's mean confidence relative to baseline diagnostic accuracy. Red signals are triggered when the mean confidence falls below ChatGPT's established baseline accuracy, explicitly flagging high-uncertainty cases that demand heightened critical scrutiny. Orange signals indicate that while the mean confidence exceeds the baseline average, it remains below 100%, signaling the need for continued clinical vigilance and the avoidance of complacency. Finally, green signals are reserved for instances of 100% ensemble consensus; however, even at this level of confidence, standard AI safety warnings remain present to guard against over-reliance on the system's output.
Participants will be presented with six clinical vignettes specifically designed to measure automation bias, sourced and modified from real cases representing a range of diagnostic difficulty and common medical specialties. Each vignette follows a standardized format including chief complaint, history of present illness, relevant past medical/social/family history, physical examination findings, and initial laboratory results.
The primary outcome is the Diagnostic Reasoning Performance Score, a composite percentage score based on a structured rubric evaluating: quality of differential diagnoses, supporting findings, opposing findings, final diagnosis accuracy, and appropriateness of next steps. Secondary outcomes include top-choice diagnosis accuracy (incorrect, partially correct, or correct). All responses will be evaluated by blinded reviewers using the assessment rubric.
Enrollment
Sex
Volunteers
Inclusion criteria
Exclusion criteria
Primary purpose
Allocation
Interventional model
Masking
50 participants in 2 patient groups
Loading...
Central trial contact
Ayesha Ali, PhD; Ihsan Ayyub Qazi, PhD
Data sourced from clinicaltrials.gov
Clinical trials
Research sites
Resources
Legal