Construction of a Benchmark for Breast Ultrasound AI Interpretation and Performance Evaluation of Multimodal AI Models (BUST-AI Bench)

Chinese Academy of Medical Sciences & Peking Union Medical College

Status

Enrolling

Conditions

Breast Diseases

Ultrasonography

Breast Neoplasms

Treatments

Diagnostic Test: Multimodal AI Model Diagnostic Evaluation

Study type

Observational

Funder types

Other

Identifiers

NCT07500428

I-26PJ0568 (Other Identifier)

2024-I2M-CT-B-035 (Other Grant/Funding Number)

K10349

Details and patient eligibility

About

This single-center, retrospective, observational study aims to construct a standardized benchmark evaluation system for intelligent breast ultrasound image interpretation and to systematically assess the diagnostic performance of current mainstream multimodal artificial intelligence (AI) models.

De-identified B-mode breast ultrasound images with confirmed pathological diagnoses will be retrospectively collected from the institutional archive (2018-2025) and supplemented with images from published open-access datasets. Expert radiologists with varying experience levels will independently annotate all images according to the American College of Radiology (ACR) Breast Imaging Reporting and Data System (BI-RADS) v2025 criteria, including glandular tissue composition, lesion characterization (mass vs. non-mass lesion), morphological descriptors, and final BI-RADS classification.

Baseline deep learning models (CNN-based ResNet-50 and Transformer-based USFM) will be trained to establish performance baselines and to stratify cases by diagnostic difficulty through cross-architecture consensus. Multiple multimodal large language models (MLLMs), including both general-purpose and medical-domain models, will then be evaluated via standardized API calls using BI-RADS-guided chain-of-thought prompts at temperature 0 for reproducibility.

Primary endpoints include BI-RADS classification accuracy and diagnostic AUC for benign-malignant differentiation. Model robustness and safety will be assessed through out-of-distribution rejection testing, temperature-stability experiments, and thinking-mode ablation studies. This study adheres to the FLAIR and TRIPOD-LLM reporting guidelines.

Full description

Background: Breast cancer is the most prevalent malignancy among women worldwide. Ultrasound is a first-line screening modality, particularly in Asian populations with dense breast tissue where mammographic sensitivity is limited. However, ultrasound interpretation is highly operator-dependent, with substantial inter-observer variability in BI-RADS classification, especially for category 4A-4B lesions. Multimodal large language models (MLLMs) have emerged as a promising tool for medical image analysis due to their zero-shot diagnostic capability, interpretable chain-of-thought reasoning, and structured report generation. Nevertheless, there is currently no standardized benchmark for evaluating AI performance in breast ultrasound interpretation.

Study Design: Approximately 1,380 breast ultrasound images will be curated (1,200 evaluation set + 150 out-of-distribution safety test set + 30 prompt development set), encompassing three diagnostic categories: normal breast, benign lesions (BI-RADS 2-4B), and malignant lesions (BI-RADS 3-5). Two junior radiologists (<5 years of experience) and two senior radiologists (>15 years) will independently annotate images per ACR BI-RADS v2025 with arbitration by a fifth expert for discordant cases.

Diagnostic difficulty will be stratified into three tiers using cross-architecture deep learning consensus: Tier 1 (straightforward, both models correct), Tier 2 (equivocal, one correct/one incorrect), and Tier 3 (difficult, both incorrect, with senior expert validation). MLLMs will be evaluated across multiple dimensions: classification accuracy, sensitivity, specificity, F1 score, AUC, Cohen's kappa agreement with expert consensus, expected calibration error (ECE), morphological feature description accuracy, and chain-of-thought reasoning quality.

Safety Assessment: (1) Out-of-distribution rejection test using 150 non-diagnostic images (degraded images, non-breast ultrasound, other imaging modalities); (2) Temperature-stability pre-experiment across parameter settings; (3) Thinking-mode ablation comparing standard vs. chain-of-thought reasoning modes. All experiments use fixed model snapshots, system fingerprint monitoring, and complete logging for reproducibility.

Enrollment

1,380 estimated patients

Sex

Female

Ages

18 to 75 years old

Volunteers

Accepts Healthy Volunteers

Inclusion criteria

B-mode breast ultrasound grayscale images from the institutional PACS database or from published open-access breast ultrasound datasets with documented original institutional ethics approval
Image quality adequate for clinical diagnosis with clear visualization of the region of interest
Pathological diagnosis confirmed (for benign and malignant lesion groups), or normal breast status confirmed by a senior radiologist with >15 years of breast ultrasound experience (for the normal group)
Complete de-identification with removal of all personally identifiable information

Exclusion criteria

Severely degraded image quality precluding meaningful BI-RADS assessment
Duplicate images from the same patient (only the most representative image retained per lesion)
Images with residual personally identifiable information after de-identification processing
Cases with ambiguous, disputed, or unavailable pathological results
Non-B-mode ultrasound images, including elastography, contrast-enhanced ultrasound, and Doppler imaging

Trial design

1,380 participants in 3 patient groups

Normal Breast

Description:

Breast ultrasound images showing normal glandular tissue across different tissue composition types, with no focal lesions identified. Confirmed by senior radiologist review.

Treatment:

Diagnostic Test: Multimodal AI Model Diagnostic Evaluation

Benign Lesion

Description:

Breast ultrasound images containing pathologically confirmed benign lesions (BI-RADS 2-4B), including fibroadenoma, cyst, lipoma, sclerosing adenosis, intraductal papilloma, and selected non-mass lesions (NML).

Treatment:

Diagnostic Test: Multimodal AI Model Diagnostic Evaluation

Malignant Lesion

Description:

Breast ultrasound images containing pathologically confirmed malignant lesions (BI-RADS 3-5), including invasive ductal carcinoma, invasive lobular carcinoma, mucinous carcinoma, and selected non-mass lesions (NML).

Treatment:

Diagnostic Test: Multimodal AI Model Diagnostic Evaluation

Trial contacts and locations

Central trial contact

Qingli Zhu, MD; Yinglan Wu, MD

Data sourced from clinicaltrials.gov

Clinical trials

Find clinical trials Trials by location

Research sites

Find research sites Learn about CTV for professionals

Resources

Contact CTV support

Legal

Privacy Notice Terms