Status
Conditions
Treatments
About
OpenEvidence is an online tool that aggregates and synthesizes data from peer-reviewed medical studies, then producing a response to a user's questions using generative AI. While it is in use by a number of clinicians (including residents) today, there is little to no published data on whether the tool's outputs are accurate and whether this information appropriately informs clinical decision making. Similarly, a number of clinicians are turning to other large language models (LLM's) to assist in decision making when providing clinical care. While there have been a number of studies published on the accuracy of these LLM's responses to medical boards questions or clinical vignettes, there have been few studies to date examining their performance in a real world clinical setting, and even fewer comparing this performance.
In this study, investigators have two goals:
To accomplish study goal #1, investigators have enlisted residents in the above specialties to use the OpenEvidence tool in the course of clinical practice. In order to mitigate any safety risks, the residents will also use a typical reference tool for their question, which is referred to as the "Gold Standard" tool. These tools include PubMed and UpToDate. The residents will:
Attending physician Subject Matter Experts (SMEs) matched by specialty with at least 5 years of post-training clinical experience will then evaluate the residents' responses. 5 years was chosen based the book "Outliers" by Malcolm Gladwell, in which he asserts that 10,000 hours of focused practice is needed to achieve expertise in a field.
SMEs will be asked to evaluate the residents' initial clinical questions and their conclusions based only on OpenEvidence. They will be asked to rate the clinical appropriateness of those conclusions on a scale of 1-10. For questions where the SME's rate the clinical appropriateness of the residents' conclusions poorly (< 5/10), they will be asked to review the OpenEvidence output and answer an additional question as to whether the output was incorrect or the resident misinterpreted the output from the tool.
To accomplish goal #2, the initial prompt entered by the residents into OpenEvidence will be copied by the research team into ChatGPT, Gemini, and Claude. The outputs from each tool (including OpenEvidence) will be surfaced to SMEs, who will be asked to rate each output based on accuracy, completeness, and bias. Likert scales will be used for these ratings. SMEs will also be asked an open-ended question to identify any patient safety issues from any of the outputs.
Full description
OpenEvidence is an online tool built out of the May Clinic Platform Accelerate [OpenEvidence] that aggregates and synthesizes data from peer-reviewed medical studies, then producing a response to a user's questions using generative AI. While it is in use by a number of clinicians (including residents) today, there is little to no published data on whether the tool's outputs are accurate and whether this information appropriately informs clinical decision making. Similarly, a number of clinicians are turning to other large language models (LLM's) to assist in decision making when providing clinical care.
OpenEvidence is an online tool designed to aggregate and synthesize data from peer-reviewed clinical studies, subsequently generating responses to user inquiries through the application of generative AI. Although increasingly utilized by both seasoned clinicians and trainees, there is a notable absence of published data regarding the accuracy of the tool's outputs, their safety and efficacy in appropriately informing clinical decision-making. Concurrently, a growing number of clinicians are leveraging other publicly-available large language models (LLMs) to support decision-making in clinical care. While a number of studies have examined the accuracy of LLM responses to medical board questions or clinical vignettes, there is limited research on their performance in real-world clinical settings, and even fewer studies offer comparative analyses of this performance.
In a review of the literature, one article shows LLM's may be better at detecting anxiety than practitioners, but this was based on clinical vignettes. [Levkovich et al.] Another looked at diagnostic sensitivity of LLM's using patient-reported outcome measures in a structured questionnaire. [Pagano et al.] An additional study comparing LLM's for oncology also uses fictional vignettes. [Benary et al.] A randomized control trial using clinical vignettes did not show any clinical improvement for providers who had access to LLM's. [Goh et al.] One case study explored integration of ChatGPT 3.5 into daily rounds and evaluated its use qualitatively, but did not compare it with other LLM's or gold standard reference tools. [Skryd et al.] Another compared ChatGPT's responses to American College of Radiology appropriateness criteria for breast pain and breast cancer screening, but again did not compare it with other LLM's. [Rao et al.] In our review, only one study evaluated LLM's in a real world clinical setting. This was a series of papers that looked at their use for complex decision making in breast-cancer care, using a small number of actual cases and a standardized prompt template. [Griewing, Knitza et al.; Griewing, Gremke et al.] That study found issues with consistency and deterioration of accuracy (particularly with GPT 3.5), leading the authors to conclude that the clinical use of LLM's for that purpose was not yet feasible at the time of publication. Still, health systems leaders see the use of these tools rapidly accelerating in clinical practice. For this reason, investigators believe it is imperative to study their safety and the clinical appropriateness of the decisions clinicians are making as a result of their use.
Cambridge Health Alliance (CHA) is a public, academic safety-net health system in the Boston area, serving a diverse population of patients. CHA has a robust primary care and outpatient psychiatry footprint, and supports a large graduate medical education program through both Harvard Medical School and Tufts University School of Medicine. Investigators chose residents as our primary study participants as many trainees are already using OpenEvidence, and found them more incentivized to participate in the study if given access to the tool at CHA (where it is otherwise blacklisted from network services and prohibited by policy until results of this study can be determined).
Study outcomes are as follows:
Determine whether the use of OpenEvidence leads to clinically appropriate decisions by residents in the course of clinical practice in a community health setting.
Determine how the output of OpenEvidence compares with three other commonly-used, publicly-available large language models (OpenAI's ChatGPT, Anthropic's Claude, and Google's Gemini) in accuracy, completeness, and bias when addressing clinical questions residents have in the course of clinical practice in a community health setting.
Methods:
Data collection is planned to take place over a 6-month period in order to minimize vendor version upgrades during the study period. Residents are grouped by specialty into "medicine" (internal medicine/family medicine) and "psychiatry" (adult/child psychiatry). In order to simplify matching to appropriate specialty subject matter experts, medicine residents are asked to use OpenEvidence only for adult primary care cases (excluding OB/GYN-related issues). Psychiatry residents are asked to use OpenEvidence only for adult psychiatry cases.
Before being accepted as participants, trainees were all asked to agree to the following:
All residents will be given brief training in prompt engineering for healthcare before data collection begins. Standardized prompts will not be used, as one of the subgoals of the study is to understand what types of queries residents submit to OpenEvidence in a real world setting.
All residents will be educated on the definition of PHI, as follows:
Queries should not include any of PHI, as defined by the Safe Harbor identifiers [HHS]; queries can include patient age in years (days/weeks/months for pediatrics), and legal sex; for patients age 89 or older, the user must instead use the term "over age 89" to comply with Safe Harbor standards.
Queries should not include patients suspected of having extremely rare conditions as defined by the National Organization for Rare Disorders, as these are also prone to reidentification [NORD]. If a rare condition is not initially suspected but becomes suspected through the research process of using the AI tool, the user will be asked to stop their query at that point.
Data collection will involve the use of a HIPAA-compliant Google Form within CHA's enterprise Google Workspace for Health cloud infrastructure. The data collection form will ask trainees do the following:
Queries will be sorted by specialty (medicine vs. psychiatry), and each query will receive a sequential study number.
Attending physician Subject Matter Experts (SMEs) Board Certified in Internal Medicine, Family Medicine, or Psychiatry with at least 5 years of post-training clinical experience were recruited. Five years of post-training clinical experience was chosen based on the fact that Malcolm Gladwell, in his book, Outliers, asserts that 10,000 hours of focused practice is needed to achieve expertise in a field.
SMEs will be asked to evaluate the residents' initial clinical questions and their conclusions based only on OpenEvidence. They will be asked to rate the clinical appropriateness of those conclusions on a 10-point Likert scale. SME's will also be provided with the OpenEvidence output for each query, and where the SME rates the clinical appropriateness of the residents' conclusions poorly (< 5/10), the SME will additionally be asked a follow-up question to assess whether the tool's output itself provided a clinically inappropriate response, in order to ascertain whether the trainee may have misinterpreted the tool's output. SME review will include a 2.5-5% overlap between reviewers to calculate a kappa score for interrater reliability.
In part two of the study, the research team will sort the OpenEvidence queries into themes, and choose a random sampling of queries from each specialty and theme for comparison between LLM's. The research team confirm that prompts do not include any PHI according to study protocol. They will then copy the OpenEvidence prompts entered by residents for the selected queries and paste them exactly into ChatGPT, Gemini, and Claude.
The outputs of each of the four tools (OpenEvidence, ChatGPT, Gemini, and Claude) will be surfaced in a Google webform. SMEs will be asked to rate each output on a Likert scale for accuracy, completeness, and bias, as well as to answer a qualitative question identifying any patient safety issues in the output.
Results:
Primary outcome results will be reported as follows:
Clinical appropriateness of decision made by residents using OpenEvidence (mean with SD, median), by specialty
Secondary outcome results will be reported as follows:
For each specialty and each variable (accuracy, completeness, and bias), investigators will report:
Enrollment
Sex
Volunteers
Inclusion criteria
Exclusion criteria
20 participants in 2 patient groups
Loading...
Central trial contact
Hannah K Galvin, MD; Nobantu Mabuza-Frantzis
Data sourced from clinicaltrials.gov
Clinical trials
Research sites
Resources
Legal