Large language models (LLMs) demonstrated higher error rates compared to humans in a clinical oncology question bank

1. A comparative evaluation tested five publicly available LLMs on 2044 oncology questions, covering comprehensive topics in the field. The responses were compared to a human benchmark.

2. Only one of the five models tested performed above the 50^th percentile, with worse performance observed in clinical oncology subcategories and female-predominant malignancies.

Evidence Rating Level: 2 (Good)

Study Rundown: Many medical professionals have begun to use large language models (LLMs), such as ChatGPT, as augmented search engines for medical information. LLMs have demonstrated high performance on subspecialty medical examinations across multiple medical specialties, but the utility of LLMs in clinical applications of clinical oncology remain unexplored.

Rydzewski and colleagues compared the performance of five LLMs on a set of multiple-choice questions related to clinical oncology to a random guess algorithm and the performance of radiation oncology trainees. The authors assessed the accuracy of the models, their self-appraised confidence, and consistency of responses across three independent replicates of questions. The LLMs were asked to provide an answer to a question, a confidence score, and an explanation of the response. Each LLM was evaluated with 2044 unique questions, across three independent replicates.

The study found that only one of the five LLMs (GPT-4) scored higher than the 50^th percentile when compared to human trainees, despite all showing high self-appraised confidence. The remaining LLMs had much lower accuracies, with some being similar to the random guess strategy. LLMs scored higher on foundational topics and worse on clinical oncology topics, especially ones related to female-predominant malignancies. The authors found combining model selection, self-appraised confidence, and output consistency, helped identify more reliable outputs. Overall, this study demonstrated a need to assess the safety of implementing LLMs in clinical settings and the presence of training bias, in the form of medical misinformation related to female-predominant malignancies.

Click here to read the study in NEJM AI

Click to read an accompanying editorial in NEJM AI

Relevant Reading: Performance of ChatGPT on a primary FRCA multiple choice question bank

In-Depth [cross-sectional study]: In this study, Rydzewski and colleagues assessed the accuracies of five LLMs (LLaMA, PaLM 2, Claude-v1, GPT-3.5, and GPT-4) in 2044 multiple choice questions and aimed to identify strategies to help end users identify reliable LLM outputs. The questions were sourced from the American College of Radiology in-training radiation oncology examinations from 2013-2017, 2020, and 2021. Each question was repeated across three independent replicates. The authors compared LLM performance with a random guessing strategy and human scores for the questions sourced from the 2013 and 2014 examinations. Also, the authors assessed the self-appraised confidence of the LLMs by prompting for a confidence score ranging from 1-4, with 1 indicating a random guess and 4 indicating maximal confidence.

The five LLMs had mean accuracies ranging from 25.6%-68.7%, as compared to 25.2% of the random guess strategy. When compared against humans, only GPT-4 scored higher than 50^th percentile, achieving 69^th and 89^th percentiles. The overall performances of LLMs were positively correlated (Pearson’s r = 0.630; p < 0.001) with their performances in a specific topic. Other than LLaMA 65B, all LLMs performed better on foundational topics (e.g., medical statistics, cancer biology), than clinical subcategories (p < 0.02). LLMs performed the worst with regards to subjects involving breast and gynecologic malignancies. All LLMs produced a confidence score of 3 or 4 in more than 94% of responses. Finally, by combining self-assessed confidence and output consistency, the authors generated accuracies of 81.7% and 81.1% in Claude-v1 and GPT-4, respectively.

In conclusion, the authors assessed the ability of five LLMs to answer clinical oncology examination questions. This work displayed the need for further LLM safety evaluations before routine clinical implementation. It also provided insight into a potential strategy to more reliably use LLM output.

Image: PD

©2024 2 Minute Medicine, Inc. All rights reserved. No works may be reproduced without expressed written consent from 2 Minute Medicine, Inc. Inquire about licensing here. No article should be construed as medical advice and is not intended as such by the authors or by 2 Minute Medicine, Inc.

The post Large language models (LLMs) demonstrated higher error rates compared to humans in a clinical oncology question bank first appeared on Physician's Weekly.

Large language models (LLMs) demonstrated higher error rates compared to humans in a clinical oncology question bank

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112