Doctors still outperform AI in clinical reasoning, study shows

0
Doctors still outperform AI in clinical reasoning, study shows

AI may ace multiple-choice medical exams, but it still stumbles when faced with changing clinical information, according to new research in the New England Journal of Medicine.

University of Alberta neurology resident Liam McCoy evaluated how well large language models perform clinical reasoning — the ability to sort through symptoms, order the right tests, evaluate new information and come to the correct conclusion about what’s wrong with a patient.

He found that advanced AI models struggle to update judgment in response to new and uncertain information, and often fail to recognize when some information is completely irrelevant. In fact, some recent improvements designed to make AI reasoning better have actually made this overconfidence problem worse. 

It all means that while AI may do really well on medical licensing exams, there’s a lot more to being a good doctor than instantly recalling facts, says McCoy.

“Large language models have superhuman performance on multiple choice questions, but we’re still not at a stage where a patient can safely walk into a room, turn on their language model assistant, and have that do the entire visit,” says McCoy, who is also a research affiliate at the Massachusetts Institute of Technology and a research student intern with Harvard’s Beth Israel Deaconess Medical Center. 

The use of AI in medicine has grown by leaps and bounds in the past five years — from writing up doctors’ notes to looking for patterns in disease data and advising physicians on what to look for in medical images — but it’s not yet ready to take over from doctors when it comes to giving a diagnosis, he says. 

McCoy and colleagues from Harvard, MIT and elsewhere took a page from medical education to develop their benchmark test to measure this flexibility in clinical reasoning for AI models. Their tool, called concor.dance, is based on script concordance testing, a common method of assessing the skills of medical and nursing students. 

“As a clinician you develop a script of how an illness looks and how it goes, what to do next. The simplest example would be if somebody has chest pain to think, ‘OK, the heart might possibly be involved so we should do an ECG and some blood work to look for the markers of a heart attack,’” McCoy explains. 

As medical learners gain experience, their diagnostic “scripts” become more sophisticated and they are better able to sort through which symptoms are most relevant and come up with a proper diagnosis. 

“As you get more advanced, you might say, ‘But it’s also possible this chest pain could be due to pneumonia or a hole in the lung lining,’” he explains. “You develop more complex scripts and become nimble enough to switch between different scripts based on what is happening to your patient.”

In medical education, script concordance testing awards students points for how well they do this nuanced human reasoning in comparison with the most experienced experts in each field. 

McCoy’s test for AI models used med school scripts for surgery, pediatrics, obstetrics, psychiatry, emergency medicine, neurology and internal medicine from Canada, the United States, Singapore and Australia.

McCoy tested 10 of the most popular AI models from Google, OpenAI, DeepSeek, Anthropic and others. While the models generally performed similarly to first- or second-year medical students, they often failed to reach the standard set by senior residents or attending physicians.

 

In the script concordance tests used, McCoy says, about 30 per cent of the time, the new information given in the question is a red herring that doesn’t change the diagnosis or management plan. For example, you may learn that our hypothetical chest pain patient stubbed their toe last week. That’s probably not relevant to our case, but the AI models were terrible at figuring that out. 

Instead, the most advanced models tried to explain why the irrelevant facts were relevant, botching the diagnosis. 

“One of our biggest concerns about large language models is that they have been fine-tuned to be very helpful, giving frequent answers that inspire confidence in humans,” says McCoy. “They will explain a mistake in a way that makes you agree with them. There are a lot of ways the models can output a very convincing but wrong answer.”

Interestingly, human medical students who do well on multiple-choice exams don’t always do as well on script concordance because it’s a very different skill. “It’s important to realize that performance on a task like clinical reasoning is very complicated and task-specific,” McCoy points out.

That doesn’t mean AI models can’t be improved to do better at it. In fact, McCoy figures the technology is here to stay, so it’s incumbent on researchers such as himself to keep pushing to make it better.

“This technology is coming one way or another, so I think as physicians we need to make sure it is effective, equitable and aligned with what patients need, rather than allowing it to just be driven by external actors,” he says.

McCoy was a promising student back in 2015 when he started his education by winning the President’s Centenary Citation, the U of A’s most generous undergraduate scholarship, valued at $50,000. Now, with a University of Toronto medical degree and stops along the way at MIT and Harvard, he’s more than halfway through his residency to become a fully trained neurologist. 

McCoy intends to continue testing AI systems that could help clinicians do their jobs better. He’s already had questions about his benchmark tool from researchers at Google and Microsoft Research, and he hopes to collaborate with them to improve AI intended for clinical settings.

To explain why he is so motivated to make AI more useful to doctors and patients, McCoy quotes Facebook founder Mark Zuckerberg. 

“You can’t ‘move fast and break things’ in medicine, because human lives are on the line and it is important to have that appropriate caution,” McCoy states. “But at the same time, those stakes inform the urgency for me.” 

“We have a moral responsibility to use the best of technology that is available, whether it’s a new type of MRI scan, a new radiation machine or a new type of surgical tool. Eventually, that new best technology might actually be a clinical reasoning tool.”

link

Leave a Reply

Your email address will not be published. Required fields are marked *