Good with the answer, poor with the question

Large language models are increasingly used in medicine, and two studies published this month tested how well they actually perform. The results appear to conflict, but they measure different things.

Two studies, two different results

The first study, published in JAMA Network Open, tested 21 large language models on 29 standardised clinical cases. Each case was given to the model in stages, as a doctor would meet it. The models reached the correct final diagnosis more than 90% of the time. However, when asked to produce a differential diagnosis (the list of conditions that could explain the presentation), they failed in 80% to 100% of cases. The models tended to settle on a single answer early rather than hold several possibilities open.

The second study, published in Nature Medicine, compared three general-purpose models (GPT-5.2, Gemini 3.1 Pro and Claude Opus 4.6) with two products built specifically for doctors, OpenEvidence and UpToDate Expert AI. The general models performed better than the specialist tools on medical knowledge, on agreement with expert clinicians, and on 100 real questions asked by physicians during clinical work, which were scored blind by 12 clinicians. The specialist tools performed no better than the free Google search summary that appears above ordinary results. There was no meaningful difference between any of the models in the rate of unsafe answers.

Why the two results are not in conflict

At first glance these findings seem contradictory. One study reports that the models fail; the other reports that they outperform the tools designed for the job. The explanation is that they measured different things. The JAMA study tested the models on their own, with their reasoning settings and any web access switched off, and gave full marks only when the differential was complete and correct. The Nature Medicine study allowed the models their normal tools and compared them with commercial products rather than with a fixed standard. Both findings can therefore be true. The models are weak at building a differential from incomplete information, and they are still better at it than the specialist products, because those products are weak at the same task.

What the Nature Medicine study shows in practice is that ready access to medical knowledge is no longer something a specialist tool can charge for. Looking up a guideline, summarising the evidence on a drug, or answering a clearly worded clinical question is now done about as well by a general model, or by a free search summary, as by an expensive medical product.

The weakness can be reduced

The differential weakness reported in the JAMA study is not fixed. When the same models are placed inside a system that forces them to work in steps (list the possible diagnoses, choose the next test, and revise as results come back), their accuracy on difficult cases improves. A recent study from Microsoft reported accuracy of around 80% to 85% on a set of hard published cases using this approach. That study has not yet been peer reviewed, and its claim that the system performed four times better than doctors should be treated with caution, because the doctors in the comparison worked alone, without colleagues or references. The point is simply that the models do better when they are made to reason in sequence rather than answer in one step.

A randomised trial in cardiology published earlier this year points the same way. A model adapted for the task, used alongside general cardiologists, produced assessments that specialists preferred to those of the cardiologists working without it. The model did not replace the clinician. It supported the clinician.

A model can find the facts. It cannot decide who should have an operation.

What this means for surgical practice

None of these studies measured the decision that takes up most of a thoracic surgeon's clinic. Reaching a diagnosis, retrieving a fact, or answering a question is not the same as deciding whether a particular patient should have an operation. A small nodule found on a screening scan may be cancer, infection, or old scarring, and a proportion of these will never harm the patient. The task is to keep these possibilities open, to resolve them in turn with imaging, sampling and time, and then to judge whether an operation will help this patient or harm them. A good proportion of these assessments end, correctly, in a decision not to operate.

The same judgement applies when the decision runs the other way. Some patients are at high risk from surgery (for example, those with poor lung function and impaired renal function), yet the alternatives may be more harmful still. Chemotherapy, in a patient with limited reserve, can itself cause serious harm. The operative risk then has to be weighed against the harm of the alternatives for that individual patient. Where the balance favours surgery, the risks are explained in full, discussed with the patient and accepted, and the decision to proceed is a shared one. A model can inform that conversation, but it cannot weigh these competing harms for a particular person, or take responsibility for the decision.

This judgement depends on knowing the individual patient (their lung function, their other conditions, and what they want), and it cannot be reduced to looking up information. That is why making medical information cheap and easy to obtain does not make the surgical decision any easier. The two are different kinds of work. The same point applies along the lung cancer pathway, where AI is increasingly used to find and flag disease but the decision to operate still rests with the surgeon and the patient.

In my view the sensible position is neither that these tools are unreliable nor that they will soon replace clinical judgement. They are now a good way to find and explain medical information, and they should be used for that, provided the source is checked. They cannot yet do the harder reasoning, under uncertainty, that decides which patient benefits from surgery. That judgement remains with the clinician.

Mr Lawrence Okiror is a Consultant Thoracic and Robotic Surgeon at Guy's and St Thomas' NHS Foundation Trust.

Declared interests: I have no industry honoraria, advisory roles or speaker engagements relevant to this piece.

Views are my own and do not necessarily represent Guy's and St Thomas' NHS Foundation Trust.

← Back to the Journal

Good with the answer, poor with the questionWhat two new studies show about artificial intelligence and clinical reasoning

Two studies, two different results

Why the two results are not in conflict

The weakness can be reduced

What this means for surgical practice

Good with the answer, poor with the question
What two new studies show about artificial intelligence and clinical reasoning