AI Outperforms Doctors in ER Diagnosis Study

Harvard Study Shows AI Beats Human Doctors in Emergency Room Triage

Television has long celebrated the emergency room doctor as a hero. From George Clooney in ER to Noah Wyle in the same iconic show, these figures captured public imagination. But a landmark new study now challenges human dominance in emergency medicine. Artificial intelligence has outperformed experienced physicians in high-pressure diagnostic tests.

A Harvard-led research team published the findings in the journal Science. The study involved collaborators from Stanford University and Beth Israel Deaconess Medical Center in Boston. Researchers tested OpenAI’s o1 reasoning model against hundreds of human doctors. The results delivered what independent experts called “a genuine step forward” in AI clinical reasoning.

What the Study Actually Tested

Researchers designed several demanding experiments to evaluate AI performance. They asked the large language model to produce accurate patient diagnoses. They also tested its ability to recommend appropriate diagnostic tests. Finally, they assessed its skill in managing complex long-term care plans.

Researchers tested the AI at three distinct moments in the emergency care process. These stages included initial triage at arrival, first contact with a physician, and admission to a medical floor or intensive care unit. Two independent doctors evaluated all assessments. Those reviewers did not know whether the AI or human physicians had produced each result.

AI Identifies Diagnoses More Accurately Than Physicians

One key experiment focused on 76 real patients. These patients arrived at the emergency room of a Boston hospital. The AI and two human doctors each received the same electronic health records. These records typically contained vital signs, demographic data, and brief nursing notes.

The AI correctly identified the exact or near-exact diagnosis in 67% of cases. Human doctors achieved accuracy rates of only 50% to 55%. The AI’s advantage proved especially strong during initial triage. At that stage, doctors must act quickly with very limited information available.

When researchers provided the AI with more detailed patient information, accuracy climbed to 82%. Human expert accuracy reached between 70% and 79% under the same conditions. Researchers noted this difference was not statistically significant at the more detailed stage. However, the AI’s early-stage advantage remained a clear and consistent finding.

Long-Term Treatment Planning Also Favoured the AI

The study extended beyond initial diagnosis into longer-term care planning. Researchers asked the AI and 46 human doctors to examine five clinical case studies. These cases required planning treatments like antibiotic regimens and end-of-life care. The AI scored 89% accuracy on these tasks.

Human doctors using conventional resources, such as search engines, scored only 34%. That gap represented a statistically significant difference in performance. The AI model demonstrated strong capability across a wide range of clinical scenarios. Researchers described this as one of the study’s most striking findings.

The AI Excelled in Rare and Complex Cases

The AI performed particularly well in cases involving rare diseases. Researchers used real clinical scenarios published in The New England Journal of Medicine. These cases originated from Massachusetts General Hospital and carry a strong reputation for difficulty. They often contain distracting or arcane medical details spanning many specialties.

Senior co-author Arjun Manrai said the AI’s performance on these cases “shocked a lot of folks.” Manrai serves as an assistant professor of Biomedical Informatics at Harvard Medical School. He told reporters the AI achieved nearly optimal diagnosis on this set of challenging cases. Thomas Buckley, a doctoral student at Harvard Medical School, also contributed to this area of the research.

One vivid example involved a patient with a pulmonary embolism, a blood clot in the lungs. After initial improvement, the patient’s condition began to worsen. The medical team struggled to explain the deterioration. The AI scanned the medical records and identified a history of lupus as a potential cause, and it turned out to be correct.

Real-World Emergency Data Proved Key to Findings

Dr. Adam Rodman is a clinical researcher at Beth Israel Deaconess Medical Center. He served as one of the study’s authors and highlighted a central conclusion. According to Dr. Rodman, the AI’s ability to work with real-world emergency department data marks a significant finding. The AI handled the messy, incomplete data typical of genuine emergency settings.

This real-world testing distinguished the study from purely theoretical benchmarks. The AI worked only with electronic health records and the limited information physicians had at the time. It did not receive any additional data or special inputs. Yet it still outperformed the two experienced attending physicians across all three evaluation stages.

Critical Limits Remain Before AI Replaces Doctors

Researchers were careful to frame the findings in proper context. The study tested only information that doctors can communicate through text. The AI never observed a patient directly or assessed visual cues. It could not evaluate a patient’s level of distress or physical appearance.

That limitation means the AI functioned more like a second-opinion tool. It reviewed paperwork rather than sitting beside the patient’s bed. Researchers acknowledged that real clinical care demands far more than textual analysis. Sounds, images, and nonverbal signals all play vital roles in genuine diagnosis.

“I don’t think our findings mean that AI replaces doctors,” said Arjun Manrai. He stressed the need for controlled clinical trials of the technology. Those trials would help determine how AI can integrate most effectively into medical practice. The study authors called this next step urgent and necessary.

A Call for Urgent Controlled Trials

The research team argued the results demand immediate action from the medical community. They called for structured, real-world trials to test AI deployment in clinical settings. These trials would explore how AI tools can best support, rather than replace, human physicians. The goal remains better patient outcomes through smarter collaboration.

Large language models have clearly advanced beyond simple clinical benchmarks. The study authors wrote that such models “have eclipsed most benchmarks of clinical reasoning.” That conclusion carries significant weight for healthcare systems worldwide. It signals that AI may soon become a standard part of emergency medical decision-making.