MedArena: Comparing LLMs for Medicine in the Wild

0
MedArena: Comparing LLMs for Medicine in the Wild

The use of large language models (LLMs) in the medical domain holds transformative potential, promising advancements in areas ranging from clinical decision support and medical education to patient communication. This increasing relevance is highlighted by recent reports indicating that up to two-thirds of American physicians now utilize AI tools in their practice.

Realizing this potential safely and effectively hinges on the development of rigorous and clinically relevant evaluation methodologies. Currently, the predominant approaches for assessing the medical capabilities of LLMs, namely benchmark datasets derived from MMLU (Massive Multitask Language Understanding) and MedQA (Medical Question Answering), primarily utilize static, multiple-choice question (MCQ) formats. While valuable for gauging foundational knowledge, these existing evaluation paradigms suffer from significant challenges that limit their applicability to real-world clinical contexts. 

Firstly, they typically assess a narrow spectrum of medical knowledge, often neglecting other critical and common LLM use cases in healthcare, such as patient communication, clinical documentation generation, or summarizing medical literature. Secondly, their static nature means they fail to reflect the most recent medical knowledge, such as the latest drug approvals or recently updated clinical guidelines. Furthermore, the reliance on MCQ formats oversimplifies the complexities of clinical reasoning and practice. Clinicians are seldom presented with pre-defined options when diagnosing a patient or formulating a treatment plan. Evaluations focused solely on identifying the single ‘correct’ answer overlook the critical importance of the diagnostic process itself. Particularly in clinical diagnosis, the method of information synthesis and the overall presentation of the reasoning are often as important, if not more so, than the final conclusion. Existing methods fail to capture these nuances, offering an incomplete picture of an LLM’s true clinical utility.

This highlights a critical need for evaluation frameworks that move beyond these current limitations. Specifically, medical LLM evaluations must become more dynamic, capable of reflecting the most current medical questions and adapting to the iterative nature of clinical questioning, and more holistic, assessing the entire response quality, including reasoning, multi-turn conversations, and clinical appropriateness, rather than merely scoring factual accuracy on a fixed answer set. How do we move to an evaluation that incorporates real-world clinician questions with model responses evaluated by clinicians

To this end, we introduce MedArena.ai, a novel LLM evaluation platform specifically designed for clinical medicine. MedArena provides a free, interactive arena for clinicians to test and compare top-performing LLMs on their medical queries. 

How MedArena works

MedArena is open to clinicians only. To authenticate users, we partnered with Doximity, a networking service for medical professionals. Clinicians can sign in with their Doximity account or, alternatively, provide their National Provider Identifier (NPI) number. For an input query, the user is presented with responses from two randomly chosen LLMs and asked to specify which model they prefer (Figure 1). Our platform then aggregates the preferences and presents a leaderboard (Figure 2), ranking different LLMs against each other. To help clinicians understand their own most-preferred LLMs, we also provide personal rankings based on a user’s individual data, given a minimum number of preferences. 

link

Leave a Reply

Your email address will not be published. Required fields are marked *