Site icon Navigating New Norms

Synthetic medical education in dermatology leveraging generative artificial intelligence

Synthetic medical education in dermatology leveraging generative artificial intelligence

Considering recent scrutiny of standardized medical examinations, it is more important than ever that there is an abundance of high-quality educational materials to be used for standardized assessments like the USMLE. However, creating new questions is resource-intensive, requiring experienced physicians to write clinical vignettes and multiple test administrations to evaluate the generalizability of question performance. Novel methods for developing numerous, unique clinical vignettes are necessary and sought after.

In this study, we present promising evidence towards the feasibility and effectiveness of using a large language model, GPT-4, as a source of synthetic medical education, offering the potential for accessible, customizable, and scalable educational resources. We provide the first analysis of GPT-4’s utility for clinical vignette generation, demonstrating that GPT-4’s inherent clinical knowledge extends to the creation of representative and accurate patient descriptions. We find that for diseases tested in the Skin & Soft Tissue section of the USMLE Step 2 CK exam, GPT-4 generated vignettes that were highly accurate, highlighting the potential of LLMs to design vignettes which could eventually be incorporated into standardized medical examinations.

Analysis of the generated vignettes revealed high ratings in alignment with scientific consensus, comprehensiveness, and overall quality, coupled with low ratings in potential for clinical harm and demographic bias. There was a high statistical correlation between vignette comprehensiveness and overall quality (r = 0.83), indicating the importance of thorough and detailed case presentations in medical education and highlighting the ability of LLMs to provide contextually relevant and complete scenarios for clinical reasoning.

The average length of vignettes was 145.79 ± 26.97 words. This is well within the scope of USMLE vignette length; examinees have on average 90 s to answer each question on the USMLE Step 1, Step 2 CK, and Step 3 exams. Vignettes were accompanied by longer explanations, showcasing the ability of LLMs to generate not just patient descriptions, but also useful didactic material.

While vignettes received overall low ratings from evaluators for the possibility of demographic bias (1.52, 95% CI: 1.31– 1.72), the limited variety in patient demographics, highlighted by predominantly male patients and limited racial diversity, suggests a need for more conscious efforts to include diverse patient representations. Specific inclusion of such efforts in prompt engineering and model training datasets is crucial in preparing students to serve as physicians in an inclusive healthcare environment. Our study did not account for patient diversity in LLM prompts; further, since a new chat session was started for each prompt, overall balance could not be controlled by the LLM. In addition, future iterations of this work should further investigate sources and manifestations of systemic bias in model output.

While our initial pilot shows that GPT-4–generated vignettes display high clinical accuracy as evaluated by expert raters, it is important to note that LLM hallucinations may produce inconsistencies when deployed at scale. Additionally, LLMs are trained on the entire breadth of content available on the internet, which may not represent the standard of care. This could result in inaccurate responses. Deployment and widespread adoption of these models necessitates careful screening; clinical experts may be employed, as in this study, as evaluators of vignettes prior to usage. Specific training data for diagnoses of interest based on expert-recommended content may also help to refine model output and could facilitate the development of custom models.

A key limitation of this study is the composition of our expert rater panel, which included only one dermatologist alongside two attending physicians from internal medicine and emergency medicine. While these non-dermatologist raters frequently diagnose and manage common skin conditions in their respective specialties, and necessarily are familiar with the standard presentations of those evaluated in this study via their presence on national board examinations, their expertise may not encompass the full spectrum of dermatologic disease. As a result, their assessments were likely most reliable for clear-cut, board-style presentations but may have been less sensitive to subtle diagnostic nuances. Future studies would benefit from a larger proportion of dermatologists to ensure a more specialized evaluation of AI-generated cases.

Overall, this work demonstrates that off-the-shelf LLMs like GPT-4 hold great potential to be used for clinical vignette generation for standardized examination and teaching purposes. Fit-for-purpose LLMs trained on more specific datasets may further enhance these capabilities. The high accuracy and efficiency of “synthetic education” are a promising solution to current limitations in traditional means for generating medical educational materials.

link

Exit mobile version