CASE STUDIES 案例研究
Rubrics to Prompts: Assessing Medical Student Post-Encounter Notes with AI
A.R. Jamieson and Others
Abstract
This case study, conducted at UT Southwestern Medical Center’s Simulation Center, describes the first successful prospective deployment of a generative artificial intelligence (AI)–based automated grading system for medical student post-encounter Objective Structured Clinical Examination (OSCE) notes. The OSCE is a standard approach to measuring the competence of medical students by their participation in live-action, simulated patient encounters with human actors. The post-encounter learner note is a vital element of the OSCE, and accurate assessment of student performance requires specially trained manual evaluators, which imposes significant labor and time investments. The Simulation Center at UT Southwestern provides a compelling platform for observing the benefits and challenges of AI-based enhancements in medical education at scale. To that end, we prospectively activated a first-pass AI grading system at the center for 245 (preclerkship) medical students participating in a 10-station fall 2023 OSCE session. Our inaugural deployment of the AI notes grading system reduced human effort by an estimated 91% (as measured by gradable items) and dramatically reduced turnaround time (from weeks to days). Conceived as a zero-shot large language model architecture with minimal prompt engineering, the system requires no prior domain-specific training data and can be readily adapted for new evaluation rubrics, opening the door to scaling this approach to other institutions. Confidence in our zero-shot Generative Pretrained Transformer 4 (GPT-4) framework was established by pre-deployment of retrospective evaluations. With the OSCE in prior years, the system achieved up to 89.7% agreement with human expert graders at the rubric item level (Cohen’s kappa, 0.79) and a Spearman’s correlation of 0.86 with the total examination score. We also demonstrate that local, smaller, open-source models (such as Llama-2-7B) can be fine-tuned via knowledge distillation from frontier models like GPT-4 to achieve similar performance, thereby indicating important operational implications for scalability, data privacy, security, and model control. These achievements were the result of a strategic, multiyear effort to pivot toward AI that was begun prior to ChatGPT’s release. In addition to highlighting the model’s performance and capabilities (including a retrospective analysis of 1124 students, 10,175 post-encounter notes, and 156,978 scored items), we share observations on the development and sign-off prior to the launch of an AI deployment protocol for our program. (Funded by UT Southwestern institutional funds and others.)
DOI: 10.1056/AIcs2400631
全文链接:https://ai.nejm.org/doi/abs/10.1056/AIcs2400631
从评分标准到提示:用人工智能评估医学生的临床后笔记
A.R. Jamieson 等人
摘要: 这项在UT Southwestern医学中心模拟中心进行的案例研究描述了第一个成功前瞻性部署的生成性人工智能(AI)基础自动评分系统,用于医学生的临床后客观结构化临床考试(OSCE)笔记。OSCE是通过学生参与与人类演员的现场模拟患者接触来衡量医学生的竞争力的标准方法。临床后学习者笔记是OSCE的一个重要元素,准确评估学生表现需要特别训练的手动评估员,这需要大量的劳动和时间投资。UT Southwestern的模拟中心为观察人工智能在医学教育中的增强益处和挑战提供了一个引人注目的平台。为此,我们为245名(临床前)医学生在2023年秋季的10站OSCE课程中激活了首次通过的AI评分系统。我们首次部署的AI笔记评分系统将人力减少了估计91%(以可评分项目衡量),并将周转时间从几周大幅减少到几天。该系统被设计为零次射击的大型语言模型架构,最少的提示工程,不需要以前的特定领域训练数据,并且可以轻松适应新的评估标准,为将这种方法扩展到其他机构打开了大门。通过部署前的回顾性评估建立了对我们零次射击生成预训练变换器4(GPT-4)框架的信心。在以前的OSCE中,系统在评分标准项目级别上与人类专家评分员达成了高达89.7%的一致性(Cohen's kappa,0.79),并且与总考试分数的Spearman相关性为0.86。我们还展示了本地的、较小的、开源模型(如Llama-2-7B)可以通过从前沿模型如GPT-4的知识蒸馏进行微调,以实现类似的性能,从而表明了对可扩展性、数据隐私、安全性和模型控制的重要运营影响。这些成就是在ChatGPT发布之前开始的多年战略努力向人工智能转变的结果。除了突出模型的性能和能力(包括对1124名学生、10175篇临床后笔记和156978个评分项目的回顾性分析),我们还分享了在启动AI部署协议之前对我们项目的观察和签署。(由UT Southwestern机构基金和其他人资助。)
NEJM AI, Volume 1 No. 12 December 2024
译文来自于AI工具Kimi