Large Language Models (LLMs) hold great promise to revolutionize current clinical systems due to their superior capabilities in medical text processing tasks and medical licensing exams. However, traditional machine learning (ML) models such as SVM and XGBoost remain predominantly used in clinical prediction tasks. This raises the question: Can LLMs outperform traditional ML models in clinical prediction? To address this, we introduce ClinicalBench, a new benchmark designed to comprehensively evaluate the clinical predictive modeling capabilities of both general-purpose and medical LLMs, comparing them to traditional ML models. ClinicalBench includes three common clinical prediction tasks, two databases, 14 general-purpose LLMs, 8 medical LLMs, and 11 traditional ML models. Our extensive empirical investigations reveal that both general-purpose and medical LLMs, regardless of model scale or diverse prompting and fine-tuning strategies, are still unable to surpass traditional ML models in clinical prediction tasks, highlighting potential deficiencies in clinical reasoning and decision-making. We urge practitioners to exercise caution when adopting LLMs in clinical applications. ClinicalBench aims to bridge the gap between LLM development for healthcare and real-world clinical practice.
@article{chen2024clinicalbench,
title = {ClinicalBench: Can LLMs Beat Traditional ML Models in Clinical Prediction?},
author = {Canyu Chen and Jian Yu and Shan Chen and Che Liu and Zhongwei Wan and Danielle Bitterman and Fei Wang and Kai Shu},
year = {2024},
journal = {arXiv preprint arXiv: 2411.06469}
}