ClinicalBench: Can LLMs Beat Traditional ML Models in Clinical Prediction?

TLDR: We discover that both general-purpose and medical LLMs, even with different model scales, diverse prompting or fine-tuning strategies, still cannot beat traditional ML models in clinical prediction yet, shedding light on their potential deficiency in clinical reasoning and decision-making.


1. Illinois Institute of Technology 2. Emory University 3. Mass General Brigham & Boston Children’s Hospital, Harvard Medical School 4. Weill Cornell Medicine, Cornell University 5. Imperial College London 6. Ohio State University
* Equal contribution
framework
Framework of ClinicalBench.


Abstract

Large Language Models (LLMs) hold great promise to revolutionize current clinical systems due to their superior capabilities in medical text processing tasks and medical licensing exams. However, traditional machine learning (ML) models such as SVM and XGBoost remain predominantly used in clinical prediction tasks. This raises the question: Can LLMs outperform traditional ML models in clinical prediction? To address this, we introduce ClinicalBench, a new benchmark designed to comprehensively evaluate the clinical predictive modeling capabilities of both general-purpose and medical LLMs, comparing them to traditional ML models. ClinicalBench includes three common clinical prediction tasks, two databases, 14 general-purpose LLMs, 8 medical LLMs, and 11 traditional ML models. Our extensive empirical investigations reveal that both general-purpose and medical LLMs, regardless of model scale or diverse prompting and fine-tuning strategies, are still unable to surpass traditional ML models in clinical prediction tasks, highlighting potential deficiencies in clinical reasoning and decision-making. We urge practitioners to exercise caution when adopting LLMs in clinical applications. ClinicalBench aims to bridge the gap between LLM development for healthcare and real-world clinical practice.



Our Contributions


  • We built a new benchmark ClinicalBench, embracing 14 general-purpose LLMs, 8 medical LLMs, 11 traditional ML models, three tasks, and two databases, which made the first attempt to compare the clinical prediction capacities of LLMs and traditional ML models in a head-to-head way.
  • We discover that both general-purpose and medical LLMs, even with different model sizes, prompting or fine-tuning strategies, still cannot beat traditional ML models in clinical prediction yet.
  • Our findings demonstrate the potential deficiency of both general-purpose and medical LLMs in real-world clinical reasoning and decision-making, which could have almost clinician-level performance in medical licensing exams and clinical case challenges. We call for caution when adopting LLMs in practical clinical applications. ClinicalBench could be leveraged to bridge the gap between the development of LLMs for healthcare and real-world clinical practice.

BibTeX

@article{chen2024clinicalbench,
      title   = {ClinicalBench: Can LLMs Beat Traditional ML Models in Clinical Prediction?},
      author  = {Canyu Chen and Jian Yu and Shan Chen and Che Liu and Zhongwei Wan and Danielle Bitterman and Fei Wang and Kai Shu},
      year    = {2024},
      journal = {arXiv preprint arXiv: 2411.06469}
    }