Large language models and applications in healthcare

Risks and benefits of using LLMs like ChatGPT in the medical field

Jun 01, 2023

Comparison of several large language models (LLMs) on MedQA, a medical question-answering dataset based off of the US Medical Licensing Examinations. Model size is in parentheses. The x-axis shows the time period the model was released or published. Scores obtained from Papers With Code and from corresponding papers. The sizes of GPT-4 and Med-PaLM 2 are based on estimates.

Large Language Models (LLMs) are increasingly being utilized in the field of medicine. Recently, there has been a lot of excitement and attention around LLMs such as OpenAI’s GPT-4 and Google’s Med-PaLM 2 demonstrating high proficiency on the US Medical Licensing Examinations. In response, numerous potential use cases have been proposed for LLMs in medicine, such as serving as resourceful assistants, interpreting complex medical queries, and assisting in medical education.

Their capability to encode enormous amounts of medical knowledge hints at a future where LLMs can provide critical support in clinical decision-making and medical learning. While LLMs have the potential to transform the medical field, the hype (such as news headlines suggesting that medical chatbots may replace doctors) around them makes it difficult to assess their potential benefits and harms.

Regardless, such language models are increasingly being incorporated into healthcare systems. Just in 2023, GPT-4 was incorporated into healthcare systems and softwares as digital assistants, such as:

Epic Health Systems to help doctors communicate asynchronously with patients
eClinicalWorks to help patients converse with their health records using natural language questions
Nabla Copilot to help healthcare practitioners transcribe consultations in real-time and generate a summary

This article explores the rapidly evolving landscape of LLMs in the medical domain. Here's what we'll cover:

An overview of LLM approaches in the medical field: general models, fine-tuned models, and domain-specific models trained from scratch
What model evaluations are missing and why human expert evaluations are critical in medical applications
Benefits and downsides of using smaller, domain-specific models vs. larger, general models
The short shelf life of research, and what all of this means going forward

Overview of LLMs in the medical domain

This article focuses on large language models in the medical domain. The scope of the article is as follows:

Large Language Model (LLM): In the context of this article, I focus on generative large language models, or those that were trained to generate text (in contrast to models trained for classification or extraction)

Large = debatable what is meant by “large”, but Wikipedia refers to LLMs as neural networks with “typically billions of weights or more”

Language = large quantities of unlabeled text
Model = deep neural networks, typically a Transformer architecture

Medical domain: In the context of this article, I focus on text-based domains in the field of medical knowledge. This includes electronic health records, biomedical papers, and medical textbooks. While machine learning is used in a variety of other important and innovative ways in healthcare (such as in medical imaging, synthetic X-ray generation, and AI-powered drug discovery by predicting protein structures), this article focuses on the application of machine learning in text-based medical scenarios.

I group LLMs used in medical domains into three categories. At the end of the article, I provide more detailed descriptions of each category: how they’re trained, how they’re evaluated, and more examples of models per category.

1. Large, general models used with a few examples

Examples: OpenAI’s ChatGPT, GPT-4

Large, general models (such as ChatGPT) were trained on billions of webpages on the Internet. The training data most likely included some medical knowledge (such as medical textbooks, WebMD, biomedical journal articles, and subreddit pages for medical questions), but the model was not trained specifically for the medical domain. These models “encode” some medical knowledge and are able to answer medical-related questions by prompting with a few examples (also known as “few-shot” or “in-context” learning).

Example of prompting GPT-3 using a few examples (from Large Language Models are Few-Shot Clinical Information Extractors)

2. Large, general models with domain-specific fine-tuning

Examples: Google’s Med-PaLM and Med-PaLM 2

Large, general models (such as Google’s PaLM model) were trained on billions of webpages on the Internet, similar to ChatGPT. But afterwards, the models were additionally trained on a small amount of medical-related texts — Google used 40 examples carefully curated by medical experts to train Med-PaLM (a process known as “instruction fine tuning”). The resulting model is more aligned to the medical domain.

Example of taking PaLM and fine-tuning it on medical-related e (from Large Language Models Encode Clinical Knowledge)

3. Slightly-less-large, domain-specific models trained from scratch

Examples: Microsoft’s BioGPT, Meta’s Galactica, NVIDIA’s GatorTron, ClinicalT5

Large, domain-specific models were trained from scratch on domain-specific texts (which could be biomedical journal articles or electronic health records). These models tend to be smaller than the first two groups above, but are more efficient, cheaper, and sometimes better on specific tasks than the more general models.

Medical benchmarks and tests only tell part of the story

Some of the models, including GPT-4 and Med-PaLM 2, were assessed using United States Medical Licensing Examination (USMLE) practice questions or benchmark datasets such as MedQA.1

Example of questions from the MedQA benchmark questions. From What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams.

Recent discussions by the USMLE organization revealed that the questions from MedQA and the USMLE practice questions were not “representative of the entire depth and breadth of USMLE exam content as experienced by examinees.”

For example, the models were tested on versions of the tests that did not include questions using pictures, heart sounds, and computer-based clinical skill simulations.

Researchers from Microsoft and OpenAI found that GPT-4 performed well on a USMLE sample test questions, including those containing media elements such as graphs, photographs, and charts — even when these images were not shown to GPT-4.2 The authors claim that this is due to GPT-4 employing “logical reasoning and test-taking strategies”. However, this is a bit concerning, given that real-life medical scenarios rarely have a clear-cut “correct” answer in the way that tests are designed to have. Further, the details contained in non-text media such as graphs or heart sounds may be crucial information for a medical professional for making decisions.

Another potential issue with using medical benchmarks and tests is data contamination. In the context of machine learning and LLMs, data contamination refers to instances when the model's training data includes information from the test set or very similar information. This can make it difficult to accurately evaluate the model's performance, as it may have effectively “seen” the test data during training.

Models like GPT-4 are trained on Internet data, and given that these datasets exist online, it is possible that GPT-4 has already encountered some version of the tests during its training. The OpenAI and Microsoft researchers claim that such contamination is unlikely to exist.3 However, it is difficult to properly assess the data contamination of LLMs, and many examples online point to how GPT-4 is likely to have some level of data contamination (for example, how GPT-4 performs worse on coding problems not in its training data).

Human evaluations are essential for safe deployment in the medical field

Human evaluations, particularly from medical experts, are essential for properly assessing LLM outputs in the medical field. Involving professionals who can analyze the outputs of these models in realistic scenarios, rather than simply relying on standardized tests, can reveal critical gaps in the models' ability to provide safe and accurate medical care.

An ER doctor’s trial of ChatGPT for patient diagnosis highlighted this need. He tested ChatGPT in real-world medical scenarios, such as diagnosing patients. Even though ChatGPT passed the USMLE, its real-world application was concerning. It only correctly diagnosed about half of the patients, indicating a high risk of error. While potentially useful as an assistant, it might also worsen patient outcomes.

In a separate experiment, GPT-3 and GPT-4 were tested on the Japanese national medical licensing exams. Despite the models’ high scores on the test, they occasionally suggested actions such as suggesting euthanasia, which is illegal in Japan. A comprehensive understanding of country-specific statistics or systems is a crucial for medical practice. These examples underscore the importance for human evaluation for catching potentially serious missteps.

Addressing such concerns, researchers from Google tested Med-PaLM on extensive human evaluations, marking an encouraging move in the right direction. However, a similar rigorous human evaluation has not been conducted on GPT-4 (for example, the authors testing GPT-4 on medical problems do not include human evaluations). For the safe application of LLMs in the medical field, whether in diagnostics, treatment, or any other area, continual inclusion of expert human evaluations is crucial.

Human evaluation framework employed by researchers at Google to assess Med-PaLM. Figure 2 from Large Language Models Encode Clinical Knowledge.

“Professional exams are not a valid way to compare human capabilities with bots”, write Arvind Narayanan and Sayash Kapoor on their blog, AI Snake Oil. Professional exams such as UMSLE overemphasize subject-matter knowledge (which language models optimize for) and underemphasize real-world skills (which are more difficult to matter using standardized tests). Especially for fields such as medicine, it is crucial to not automate away decision making to a model.

A conundrum: to use larger, general models or smaller, specific models?

“Using in-context learning with extremely large language models, like GPT-3, is not a sufficient replacement for finetuned specialized clinical models”
— from Do We Still Need Clinical Language Models?

In what scenarios is it better to use a large general-purpose model like GPT-3, which was trained on a lot of different Internet data in addition to medical data, and when is it better to use a smaller, domain-specific model trained specifically on biomedical or clinical texts?

Some researchers argue that small, specialized clinical models outperform in-context learning approaches on LLMs such as GPT-3. Smaller T5-based models outperformed GPT-3 in tasks clinical tasks (such as answering questions regarding radiology reports), despite being 227x smaller. In other biomedical benchmark tasks, BioBERT outperformed GPT-3, despite being 514x smaller.

This begs the question of when larger, general models or smaller, specific models are needed. Smaller models are more parameter-efficient than larger, domain-agnostic models. Additionally, smaller models tend to avoid many of the challenges of larger NLP models, such as high computational and storage requirements and a high carbon footprint.

Research on older versions of LLMs becomes quickly obsolete

The rapid pace of model development and research means that many studies examining the behaviors of models quickly become obsolete. Research before 2022 using GPT-3 or GPT-3.5 may no longer be applicable to the newer GPT-4.

For instance, the smaller BioBERT and T5 models tested on clinical tasks above outperformed GPT-3 (text-davinci-003). However, it is unclear how the results would alter with the introduction of a newer model, such as GPT-4.

Furthermore, as GPT-4 continues to evolve, it undergoes continuous training and updates. Given its black-box nature, it's unclear whether a problem identified in one model version will manifest in another, or conversely, whether a test that passes on an older version will surface in a newer iteration. Any updates or improvements made GPT-4 (which is currently one of the more popular closed-source LLMs used) can potentially affect all of its results. This includes not only benchmark datasets like MedQA and medical tests such as UMSLE but also human evaluations and real-world applications.

Such concerns raise the importance of open science and using open-source models (which GPT-4 is not), but that is beyond the scope of this article.

What does this mean going forward?

Despite these concerns, the potential benefits of LLMs in healthcare are huge. They can help doctors with time-consuming tasks and aid in expanding medical datasets. They can answer complex medical questions, aid in research, and could soon become invaluable tools for doctors.

LLMs can streamline administrative tasks that currently burden clinicians, freeing up more time for patient care. Models such as GPT-3 and LLaMA (Meta’s LLM) have been creatively utilized in various medical contexts. Researchers have used these LLMs to synthetically generate more training data to expand existing datasets and improve models for tasks such as medical question-answering and medical dialogue summarization.

The potential of LLMs extends beyond traditional healthcare systems. For instance, HealthGPT, an experimental iOS app developed by a Stanford student, demonstrates the interactive potential of GPT-3.5 and GPT-4 with Apple Health data.

These examples underscore the dynamic ways in which LLMs are transforming healthcare. Despite the hurdles, it's clear that the fusion of LLMs with medical practice holds vast potential, promising to enhance both patient care and healthcare efficiency.

Going forwards, it is important to consider the following:

Regarding medical benchmark tests, it is important to examine the questions that LLMs like ChatGPT, Med-PaLM, and others answer incorrectly. This analysis is important as these models' inaccuracies can pose serious issues. We must identify which questions they're failing at, and the potential reasons behind these mistakes. Moreover, we should discern if there are recurring patterns or biases influencing these incorrect responses.
It is imperative to include human experts (doctors, nurses, clinical researchers and other medical experts) in model evaluation processes. Medical benchmark tests such as MedQA or UMSLE are not sufficient on their own for determining a model’s capability to perform well in a medical setting.
Even if an LLM is vetted by a medical professional for its clinical accuracy and knowledge, they should be used in a medical setting with close human supervision.

Large Language Models hold significant potential to revolutionize healthcare, yet their deployment needs to be executed with caution. Their ability to streamline administrative tasks, expand medical datasets, and aid in research is promising. However, their effectiveness is reliant on the continual inclusion of human expert evaluations and supervision to ensure safe and accurate medical use. As we navigate this evolving landscape, a balanced approach, incorporating both human expertise and AI capabilities, will be key.

A more detailed overview of LLMs in the medical domain

1. In-context learning: LLMs encode medical knowledge

Examples: GPT-3, GPT-4

“GPT-4 is a general purpose model that is not specialized for medical problems through training or engineered to solve clinical tasks”
— from Large Language Models Encode Clinical Knowledge

How: Train the largest language model possible by throwing all sorts of text at it (such as Common Crawl, which consists of billions of webpages scraped from the Internet). The training data most likely includes medical knowledge (e.g. medical textbooks, WebMD, biomedical journal articles, subreddit pages for medical questions).

The model is evaluated by prompting it to solve a new task (such as taking the USMLE) by feeding it a few examples of that task. This procedure is known as few-shot learning or in-context learning and does not require the model to be fine-tuned or trained on the specific task.

Researchers from Microsoft and OpenAI demonstrated GPT-4’s ability to perform well in the medical field.4 With a little bit of prompt engineering (and without any additional training), GPT-4 was able to

Exceed the passing score on the official USMLE exam questions by over 20 points
Score highly on the MedQA benchmark
Conduct medical reasoning
Personalize explanations to medical students
Craft counterfactual scenarios around a medical scores

In fact, GPT-4 was so good that it passed the Japanese medical licensing examinations and Korean general surgery board exams.

Researchers and practitioners have found earlier versions of the model (GPT-3 and GPT-3.5) to also contain the capability to reason through USMLE questions, answer patient questions, and generate synthetic medical dialogue data.

2. Fine-tuning LLMs for specific domains

Examples: Med-PaLM, Med-PaLM 2

How: Train the largest language model possible (in this example, this is Google’s PaLM or PaLM 2). Then, align it to the medical domain by adapting a few of the parameters using a method of fine-tuning called instruction prompt tuning. Essentially, it is an efficient approach for updating some of the model’s parameters using a few examples of carefully curated prompts created by clinical experts.

The model is evaluated similar to GPT-4 by using in-context learning.

Researchers from Google fine-tuned PaLM on only 40 examples (carefully curated by medical professionals) to create Med-PaLM, a LLM designed for the medical domain. The examples were questions sampled from a medical question answering dataset and labeled by a panel of five clinicians from the US and UK (they were asked to “provide exemplar answers” to the questions). Similarly, when PaLM 2 was released a few months later, it was fine-tuned to create Med-PaLM 2.

Google evaluated Med-PaLM and Med-PaLM 2 on a collection of seven medical question answering datasets, including MedQA (medical problems collected from professional medical board exams). The scores of Med-PaLM 2 vs. GPT-4 on MedQA were extremely close. The authors of Med-PaLM included a detailed human evaluation of model outputs by clinical experts.

Other fine-tuned models in this category:

John Snow’s BioGPT-JSL, which fine-tuned a BioGPT language model with medical conversations for clinical question answering
PMC-LLaMA, which fine-tuned a LLaMA language model on biomedical papers for medical question answering

3. Training models from scratch

Examples: BioGPT, GatorTron, ClinicalT5

How: Train a large language model (but not the largest language model) from scratch on domain-specific dataset – this could be anything from biomedical journals to electronic health records. The model is trained to be proficient at particular domains without wasting parameters on other parts of language that are not relevant to the medical field.

Then, the model is fine-tuned on specific downstream tasks, such as medical question answering. The model is evaluated directly on the final task.

Figure describing the tasks evaluated on the ClinicalT5 model (from Do We Still Need Clinical Language Models?)

These models trained from scratch tend to be smaller and more domain specific, trained specifically on journal articles or clinical notes.

Biomedical Texts

Models based on another Transformer architecture called BERT, such as Microsoft’s PubMedBERT (110M parameters) and NVIDIA’s BioMegatron (345M parameters), were trained on PubMed articles

Microsoft’s BioGPT was trained on 15M PubMed articles (374M parameters).
“Currently we cannot follow the GPT-3 setting due to its extremely large model with 15 billion parameters.”
Meta’s Galactica, trained on a combination of “48 million papers, textbooks and lecture notes, millions of compounds and proteins, scientific websites, encyclopedias, and more” (120B parameters). The public demo of Galactica was removed three days after its introduction

Clinical Texts

ClinicalBERT (110M parameters) was trained on the Medical Information Mart for Intensive Care III (MIMIC III)5 dataset.
ClinicalT5 (220M parameters for the small version, 770M for the large version) was trained on MIMIC III notes
NVIDIA’s GatorTron (345M parameters for the small version, 8.9B for the large version) was trained on a collection of de-identified clinical notes from University of Florida Health, PubMed articles, and Wikipedia
What my company, Truveta, is currently working on!