Potential of Multimodal Large Language Models for Data Mining of Medical Images and Free-text Reports (2024)

Yutong ZhangCo-first authors.Institute of Medical Research, Northwestern Polytechnical University, Xi’an 710072, ChinaYi Pan footnotemark: School of Computing, University of Georgia, GA, USASchool of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 611731, ChinaTianyang Zhong footnotemark: School of Automation, Northwestern Polytechnical University, Xi’an 710072, ChinaPeixin DongCo-second authors.School of Automation, Northwestern Polytechnical University, Xi’an 710072, ChinaKangni Xie footnotemark: School of Electronics and Information, Northwestern Polytechnical University, Xi’an 710072, ChinaYuxiao Liu footnotemark: School of Biomedical Engineering, ShanghaiTech University, Shanghai 201210, ChinaLingang Laboratory, Shanghai, 200031, ChinaHanqi JiangSchool of Computing, University of Georgia, GA, USAZhengliang LiuSchool of Computing, University of Georgia, GA, USAShijie ZhaoSchool of Automation, Northwestern Polytechnical University, Xi’an 710072, ChinaTuo ZhangSchool of Automation, Northwestern Polytechnical University, Xi’an 710072, ChinaXi JiangSchool of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 611731, ChinaDinggang ShenSchool of Biomedical Engineering, ShanghaiTech University, Shanghai 201210, ChinaShanghai United Imaging Intelligence Co., Ltd.Shanghai Clinical Research and Trial CenterTianming LiuSchool of Computing, University of Georgia, GA, USAXin ZhangCorresponding author. E-mail: xzhang@nwpu.edu.cnInstitute of Medical Research, Northwestern Polytechnical University, Xi’an 710072, China

Abstract

Medical images and radiology reports are essential for physicians to diagnose medical conditions, emphasizing the need of quantitative analysis for clinical decision-making. However, the vast diversity and cross-source heterogeneity inherent in these data have posed significant challenges to the generalizability of current data-mining methods. Recently, multimodal large language models (MLLMs) have revolutionized numerous domains, significantly impacting the medical field. Notably, Gemini-Vision-series (Gemini) and GPT-4-series (GPT-4) models have epitomized a paradigm shift in Artificial General Intelligence (AGI) for computer vision, showcasing their potential in the biomedical domain. In this study, we evaluated the performance of the Gemini, GPT-4, and 4 popular large models for an exhaustive evaluation across 14 medical imaging datasets, including 5 medical imaging categories (dermatology, radiology, dentistry, ophthalmology, and endoscopy), and 3 radiology report datasets. The investigated tasks encompass disease classification, lesion segmentation, anatomical localization, disease diagnosis, report generation, and lesion detection. Our experimental results demonstrated that Gemini-series models excelled in report generation and lesion detection but faces challenges in disease classification and anatomical localization. Conversely, GPT-series models exhibited proficiency in lesion segmentation and anatomical localization but encountered difficulties in disease diagnosis and lesion detection. Additionally, both the Gemini series and GPT series contain models that have demonstrated commendable generation efficiency. While both models hold promise in reducing physician workload, alleviating pressure on limited healthcare resources, and fostering collaboration between clinical practitioners and artificial intelligence technologies, substantial enhancements and comprehensive validations remain imperative before clinical deployment.

1 Introduction

Recent advancements in natural language processing (NLP) have shifted from monomodal (i.e., text-only) to multimodal large language models (MLLMs), marking a significant paradigm shift in artificial general intelligence (AGI) research. To begin with text, language has always been a distinguishing feature of human intelligence from animals’. With the advancement of artificial intelligence (AI), particularly in the field of NLP, machines are becoming increasingly capable of understanding and processing language. Over the past few years, pre-trained language models (PLMs) based on self-attention mechanisms and the Transformer framework[93] have emerged and rapidly gained popularity. PLMs can learn general language representations from large-scale data in an unsupervised manner, which facilitates various downstream NLP tasks without the need for retraining new models[27]. It is noteworthy that when the scale of training data and parameters exceeds a certain threshold, language models exhibit significant performance improvements and acquire capabilities absent in smaller models, such as context learning. We refer to such models as large language models (LLMs). Large language models (LLMs) like GPT-3[8] and its derivatives (e.g., InstructGPT[64], also known as ChatGPT), Llama series (i.e., Llama[88], Llama 2 [89], and Llama 3[2]), and PaLM series (i.e., PaLM [16]) and PaLM 2 [5]), have laid the groundwork by demonstrating exceptional text interpretation and generation capabilities[120].

The advent of the era of LLMs has brought us closer to the dawn of AGI[48, 25]. Academia and industry are experiencing vibrant competition and diverse developments, from the early Google T5 model[71] to the highly acclaimed OpenAI’s GPT series today. The parameter scales of these models have long surpassed the billion-level mark, and their generative and learning capabilities of LLMs are revealing emergent abilities[100] and increasing being applied across various sectors. LLMs have demonstrated exceptional proficiency in understanding and generating natural language, providing foundational solutions for specific domains such as law[15, 19, 30], education[81, 60, 29], and public healthcare[115, 52, 56]. However, each domain deals with unique challenges, and directly using pre-trained LLMs may not yield ideal results. Fine-tuning these models, considering their inherent complexity, enables them to better adapt to downstream tasks, which is a key approach to leveraging large models [122, 119, 107].

Despite their exceptional proficiency in zero/few-shot reasoning in most NLP tasks, LLMs face challenges in processing visual information as they can only understand discrete text. Meanwhile, large-scale visual foundation models have made significant advancements in perception, leading to the gradual integration of monomodal LLMs and visual models, ultimately giving rise to the emergence of MLLMs[95]. MLLMs are models based on LLMs that can receive and reason with multimodal information, extending beyond the traditional single "language modality" to include "image," "speech," and other "multimodal" data. Among these, Gemini [4] and GPT-4 [1] are notable examples. Gemini combines language and visual information processing, while GPT-4 enhances the understanding and generation of visual data, garnering widespread attention. From the perspective of developing AGI, MLLMs may represent a step forward compared to LLMs, as they align more closely with human ways of perceiving the world, are more user-friendly, and generally support a broader range of tasks [109].

However, the exploration of these models’ abilities to integrate and interpret visual data, particularly in highly specialized domains such as biomedicine, signifies a new frontier in AI application. Notable among these advancements are the GPT-4-series and Gemini-Vision-series models (namely GPT-4 and Gemini for the rest of this article), which epitomize the fusion of linguistic and visual information processing.

This research conducts a meticulous comparative analysis of GPT-4 and Gemini, focusing on their application in biomedical image analysis. And we also design experiments on the popular models including Yi, Claude, and Llama 3, to evaluate these LLMs’ textual comprehension and MLLMs’ multimodal comprehension compared to GPT-4 and Gemini. GPT-4, an advanced extension of the monomodal ChatGPT model from OpenAI, and Gemini, a similarly advanced multimodal model from Google DeepMind, are designed to comprehend and analyze information across textual and visual dimensions. This study explores how effectively these multimodal models can handle the complexities of visual data within the biomedical domain, potentially broadening their applicability and effectiveness.

The evaluation methodology includes a series of rigorous tests to assess the models’ accuracy, efficiency, and adaptability in interpreting and leveraging visual information for biomedical purposes. By examining the performance of GPT-4 and Gemini in tasks such as medical image classification, anomaly detection, and data synthesis, this paper highlights each model’s strengths and limitations. Additionally, it offers insights into optimizing these MLLMs for specialized applications.

This investigation not only showcases the groundbreaking potential of integrating advanced AI models like GPT-4 and Gemini into biomedical analysis but also sets a benchmark for future research in the field. By comparing these models, the study provides valuable knowledge to the ongoing discourse on enhancing AI’s multimodal capabilities, especially in sectors where combining text and visual data is crucial. This research underscores the transformative impact these advancements could have on medical diagnostics, treatment planning, and the broader biomedical field, marking a significant step toward realizing fully integrated AGI systems in specialized domains. Overall, the main contributions of our work are summarized as follows:

  1. 1.

    We provide a detailed comparative analysis of GPT-4 and Gemini models, specifically focusing on their application in biomedical image analysis, highlighting their strengths and limitations across multiple tasks such as disease classification, lesion segmentation, and report generation.

  2. 2.

    Our study employs rigorous evaluation methodologies to assess the accuracy, efficiency, and adaptability of these models in interpreting and leveraging visual information, offering insights into their potential optimization for specialized biomedical applications.

  3. 3.

    By integrating advanced AI models into biomedical analysis, our research underscores their transformative impact on medical diagnostics, treatment planning, and the broader biomedical field, setting a benchmark for future AGI system developments in specialized domains.

2 Related work

2.1 Large Language Models

With the increasing GPU computing ability and training data size, a series of Transformer [93] based pre-trained LLMs have emerged. Pre-trained LLMs can be grouped into encoder based [21, 37, 39, 49], decoder based [22, 9, 117, 88], and encoder-decoder based models [85, 18, 72, 39].Encoder based LLMs are better at analyzing and classifying text content, including semantic feature extraction and named entity recognition. The first encoder based pre-trained LLM is the Bidirectional Encoder Representations from Transformers (BERT) [21]. BERT uses bidirectional language encoders with specially designed context-aware and mask prediction tasks on large-scale unlabeled text data.Following BERT [21], roBERT [49] further improves the performance by updating its training method, such as expanding the batch size, training on larger data, and eliminating BERT’s next-sentence prediction task training method.

Although encoder based LLMs can effectively extract sentence features, they can not perform well on zero-shot or few-shot tasks, which are vital for LLMs. Generative Pre-trained Transformer (GPT) in the other view, uses the auto-regressive task for training and acquires better generalization ability in both zero-shot and few-shot tasks [96]. Decoder based models use the similar multi-head self-attention mechanism to the one in the encoder, but the attention mask prevents the model from attending to future positions, ensuring that the predictions for position i𝑖iitalic_i can only depend on the known outputs at positions less than i𝑖iitalic_i. Radford et al. find the scaling (model size, dataset size or both) can largely improve the decoder-based LLMs capacity [70]. This makes decoder framework widely used in scaling LLMs (ChatGPT as the most typical example). Although the scaling LLMs are based on similar deep learning architecture and training algorithm compared with smaller LLMs, they exhibit new emergent capabilities that do not appear before. For example, GPT-3 [9] can solve few-shot tasks through in-context learning, whereas GPT-2 cannot do well. Another remarkable application of scaling LLMs is ChatGPT that adapts the LLMs for dialogue, which presents an amazing conversation ability with humans and can solve different complex tasks by instructiontuning𝑖𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑡𝑢𝑛𝑖𝑛𝑔instruction-tuningitalic_i italic_n italic_s italic_t italic_r italic_u italic_c italic_t italic_i italic_o italic_n - italic_t italic_u italic_n italic_i italic_n italic_g [63] and chainofthought𝑐𝑎𝑖𝑛𝑜𝑓𝑡𝑜𝑢𝑔𝑡chain-of-thoughtitalic_c italic_h italic_a italic_i italic_n - italic_o italic_f - italic_t italic_h italic_o italic_u italic_g italic_h italic_t (CoT) [101]. With instruction tuning, LLMs are enabled to follow the task instructions for new tasks without using explicit examples, thus having an improved zero-shot ability which is vital for solving different tasks. CoT strategy can ease the model solving the difficult task by dividing it into multiple reasoning steps, then, LLMs can solve such tasks by utilizing the prompting mechanism that involves intermediate reasoning steps for deriving the final answer.

There are also works [85, 18, 72, 39] trying to build encoder-decoder based LLMs to take full advantage of encoder and decoder based LLMs. Encoder-decoder based LLMs are mainly used to handle tasks that require precise mapping between input and output, such as machine translation, text summarization, etc. In these tasks, it is very important to understand the precise content of the input and generate specific output accordingly. Models trained based on this architecture can generally only be applied to certain specific tasks. For example, an encoder-decoder LLM specially trained for machine translation may not be suitable for direct use in text summarization or other types of tasks. This makes encoder-decoder based LLMs mostly used in specific field, instead of the general domain like decoder based LLMs.

To summarize, with scaling model and dataset, the zero-shot and few-shot become key capabilities for LLMs. The decoder based LLMs have greater flexibility compared with encoder and ecoder-decoder based LLMs. Models trained based on decoder only architecture can handle many different types of text generation tasks, such as question and answer, translation, etc., without the need for special training or adjustment for each task, thus making it more general in the practical.

2.2 Multimodal Large Language Models

In recent years, LLMs have achieved significant progress [120]. By scaling up model and data sizes, LLMs have demonstrated extraordinary emergent abilities, such as instruction following [66], in-context learning [9], and chain-of-thought reasoning [101]. Concurrently, Large Vision Models (LVMs) have also made substantial advancements [36, 79, 113, 62]. MLLMs, as a natural extension, leverage the complementary strengths of LLMs and LVMs [109].

MLLMs typically consist of a modality encoder, a pre-trained LLM, and a modality interface that bridges the gap between different modalities [109]. The modality encoder processes inputs from various modalities, such as images, videos, and audio, transforming them into representations that the LLM can comprehend. The pre-trained LLM is responsible for understanding and reasoning over these representations. The modality interface serves as a bridge, aligning and fusing information from different modalities into the LLM.

The training process of MLLMs generally involves three stages: pre-training to align modalities and learn multimodal knowledge, instruction tuning to enable generalization to new tasks, and alignment tuning to adapt the model to specific human preferences [109]. During the pre-training stage, researchers utilize large-scale image-text paired datasets, such as LAION [77] and CC [78, 11], to train the model to learn the alignment between different modalities and acquire rich world knowledge. In the instruction tuning stage, the model is fine-tuned on instruction datasets, such as LLaVA-Instruct [43] and ALLaVA [13], to enhance its generalization ability to new tasks. During the alignment tuning stage, the model is trained using human feedback data, such as LLaVA-RLHF [83] and RLHF-V [110], to better adapt to human preferences and generate more accurate outputs with fewer hallucinations.

In addition to these open-source MLLMs, several commercial companies have introduced powerful closed-source models, such as OpenAI’s GPT-4 [61], Anthropic’s Claude-3 [6], and Microsoft’s KOSMOS-1 [67]. These models have demonstrated remarkable capabilities in handling multimodal tasks, such as generating stories based on images and performing mathematical reasoning without OCR. Their emergence has greatly promoted the development of MLLMs and provided new ideas and directions for researchers in the field.

To evaluate and compare the performance of different MLLMs, researchers have developed various benchmarks and methods. For closed-set problems (i.e., limited answer options), benchmarks such as MME [24], MMBench [47] and MM-VET [111] provide comprehensive and fine-grained quantitative comparisons. For open-set problems (i.e., flexible and diverse answer options), evaluation methods such as human scoring [125], GPT scoring [108] and case studies [102] offer qualitative analyses of MLLMs’ generative capabilities from different perspectives.

2.3 Medical MLLMs

Quality healthcare services are the cornerstone of social welfare. With increasing demand for high-quality healthcare services, the scarcity of medical resources has become a pressing issue that underscores the importance of intelligent healthcare. The creation of foundation models has garnered significant attention in the medical AI system development realm [116, 55, 104].Compared with monomodal medical models, including medical LLMs [55] and LVMs [80], medical MLLMs that fuse various modalities have the ability to adaptively interpret and address various medical problems in different modalities, showcasing extensive applications and immense potential in the healthcare domain. By integrating language, images, audio, and other modalities of information, these models, covering medical images, electronic medical records, findings, etc. medical modalities, can offer more comprehensive and precise diagnostic, therapeutic, and patient management solutions [124]. As large models continue to evolve, notable medical MLLMs like Med-Gemini[76] have emerged.

This multimodality processing ability makes medical MLLMs more practical in clinical stages as they can give more explainable diagnosis results based on different modalities, and allow more flexible interactions for both patients and doctors. Also they can mine richer medical domain knowledge from these different acquired modalities. Technically speaking, medical MLLMs can be divided into two kinds of models, which are multimodality-alignment models and multimodality-generation models.

Multimodality-alignment Models

The multimodality-alignment models are based on the pioneer work CLIP [69]. CLIP is an image-text matching model, utilizing contrastive learning methods to generate fused representations for images and texts. Based on this, there has emerged a series of works trying to align different modalities in the medical domain by this contrastive training formulation [116, 42, 99].Moreover, a series of works further build multimodality models based on these well align medical CLIP model for explainable diagnosis [87, 31, 103, 58, 121], segmentation [45, 68, 3], structured report generation [92, 34], etc.Although these multimodality models perform well in different downstream tasks, the lack of interaction and medical knowledge makes them cannot be flexibly used in clincial stages.

Multimodality-generation Models

On the other hand, the generative multimodality foundation models can provide unstructured text reports and make full use of the rich domain knowledge from their LLM component. These models are mostly based on multimodality models [40, 44] and fine-tuned specifically on medical domain. Compared with multimodality-alignment models, these models can promise more interaction, and few-shot and zero-shot learning abilities. It can not only automatically draft radiology reports that describe both abnormalities and relevant normal findings, while also taking into account the patient’s history. These models can provide further assistance to clinicians by pairing text reports with interactive visualizations, such as by highlighting the region described by each phrase.

This distinction between multimodality-alignment models and multimodality-generation models provides a foundation for understanding their practical applications across various healthcare domains. In the field of medical modalities, the mentioned multimodal MLLMs demonstrate significant potential in the following healthcare domains:

  1. 1.

    Image Diagnosis and Imaging Analysis: MLLMs integrate textual and imaging data, excelling in medical imaging diagnostics. For instance, MLLMs can expedite the identification of conditions such as cancer and pneumonia by learning from vast medical imaging datasets and corresponding diagnostic reports. These models not only automate the analysis of CT and MRI scans but also integrate imaging analyses with patient medical histories and symptom information to provide more accurate diagnostic insights.

  2. 2.

    Medical Literature and Record Analysis: MLLMs can process and comprehend extensive medical literature, research papers, and electronic health records. In medical research, these models swiftly sift through and analyze the latest research findings, aiding healthcare providers and researchers in understanding cutting-edge treatment methods and diagnostic technologies. In clinical applications, MLLMs automatically extract and analyze critical information from patient records, supporting clinical decision-making.

  3. 3.

    Medical Smart Devices: MLLMs contribute to the development of smart medical devices and robotic systems. For example, by integrating multimodal data, MLLMs enhance the precision and safety of robotic surgeries in complex procedures. These intelligent devices can analyze real-time images and data during surgery, providing precise assistance and guidance to reduce surgical risks.

  4. 4.

    Drug Development Assistance: In drug development, MLLMs predict the efficacy and side effects of new drugs by analyzing extensive biomedical data, optimizing the drug design process. These models combine data from genetics, protein structures, drug compounds, and other modalities to enhance the efficiency and success rate of new drug development.

  5. 5.

    Remote Healthcare and Diagnostics: MLLMs hold significant promise in remote healthcare. By integrating video, audio, and textual data, these models support remote diagnosis and treatment, offering high-quality healthcare services to remote areas with limited medical resources. MLLMs can analyze real-time communications between doctors and patients, integrating imaging and medical record data to provide accurate diagnostic recommendations.

However, most of medical MLLMs are still trained on specific disease or medial domains, making them lacking of the generality and thus cannot be a universal multimodal model for different diseases or tasks. Also, the limited training or fine-tuning dataset scale make them cannot be well validated. In addition, they face general challenges such as inefficient computing power and privacy risks. Therefore, while recognizing the development potential of MLLMs in healthcare, a cautious and measured approach is necessary.

2.4 Fine-tuning Methods in MLLMs

MLLMs integrate and process multiple forms of data, such as text and images, to perform complex tasks. Fine-tuning these models is crucial as it allows them to adapt to specific applications, enhancing their accuracy and efficiency. With the exponential growth in model parameters—from millions to billions—fine-tuning has become essential to leverage the full potential of pre-trained models for various downstream tasks.

Fine-tuning enables models to refine their understanding and improve performance in specific domains without requiring extensive retraining. This process is especially important in applications like visual question answering, image captioning, and multimodal translation, where precise alignment between different data modalities is required. Recent advancements in fine-tuning techniques have focused on making this process more efficient and scalable, ensuring that even large models can be fine-tuned with limited computational resources.

The importance of fine-tuning in MLLMs is underscored by the need to address issues such as catastrophic forgetting, where models lose their ability to retain previously learned information when adapting to new tasks. Additionally, fine-tuning helps in achieving better cross-modal alignment, where the integration of visual and textual data leads to a more coherent and accurate understanding of the inputs.

Parameter-Efficient Fine-Tuning Techniques

Fine-tuning MLLMs can be challenging due to their large parameter size. Techniques such as Low-Rank Adaptation (LoRA) and Quantized LoRA (QLoRA) have been found effective. These methods adjust a subset of the model’s parameters, reducing computational requirements while maintaining performance. For instance, Lu et al. [54] scaled up models like LLaVA to 33B and 65B parameters, showing that parameter-efficient methods could achieve results comparable to full-model fine-tuning, especially when combined with high-resolution images and mixed multimodal data.

Additionally, Want et al. [97] proposed a non-intrusive techniques AdaLink, leaving the internal architecture unchanged and adapting model-external parameters. This method has been effective in both text-only and multimodal tasks, providing a competitive edge without the complexities of altering internal architectures.

Addressing Catastrophic Forgetting

One significant challenge in fine-tuning MLLMs is catastrophic forgetting, where the model loses knowledge of previously learned tasks. Zhai et al. [112] proposed a technique, namely EMT (Evaluating Multimodality), to help in mitigating this by treating MLLMs as image classifiers during fine-tuning. This approach has shown that early-stage fine-tuning on image datasets could improve performance across other datasets by enhancing the alignment of text and visual features. For example, continued fine-tuning of models like LLaVA on image datasets has shown improvements in text-image alignment, though prolonged fine-tuning could lead to hallucinations and reduced generalizability.

Moreover, Xu et al. [105] proposed an approach, called Child-Tuning, which updated only a subset of model parameters, has been initiated to improve generalization and efficiency. This method has been shown to outperform traditional fine-tuning techniques on various tasks, including those in the GLUE benchmark.

Fine-Grained Cross-Modal Alignment

To achieve better cross-modal alignment, Chen et al. [12] introduced Position-enhanced Visual Instruction Tuning (PVIT), which integrated a region-level vision encoder with the language model. This technique ensured a more detailed comprehension of images by the MLLM and promoted efficient fine-grained alignment between vision and language modules. This method used multiple data generation strategies to construct a comprehensive image-region-language instruction dataset, leading to improved performance on multimodal tasks. For example, PVIT has demonstrated significant improvements in tasks requiring detailed visual comprehension, such as object detection and image segmentation.

Furthermore, techniques like LongLoRA, proposed by Chen et al. [14], have been developed to extend the context sizes of pre-trained large language models efficiently. LongLoRA combined improved LoRA with shifted sparse attention to enable context extension with significant computational savings, proving effective for tasks requiring long-context understanding.

Innovative Fine-Tuning Approaches

Several innovative fine-tuning approaches have been proposed to enhance MLLMs. For example, Yang et al. [106] adopted Prompt Tuning as a lightweight and effective fine-tuning method. This approach involved fine-tuning prompts instead of the entire model, allowing for efficient adaptation to various tasks. Prompt Tuning has been shown to achieve comparable performance to full-model fine-tuning while offering improved robustness against adversarial attacks.

Another innovative approach, SCITUNE, proposed by Horawalavithana et al. [28], aligned LLMs with scientific multimodal instructions. By training models like LLaMA-SciTune with human-generated scientific instruction datasets, SCITUNE improved performance on science-focused visual and language understanding tasks.

2.5 Large Language Model Reasoning

Large language models have achieved remarkable success in a wide range of natural language processing tasks [116, 55, 104, 87, 31, 103, 58, 121], including language translation, sentiment analysis, and text classification. However, these models are typically designed to perform specific tasks, rather than engage in more general reasoning and inference. In contrast, human language understanding involves the ability to reason about complex relationships between entities, events, and concepts.

One of the key breakthroughs in large language models and reasoning is the development of cognitive architectures. These architectures are designed to mimic the human brain’s ability to process and integrate information from multiple sources, enabling models to reason and draw conclusions in a more human-like way. For example, researchers [86] at Google have developed a cognitive architecture called "Reasoning Networks" that uses a combination of neural networks and symbolic reasoning to solve complex problems.

Recent breakthroughs [118, 98] have demonstrated the potential of large language model reasoning. 1) Engaging in multi-hop reasoning can reason about complex relationships between entities and concepts, enabling applications such as question answering and text classification. 2) Reason about cause-and-effect relationships can accurately identify in text format, enabling applications such as event extraction and text summarization. 3) Another significant advancement in large language models and reasoning is to use graph-based models, which represent language as a network of entities and relationships. These models can be trained using a variety of techniques, including reinforcement learning and adversarial training [114], which are specifically designed to test the model’s robustness and ability to reason in the face of uncertainty.

In conclusion, recent advancements in large language models and reasoning have propelled the field towards more nuanced and sophisticated understanding of natural language. By leveraging cognitive architectures inspired by human cognition, researchers have made strides in enabling models to engage in multi-hop reasoning and infer cause-and-effect relationships. Additionally, the adoption of graph-based models has provided a promising avenue for representing language as interconnected entities and relationships, further enhancing the model’s ability to reason across complex scenarios. Moving forward, continued research in explainability, exploration of diverse reasoning paradigms, and robustness testing will be crucial in unlocking the full potential of large language models to tackle real-world challenges and emulate human-like reasoning capabilities.

2.6 Evaluation of MLLMs

Evaluating MLLMs, is a critical aspect of understanding their capabilities and limitations. The evaluation process encompasses a range of benchmarks, metrics, and frameworks designed to assess various aspects of these models. However, the complexity and diversity of tasks that MLLMs can perform pose significant challenges to developing comprehensive and effective evaluation methodologies.

In terms of current methods for evaluating MLLMs, one of the primary benchmarks for evaluating MLLMs is the General Language Understanding Evaluation (GLUE) benchmark, which includes a suite of tasks such as sentiment analysis, textual entailment, and question answering. The SuperGLUE benchmark extends this by including more challenging tasks. For multimodal models, benchmarks such as Visual Question Answering (VQA) and the COCO dataset, which assesses image captioning, are commonly used. These benchmarks provide a standardized way to compare model performance across different tasks and modalities [10].

The evaluation of MLLMs employs various metrics to measure performance. Common metrics include accuracy, F1 score, precision, and recall for classification tasks. For generation tasks, metrics such as BLEU, ROUGE, and METEOR are used to assess the quality of text generation. In multimodal tasks, metrics like Mean Reciprocal Rank (MRR) and Intersection over Union (IoU) are used to evaluate model performance on tasks like image captioning and object detection. These metrics help quantify the performance of MLLMs on specific tasks, providing a basis for comparison [120].

Several frameworks have been developed to facilitate the evaluation of MLLMs. The Hugging Face Transformers library, for instance, includes tools for benchmarking models on a variety of tasks using pre-defined datasets. Another notable framework is EVAL, which focuses on the automatic evaluation of language models’ capabilities in following natural language instructions. These frameworks streamline the evaluation process and ensure consistency in how different models are assessed [94].

However, despite the availability of benchmarks and metrics, evaluating MLLMs presents several challenges. One major issue is the lack of standardized evaluation methods for emerging tasks. For instance, the ability of models to handle complex, multi-turn dialogues or generate contextually relevant responses in diverse scenarios is difficult to quantify with existing metrics. Additionally, the phenomenon of catastrophic forgetting, where a model loses knowledge of previously learned tasks when fine-tuned for new ones, complicates the evaluation of MLLMs’ long-term capabilities [112]. Another challenge is the evaluation of ethical considerations and biases. LLMs can inadvertently generate harmful or biased content, making it essential to assess their outputs for ethical implications. Current evaluation frameworks often fall short in systematically addressing these issues, highlighting the need for more sophisticated evaluation tools that can detect and mitigate biases and ethical risks [26].

Future research in the evaluation of MLLMs should focus on developing more comprehensive and robust benchmarks that encompass a wider range of tasks and modalities. There is also a need for better evaluation metrics that can capture the nuances of model performance in real-world applications. Additionally, incorporating human-in-the-loop evaluations can provide more accurate assessments of model performance, particularly for tasks that require nuanced understanding and interpretation. The development of evaluation platforms that integrate multiple dimensions of assessment, including performance, safety, and ethical considerations, will be crucial. These platforms should facilitate continuous evaluation as models evolve, ensuring that they remain reliable and effective in diverse applications.

3 Methodology

3.1 Datasets

3.1.1 Medical imaging tasks

In this study, we employ a diverse array of eleven distinct medical image datasets to facilitate our investigation. To provide a comprehensive comparison of multimodal large language models, we meticulously selected medical image datasets spanning five different fields. The datasets identified and incorporated in our study include iChallenge GON, MICCAI2023 Tooth Segmentation 2D, ChestXRay2017 [35], COVID-QU Ex Dataset [84, 73, 20, 17], CholecSeg8k [91], CVC ClinicDB [7], Kvasir SEG [32], m2caiSeg [57], Skin Cancer ISIC [75], Skin Cancer MNIST: HAM10000 [90], and Skin Cancer Malignant vs. Benign.

Dataset Summary:

  1. 1.

    iChallenge GON comprises a total of 1200 color fundus photographs, all stored in JPEG format. Within this dataset, 400 images are designated for glaucoma classification, while the remaining 800 images are allocated for tasks such as optic disc detection and segmentation, along with central fovea localization. In our testing task, we utilized this dataset to comprehensively investigate the capacity of large language models in glaucoma diagnosis and optic disc localization, achieved through the design of tailored prompts.

  2. 2.

    MICCAI2023 Tooth Segmentation 2D sourced from the MICCAI2023 Challenge, comprises 3000 labeled panoramic images of teeth. Its primary objective is to facilitate researchers in accurately segmenting tooth regions utilizing deep learning methodologies. In our testing scenario, we imposed heightened requirements on the large language model. Specifically, we partitioned the oral cavity into four distinct regions and tasked the large language model with providing the count of teeth within each region and identifying the presence of dental lesions, all under zero-shot learning conditions.

  3. 3.

    ChestXRay2017 jointly developed by the University of California San Diego and the Guangzhou Women and Children’s Medical Center, comprises a vast collection of validated OCT and chest X-ray images. This dataset contains thousands of validated OCT and chest X-ray images described and analyzed in "Recognition of Medical Diagnosis and Treatable Diseases through Image Based Deep Learning". The dataset is well-suited for binary classification tasks. Utilizing this dataset, our investigation centers on scrutinizing the capacity of large language models to comprehend chest radiographic imaging for pneumonia diagnosis through the development of tailored prompts.

  4. 4.

    COVID-QU Ex Dataset assembled by researchers from Qatar University, comprises 33,920 chest X-ray (CXR) images. Among these, 11,956 cases pertain to COVID-19, while 11,263 represent non-COVID-19 infections (viral or bacterial pneumonia). Additionally, the dataset includes 10,701 ground truth lung segmentation masks, making it the largest lung mask dataset to date. In our experimental evaluation, we employed this dataset to assess the efficacy of the large language model in pneumonia diagnosis based on CXR images. Moreover, we investigated the model’s capability to differentiate between typical pneumonia and novel coronavirus pneumonia.

  5. 5.

    CholecSeg8k serves as a valuable resource for semantic segmentation tasks within endoscopic modalities. Derived from the Cholec80 dataset, it comprises 8080 meticulously annotated frames from 17 videos. These images are pixel-level annotated across 13 categories commonly encountered in laparoscopic cholecystectomy surgery. In our experimental evaluation, our focus was on assessing the image recognition and comprehension capabilities of the large language model through the formulation of inquiries concerning the image content.

  6. 6.

    CVC ClinicDB serves as the official dataset for the training phase of the MICCAI 2015 Colonoscopy Video Automatic Polyp Detection Challenge. This database comprises 612 static images extracted from colonoscopy videos, sourced from 29 distinct sequences. In our evaluation, we utilized endoscopic images of intestinal polyps from this dataset to assess the discriminative capability of the large language model for detecting intestinal polyps.

  7. 7.

    Kvasir SEG comprises 1000 polyp images along with their corresponding ground truth annotations, derived from the Kvasir Dataset v2. The resolution of images in the Kvasir-SEG dataset ranges from 332x487 to 1920x1072 pixels. The images and their corresponding masks are stored in two separate directories, each utilizing the same filenames for easy pairing. The image files are encoded using JPEG compression. In our testing, we evaluated the model’s capability in lesion detection by inputting images with lesions into the large language model.

  8. 8.

    m2caiSeg is a dataset for segmenting endoscopic images during surgery. Originating from the first and second videos of the MICCAI 2016 Surgical Tool Detection dataset, it encompasses a total of 307 images, each meticulously annotated at the pixel level. The dataset features images of diverse organs (e.g., liver, gallbladder, upper wall, intestines), surgical instruments (e.g., clips, bipolar forceps, hooks, scissors, trimmers), and bodily fluids (e.g., bile, blood). Additionally, it includes specialized labels for unknown and black regions to address areas obscured by certain instruments.

  9. 9.

    Skin Cancer ISIC comprises 2357 images of both malignant and benign oncological conditions, curated by The International Skin Imaging Collaboration (ISIC). The images are categorized based on ISIC’s classification, with all subsets containing an equal number of images, except for those of melanomas and moles, which are slightly more prevalent. In our evaluation, we assessed the classification and recognition capabilities of the large language model for various skin diseases. Additionally, we tested the model’s ability to comprehend image content based on these classifications.

  10. 10.

    Skin Cancer MNIST: HAM10000 comprises 10,015 dermoscopic images collected from diverse populations through various acquisition methods. It includes representative cases from all major diagnostic categories of pigmented lesions: actinic keratoses and intraepithelial carcinoma/Bowen’s disease (akiec), basal cell carcinoma (bcc), benign keratosis-like lesions (bkl), dermatofibroma (df), melanoma (mel), melanocytic nevi (nv), and vascular lesions (vasc). In our evaluation, we assessed the large language model’s capabilities in multi-class classification and lesion localization for various skin diseases.

  11. 11.

    Skin Cancer Malignant vs. Benign contains 1800 images of skin cancer, both malignant and benign, organized into two separate folders. Each image has a resolution of 224x244 pixels. In our experiment, we assessed the large language model’s capability to distinguish between benign and malignant skin cancer by designing specific prompts and utilizing images from the dataset.

3.1.2 Medical report generation task

In this investigation, we utilized three distinct datasets of radiology chest X-ray reports: MIMIC-CXR (publicly available) [33], OpenI (publicly accessible) [53], and SXY (privately obtained) [123]. Our analysis focused on the inspection findings and conclusion segments across these datasets. By formulating tailored prompts, we assessed the efficacy of a large language model in generating radiological text reports, with a particular emphasis on crafting conclusions derived from the examination findings.

Dataset Summary:

  1. 1.

    MIMIC-CXR represents a substantial publicly accessible repository comprising chest radiographs in DICOM format along with accompanying free-text radiology reports. This extensive dataset encompasses 377,110 images corresponding to 227,835 radiographic studies conducted at the esteemed Beth Israel Deaconess Medical Center located in Boston, Massachusetts.

  2. 2.

    OpenI provided by the National Center for Biotechnology Information (NCBI) at the National Institutes of Health (NIH), constitutes a publicly available repository of medical images sourced from scholarly literature. Predominantly comprising X-rays, CT scans, and MRI images, the dataset is accompanied by relevant metadata. Designed to facilitate advancements in medical image analysis and information retrieval, OpenI serves as a resource for researchers to train and evaluate various algorithms and techniques in the field.

  3. 3.

    SXY courtesy of Xiangya Second Hospital, affiliated with Central South University, encompasses radiology reports spanning from 2012 to 2023 across five systems. This comprehensive dataset includes essential information, detailed descriptions, and diagnostic impressions. These data serve as the foundation for model development and internal validation processes. Specifically, we leverage the chest X-ray reports for testing purposes within our study framework.

3.2 Model Selection

In this research, we mainly target on evaluating the multimodal performance of Gemini and GPT families in the biomedical region. To provide a comprehensive comparison within these families, we either consider the generation or the usage-specific type. For the Gemini family, we select Gemini-1.0-Pro-Vision, Gemini-1.5-Pro, and Gemini-1.5-Flash; for the GPT family, we adopt GPT-3.5-Turbo, GPT-4-Turbo, and GPT-4o into the model pool. Despite the extensiveness of exploring the two specified families, we have strong determination on identifying their effectiveness compared to other latest, cutting-edge, prestigious, and state-of-the-art LLMs to reflect the most practical assessment of current LLMs, not limited to only the Gemini and GPT families. Correspondingly, we supplement Yi-Large, Yi-Large-Turbo, Claude-3-Opus and Llama 3 to construct the final model pool.

Descriptions of these models are listed as follows:

1) Gemini-Pro:

Google’s Gemini-Pro model is a state-of-the-art multimodal AI platform designed to excel across a wide range of tasks with high accuracy. Launched in February 2024, Gemini-Pro handles complex queries in various domains, including STEM and humanities. It features enhanced capabilities in Python code generation, challenging math problems, and multi-step reasoning tasks. Additionally, it demonstrates impressive performance in language translation and automatic speech recognition. By May 2024, its performance had further improved, reflecting Google’s commitment to advancing AI technology. In the following experiment session, we include Gemini-1.0-Pro-Vision and Gemini-1.5-Pro in the model pool.

2) Gemini-Flash:

Gemini-Flash, introduced as a more streamlined version of the Gemini AI platform, is optimized for speed and efficiency. While it may not match the accuracy of Gemini-Pro, it delivers results more rapidly, making it an excellent choice for applications that require swift responses. Currently available as a public preview for developers through Google’s AI Studio, Gemini-Flash is designed to support the development of fast-paced applications and chatbots. It shares the same one million token limit as Gemini-Pro, ensuring that it can process substantial amounts of data, albeit with a slightly lower accuracy in benchmark tests. In the following experiment session, we include Gemini-1.5-Flash in the model pool.

3) GPT-4o:

OpenAI’s latest flagship model, GPT-4o, marks a significant advancement in human-computer interaction by processing and generating text, audio, and visual content in real time. This "omni" model excels in handling diverse inputs and outputs, including speech and images, with response times averaging 320 milliseconds, closely mirroring human conversational pace. Additionally, GPT-4o’s multilingual capabilities have been greatly enhanced, offering improved performance in understanding non-English text, vision, and audio. Despite these advancements, GPT-4o remains more cost-effective and faster in API usage compared to its predecessors.

4) GPT-4-Turbo:

The predecessor to GPT-4o, GPT-4-Turbo, is an enhanced version of the GPT-4 model line. It features a 128k context window, enabling it to process substantial amounts of text—up to 300 pages in a single prompt. Updated with knowledge up to April 2023, GPT-4-Turbo is more affordable, offering reduced costs for both input and output tokens, with a maximum output token limit of 4096. This model is accessible to any OpenAI API account holder with existing GPT-4 access and can be specified by using gpt-4-turbo as the model name in the API.

5) GPT-3.5-Turbo:

GPT-3.5-Turbo is a powerful iteration in OpenAI’s language model series, designed to offer a balance between performance and efficiency. It provides robust text generation and comprehension capabilities with a focus on cost-effectiveness and speed. The model can handle a wide range of language tasks, including summarization, translation, and question-answering, making it versatile for various applications. Despite being less advanced than the GPT-4 series, GPT-3.5-Turbo remains a reliable and accessible option for many users, maintaining strong performance while being more affordable for large-scale deployments.

6) Yi:

Yi is an advanced open-source large language model developed by 01.AI. Available in two versions, Yi-34B and Yi-6B, it supports bilingual capabilities (English and Chinese) and is designed for both academic research and commercial use with appropriate licensing. Yi-34B, with its 34 billion parameters, excels in numerous benchmarks such as MMLU, CMMLU, and C-Eval, outperforming many larger models like Llama-2 70B. It offers an impressive context window of 200K, enabling it to handle extensive text inputs effectively. In the following experiment session, we contain Yi-Large and Yi-Large-Turbo in the model pool.

7) Claude-3-Opus:

Claude-3-Opus, developed by Anthropic, excels in handling complex tasks and content creation. It supports both text and image inputs, making it versatile for multimodal applications. With a context window of 200K, it can manage extensive inputs efficiently, and its output quality is tailored for high-level fluency and understanding. This model balances speed and intelligence, making it suitable for tasks requiring nuanced comprehension and detailed responses. We select Claude-3-Opus, part of Anthropic’s Claude family, which includes several models designed to cater to varying needs of performance and cost-effectiveness, in the final model pool.

8) Llama 3:

Llama 3, developed by Meta, represents a significant advancement in open-source large language models. It includes models with 8 billion and 70 billion parameters, designed for a wide range of applications. Llama 3 improves upon its predecessors with enhanced pre-training data, a more efficient tokenizer, and advanced instruction fine-tuning techniques. It supports extensive context windows and demonstrates state-of-the-art performance in benchmarks such as reasoning, coding, and content creation. Meta aims to foster innovation by making Llama 3 widely available and emphasizing responsible use and deployment.

3.3 Experiment Setting

To more rigorously evaluate the proficiency of various large language models (LLMs) in handling medical images and reports under zero-shot conditions, we segmented the testing experiment into two distinct phases: testing of medical image data and testing of medical report generation, as illustrated in Fig. 1(a.).

In order to ascertain the fairness and reliability of the experiment, we carried out experimental trials across diverse models using identical parameter configurations. Our testing methodology involves employing a standardized set of prompts and parameters to evaluate the performance of the LLM.

When conducting medical image testing, we rigorously observe the prescribed usage protocols for each model. As shown in Fig. 1(b.), We perform medical image testing utilizing models like GPT-4-Turbo and Claude-3-Opus through the web interface. However, owing to the stringent content restrictions of Gemini-series models on the web platform, it is unable to participate in equitable competition with other models in the medical imaging domain. Consequently, we opted to utilize the Google AI Studio platform for conducting medical image testing on Gemini-series models. In order to comprehensively assess the interpretive capabilities of various models within the medical imaging domain, we meticulously crafted distinct prompts tailored to specific datasets and tasks. Uniform prompts and input images were employed across all tests for different models to ensure the integrity and fairness of the results.

During the testing of medical report generation, given the extensive volume of tests, Python was employed to invoke the open API interfaces of various models on the Colab platform. Throughout the testing phase, consistency between the prompts utilized and the model inputs was maintained to produce a series of results, subsequently evaluated for efficacy.

Potential of Multimodal Large Language Models for Data Mining of Medical Images and Free-text Reports (1)

3.4 Evaluating Indicator

In the task of medical image question answering, our research rigorously juxtaposed responses from various large-scale models when presented with identical input images and queries. For diverse medical images, we employed ground truth supplied in the dataset, such as semantic segmentation maps and optic disc segmentation maps, as benchmark answers. This enabled a comprehensive horizontal comparison of responses generated by different language models, facilitating the synthesis of the respective strengths and weaknesses of each model.

In the context of generating radiology reports, our approach involves employing the ROUGE [41] (Recall-Oriented Understudy for Gisting Evaluation) metric to assess the level of correspondence between radiology reports generated by the large language model and the reference answers authored by medical professionals. This study incorporates three distinct methods: Rouge-1 (R-1), Rouge-2 (R-2), and Rouge-L (R-L), as shown from Eq. (1).

ROUGEN=S{ReferenceSummariesgramnSCountmatch(gramn)S{ReferenceSummariesgramnSCount(gramn)ROUGE-N=\frac{\sum_{S\in\{ReferenceSummaries}{\sum_{{gram}_{n}\in S}{Count_{%match}(gram_{n})}}}{\sum_{S\in\{ReferenceSummaries}{\sum_{{gram}_{n}\in S}{%Count(gram_{n})}}}italic_R italic_O italic_U italic_G italic_E - italic_N = divide start_ARG ∑ start_POSTSUBSCRIPT italic_S ∈ { italic_R italic_e italic_f italic_e italic_r italic_e italic_n italic_c italic_e italic_S italic_u italic_m italic_m italic_a italic_r italic_i italic_e italic_s end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_g italic_r italic_a italic_m start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ italic_S end_POSTSUBSCRIPT italic_C italic_o italic_u italic_n italic_t start_POSTSUBSCRIPT italic_m italic_a italic_t italic_c italic_h end_POSTSUBSCRIPT ( italic_g italic_r italic_a italic_m start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_S ∈ { italic_R italic_e italic_f italic_e italic_r italic_e italic_n italic_c italic_e italic_S italic_u italic_m italic_m italic_a italic_r italic_i italic_e italic_s end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_g italic_r italic_a italic_m start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ italic_S end_POSTSUBSCRIPT italic_C italic_o italic_u italic_n italic_t ( italic_g italic_r italic_a italic_m start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) end_ARG(1)

4 Experiments and Observation

4.1 Medical Image Test Results

In Section 4.1.1, we evaluated the performance of six advanced multimodal large language models, including GPT-4-Turbo, across five distinct categories of medical imaging question answering. For the chest X-ray dataset, our primary focus was on the models’ ability to diagnose pneumonia or other diseases from chest X-ray images. As depicted in Fig. 2, the GPT-series models excelled in this task, accurately determining patient health status without requiring additional prompt information. The Gemini-series models followed in performance, while Claude-3-Opus performed the worst, with its answers offering negligible reference value. It is important to note that none of the six models we assessed could further determine whether the pneumonia was COVID-19, which does not imply a lack of model performance, as distinguishing the type of pneumonia based solely on X-ray images is inherently impossible.

In Section 4.1.2, leveraging the ophthalmic imaging dataset, our analysis centers on the model’s ability to diagnose glaucoma and accurately identify the macular fovea’s position using fundus photographs. As depicted in Fig. 3, in Case 1, all models except Claude-3-Opus and GPT-4o incorrectly diagnosed the absence of glaucoma. However, regarding macular fovea localization, GPT-4-Turbo was the only model to offer a vague location, while all other models failed to provide accurate localization. In Case 2, only GPT-4o successfully detected glaucoma, and GPT-4-Turbo again provided a vague description of the macular fovea position, whereas the other models did not accomplish the task.

In Section 4.1.3, utilizing the endoscopic imaging dataset, our focus is on the model’s capability to describe lesion conditions in detail within complex scenes and accurately determine the lesion’s location. As illustrated in Fig. 4, all models successfully provided detailed descriptions of the lesions, with the GPT-series models offering the most comprehensive information. Notably, only Gemini-1.0-Pro-Vision and Claude-3-Opus were unable to determine the lesion locations, whereas the remaining models accurately identified the lesion locations.

In Section 4.1.4, we investigate the model’s capability to classify skin diseases using a skin disease dataset, without the aid of supplementary prompts. As demonstrated in Fig. 5, none of the models accurately identified the type of skin disease afflicting the patients. This deficiency may stem from the sensitivity of the skin disease images or the insufficient training of the large language model on this particular category of diseases.

In Section 4.1.5, employing the dental X-ray dataset, our focus is on the model’s ability to assess dental health and count the number of existing teeth. As shown in Fig. 6, due to the lack of accurate reference answers, our evaluation was based solely on the models’ responses. The responses from the Gemini-series models were relatively simple, concentrating only on the number of teeth. In contrast, the responses from Claude-3-Opus and the GPT-series models were more detailed, addressing tooth integrity, surrounding bone structures, the presence of implants, and the presence of wisdom teeth, while also providing relevant recommendations.

4.1.1 Chest Radiography

Potential of Multimodal Large Language Models for Data Mining of Medical Images and Free-text Reports (2)

4.1.2 Ophthalmological Imaging

Potential of Multimodal Large Language Models for Data Mining of Medical Images and Free-text Reports (3)

4.1.3 Endoscopic Imaging

Potential of Multimodal Large Language Models for Data Mining of Medical Images and Free-text Reports (4)

4.1.4 Dermatological Imaging

Potential of Multimodal Large Language Models for Data Mining of Medical Images and Free-text Reports (5)

4.1.5 Dental Imaging

Potential of Multimodal Large Language Models for Data Mining of Medical Images and Free-text Reports (6)

4.2 Medical Report Generation Task Results

As illustrated in Table 1, in the zero-shot setting of the MIMIC-CXR dataset, Gemini-1.0-Pro-Vision exhibited strong performance, attaining an R-1 score of 0.2814, an R-2 score of 0.1334, and an R-L score of 0.2259. These metrics notably surpass those achieved by other models operating within similar parameters.

The assessment conducted on the OpenI dataset reveals that the GPT-4o model consistently demonstrates outstanding performance in zero-shot scenarios, achieving R-1 scores of 0.1713, R-2 scores of 0.0622, and R-L scores of 0.1466.

In the zero-shot scenario of the Internal dataset dataset, the GPT-4o model demonstrates robust performance, achieving an R-1 score of 0.2805, an R-2 score of 0.0746, and an R-L score of 0.2635. These results notably outperform those of other models operating under comparable conditions.

In essence, the comparison of Rouge metrics for medical reports generated by various large language models, using identical prompt words and zero-shot techniques, serves as an effective means to gauge the performance discrepancies among these models operating under equivalent conditions. Such evaluation holds considerable importance in guiding the selection of specific task-oriented large language models for future research endeavors and practical applications.

ModelMIMIC-CXROpenISXY
R-1R-2R-LR-1R-2R-LR-1R-2R-L
Gemini-1.0-Pro-Vision0.28140.13340.22590.16540.06630.14250.01030.00000.0087
Gemini-1.5-Pro0.17590.06240.12650.13640.05050.11600.03440.01240.0336
Gemini-1.5-Flash0.19730.08520.14880.10180.03220.08110.02690.00200.0250
GPT-3.5-Turbo0.24060.11520.19140.15290.05540.12910.10100.03050.0957
GPT-4-Turbo0.12350.04780.09250.08000.02600.06470.22980.08370.2173
GPT-4o0.22750.09970.17520.17130.06220.14660.28050.07460.2635
Yi-Large0.08500.03460.06590.05010.01530.04110.23210.09680.2200
Yi-Large-Turbo0.16990.07280.13030.09860.03290.08160.12720.05180.1185
Claude-3-Opus0.14340.06080.10800.08400.02760.07040.03250.00700.0299
Llama 30.18840.07910.14290.11740.03710.09710.07860.01730.0740

4.3 Model Generation Time

Given the importance of timeliness in practical applications, our research involved conducting comparative tests on the model’s generation speed using the online platform provided by the model within the same network environment. Specifically, we quantified the total number of characters generated by the model and the time taken for this generation process. Subsequently, we calculated the time required for the model to generate each character, as shown from Eq. (2). As reported in Fig. 7, across the five categories of medical image question answering tasks, GPT-4o exhibited the fastest generation speed, except for tasks related to skin image testing. In this particular domain, Gemini-1.0-Pro-Vision demonstrated the fastest performance, with GPT-4o closely following.Gemini-1.5-Pro exhibited the slowest generation speed, with its average time to produce the same number of characters being 9.16 times longer than that of GPT-4o.

Character Per Time (ms)=Total generation time (ms)Number of characters generatedCharacter Per Time (ms)Total generation time (ms)Number of characters generated\text{Character Per Time (ms)}=\frac{\text{Total generation time (ms)}}{\text{%Number of characters generated}}Character Per Time (ms) = divide start_ARG Total generation time (ms) end_ARG start_ARG Number of characters generated end_ARG(2)
Potential of Multimodal Large Language Models for Data Mining of Medical Images and Free-text Reports (7)

5 Discussion and Conclusion

In this comprehensive study, we rigorously evaluated the performance of 10 prominent large models in medical image understanding and radiology report generation, including globally leading models such as Gemini-1.5-Pro, GPT-4o, Claude-3-Opus, Yi-Large-Turbo, among others.

Our assessment benchmarks these models in explaining medical images, summarizing their advantages and disadvantages, and exploring their potential in medical applications. The findings indicate that while current state-of-the-art MLLMs cannot yet be directly applied to the medical field, their robust reasoning abilities and impressive response speed suggest significant potential for improving model generalization in this domain.

Additionally, we benchmarked these models in generating radiological reports to understand their varying capabilities, strengths, and weaknesses. Our results affirm the performance of numerous domestic and international MLLMs, highlighting their untapped potential in healthcare, especially in radiology. These insights indicate a promising development trajectory, with multilingual and diverse MLLMs poised to enhance global healthcare systems.

Looking ahead, our large-scale research provides a foundation for further exploration, suggesting the potential to extend these MLLMs to different medical specialties and develop multimodal medical MLLMs for comprehensive patient health understanding. However, ethical considerations such as privacy protection, model fairness, and interpretability, along with legal and regulatory frameworks, are essential for safe and ethical MLLM deployment in healthcare. In summary, despite the promise of reducing doctors’ workloads and alleviating medical resource constraints, significant enhancements and comprehensive validation are urgently needed before clinical deployment of these MLLMs.

References

  • [1]Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, FlorenciaLeoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, etal.Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023.
  • [2]AI@Meta.Llama 3 model card.2024.
  • [3]Deepa Anand, Vanika Singhal, DatteshD Shanbhag, Shriram KS, Uday Patil, Chitresh Bhushan, Kavitha Manickam, Dawei Gui, Rakesh Mullick, Avinash Gopal, etal.One-shot localization and segmentation of medical images with foundation models.arXiv preprint arXiv:2310.18642, 2023.
  • [4]Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, AndrewM Dai, Anja Hauth, Katie Millican, etal.Gemini: A family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 1, 2023.
  • [5]Rohan Anil, AndrewM Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, etal.Palm 2 technical report.arXiv preprint arXiv:2305.10403, 2023.
  • [6]Anthropic.Meet claude: The ai assistant from anthropic.Anthropic Blog, 2023.
  • [7]Jorge Bernal, FJavier Sánchez, Gloria Fernández-Esparrach, Debora Gil, Cristina Rodríguez, and Fernando Vilariño.Wm-dova maps for accurate polyp highlighting in colonoscopy: Validation vs. saliency maps from physicians.Computerized medical imaging and graphics, 43:99–111, 2015.
  • [8]Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, JaredD Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei.Language models are few-shot learners.In H.Larochelle, M.Ranzato, R.Hadsell, M.F. Balcan, and H.Lin, editors, Advances in Neural Information Processing Systems, volume33, pages 1877–1901. Curran Associates, Inc., 2020.
  • [9]Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, JaredD Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, etal.Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020.
  • [10]Yupeng Chang, XuWang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, etal.A survey on evaluation of large language models.ACM Transactions on Intelligent Systems and Technology, 15(3):1–45, 2024.
  • [11]Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut.Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts.In CVPR, 2021.
  • [12]Chi Chen, Ruoyu Qin, Fuwen Luo, Xiaoyue Mi, Peng Li, Maosong Sun, and Yang Liu.Position-enhanced visual instruction tuning for multimodal large language models.arXiv preprint arXiv:2308.13437, 2023.
  • [13]GaoleH Chen, Shizhe Chen, Renrui Zhang, Jiawei Chen, Xiaoyu Wu, Zhou Zhang, Zecheng Chen, Jinfeng Li, Xing Wan, and Bin Wang.Allava: Harnessing gpt4v-synthesized data for a lite vision-language model.arXiv preprint arXiv:2402.11684, 2024.
  • [14]Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, and Jiaya Jia.Longlora: Efficient fine-tuning of long-context large language models.arXiv preprint arXiv:2309.12307, 2023.
  • [15]Inyoung Cheong, King Xia, K.J.Kevin Feng, QuanZe Chen, and AmyX. Zhang.(a)i am not a lawyer, but…: Engaging legal experts towards responsible llm policies for legal advice.In Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’24, page 2454–2469, New York, NY, USA, 2024. Association for Computing Machinery.
  • [16]Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, HyungWon Chung, Charles Sutton, Sebastian Gehrmann, etal.Palm: Scaling language modeling with pathways.Journal of Machine Learning Research, 24(240):1–113, 2023.
  • [17]Muhammad E.H. Chowdhury, Tawsifur Rahman, Amith Khandakar, Rashid Mazhar, MuhammadAbdul Kadir, ZaidBin Mahbub, KhandakarReajul Islam, MuhammadSalman Khan, Atif Iqbal, NasserAl Emadi, Mamun BinIbne Reaz, and MohammadTariqul Islam.Can ai help in screening viral and covid-19 pneumonia?IEEE Access, 8:132665–132676, 2020.
  • [18]HyungWon Chung, LeHou, Shayne Longpre, Barret Zoph, YiTay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, etal.Scaling instruction-finetuned language models.arXiv preprint arXiv:2210.11416, 2022.
  • [19]Jiaxi Cui, Zongjian Li, Yang Yan, Bohua Chen, and LiYuan.Chatlaw: Open-source legal large language model with integrated external knowledge bases.arXiv preprint arXiv:2306.16092, 2023.
  • [20]Aysen Degerli, Mete Ahishali, Mehmet Yamac, Serkan Kiranyaz, MuhammadEH Chowdhury, Khalid Hameed, Tahir Hamid, Rashid Mazhar, and Moncef Gabbouj.Covid-19 infection map generation and detection from chest x-ray images.Health information science and systems, 9(1):15, 2021.
  • [21]Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.Bert: Pre-training of deep bidirectional transformers for language understanding.arXiv preprint arXiv:1810.04805, 2018.
  • [22]Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang.Glm: General language model pretraining with autoregressive blank infilling.arXiv preprint arXiv:2103.10360, 2021.
  • [23]Luciano Floridi and Massimo Chiriatti.Gpt-3: Its nature, scope, limits, and consequences.Minds and Machines, 30:681–694, 2020.
  • [24]Chaoyou Fu, Pengbo Chen, YiShen, Yixiang Qin, Min Zhang, Xiaoqi Lin, Zhenyu Qiu, Weihang Lin, Zhenling Qiu, Wenjing Lin, etal.Mme: A comprehensive evaluation benchmark for multimodal large language models.In arXiv preprint arXiv:2306.13394, 2023.
  • [25]Yingqiang Ge, Wenyue Hua, Kai Mei, jianchao ji, Juntao Tan, Shuyuan Xu, Zelong Li, and Yongfeng Zhang.Openagi: When llm meets domain experts.In A.Oh, T.Naumann, A.Globerson, K.Saenko, M.Hardt, and S.Levine, editors, Advances in Neural Information Processing Systems, volume36, pages 5539–5568. Curran Associates, Inc., 2023.
  • [26]Zishan Guo, Renren Jin, Chuang Liu, Yufei Huang, Dan Shi, Linhao Yu, Yan Liu, Jiaxuan Li, Bojian Xiong, Deyi Xiong, etal.Evaluating large language models: A comprehensive survey.arXiv preprint arXiv:2310.19736, 2023.
  • [27]XuHan, Zhengyan Zhang, Ning Ding, Yuxian Gu, Xiao Liu, Yuqi Huo, Jiezhong Qiu, Yuan Yao, AoZhang, Liang Zhang, etal.Pre-trained models: Past, present and future.AI Open, 2:225–250, 2021.
  • [28]Sameera Horawalavithana, Sai Munikoti, Ian Stewart, and Henry Kvinge.Scitune: Aligning large language models with scientific multimodal instructions.arXiv preprint arXiv:2307.01139, 2023.
  • [29]Muhammad Irfan and LIAM MURRAY.Micro- Credential: A guide to prompt writing and engineering in higher education: A tool for Artificial Intelligence in LLM.5 2023.
  • [30]Ahmed Izzidien, Holli Sargeant, and Felix Steffek.Llm vs. lawyers: Identifying a subset of summary judgments in a large uk case law dataset.arXiv preprint arXiv:2403.04791, 2024.
  • [31]Jongseong Jang, Daeun Kyung, SeungHwan Kim, Honglak Lee, Kyunghoon Bae, and Edward Choi.Significantly improving zero-shot x-ray pathology classification via fine-tuning pre-trained image-text encoders.arXiv preprint arXiv:2212.07050, 2022.
  • [32]Debesh Jha, PiaH Smedsrud, MichaelA Riegler, Pål Halvorsen, Thomas deLange, Dag Johansen, and HåvardD Johansen.Kvasir-seg: A segmented polyp dataset.In International Conference on Multimedia Modeling, pages 451–462. Springer, 2020.
  • [33]Alistair Johnson, Tom Pollard, Roger Mark, Seth Berkowitz, and Steven Horng.Mimic-cxr database.PhysioNet10, 13026:C2JT1Q.
  • [34]Matthias Keicher, Kamilia Zaripova, Tobias Czempiel, Kristina Mach, Ashkan Khakzar, and Nassir Navab.Flexr: Few-shot classification with language embeddings for structured reporting of chest x-rays.In Medical Imaging with Deep Learning, pages 1493–1508. PMLR, 2024.
  • [35]Daniel Kermany, Kang Zhang, Michael Goldbaum, etal.Labeled optical coherence tomography (oct) and chest x-ray images for classification.Mendeley data, 2(2):651, 2018.
  • [36]Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Saeid Whitehead, AlexanderC Berg, Wan-Yen Lo, etal.Segment anything.In arXiv preprint arXiv:2304.02643, 2023.
  • [37]Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut.Albert: A lite bert for self-supervised learning of language representations.arXiv preprint arXiv:1909.11942, 2019.
  • [38]Gyeong-Geon Lee, Lehong Shi, Ehsan Latif, Yizhu Gao, Arne Bewersdorf, Matthew Nyaaba, Shuchen Guo, Zihao Wu, Zhengliang Liu, Hui Wang, etal.Multimodality of ai for education: Towards artificial general intelligence.arXiv preprint arXiv:2312.06037, 2023.
  • [39]Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer.Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension.arXiv preprint arXiv:1910.13461, 2019.
  • [40]Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi.Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models.In International conference on machine learning, pages 19730–19742. PMLR, 2023.
  • [41]Chin-Yew Lin.Rouge: A package for automatic evaluation of summaries.In Text summarization branches out, pages 74–81, 2004.
  • [42]Weixiong Lin, Ziheng Zhao, Xiaoman Zhang, Chaoyi Wu, YaZhang, Yanfeng Wang, and Weidi Xie.Pmc-clip: Contrastive language-image pre-training using biomedical documents.In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 525–536. Springer, 2023.
  • [43]Haotian Liu, Chunyuan Li, Qingyang Wu, and YongJae Lee.Visual instruction tuning.In arXiv preprint arXiv:2304.08485, 2023.
  • [44]Haotian Liu, Chunyuan Li, Qingyang Wu, and YongJae Lee.Visual instruction tuning.Advances in neural information processing systems, 36, 2024.
  • [45]Jie Liu, Yixiao Zhang, Jie-Neng Chen, Junfei Xiao, Yongyi Lu, Bennett ALandman, Yixuan Yuan, Alan Yuille, Yucheng Tang, and Zongwei Zhou.Clip-driven universal model for organ segmentation and tumor detection.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 21152–21164, 2023.
  • [46]Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig.Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing.ACM Computing Surveys, 55(9):1–35, 2023.
  • [47]Yang Liu, Huiwen Duan, Yuyang Zhang, Bohan Li, Songlin Zhang, Wei Zhao, YeYuan, Jiajun Wang, Cheng He, Ziwei Liu, etal.Mmbench: Is your multi-modal model an all-around player?In arXiv preprint arXiv:2307.06281, 2023.
  • [48]Yiheng Liu, Tianle Han, Siyuan Ma, Jiayue Zhang, Yuanyuan Yang, Jiaming Tian, Hao He, Antong Li, Mengshen He, Zhengliang Liu, Zihao Wu, Lin Zhao, Dajiang Zhu, Xiang Li, Ning Qiang, Dingang Shen, Tianming Liu, and Bao Ge.Summary of chatgpt-related research and perspective towards the future of large language models.Meta-Radiology, 1(2):100017, 2023.
  • [49]Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov.Roberta: A robustly optimized bert pretraining approach.arXiv preprint arXiv:1907.11692, 2019.
  • [50]ZeLiu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo.Swin transformer: Hierarchical vision transformer using shifted windows.In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
  • [51]Zhengliang Liu, LuZhang, Zihao Wu, Xiaowei Yu, Chao Cao, Haixing Dai, Ninghao Liu, Jun Liu, Wei Liu, Quanzheng Li, etal.Surviving chatgpt in healthcare.Frontiers in Radiology, 3:1224682, 2024.
  • [52]Zhengliang Liu, Tianyang Zhong, Yiwei Li, Yutong Zhang, YiPan, Zihao Zhao, Peixin Dong, Chao Cao, Yuxiao Liu, Peng Shu, etal.Evaluating large language models for radiology natural language processing.arXiv preprint arXiv:2307.13693, 2023.
  • [53]Zhengliang Liu, Tianyang Zhong, Yiwei Li, Yutong Zhang, YiPan, Zihao Zhao, Peixin Dong, Chao Cao, Yuxiao Liu, Peng Shu, Yaonai Wei, Zihao Wu, Chong Ma, Jiaqi Wang, Sheng Wang, Mengyue Zhou, Zuowei Jiang, Chunlin Li, Jason Holmes, Shaochen Xu, LuZhang, Haixing Dai, Kai Zhang, Lin Zhao, Yuanhao Chen, XuLiu, Peilong Wang, Pingkun Yan, Jun Liu, Bao Ge, Lichao Sun, Dajiang Zhu, Xiang Li, Wei Liu, Xiaoyan Cai, Xintao Hu, XiJiang, Shu Zhang, Xin Zhang, Tuo Zhang, Shijie Zhao, Quanzheng Li, Hongtu Zhu, Dinggang Shen, and Tianming Liu.Evaluating large language models for radiology natural language processing, 2023.
  • [54]Yadong Lu, Chunyuan Li, Haotian Liu, Jianwei Yang, Jianfeng Gao, and Yelong Shen.An empirical study of scaling instruct-tuned large multimodal models.arXiv preprint arXiv:2309.09958, 2023.
  • [55]Renqian Luo, Liai Sun, Yingce Xia, Tao Qin, Sheng Zhang, Hoifung Poon, and Tie-Yan Liu.Biogpt: generative pre-trained transformer for biomedical text generation and mining.Briefings in bioinformatics, 23(6):bbac409, 2022.
  • [56]Chong Ma, Zihao Wu, Jiaqi Wang, Shaochen Xu, Yaonai Wei, Zhengliang Liu, Fang Zeng, XiJiang, Lei Guo, Xiaoyan Cai, Shu Zhang, Tuo Zhang, Dajiang Zhu, Dinggang Shen, Tianming Liu, and Xiang Li.An iterative optimizing framework for radiology report summarization with chatgpt.IEEE Transactions on Artificial Intelligence, pages 1–12, 2024.
  • [57]Salman Maqbool, Aqsa Riaz, Hasan Sajid, and Osman Hasan.m2caiseg: Semantic segmentation of laparoscopic images using convolutional neural networks.arXiv preprint arXiv:2008.10134, 2020.
  • [58]Aakash Mishra, Rajat Mittal, Christy Jestin, Kostas Tingos, and Pranav Rajpurkar.Improving zero-shot detection of low prevalence chest pathologies using domain pre-trained language models.arXiv preprint arXiv:2306.08000, 2023.
  • [59]Michael Moor, Oishi Banerjee, Zahra ShakeriHossein Abad, HarlanM Krumholz, Jure Leskovec, EricJ Topol, and Pranav Rajpurkar.Foundation models for generalist medical artificial intelligence.Nature, 616(7956):259–265, 2023.
  • [60]Steven Moore, Richard Tong, Anjali Singh, Zitao Liu, Xiangen Hu, YuLu, Joleen Liang, Chen Cao, Hassan Khosravi, Paul Denny, Chris Brooks, and John Stamper.Empowering education with llms - the next-gen interface and content generation.In Ning Wang, Genaro Rebolledo-Mendez, Vania Dimitrova, Noboru Matsuda, and OlgaC. Santos, editors, Artificial Intelligence in Education. Posters and Late Breaking Results, Workshops and Tutorials, Industry and Innovation Tracks, Practitioners, Doctoral Consortium and Blue Sky, pages 32–37, Cham, 2023. Springer Nature Switzerland.
  • [61]OpenAI.Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023.
  • [62]Maxime Oquab, Thomas Darcet, Timothee Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, etal.Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023.
  • [63]Long Ouyang, Jeffrey Wu, XuJiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, etal.Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022.
  • [64]Long Ouyang, Jeffrey Wu, XuJiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, PaulF Christiano, Jan Leike, and Ryan Lowe.Training language models to follow instructions with human feedback.In S.Koyejo, S.Mohamed, A.Agarwal, D.Belgrave, K.Cho, and A.Oh, editors, Advances in Neural Information Processing Systems, volume35, pages 27730–27744. Curran Associates, Inc., 2022.
  • [65]Chantal Pellegrini, Matthias Keicher, Ege Özsoy, Petra Jiraskova, Rickmer Braren, and Nassir Navab.Xplainer: From x-ray observations to explainable zero-shot diagnosis.In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 420–429. Springer, 2023.
  • [66]Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao.Instruction tuning with gpt-4.arXiv preprint arXiv:2304.03277, 2023.
  • [67]Baolin Peng, Chunyuan Li, Zirui Lin, Wei Zhang, Yiping Jiang, Lianhui Zhou, Yiwen Sheng, Xian Wang, Yong Gao, Daxin Jiang, Pengcheng He, etal.Kosmos-1: Simulating human-like compositionality and concept learning.arXiv preprint arXiv:2306.14828, 2023.
  • [68]Kanchan Poudel, Manish Dhakal, Prasiddha Bhandari, Rabin Adhikari, Safal Thapaliya, and Bishesh Khanal.Exploring transfer learning in medical image segmentation using vision-language models.arXiv preprint arXiv:2308.07706, 2023.
  • [69]Alec Radford, JongWook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, etal.Learning transferable visual models from natural language supervision.In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  • [70]Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, etal.Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019.
  • [71]Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and PeterJ. Liu.Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learning Research, 21(140):1–67, 2020.
  • [72]Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and PeterJ Liu.Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020.
  • [73]Tawsifur Rahman, Amith Khandakar, Yazan Qiblawey, Anas Tahir, Serkan Kiranyaz, SaadBin Abul Kashem, MohammadTariqul Islam, Somaya Al Maadeed, SusuM. Zughaier, MuhammadSalman Khan, and MuhammadE.H. Chowdhury.Exploring the effect of image enhancement techniques on covid-19 detection using chest x-ray images.Computers in Biology and Medicine, 132:104319, 2021.
  • [74]Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer.High-resolution image synthesis with latent diffusion models.In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  • [75]V.Rotemberg, N.Kurtansky, B.Betz-Stablein, L.Caffery, E.Chousakos, N.Codella, M.Combalia, S.Dusza, P.Guitera, D.Gutman, A.Halpern, B.Helba, H.Kittler, K.Kose, S.Langer, K.Lioprys, J.Malvehy, S.Musthaq, J.Nanda, O.Reiter, G.Shih, A.Stratigos, P.Tschandl, J.Weber, and P.Soyer.A patient-centric dataset of images and metadata for identifying melanomas using clinical context.Sci Data, 8(34):1–15, 2021.
  • [76]Khaled Saab, Tao Tu, Wei-Hung Weng, Ryutaro Tanno, David Stutz, Ellery Wulczyn, Fan Zhang, Tim Strother, Chunjong Park, Elahe Vedadi, etal.Capabilities of gemini models in medicine.arXiv preprint arXiv:2404.18416, 2024.
  • [77]Christoph Schuhmann, Richard Beaumont, Romain Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Timothy Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, etal.Laion-5b: An open large-scale dataset for training next generation image-text models.In NeurIPS, 2022.
  • [78]Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut.Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning.In ACL, 2018.
  • [79]YuShen, Chaoyou Fu, Pengbo Chen, Min Zhang, Keyi Li, Xuejing Sun, YiWu, Shu Lin, and Rongrong Ji.Aligning and prompting everything all at once for universal visual perception.In CVPR, 2024.
  • [80]Peilun Shi, Jianing Qiu, Sai MuDalike Abaxi, Hao Wei, Frank P-W Lo, and WuYuan.Generalist vision foundation models for medical imaging: A case study of segment anything model on zero-shot medical segmentation.Diagnostics, 13(11):1947, 2023.
  • [81]Peng Shu, Huaqin Zhao, Hanqi Jiang, Yiwei Li, Shaochen Xu, YiPan, Zihao Wu, Zhengliang Liu, Guoyu Lu, LeGuan, etal.Llms for coding and robotics education.arXiv preprint arXiv:2402.06116, 2024.
  • [82]K.Singhal, T.Tu, J.Gottweis, etal.Towards expert-level medical question answering with large language models.arXiv preprint arXiv:2305.09617, 2023.
  • [83]Zhiying Sun, Sheng Shen, Siqi Cao, Haotian Liu, Chunyuan Li, Yujun Shen, Chuang Gan, Li-Yen Gui, Yen-Xiang Wang, Yinan Yang, etal.Aligning large multimodal models with factually augmented rlhf.In arXiv preprint arXiv:2309.14525, 2023.
  • [84]AnasM. Tahir, MuhammadE.H. Chowdhury, Amith Khandakar, Tawsifur Rahman, Yazan Qiblawey, Uzair Khurshid, Serkan Kiranyaz, Nabil Ibtehaz, M.Sohel Rahman, Somaya Al-Maadeed, Sakib Mahmud, Maymouna Ezeddin, Khaled Hameed, and Tahir Hamid.Covid-19 infection localization and severity grading from chest x-ray images.Computers in Biology and Medicine, 139:105002, 2021.
  • [85]YiTay, Mostafa Dehghani, VinhQ Tran, Xavier Garcia, Jason Wei, Xuezhi Wang, HyungWon Chung, Siamak Shakeri, Dara Bahri, Tal Schuster, etal.Ul2: Unifying language learning paradigms.arXiv preprint arXiv:2205.05131, 2022.
  • [86]ChristoKurisummoottil Thomas, Christina Chaccour, Walid Saad, Mérouane Debbah, and ChoongSeon Hong.Causal reasoning: Charting a revolutionary course for next-generation ai-native wireless networks.IEEE Vehicular Technology Magazine, 2024.
  • [87]Ekin Tiu, Ellie Talius, Pujan Patel, CurtisP Langlotz, AndrewY Ng, and Pranav Rajpurkar.Expert-level detection of pathologies from unannotated chest x-ray images via self-supervised learning.Nature Biomedical Engineering, 6(12):1399–1406, 2022.
  • [88]Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, etal.Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023.
  • [89]Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, etal.Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023.
  • [90]Philipp Tschandl.The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions, 2018.
  • [91]AndruP Twinanda, Sherif Shehata, Didier Mutter, Jacques Marescaux, Michel DeMathelin, and Nicolas Padoy.Endonet: a deep architecture for recognition tasks on laparoscopic videos.IEEE transactions on medical imaging, 36(1):86–97, 2016.
  • [92]Tom van Sonsbeek and Marcel Worring.X-tra: Improving chest x-ray tasks with cross-modal retrieval augmentation.In International Conference on Information Processing in Medical Imaging, pages 471–482. Springer, 2023.
  • [93]Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, AidanN Gomez, Łukasz Kaiser, and Illia Polosukhin.Attention is all you need.Advances in neural information processing systems, 30, 2017.
  • [94]Chenglong Wang, Hang Zhou, Kaiyan Chang, Tongran Liu, Chunliang Zhang, Quan Du, Tong Xiao, and Jingbo Zhu.Learning evaluation models from large language models for sequence generation.arXiv preprint arXiv:2308.04386, 2023.
  • [95]Jiaqi Wang, Zhengliang Liu, Lin Zhao, Zihao Wu, Chong Ma, Sigang Yu, Haixing Dai, Qiushi Yang, Yiheng Liu, Songyao Zhang, Enze Shi, YiPan, Tuo Zhang, Dajiang Zhu, Xiang Li, XiJiang, Bao Ge, Yixuan Yuan, Dinggang Shen, Tianming Liu, and Shu Zhang.Review of large vision models and visual prompt engineering.Meta-Radiology, 1(3):100047, 2023.
  • [96]Thomas Wang, Adam Roberts, Daniel Hesslow, Teven LeScao, HyungWon Chung, IzBeltagy, Julien Launay, and Colin Raffel.What language model architecture and pretraining objective works best for zero-shot generalization?In International Conference on Machine Learning, pages 22964–22984. PMLR, 2022.
  • [97]Yaqing Wang, Jialin Wu, Tanmaya Dabral, Jiageng Zhang, Geoff Brown, Chun-Ta Lu, Frederick Liu, YiLiang, BoPang, Michael Bendersky, etal.Non-intrusive adaptation: Input-centric parameter-efficient fine-tuning for versatile multimodal modeling.arXiv preprint arXiv:2310.12100, 2023.
  • [98]Yiqi Wang, Wentao Chen, Xiaotian Han, Xudong Lin, Haiteng Zhao, Yongfei Liu, Bohan Zhai, Jianbo Yuan, Quanzeng You, and Hongxia Yang.Exploring the reasoning abilities of multimodal large language models (mllms): A comprehensive survey on emerging trends in multimodal reasoning.arXiv preprint arXiv:2401.06805, 2024.
  • [99]Zifeng Wang, Zhenbang Wu, Dinesh Agarwal, and Jimeng Sun.Medclip: Contrastive learning from unpaired medical images and text.arXiv preprint arXiv:2210.10163, 2022.
  • [100]Jason Wei, YiTay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, etal.Emergent abilities of large language models.arXiv preprint arXiv:2206.07682, 2022.
  • [101]Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, EdChi, QuocV Le, Denny Zhou, etal.Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022.
  • [102]Lirong Wen, Xuedong Yang, Dongyi Fu, Xinlong Wang, Peixian Cai, Xin Li, Tao Ma, Yayun Li, Lebin Xu, Dapeng Shang, etal.On the road with gpt-4v (ision): Early explorations of visual-language model on autonomous driving.In arXiv preprint arXiv:2311.05332, 2023.
  • [103]Anton Wiehe, Florian Schneider, Sebastian Blank, Xintong Wang, Hans-Peter Zorn, and Chris Biemann.Language over labels: Contrastive language supervision exceeds purely label-supervised classification performance on chest x-rays.In Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing: Student Research Workshop, pages 76–83, 2022.
  • [104]Honglin Xiong, Sheng Wang, Yitao Zhu, Zihao Zhao, Yuxiao Liu, Qian Wang, and Dinggang Shen.Doctorglm: Fine-tuning your chinese doctor is not a herculean task.arXiv preprint arXiv:2304.01097, 2023.
  • [105]Runxin Xu, Fuli Luo, Zhiyuan Zhang, Chuanqi Tan, Baobao Chang, Songfang Huang, and Fei Huang.Raise a child in large language model: Towards effective and generalizable fine-tuning.arXiv preprint arXiv:2109.05687, 2021.
  • [106]Hao Yang, Junyang Lin, AnYang, Peng Wang, Chang Zhou, and Hongxia Yang.Prompt tuning for generative multimodal pretrained models.arXiv preprint arXiv:2208.02532, 2022.
  • [107]Jingfeng Yang, Hongye Jin, Ruixiang Tang, Xiaotian Han, Qizhang Feng, Haoming Jiang, Bing Yin, and Xia Hu.Harnessing the power of llms in practice: A survey on chatgpt and beyond.2023.
  • [108]Zekun Yang, Linxi Li, Kevin Lin, Jing Wang, Chung-Cheng Lin, Zicheng Liu, and Lin Wang.The dawn of lmms: Preliminary explorations with gpt-4v (ision).In arXiv preprint arXiv:2309.17421, 2023.
  • [109]Shukang Yin, Chaoyou Fu, Sirui Zhao, KeLi, Xing Sun, Tong Xu, and Enhong Chen.A survey on multimodal large language models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  • [110]Tao Yu, Yue Yao, Hang Zhang, Tianxiang He, Yizhuo Han, GeCui, Jiarong Hu, Zhenfeng Liu, Hua-Tong Zheng, Maosong Sun, etal.Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback.In arXiv preprint arXiv:2312.00849, 2023.
  • [111]Wenhai Yu, Zhe Yang, Linxi Li, Jing Wang, Kevin Lin, Zicheng Liu, Xin Wang, and Lin Wang.Mm-vet: Evaluating large multimodal models for integrated capabilities.In arXiv preprint arXiv:2308.02490, 2023.
  • [112]Yuexiang Zhai, Shengbang Tong, Xiao Li, MuCai, Qing Qu, YongJae Lee, and YiMa.Investigating the catastrophic forgetting in multimodal large language models.arXiv preprint arXiv:2309.10313, 2023.
  • [113]Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, LionelM Ni, and Heung-Yeung Shum.Dino: Detr with improved denoising anchor boxes for end-to-end object detection.arXiv preprint arXiv:2203.03605, 2022.
  • [114]Hao Zhang, Wenqi Shao, Hong Liu, Yongqiang Ma, Ping Luo, YuQiao, and Kaipeng Zhang.Avibench: Towards evaluating the robustness of large vision-language model on adversarial visual-instructions.arXiv preprint arXiv:2403.09346, 2024.
  • [115]Lian Zhang, Zhengliang Liu, LuZhang, Zihao Wu, Xiaowei Yu, Jason Holmes, Hongying Feng, Haixing Dai, Xiang Li, Quanzheng Li, etal.Generalizable and promptable artificial intelligence model to augment clinical delineation in radiation oncology.Medical physics, 51(3):2187–2199, 2024.
  • [116]Sheng Zhang, Yanbo Xu, Naoto Usuyama, Hanwen Xu, Jaspreet Bagga, Robert Tinn, Sam Preston, Rajesh Rao, MuWei, Naveen Valluri, etal.Biomedclip: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs.arXiv preprint arXiv:2303.00915, 2023.
  • [117]Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, XiVictoria Lin, etal.Opt: Open pre-trained transformer language models.arXiv preprint arXiv:2205.01068, 2022.
  • [118]Haiyan Zhao, Hanjie Chen, Fan Yang, Ninghao Liu, Huiqi Deng, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, and Mengnan Du.Explainability for large language models: A survey.ACM Transactions on Intelligent Systems and Technology, 15(2):1–38, 2024.
  • [119]Huan Zhao, Qian Ling, YiPan, Tianyang Zhong, Jin-Yu Hu, Junjie Yao, Fengqian Xiao, Zhenxiang Xiao, Yutong Zhang, San-Hua Xu, etal.Ophtha-llama2: A large language model for ophthalmology.arXiv preprint arXiv:2312.04906, 2023.
  • [120]WayneXin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, etal.A survey of large language models.arXiv preprint arXiv:2303.18223, 2023.
  • [121]Zihao Zhao, Sheng Wang, Jinchen Gu, Yitao Zhu, Lanzhuju Mei, Zixu Zhuang, Zhiming Cui, Qian Wang, and Dinggang Shen.Chatcad+: Towards a universal and reliable interactive cad using llms.arXiv preprint arXiv:2305.15964, 2023.
  • [122]Tianyang Zhong, Wei Zhao, Yutong Zhang, YiPan, Peixin Dong, Zuowei Jiang, Xiaoyan Kui, Youlan Shang, LiYang, Yaonai Wei, etal.Chatradio-valuer: a chat large language model for generalizable radiology report generation based on multi-institution and multi-system data.arXiv preprint arXiv:2310.05242, 2023.
  • [123]Tianyang Zhong, Wei Zhao, Yutong Zhang, YiPan, Peixin Dong, Zuowei Jiang, Xiaoyan Kui, Youlan Shang, LiYang, Yaonai Wei, Longtao Yang, Hao Chen, Huan Zhao, Yuxiao Liu, Ning Zhu, Yiwei Li, Yisong Wang, Jiaqi Yao, Jiaqi Wang, Ying Zeng, Lei He, Chao Zheng, Zhixue Zhang, Ming Li, Zhengliang Liu, Haixing Dai, Zihao Wu, LuZhang, Shu Zhang, Xiaoyan Cai, Xintao Hu, Shijie Zhao, XiJiang, Xin Zhang, Xiang Li, Dajiang Zhu, Lei Guo, Dinggang Shen, Junwei Han, Tianming Liu, Jun Liu, and Tuo Zhang.Chatradio-valuer: A chat large language model for generalizable radiology report generation based on multi-institution and multi-system data, 2023.
  • [124]Tongxue Zhou, SuRuan, and Stéphane Canu.A review: Deep learning for medical image segmentation using multi-modality fusion.Array, 3:100004, 2019.
  • [125]Deyao Zhu, Jun Chen, Xiaonan Shen, Xiang Li, and Mohamed Elhoseiny.Minigpt-4: Enhancing vision-language understanding with advanced large language models.In arXiv preprint arXiv:2304.10592, 2023.

Appendix A Appendix

A.1 Chest Radiography

Potential of Multimodal Large Language Models for Data Mining of Medical Images and Free-text Reports (8)
Potential of Multimodal Large Language Models for Data Mining of Medical Images and Free-text Reports (9)
Potential of Multimodal Large Language Models for Data Mining of Medical Images and Free-text Reports (10)

A.2 Ophthalmological Imaging

Potential of Multimodal Large Language Models for Data Mining of Medical Images and Free-text Reports (11)
Potential of Multimodal Large Language Models for Data Mining of Medical Images and Free-text Reports (12)
Potential of Multimodal Large Language Models for Data Mining of Medical Images and Free-text Reports (13)
Potential of Multimodal Large Language Models for Data Mining of Medical Images and Free-text Reports (14)

A.3 Endoscopic Imaging

Potential of Multimodal Large Language Models for Data Mining of Medical Images and Free-text Reports (15)
Potential of Multimodal Large Language Models for Data Mining of Medical Images and Free-text Reports (16)
Potential of Multimodal Large Language Models for Data Mining of Medical Images and Free-text Reports (17)
Potential of Multimodal Large Language Models for Data Mining of Medical Images and Free-text Reports (18)
Potential of Multimodal Large Language Models for Data Mining of Medical Images and Free-text Reports (19)
Potential of Multimodal Large Language Models for Data Mining of Medical Images and Free-text Reports (20)

A.4 Dermatological Imaging

Potential of Multimodal Large Language Models for Data Mining of Medical Images and Free-text Reports (21)
Potential of Multimodal Large Language Models for Data Mining of Medical Images and Free-text Reports (22)
Potential of Multimodal Large Language Models for Data Mining of Medical Images and Free-text Reports (23)

A.5 Dental Imaging

Potential of Multimodal Large Language Models for Data Mining of Medical Images and Free-text Reports (24)
Potential of Multimodal Large Language Models for Data Mining of Medical Images and Free-text Reports (25)
Potential of Multimodal Large Language Models for Data Mining of Medical Images and Free-text Reports (26)
Potential of Multimodal Large Language Models for Data Mining of Medical Images and Free-text Reports (2024)
Top Articles
Box Score | Últimas noticias
The Morning Union from Springfield, Massachusetts
Words With Friends Cheat Board Layout 11X11
Hardheid van drinkwater - Waterbedrijf Groningen
Davaba19
Australian Gold zonbescherming review - Verdraaid Mooi
Circle L Bassets
What Is the Z-Track Injection Method?
Large Pawn Shops Near Me
Drift Boss 911
Terry Gebhardt Obituary
Spanish Speaking Daycare Near Me
Carmax Chevrolet Tahoe
manhattan cars & trucks - by owner - craigslist
Summoner Calamity
Teenbeautyfitness
Robertos Pizza Penbrook
Air Force Chief Results
Best Stb 556 Mw2
Masdar | Masdar’s Youth 4 Sustainability Announces COP28 Program to Empower Next Generation of Climate Leaders
14314 County Road 15 Holiday City Oh
Zipcar Miami Airport
Banette Gen 3 Learnset
Free 120 Step 2 Correlation
360 Training Food Handlers Final Exam Answers 2022
Harris Teeter Weekly Ad Williamsburg Va
25+ Irresistible PowerXL Air Fryer Recipes for Every Occasion! – ChefsBliss
Fototour verlassener Fliegerhorst Schönwald [Lost Place Brandenburg]
Rubios Listens Com
Bodek And Rhodes Catalog
Restored Republic December 1 2022
Dollar General Cbl Answers Shrink Awareness
Marketwatch Com Game
Myhr North Memorial
Food Handlers Card Yakima Wa
Grand Forks (British Columbia) – Travel guide at Wikivoyage
MyChart | University Hospitals
Planet Zoo Obstructed
Sky Nails Albany Oregon
In Memoriam | September 2024
Con Edison Outage Map Staten Island
MAXSUN Terminator Z790M D5 ICE Motherboard Review
Cashflow Manager Avid
Joftens Notes Skyrim
Naviance Hpisd
Kona Airport Webcam
Viaggio Apostolico a Singapore: Santa Messa nello Stadio Nazionale presso il “Singapore Sports Hub” (12 settembre 2024)
Hocus Pocus Showtimes Near Harkins Theatres Yuma Palms 14
Alvin Isd Ixl
Commissary Exchange Benefits What You Need To Know Aafes To Offer Service To Former Military
2045 Union Ave SE, Grand Rapids, MI 49507 | Estately 🧡 | MLS# 24048395
Nfl Espn Expert Picks 2023
Latest Posts
Article information

Author: Errol Quitzon

Last Updated:

Views: 5757

Rating: 4.9 / 5 (59 voted)

Reviews: 90% of readers found this page helpful

Author information

Name: Errol Quitzon

Birthday: 1993-04-02

Address: 70604 Haley Lane, Port Weldonside, TN 99233-0942

Phone: +9665282866296

Job: Product Retail Agent

Hobby: Computer programming, Horseback riding, Hooping, Dance, Ice skating, Backpacking, Rafting

Introduction: My name is Errol Quitzon, I am a fair, cute, fancy, clean, attractive, sparkling, kind person who loves writing and wants to share my knowledge and understanding with you.