AI Research

Vision-language foundation model for 3D medical imaging

Published

4 weeks ago

August 6, 2025

Yuli Wang

Challenges and future directions

Enhancement of high-quality and real-world datasets of image-reports pairs

A key challenge in advancing AI applications for 3D medical imaging, particularly for radiology report generation, is the lack of large, annotated datasets that encompass a wide range of pathologies across diverse patient populations. Comprehensive datasets are essential for training VLFMs that can accurately generate radiology reports from 3D images. Efforts to expand these datasets should focus on including a greater variety of 3D imaging types and ensuring detailed annotations that correlate imaging findings with clinical reports.

A significant limitation of current methodologies in radiology report generation from 3D medical images is their heavy reliance on metadata extracted from DICOM files. These metadata fields, typically only providing basic information about image modality and the body parts imaged, are inherently restrictive. This approach results in low-quality ground truth and often lacks the depth and context necessary for creating nuanced and clinically relevant reports. They fail to capture complex diagnostic information, critical nuances in pathology, and other subtleties essential for a comprehensive radiology report. Consequently, models trained on such datasets may develop a superficial understanding of the images, leading to generic and potentially inaccurate report generation.

This underscores the need for developing datasets that go beyond mere metadata to include rich, contextual annotations that directly relate specific imaging findings to detailed clinical insights. Collaborative initiatives with hospitals and research institutions to anonymize and share 3D imaging data could be vital in achieving this goal. The true potential of such collaborations can be realized through the establishment of large-scale, multi-modality imaging and report data consortiums. By pooling resources and datasets from diverse geographic and demographic sources, these consortia can create a more comprehensive and varied dataset that reflects a broader spectrum of pathologies, treatment outcomes, and patient populations.

This approach would not only enhance the volume and variety of data but also improve the robustness and generalizability of the AI models trained on them. Additionally, multi-site collaboration facilitates the standardization of data collection, annotation, and processing protocols, further enriching the quality of the data. Such an enriched dataset can serve as a cornerstone for developing more precise and contextually aware AI tools, ultimately leading to improved accuracy in medical imaging report generation and better patient care outcomes.

Domain-specific insights in medical imaging beyond general computer vision

To significantly improve the performance of VLFMs in generating accurate radiology reports from 3D medical images, it is crucial to focus on both the development and refinement of model architectures tailored to the inherent complexities of 3D medical scans⁹. The intricate spatial relationships and detailed anatomical structures present in these images necessitate the use of enhanced 3D convolutional layers, specifically designed to better capture the spatial hierarchies essential for accurately interpreting medical images.

Moreover, the integration of advanced language processing modules is indispensable. These modules must not only understand the clinical language but also articulate medical findings with high precision, effectively incorporating medical terminologies and nuanced patient data. Such capabilities require a deep fusion of visual and textual understanding within the model architecture, ensuring that the generated reports are both medically accurate and contextually relevant.

Further augmenting the efficacy of these models, advanced training techniques like multi-task learning play a pivotal role. By enabling the model to simultaneously learn to identify specific medical conditions from 3D images and generate descriptive, clinically relevant text, multi-task learning enhances the model’s ability to handle multiple tasks that mirror the workflow of human radiologists. This approach ensures a more holistic learning process, fostering models that are not only technically proficient but also practically applicable in clinical settings.

In addition to these architectural and training enhancements, the application of anatomical guidance tools such as TotalSegmentor can revolutionize model training³². By allowing precise segmentation of specific organs or regions within the 3D scans, these tools help create anatomical guidance in the image-text pair alignments. This guidance significantly aids the model in distinguishing between different anatomical features and their corresponding clinical descriptions, thereby refining the accuracy and relevance of the generated reports. Collectively, these strategies form a robust approach to overcoming current limitations and setting new benchmarks in the AI-driven generation of radiology reports from complex 3D medical imaging data.

Advanced metrics for assessing VLFMs in medical imaging report accuracy and clinical utility

Current metrics for evaluating VLFMs in medical imaging often fall short. These metrics, adapted from traditional NLP models, primarily measure textual similarity rather than clinical accuracy. This limitation can result in high scores for reports that appear textually accurate but miss critical diagnostic findings, impressions, and recommendations.

Although there are some efforts to include radiologists’ subjective evaluations³³, these studies are limited to 2D chest X-rays and do not address the complexities of 3D imaging, which is more relevant to actual diagnostics and treatment. Evaluating VLFMs in 3D imaging requires more sophisticated metrics.

Studies^29,34,35,36 have highlighted the flaws in current metrics. BLEU struggles with identifying false findings, while BERTScore has higher errors in locating findings compared to CheXbert. These issues underscore the need for improved evaluation methods. The proposed RadCliQ²⁹ metric combines existing metrics with linear regression to create a more balanced evaluation model. However, RadCliQ’s reliance on text overlap-based metrics and its testing on 2D datasets reveal limitations when applied to 3D imaging.

Future research should focus on developing metrics that accurately assess the clinical relevance of the generated reports, including diagnostic accuracy and terminology appropriateness. Advanced NLP techniques could also compare generated reports with a database of clinician-validated reports. By improving these metrics, researchers can establish more effective benchmarks for VLFMs in 3D medical imaging, ensuring the generated reports are not only accurate but also valuable in clinical practice, thereby enhancing patient care and diagnostic processes.

Controversy on simulated/synthesized 3D data for model training

The use of simulated or augmented 3D data in training VLFMs for medical applications presents significant challenges alongside its benefits. While it fills gaps in data availability, especially for rare or complex diagnostic scenarios, there is substantial debate about its reliability. Critics^37,33 argue that because simulated data do not originate from actual clinical experiences, they may not accurately reflect the complexities and variability necessary for training truly effective medical AI models.

Despite these concerns, advancements in generative AI offer promising solutions to enhance the reliability and quality control of synthesized data^38,39,40. High-quality, AI-generated 3D datasets can now mimic real-world data with greater fidelity, reducing the gap between simulated and actual clinical scenarios. This evolution in data generation technology allows for an augmentation of existing 3D datasets, enhancing the breadth and depth of training environments for VLFMs without compromising the integrity of the models. By utilizing sophisticated generative techniques, including 3D-GAN⁴¹, 3D diffusion probabilistic model⁴², Neural Radiance Fields⁴³ and so on, developers can create augmented 3D datasets that are not only diverse but also closely aligned with real clinical conditions. These 3D datasets provide a controlled environment for testing and validating VLFMs, ensuring that the models are exposed to a wide range of medical scenarios, including rare and complex conditions. This approach not only enhances the model’s diagnostic capabilities but also ensures that the VLFMs are robust and reliable, thereby potentially improving healthcare outcomes through more accurate and comprehensive medical imaging reports.

Addressing computational and clinical challenges in 3D transformer-based models

The integration of 3D transformer-based vision-language models in medical imaging represents a significant advancement, holding considerable promise for improved diagnostic accuracy and clinical reporting. However, the translation of these models from research prototypes to clinically deployable solutions face computational and clinical challenges that must be thoroughly addressed.

A primary obstacle to the widespread clinical adoption of traditional 3D transformers is their computational complexity. Specifically, these models suffer from quadratic memory and computational requirements associated with self-attention mechanisms, especially when processing high-resolution volumetric medical data^44,45. Due to these constraints, practical implementations often necessitate significant compromises in input data resolution, thereby reducing the ability to capture fine-grained anatomical details essential for clinical diagnostics, such as subtle abnormalities, delicate vascular structures, nodules, and small lesions^46,47,48. Consequently, these limitations directly impact the clinical applicability and diagnostic utility of transformer-based vision-language systems.

To overcome these computational challenges, recent developments in transformer architectures have introduced promising solutions. Hierarchical 3D swin transformers, for example, employ multi-scale hierarchical structures and locality-sensitive computations to substantially reduce memory usage without sacrificing model accuracy⁴⁹. Additionally, sparse attention strategies, such as axial attention⁵⁰ and windowed attention mechanisms⁵¹, have emerged as effective methods to selectively prioritize computations, significantly decreasing resource requirements while maintaining robust performance. Furthermore, hybrid CNN-transformer models exploit the complementary strengths of convolutional neural networks and transformers, combining efficient local feature extraction from CNNs with the global contextual understanding facilitated by transformers, thereby achieving a practical balance between computational efficiency and clinical efficacy^52,53. Recent advances, such as the hierarchical attention approach proposed by Zhou et al.⁵⁴, exemplify these improvements by considerably reducing memory usage and computational demands. This strategy enables the analysis of higher-resolution volumetric data, resulting in improved performance in clinically relevant segmentation tasks, including pulmonary vessels and airways segmentation⁵⁴. Such innovative solutions demonstrate significant progress toward resolving existing barriers, yet further advancements remain essential.

To address ongoing computational and clinical challenges comprehensively, future research should prioritize several key directions. Enhanced hierarchical attention methods capable of dynamically allocating computational resources based on clinical significance represent a promising area of exploration. Additionally, the development of transformer architectures specifically tailored for medical imaging, incorporating domain knowledge to optimize performance and efficiency, remains a critical research priority. Finally, investigating memory-efficient training paradigms, including model distillation, quantization, and efficient training strategies, will be crucial to improving practical feasibility. Explicitly recognizing and systematically addressing these computational and clinical limitations in this review aims to provide valuable insights and actionable guidance for researchers committed to developing practical, efficient, and clinically impactful 3D transformer-based vision-language models in radiology.

Lessons learned from 2D VLFMs and a roadmap for future research on 3D VLFMs

The development and deployment of VLFMs in clinical environments represent promising yet challenging objectives. Although 2D VLFMs have exhibited encouraging results in controlled research environments, their real-world applicability and acceptance in clinical workflows have thus far remained limited. Understanding these limitations offers valuable insights that can guide the development and eventual clinical translation of more complex 3D VLFMs.

Several critical insights have been identified from examining the limitations inherent to existing 2D VLFMs. First, interpretability and clinical trust present significant challenges. Despite high quantitative performance demonstrated in research studies, clinicians frequently express reservations regarding the interpretability and transparency of these models’ decisions, underscoring the necessity of incorporating explainability methods, uncertainty quantification, and clear visual justifications into model predictions. Second, the issue of domain generalization and robustness is a crucial barrier. Models trained on datasets from specific institutions or limited imaging protocols often struggle to generalize effectively to diverse clinical environments. This limitation highlights the importance of robust training methodologies, domain adaptation techniques, and comprehensive validation on heterogeneous datasets. Third, computational efficiency and practical feasibility remain pivotal for real-world deployment. Despite being simpler than 3D models, current 2D VLFMs still encounter challenges related to computational demands, limiting their integration into clinical workflows, especially in resource-constrained settings. Lastly, regulatory approval and ethical considerations significantly impact clinical adoption. Factors such as patient privacy, data security, and algorithmic biases must be comprehensively addressed to facilitate the real-world integration of these models into healthcare practice. These lessons and their corresponding implications for clinical adoption are summarized in Table 3.

Table 3 Summary of limitations of 2D vision-language foundation models (VLFMs), example studies, and clinical impact

Based on these insights from 2D VLFMs, we propose the following practical roadmap for future research and clinical integration of 3D VLFMs, which is shown in Fig. 4. Explicitly integrating these considerations into the research agenda will significantly enhance the likelihood of translating 3D VLFMs into clinically valuable tools, ultimately improving patient care through more precise and insightful medical image interpretation.

Fig. 4: Future Roadmap for Clinical Integration of 3D VLFMs.

The practical roadmap for future research and clinical integration of 3D VLFMs, including the short-term, medium-term and long-tern goals.

In the short-term (1–2 years), efforts should prioritize establishing technical feasibility and computational efficiency. This involves reducing computational complexity by developing memory-efficient transformer architectures, such as hierarchical (e.g., Swin), axial, or sparse attention mechanisms, specifically tailored for medical 3D imaging. Additionally, exploring hybrid modeling strategies that integrate transformer models with convolutional neural networks will help balance interpretability, accuracy, and computational demands. Initial validations can be conducted using standard public medical imaging datasets, focusing on benchmarking tasks such as fine-grained abnormality detection and segmentation (e.g., lung nodules, brain tumors). The mid-term goals (2–4 years) should focus on enhancing clinical relevance and interpretability. This phase includes integrating transformer models with advanced explainability techniques, including attention visualization and feature attribution methods. Furthermore, close collaboration with clinical experts is essential to clearly define realistic clinical scenarios such as preliminary report generation, screening assistance, and triage prioritization. Robustness and generalizability of models must be thoroughly validated through multi-institutional studies that encompass diverse populations and imaging modalities, alongside establishing standardized reporting criteria for consistent evaluation. In the long-term (4–7 years), extensive prospective clinical validation studies are essential, involving large-scale, multicenter clinical trials to rigorously demonstrate the clinical effectiveness, safety, and value of these models. Concurrently, proactive engagement with regulatory bodies (e.g., FDA, CE marking) is crucial to ensure compliance with ethical standards, transparency, fairness, bias mitigation, and reproducibility. Finally, successful integration into clinical workflows requires the development of interoperability standards, facilitating seamless integration with existing healthcare systems such as PACS, RIS, and EMR. Post-deployment, continuous monitoring and evaluation must be maintained to ensure sustained clinical benefit and system reliability.

Overall, the primary aimed scenarios for VLFMs focus on realistic, practical applications rather than full automation of medical diagnosis. Specifically, VLFMs are well-suited for supportive roles such as preliminary anomaly detection and triage assistance to streamline clinical workflows, as well as assisting radiologists in generating structured clinical reports by automating routine descriptive tasks. Additionally, patient-focused applications involve providing simplified, patient-friendly explanations of imaging results, enhancing patient understanding and health literacy. Emphasizing interpretability, robust validation, clinician collaboration, and integration into existing clinical workflows will ensure these models have meaningful clinical impact and broad acceptance.

Source link

Related Topics:Artificial intelligence Computational models Computer science

Up Next

Undergrads explore AI’s promise in summer research experience

Don't Miss

UB Neurosurgery and Ambulatory Neurosurgery Center (ANSC™) in partnership with NOVA Neuro, Awarded First-of-Its-Kind Grant to Decode Brain Aneurysm Risk Powered by Artificial Intelligence

Yuli Wang

Click to comment

AI Research

If I Could Only Buy 1 Artificial Intelligence (AI) Chip Stock Over The Next 10 Years, This Would Be It (Hint: It’s Not Nvidia)

Published

3 hours ago

August 31, 2025

Adam Spatacco

While Nvidia continues to capture headlines, a critical enabler of the artificial intelligence (AI) infrastructure boom may be better positioned for long-term gains.

When investors debate the future of the artificial intelligence (AI) trade, the conversation generally finds its way back to the usual suspects: Nvidia, Advanced Micro Devices, and cloud hyperscalers like Microsoft, Amazon, and Alphabet.

Each of these companies is racing to design GPUs or develop custom accelerators in-house. But behind this hardware, there’s a company that benefits no matter which chip brand comes out ahead: Taiwan Semiconductor Manufacturing (TSM -3.05%).

Let’s unpack why Taiwan Semi is my top AI chip stock over the next 10 years, and assess whether now is an opportune time to scoop up some shares.

Agnostic to the winner, leveraged to the trend

As the world’s leading semiconductor foundry, TSMC manufactures chips for nearly every major AI developer — from Nvidia and AMD to Amazon’s custom silicon initiatives, dubbed Trainium and Inferentia.

Unlike many of its peers in the chip space that rely on new product cycles to spur demand, Taiwan Semi’s business model is fundamentally agnostic. Whether demand is allocated toward GPUs, accelerators, or specialized cloud silicon, all roads lead back to TSMC’s fabrication capabilities.

With nearly 70% market share in the global foundry space, Taiwan Semi’s dominance is hard to ignore. Such a commanding lead over the competition provides the company with unmatched structural demand visibility — a trend that appears to be accelerating as AI infrastructure spend remains on the rise.

Image source: Getty Images.

Scaling with more sophisticated AI applications

At the moment, AI development is still concentrated on training and refining large language models (LLMs) and embedding them into downstream software applications.

The next wave of AI will expand into far more diverse and demanding use cases — autonomous systems, robotics, and quantum computing remain in their infancy. At scale, these workloads will place greater demands on silicon than today’s chips can support.

Meeting these demands doesn’t simply require additional investments in chips. Rather, it requires chips engineered for new levels of efficiency, performance, and power management. This is where TSMC’s competitive advantages begin to compound.

With each successive generation of process technology, the company has a unique opportunity to widen the performance gap between itself and rivals like Samsung or Intel.

Since Taiwan Semi already has such a large footprint in the foundry landscape, next-generation design complexities give the company a chance to further lock in deeper, stickier customer relationships.

TSMC’s valuation and the case for expansion

Taiwan Semi may trade at a forward price-to-earnings (P/E) ratio of 24, but dismissing the stock as “expensive” overlooks the company’s extraordinary positioning in the AI realm. To me, the company’s valuation reflects a robust growth outlook, improving earnings prospects, and a declining risk premium.

TSM PE Ratio (Forward) data by YCharts

Unlike many of its semiconductor peers, which are vulnerable to cyclicality headwinds, TSMC has become an indispensable utility for many of the world’s largest AI developers, evolving into one of the backbones of the ongoing infrastructure boom.

The scale of investment behind current AI infrastructure is jaw-dropping. Hyperscalers are investing staggering sums to expand and modernize data centers, and at the heart of each new buildout is an unrelenting demand for more chips. Moreover, each of these companies is exploring more advanced use cases that will, at some point, require next-generation processing capabilities.

These dynamics position Taiwan Semi at the crossroad of immediate growth and enduring long-term expansion, as AI infrastructure swiftly evolves from a constant driver of growth today into a multidecade secular theme.

TSMC’s manufacturing dominance ensures that its services will continue to witness robust demand for years to come. For this reason, I think Taiwan Semi is positioned to experience further valuation expansion over the next decade as the infrastructure chapter of the AI story continues to unfold.

While there are many great opportunities in the chip space, TSMC stands alone. I see it as perhaps the most unique, durable semiconductor stock to own amid a volatile technology landscape over the next several years.

Adam Spatacco has positions in Alphabet, Amazon, Microsoft, and Nvidia. The Motley Fool has positions in and recommends Advanced Micro Devices, Alphabet, Amazon, Intel, Microsoft, Nvidia, and Taiwan Semiconductor Manufacturing. The Motley Fool recommends the following options: long January 2026 $395 calls on Microsoft, short August 2025 $24 calls on Intel, short January 2026 $405 calls on Microsoft, and short November 2025 $21 puts on Intel. The Motley Fool has a disclosure policy.

Source link

AI Research

Researchers train AI to diagnose heart failure in rural patients using low-tech electrocardiograms

Published

4 hours ago

August 31, 2025

Andrew Zinin

WVU computer scientists are training AI models to diagnose heart failure using data generated by low-tech equipment widely available in rural Appalachian medical practices. Credit: WVU/Micaela Morrissette

Concerned about the ability of artificial intelligence models trained on data from urban demographics to make the right medical diagnoses for rural populations, West Virginia University computer scientists have developed several AI models that can identify signs of heart failure in patients from Appalachia.

Prashnna Gyawali, assistant professor in the Lane Department of Computer Science and Electrical Engineering at the WVU Benjamin M. Statler College of Engineering and Mineral Resources, said heart failure—a chronic, persistent condition in which the heart cannot pump enough blood to meet the body’s need for oxygen—is one of the most pressing national and global health issues, and one that hits rural regions of the U.S. especially hard.

Despite the outsized impact of heart failure on rural populations, AI models are currently being trained to diagnose the disease using data representing patients from urban and suburban areas like Stanford, California, Gyawali said.

“Imagine Jane Doe, a 62-year-old woman living in a rural Appalachian community,” he suggested. “She has limited access to specialty care, relies on a small local clinic, and her lifestyle, diet and health history reflect the realities of her environment: high physical labor, minimal preventive care, and increased exposure to environmental risk factors like coal dust or poor air quality. Jane begins to experience fatigue and shortness of breath—symptoms that could point to heart failure.

“An AI system, trained primarily on data from urban hospitals in more affluent, coastal areas, evaluates Jane’s lab results. But because the system was not trained on patients who share Jane’s socioeconomic and environmental context, it fails to recognize her condition as urgent or abnormal,” Gyawali said. “This is why this work matters. By training AI models on data from West Virginia patients, we aim to ensure people like Jane receive accurate diagnoses, no matter where they live or how their lives differ from national averages.”

The researchers identified the AI models that were most accurate at diagnosing heart failure in an anonymized sample of more than 55,000 patients who received medical care in West Virginia. They also pinpointed the exact parameters for providing the AI models with data that most enhanced diagnostic accuracy. The findings appear in Scientific Reports, a Nature portfolio journal.

Doctoral student Alina Devkota emphasized they trained the AI models to work from patients’ electrocardiogram results, rather than the echocardiogram readings typical for patient data from urban areas.

Electrocardiograms rely on round electrodes stuck to the patient’s torso to record electrical signals from the heart. According to Devkota, they don’t require specialized equipment or specialized training to operate, but they still provide valuable insights into heart function.

“One of the criteria to diagnose heart failure is by measuring the ‘ejection fraction,’ or how much blood is pumped out of the heart with every beat, and the gold standard for doing that is with echocardiography, which uses sound waves to create images of the heart and the blood flowing through its valves,” she said.

“But echocardiography is expensive, time-consuming and often unavailable to patients in the very same rural Appalachian states that have the highest prevalence of heart failure across the nation. West Virginia, for example, ranks first in the U.S. for the prevalence of heart attack and coronary heart disease, but many West Virginians don’t have local access to high-tech echocardiograms. They do have access to inexpensive electrocardiograms, so we tested whether AI models could use electrocardiogram readings to predict a patient’s ejection fraction.”

Devkota, Gyawali and their colleagues trained several AI models on patient records from 28 hospitals across West Virginia. The AI models used either “deep learning,” which relies on multilayered neural networks, or “non-deep learning,” which relies on simpler algorithms, to analyze the patient records and draw conclusions.

The researchers found the deep-learning models, particularly one called ResNet, did best at correctly predicting a patient’s ejection fraction based on data from 12-lead electrocardiograms, with the results suggesting that a larger dataset for training would yield even better results. They also found that providing the AI models with specific “leads,” or combinations of data from different electrode pairs, affected how accurate the models’ ejection fraction predictions were.

Gyawali said while AI models are not yet being used in clinical practice due to reliability concerns, training an AI to successfully estimate ejection fraction from electrocardiogram signals could soon give clinicians an edge in protecting patients’ cardiac health.

“Heart failure affects more than six million Americans today, and factors like our aging population mean the risk is growing rapidly—approximately 1 in 4 people alive today will experience heart failure during their lifetimes. The prevalence is even higher in rural Appalachia, so it’s critical the people here do not continue to be overlooked.”

Additional WVU contributors to the research included Rukesh Prajapati, graduate research assistant; Amr El-Wakeel, assistant professor; Donald Adjeroh, professor and chair for computer science; and Brijesh Patel, assistant professor in the WVU Health Sciences School of Medicine.

More information:
AI analysis for ejection fraction estimation from 12-lead ECG, Scientific Reports (2025). DOI: 10.1038/s41598-025-97113-0scientific

Provided by
West Virginia University

Citation:
Researchers train AI to diagnose heart failure in rural patients using low-tech electrocardiograms (2025, August 31)
retrieved 31 August 2025
from https://medicalxpress.com/news/2025-08-ai-heart-failure-rural-patients.html

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no
part may be reproduced without the written permission. The content is provided for information purposes only.

Source link