AI Research

Vision-language foundation model for 3D medical imaging

Published

1 month ago

August 6, 2025

Yuli Wang

Challenges and future directions

Enhancement of high-quality and real-world datasets of image-reports pairs

A key challenge in advancing AI applications for 3D medical imaging, particularly for radiology report generation, is the lack of large, annotated datasets that encompass a wide range of pathologies across diverse patient populations. Comprehensive datasets are essential for training VLFMs that can accurately generate radiology reports from 3D images. Efforts to expand these datasets should focus on including a greater variety of 3D imaging types and ensuring detailed annotations that correlate imaging findings with clinical reports.

A significant limitation of current methodologies in radiology report generation from 3D medical images is their heavy reliance on metadata extracted from DICOM files. These metadata fields, typically only providing basic information about image modality and the body parts imaged, are inherently restrictive. This approach results in low-quality ground truth and often lacks the depth and context necessary for creating nuanced and clinically relevant reports. They fail to capture complex diagnostic information, critical nuances in pathology, and other subtleties essential for a comprehensive radiology report. Consequently, models trained on such datasets may develop a superficial understanding of the images, leading to generic and potentially inaccurate report generation.

This underscores the need for developing datasets that go beyond mere metadata to include rich, contextual annotations that directly relate specific imaging findings to detailed clinical insights. Collaborative initiatives with hospitals and research institutions to anonymize and share 3D imaging data could be vital in achieving this goal. The true potential of such collaborations can be realized through the establishment of large-scale, multi-modality imaging and report data consortiums. By pooling resources and datasets from diverse geographic and demographic sources, these consortia can create a more comprehensive and varied dataset that reflects a broader spectrum of pathologies, treatment outcomes, and patient populations.

This approach would not only enhance the volume and variety of data but also improve the robustness and generalizability of the AI models trained on them. Additionally, multi-site collaboration facilitates the standardization of data collection, annotation, and processing protocols, further enriching the quality of the data. Such an enriched dataset can serve as a cornerstone for developing more precise and contextually aware AI tools, ultimately leading to improved accuracy in medical imaging report generation and better patient care outcomes.

Domain-specific insights in medical imaging beyond general computer vision

To significantly improve the performance of VLFMs in generating accurate radiology reports from 3D medical images, it is crucial to focus on both the development and refinement of model architectures tailored to the inherent complexities of 3D medical scans⁹. The intricate spatial relationships and detailed anatomical structures present in these images necessitate the use of enhanced 3D convolutional layers, specifically designed to better capture the spatial hierarchies essential for accurately interpreting medical images.

Moreover, the integration of advanced language processing modules is indispensable. These modules must not only understand the clinical language but also articulate medical findings with high precision, effectively incorporating medical terminologies and nuanced patient data. Such capabilities require a deep fusion of visual and textual understanding within the model architecture, ensuring that the generated reports are both medically accurate and contextually relevant.

Further augmenting the efficacy of these models, advanced training techniques like multi-task learning play a pivotal role. By enabling the model to simultaneously learn to identify specific medical conditions from 3D images and generate descriptive, clinically relevant text, multi-task learning enhances the model’s ability to handle multiple tasks that mirror the workflow of human radiologists. This approach ensures a more holistic learning process, fostering models that are not only technically proficient but also practically applicable in clinical settings.

In addition to these architectural and training enhancements, the application of anatomical guidance tools such as TotalSegmentor can revolutionize model training³². By allowing precise segmentation of specific organs or regions within the 3D scans, these tools help create anatomical guidance in the image-text pair alignments. This guidance significantly aids the model in distinguishing between different anatomical features and their corresponding clinical descriptions, thereby refining the accuracy and relevance of the generated reports. Collectively, these strategies form a robust approach to overcoming current limitations and setting new benchmarks in the AI-driven generation of radiology reports from complex 3D medical imaging data.

Advanced metrics for assessing VLFMs in medical imaging report accuracy and clinical utility

Current metrics for evaluating VLFMs in medical imaging often fall short. These metrics, adapted from traditional NLP models, primarily measure textual similarity rather than clinical accuracy. This limitation can result in high scores for reports that appear textually accurate but miss critical diagnostic findings, impressions, and recommendations.

Although there are some efforts to include radiologists’ subjective evaluations³³, these studies are limited to 2D chest X-rays and do not address the complexities of 3D imaging, which is more relevant to actual diagnostics and treatment. Evaluating VLFMs in 3D imaging requires more sophisticated metrics.

Studies^29,34,35,36 have highlighted the flaws in current metrics. BLEU struggles with identifying false findings, while BERTScore has higher errors in locating findings compared to CheXbert. These issues underscore the need for improved evaluation methods. The proposed RadCliQ²⁹ metric combines existing metrics with linear regression to create a more balanced evaluation model. However, RadCliQ’s reliance on text overlap-based metrics and its testing on 2D datasets reveal limitations when applied to 3D imaging.

Future research should focus on developing metrics that accurately assess the clinical relevance of the generated reports, including diagnostic accuracy and terminology appropriateness. Advanced NLP techniques could also compare generated reports with a database of clinician-validated reports. By improving these metrics, researchers can establish more effective benchmarks for VLFMs in 3D medical imaging, ensuring the generated reports are not only accurate but also valuable in clinical practice, thereby enhancing patient care and diagnostic processes.

Controversy on simulated/synthesized 3D data for model training

The use of simulated or augmented 3D data in training VLFMs for medical applications presents significant challenges alongside its benefits. While it fills gaps in data availability, especially for rare or complex diagnostic scenarios, there is substantial debate about its reliability. Critics^37,33 argue that because simulated data do not originate from actual clinical experiences, they may not accurately reflect the complexities and variability necessary for training truly effective medical AI models.

Despite these concerns, advancements in generative AI offer promising solutions to enhance the reliability and quality control of synthesized data^38,39,40. High-quality, AI-generated 3D datasets can now mimic real-world data with greater fidelity, reducing the gap between simulated and actual clinical scenarios. This evolution in data generation technology allows for an augmentation of existing 3D datasets, enhancing the breadth and depth of training environments for VLFMs without compromising the integrity of the models. By utilizing sophisticated generative techniques, including 3D-GAN⁴¹, 3D diffusion probabilistic model⁴², Neural Radiance Fields⁴³ and so on, developers can create augmented 3D datasets that are not only diverse but also closely aligned with real clinical conditions. These 3D datasets provide a controlled environment for testing and validating VLFMs, ensuring that the models are exposed to a wide range of medical scenarios, including rare and complex conditions. This approach not only enhances the model’s diagnostic capabilities but also ensures that the VLFMs are robust and reliable, thereby potentially improving healthcare outcomes through more accurate and comprehensive medical imaging reports.

Addressing computational and clinical challenges in 3D transformer-based models

The integration of 3D transformer-based vision-language models in medical imaging represents a significant advancement, holding considerable promise for improved diagnostic accuracy and clinical reporting. However, the translation of these models from research prototypes to clinically deployable solutions face computational and clinical challenges that must be thoroughly addressed.

A primary obstacle to the widespread clinical adoption of traditional 3D transformers is their computational complexity. Specifically, these models suffer from quadratic memory and computational requirements associated with self-attention mechanisms, especially when processing high-resolution volumetric medical data^44,45. Due to these constraints, practical implementations often necessitate significant compromises in input data resolution, thereby reducing the ability to capture fine-grained anatomical details essential for clinical diagnostics, such as subtle abnormalities, delicate vascular structures, nodules, and small lesions^46,47,48. Consequently, these limitations directly impact the clinical applicability and diagnostic utility of transformer-based vision-language systems.

To overcome these computational challenges, recent developments in transformer architectures have introduced promising solutions. Hierarchical 3D swin transformers, for example, employ multi-scale hierarchical structures and locality-sensitive computations to substantially reduce memory usage without sacrificing model accuracy⁴⁹. Additionally, sparse attention strategies, such as axial attention⁵⁰ and windowed attention mechanisms⁵¹, have emerged as effective methods to selectively prioritize computations, significantly decreasing resource requirements while maintaining robust performance. Furthermore, hybrid CNN-transformer models exploit the complementary strengths of convolutional neural networks and transformers, combining efficient local feature extraction from CNNs with the global contextual understanding facilitated by transformers, thereby achieving a practical balance between computational efficiency and clinical efficacy^52,53. Recent advances, such as the hierarchical attention approach proposed by Zhou et al.⁵⁴, exemplify these improvements by considerably reducing memory usage and computational demands. This strategy enables the analysis of higher-resolution volumetric data, resulting in improved performance in clinically relevant segmentation tasks, including pulmonary vessels and airways segmentation⁵⁴. Such innovative solutions demonstrate significant progress toward resolving existing barriers, yet further advancements remain essential.

To address ongoing computational and clinical challenges comprehensively, future research should prioritize several key directions. Enhanced hierarchical attention methods capable of dynamically allocating computational resources based on clinical significance represent a promising area of exploration. Additionally, the development of transformer architectures specifically tailored for medical imaging, incorporating domain knowledge to optimize performance and efficiency, remains a critical research priority. Finally, investigating memory-efficient training paradigms, including model distillation, quantization, and efficient training strategies, will be crucial to improving practical feasibility. Explicitly recognizing and systematically addressing these computational and clinical limitations in this review aims to provide valuable insights and actionable guidance for researchers committed to developing practical, efficient, and clinically impactful 3D transformer-based vision-language models in radiology.

Lessons learned from 2D VLFMs and a roadmap for future research on 3D VLFMs

The development and deployment of VLFMs in clinical environments represent promising yet challenging objectives. Although 2D VLFMs have exhibited encouraging results in controlled research environments, their real-world applicability and acceptance in clinical workflows have thus far remained limited. Understanding these limitations offers valuable insights that can guide the development and eventual clinical translation of more complex 3D VLFMs.

Several critical insights have been identified from examining the limitations inherent to existing 2D VLFMs. First, interpretability and clinical trust present significant challenges. Despite high quantitative performance demonstrated in research studies, clinicians frequently express reservations regarding the interpretability and transparency of these models’ decisions, underscoring the necessity of incorporating explainability methods, uncertainty quantification, and clear visual justifications into model predictions. Second, the issue of domain generalization and robustness is a crucial barrier. Models trained on datasets from specific institutions or limited imaging protocols often struggle to generalize effectively to diverse clinical environments. This limitation highlights the importance of robust training methodologies, domain adaptation techniques, and comprehensive validation on heterogeneous datasets. Third, computational efficiency and practical feasibility remain pivotal for real-world deployment. Despite being simpler than 3D models, current 2D VLFMs still encounter challenges related to computational demands, limiting their integration into clinical workflows, especially in resource-constrained settings. Lastly, regulatory approval and ethical considerations significantly impact clinical adoption. Factors such as patient privacy, data security, and algorithmic biases must be comprehensively addressed to facilitate the real-world integration of these models into healthcare practice. These lessons and their corresponding implications for clinical adoption are summarized in Table 3.

Table 3 Summary of limitations of 2D vision-language foundation models (VLFMs), example studies, and clinical impact

Based on these insights from 2D VLFMs, we propose the following practical roadmap for future research and clinical integration of 3D VLFMs, which is shown in Fig. 4. Explicitly integrating these considerations into the research agenda will significantly enhance the likelihood of translating 3D VLFMs into clinically valuable tools, ultimately improving patient care through more precise and insightful medical image interpretation.

Fig. 4: Future Roadmap for Clinical Integration of 3D VLFMs.

The practical roadmap for future research and clinical integration of 3D VLFMs, including the short-term, medium-term and long-tern goals.

In the short-term (1–2 years), efforts should prioritize establishing technical feasibility and computational efficiency. This involves reducing computational complexity by developing memory-efficient transformer architectures, such as hierarchical (e.g., Swin), axial, or sparse attention mechanisms, specifically tailored for medical 3D imaging. Additionally, exploring hybrid modeling strategies that integrate transformer models with convolutional neural networks will help balance interpretability, accuracy, and computational demands. Initial validations can be conducted using standard public medical imaging datasets, focusing on benchmarking tasks such as fine-grained abnormality detection and segmentation (e.g., lung nodules, brain tumors). The mid-term goals (2–4 years) should focus on enhancing clinical relevance and interpretability. This phase includes integrating transformer models with advanced explainability techniques, including attention visualization and feature attribution methods. Furthermore, close collaboration with clinical experts is essential to clearly define realistic clinical scenarios such as preliminary report generation, screening assistance, and triage prioritization. Robustness and generalizability of models must be thoroughly validated through multi-institutional studies that encompass diverse populations and imaging modalities, alongside establishing standardized reporting criteria for consistent evaluation. In the long-term (4–7 years), extensive prospective clinical validation studies are essential, involving large-scale, multicenter clinical trials to rigorously demonstrate the clinical effectiveness, safety, and value of these models. Concurrently, proactive engagement with regulatory bodies (e.g., FDA, CE marking) is crucial to ensure compliance with ethical standards, transparency, fairness, bias mitigation, and reproducibility. Finally, successful integration into clinical workflows requires the development of interoperability standards, facilitating seamless integration with existing healthcare systems such as PACS, RIS, and EMR. Post-deployment, continuous monitoring and evaluation must be maintained to ensure sustained clinical benefit and system reliability.

Overall, the primary aimed scenarios for VLFMs focus on realistic, practical applications rather than full automation of medical diagnosis. Specifically, VLFMs are well-suited for supportive roles such as preliminary anomaly detection and triage assistance to streamline clinical workflows, as well as assisting radiologists in generating structured clinical reports by automating routine descriptive tasks. Additionally, patient-focused applications involve providing simplified, patient-friendly explanations of imaging results, enhancing patient understanding and health literacy. Emphasizing interpretability, robust validation, clinician collaboration, and integration into existing clinical workflows will ensure these models have meaningful clinical impact and broad acceptance.

Source link

Related Topics:Artificial intelligence Computational models Computer science

Up Next

Undergrads explore AI’s promise in summer research experience

Don't Miss

UB Neurosurgery and Ambulatory Neurosurgery Center (ANSC™) in partnership with NOVA Neuro, Awarded First-of-Its-Kind Grant to Decode Brain Aneurysm Risk Powered by Artificial Intelligence

Yuli Wang

Click to comment

AI Research

Inside Austin’s Gauntlet AI, the Elite Bootcamp Forging “AI First” Builders

Published

2 hours ago

September 6, 2025

Walt Maciborski

AUSTIN, Texas — In the brave new world of artificial intelligence, talent is the new gold, and companies are in a frantic race to find it. While universities work to churn out computer science graduates, a new kind of school has emerged in Austin to meet the insatiable demand: Gauntlet AI.

Gauntlet AI bills itself as an elite training program. It’s a high-stakes, high-reward process designed to forge “AI-first” engineers and builders in a matter of weeks.

“We’re closer to Navy SEAL bootcamp training than a school,” said Ash Tilawat, Head of Product and Learning. “We take the smartest people in the world. We bring them into the same place for a 1000 hours over ten weeks and we make them go all in with building with AI.”

Austen Allred, the co-founder and CEO of Gauntlet AI, says when they claim to be looking for the smartest engineers in the world, it’s no exaggeration. The selection process is intensely rigorous.

“We accept around 2 percent of the applicants,” Allred explained. “We accept 98th percentile and above of raw intelligence, 95th percentile of coding ability, and then you start on The Gauntlet.”

ALSO| The 60-Second Guardian: Can a Swarm of Drones Stop a School Shooter?

The price of admission isn’t paid in dollars—there are no tuition fees. Instead, the cost is a student’s absolute, undivided attention.

“It is pretty grueling, but it’s invigorating and I love doing this,” said Nataly Smith, one of the “Gauntlet Challengers.”

Smith, whose passions lie in biotech and space, recently channeled her love for bioscience to complete one of the program’s challenges. Her team was tasked with building a project called “Geno.”

“It’s a tool where a person can upload their genomic data and get a statistical analysis of how likely they are to have different kinds of cancers,” Smith described.

Incredibly, her team built the AI-powered tool in just one week.

The ultimate prize waiting at the end of the grueling 10-week gauntlet is a guaranteed job offer with a starting salary of at least $200,000 a year. And hiring partners are already lining up to recruit challengers like Nataly.

“We very intentionally chose to partner with everything from seed-stage startups all the way to publicly traded companies,” said Brett Johnson, Gauntlet’s COO. “So Carvana is a hiring partner. Here in Austin, we have folks like Function Health. We have the Trilogy organization; we have Capital Factory just around the corner. We’re big into the Austin tech community and looking to double down on that.”

In a world desperate for skilled engineers, Gauntlet AI isn’t just training people; it’s manufacturing the very talent pipeline it believes will power the next wave of technological innovation.

Source link

AI Research

Endangered languages AI tools developed by UH researchers

Published

3 hours ago

September 5, 2025

UH News

Reading time: < 1 minute

University of Hawaiʻi at Mānoa researchers have made a significant advance in studying how artificial intelligence (AI) understands endangered languages. This research could help communities document and maintain their languages, support language learning and make technology more accessible to speakers of minority languages.

The paper by Kaiying Lin, a PhD graduate in linguistics from UH Mānoa, and Department of Information and Computer Sciences Assistant Professor Haopeng Zhang, introduces the first benchmark for evaluating large language models (AI systems that process and generate text) on low-resource Austronesian languages. The study focuses on three Formosan (Indigenous peoples and languages of Taiwan) languages spoken in Taiwan—Atayal, Amis and Paiwan—that are at risk of disappearing.

Using a new benchmark called FORMOSANBENCH, Lin and Zhang tested AI systems on tasks such as machine translation, automatic speech recognition and text summarization. The findings revealed a large gap between AI performance in widely spoken languages such as English, and these smaller, endangered languages. Even when AI models were given examples or fine-tuned with extra data, they struggled to perform well.

“These results show that current AI systems are not yet capable of supporting low-resource languages,” Lin said.

Zhang added, “By highlighting these gaps, we hope to guide future development toward more inclusive technology that can help preserve endangered languages.”

The research team has made all datasets and code publicly available to encourage further work in this area. The preprint of the study is available online, and the study has been accepted into the 2025 Conference on Empirical Methods in Natural Language Processing in Suzhou, China, an internationally recognized premier AI conference.

The Department of Information and Computer Sciences is housed in UH Mānoa’s College of Natural Sciences, and the Department of Linguistics is housed in UH Mānoa’s College of Arts, Languages & Letters.

Source link

AI Research

OpenAI reorganizes research team behind ChatGPT’s personality

Published

3 hours ago

September 5, 2025

Maxwell Zeff

OpenAI is reorganizing its Model Behavior team, a small but influential group of researchers who shape how the company’s AI models interact with people, TechCrunch has learned.

In an August memo to staff seen by TechCrunch, OpenAI’s chief research officer Mark Chen said the Model Behavior team — which consists of roughly 14 researchers — would be joining the Post Training team, a larger research group responsible for improving the company’s AI models after their initial pre-training.

As part of the changes, the Model Behavior team will now report to OpenAI’s Post Training lead Max Schwarzer. An OpenAI spokesperson confirmed these changes to TechCrunch.

The Model Behavior team’s founding leader, Joanne Jang, is also moving on to start a new project at the company. In an interview with TechCrunch, Jang says she’s building out a new research team called OAI Labs, which will be responsible for “inventing and prototyping new interfaces for how people collaborate with AI.”

The Model Behavior team has become one of OpenAI’s key research groups, responsible for shaping the personality of the company’s AI models and for reducing sycophancy — which occurs when AI models simply agree with and reinforce user beliefs, even unhealthy ones, rather than offering balanced responses. The team has also worked on navigating political bias in model responses and helped OpenAI define its stance on AI consciousness.

In the memo to staff, Chen said that now is the time to bring the work of OpenAI’s Model Behavior team closer to core model development. By doing so, the company is signaling that the “personality” of its AI is now considered a critical factor in how the technology evolves.

In recent months, OpenAI has faced increased scrutiny over the behavior of its AI models. Users strongly objected to personality changes made to GPT-5, which the company said exhibited lower rates of sycophancy but seemed colder to some users. This led OpenAI to restore access to some of its legacy models, such as GPT-4o, and to release an update to make the newer GPT-5 responses feel “warmer and friendlier” without increasing sycophancy.

Techcrunch event

San Francisco
|
October 27-29, 2025

OpenAI and all AI model developers have to walk a fine line to make their AI chatbots friendly to talk to but not sycophantic. In August, the parents of a 16-year-old boy sued OpenAI over ChatGPT’s alleged role in their son’s suicide. The boy, Adam Raine, confided some of his suicidal thoughts and plans to ChatGPT (specifically a version powered by GPT-4o), according to court documents, in the months leading up to his death. The lawsuit alleges that GPT-4o failed to push back on his suicidal ideations.

The Model Behavior team has worked on every OpenAI model since GPT-4, including GPT-4o, GPT-4.5, and GPT-5. Before starting the unit, Jang previously worked on projects such as Dall-E 2, OpenAI’s early image-generation tool.

Jang announced in a post on X last week that she’s leaving the team to “begin something new at OpenAI.” The former head of Model Behavior has been with OpenAI for nearly four years.

Jang told TechCrunch she will serve as the general manager of OAI Labs, which will report to Chen for now. However, it’s early days, and it’s not clear yet what those novel interfaces will be, she said.

“I’m really excited to explore patterns that move us beyond the chat paradigm, which is currently associated more with companionship, or even agents, where there’s an emphasis on autonomy,” said Jang. “I’ve been thinking of [AI systems] as instruments for thinking, making, playing, doing, learning, and connecting.”

🧪 i’m starting oai labs: a research-driven group focused on inventing and prototyping new interfaces for how people collaborate with ai.

i’m excited to explore patterns that move us beyond chat or even agents — toward new paradigms and instruments for thinking, making,…

— Joanne Jang (@joannejang) September 5, 2025

When asked whether OAI Labs will collaborate on these novel interfaces with former Apple design chief Jony Ive — who’s now working with OpenAI on a family of AI hardware devices — Jang said she’s open to lots of ideas. However, she said she’ll likely start with research areas she’s more familiar with.

This story was updated to include a link to Jang’s post announcing her new position, which was released after this story published. We also clarify the models that OpenAI’s Model Behavior team worked on.

Source link