AI Research

Democratizing cost-effective, agentic artificial intelligence to multilingual medical summarization through knowledge distillation

Published

1 month ago

July 29, 2025

Data source

To overcome the lack of accessible and vouched medical conversation datasets exclusively in Arabic, we adopted a synthetic data generation approach commonly seen in many studies, including an investigation by Al-Mutairi et al., which has shown that synthetic data created by LLMs can suffice as a data source for training and validation of AI models^6,11,12.

A total of 4,000 Arabic medical conversation transcripts and corresponding ground-truth medical summaries were generated using GPT-4o. The primary criteria for generating each simple clinical vignette featured in the interactive transcripts are summarized in Supplementary Table S1, including medical conditions, patient demographics, symptom characteristics, clinical details, family history, and environmental and social factors. By incorporating these expanded variables, the synthetic dataset samples real-world clinical diversity, enabling the model to learn from a range of scenarios and produce accurate, contextually relevant summaries.

A random subset of the synthetic dataset (~ 20%) was translated into English and then checked for quality assurance by a medical student experienced in both patient management skillsets and clinical research. By doing so, we generated confidence in the accuracy of the dataset and introduced accountable variability in the dataset, allowing robust learning, mimicry of human learning, and resistance to spurious correlations^13,14.

This study did not involve human participants or identifiable personal data and was therefore not subject to ethical considerations in accordance with 45 CFR 46.102. The dataset used was fully synthetic, generated by an AI model (GPT-4o) without any use of real patient information or records, consistent with legislation and other peer-reviewed studies. According to U.S. regulations, research using no human subjects or identifiable health information does not require IRB approval or informed consent in accordance with 45 CFR 164. Likewise, under EU law, the synthetic data did not constitute personal data per GDPR Recital 26, Article 4, and the EU AI Act’s Article 10 and thus fell outside the scope of data protection and consent requirements. No protected health information (PHI) or real human subjects as defined by HIPAA was used in this work, either to train, fine-tune, or validate any models. In line with regulatory guidance, our use of in silico synthetic medical data posed no privacy risk and warranted no ethics board oversight.

Development of AraSum SLM agent through knowledge distillation framework

AraSum was created by leveraging a knowledge distillation framework (Supplementary Figure S2) to transform large, multilingual language models into a compact student model optimized for the nuanced task of summarizing patient information in Arabic. This approach retains the performance characteristics of the teacher while reducing computational complexity, making it suitable for deployment in resource-constrained environments^{15,16,17,18,19,20,21,22}. After standard pre-processing for whitespace and null entries, a tokenizer capable of handling Arabic texts was utilized to enable accurate semantic and grammatical processing by the model²³.

Our study employed a multi-teacher distillation approach to enhance Arabic medical text summarization by leveraging two complementary multilingual Transformer models. Teacher Model A, facebook/mbart-large-50-many-to-many-mmt, is a multilingual encoder-decoder Transformer with 12 encoder and 12 decoder layers, pre-trained on a 50-language corpus including Arabic, and chosen for its strong multilingual summarization performance. Teacher Model B, google/mt5-large, is a sequence-to-sequence Transformer with approximately 1.2 billion parameters, pre-trained on the multilingual T5 corpus, effectively handling morphologically rich languages like Arabic. During training, logits from both teachers were independently generated on synthetic Arabic medical transcripts and combined via weighted averaging based on validation performance, creating a unified teacher signal.

The student model, AraSum, specifically targets Arabic medical summarization tasks with a Transformer architecture comprising 12 encoder and 8 decoder layers. AraSum was partially initialized using down-projected weights from teacher models to retain essential lexical and contextual knowledge, thereby accelerating convergence and improving overall performance²⁴. Additionally, a specialized SentencePiece tokenizer enriched with medical vocabulary and Arabic diacritics was developed, enhancing AraSum’s linguistic precision.

Our training approach leveraged a multi-teacher distillation framework employing both Kullback-Leibler (KL) divergence and cross-entropy loss functions to guide AraSum in mimicking the ensemble behavior of the teacher models. AraSum was trained using synthetic Arabic medical transcripts, with tokens batched at a global size of 64 per update, achieved by processing mini batches of 16 tokens across approximately 4 gradient accumulation steps. Training was conducted with a peak learning rate of 1 × 10⁻⁴, incorporating a warm-up phase of 500 steps. The model typically underwent 8–10 epochs of training, with early stopping triggered if the validation loss plateaued over two consecutive epochs. Regularization methods included a dropout rate of 0.1 in both attention and feed-forward sublayers, complemented by a weight decay of 0.01 to enhance generalization. The training process was executed on multiple NVIDIA A100 GPUs²⁵ in parallel, leveraging PyTorch’s Distributed Data Parallel (DDP) framework. Each training cycle typically lasted between 5 and 8 h, depending on dataset complexity and the total number of epochs.

For validation, performance was systematically evaluated using metrics such as ROUGE and BLEU scores. Checkpoints were monitored continuously, with the best-performing model selected based on the highest F1 score from a held-out validation set. Our current implementation centers around a singular integrated AraSum agent: a unified system internally orchestrates various tasks associated with medical note generation and summarization, rather than relying on multiple discrete agent entities.

Evaluation of model accuracy using known quantitative metrics

The synthetic dataset was split at a 90:10 ratio for training and validation. AraSum and JAIS-30B were both tasked using zero-shot prompting to create summaries of each clinical conversation transcript. To ensure clarity and grammatical accuracy in clinical summaries, the model and tokenizer were configured to preserve and generate Arabic diacritics (Taksheel/Harakat) as needed. The following quantitative metrics were calculated from the generated summaries on the validation set.

The “Evaluate” library²⁶ was used to calculate all aforementioned scores for each sample. Clinical content recall was defined as the proportion of relevant clinical information from the ground truth summary accurately captured in the AI-generated summaries. Salient clinical items were extracted from each conversation into an inventory. The recall was then calculated by dividing the number of correctly included items from the inventory by the total number of relevant items. Clinical content precision was defined as the proportion of information that was both accurate and relevant when compared to the ground truth. Precision was calculated by dividing the number of correctly included items in the summary by the total number of items, including any additional or incorrect items. This metric reflects the accuracy and relevance of the content without introducing extraneous or inaccurate details. The F1 score is used as a balanced metric to combine both clinical content precision and recall, providing a single measure of the AI-generated summaries’ performance²⁷. It represents the harmonic mean of precision and recall, ensuring that both the accuracy of relevant information captured (precision) and the completeness of that information (recall) are considered. Finally, ROUGE scores²⁸, BLEU²⁹, and BERTScore F1³⁰ were also calculated based on their original methodologies.

Measuring clinical utility with Arabic-fluent evaluators and modified performance inventories

Three random transcripts of synthetic patient-physician conversations were extracted from the dataset, along with their respective ground truths, AraSum-, and JAIS-generated summaries (Supplementary Figure S3). Eight evaluators who speak Arabic were consulted for this study, including four healthcare professionals comprised of two medical students, one of which who was a former Arabic medical interpreter, a radiologist, and a clinical professor. Consent to collect evaluator’s feedback data and publish was obtained. Each were provided the transcripts and the JAIS- and AraSum-generated summaries, blinded to the source. Summary quality was evaluated using a modified version of the Physician Documentation Quality Instrument 9 (PDQI-9). The original PDQI-9 employs a 5-point Likert scale across nine attributes to assess note quality. This was modified by Tierney et al. into a ten-item inventory to better fit the metrics relevant to ambient AI documentation and is widely used to evaluate AI-generated clinical notes^31,32. Three additional language-specific attributes, syntactic proficiency, domain-specific linguistic precision, and cultural competence were added to the inventory for the Arabic evaluators to assess the model’s ability to generate language as native Arabic speakers do. The attributes evaluated in this modified PDQI-9 are detailed in Table 1.

Statistical analysis

For comparing ROUGE and BLEU scores, normality was tested in the distribution of all scored metrics using the Shapiro-Wilk test, then a paired non-parametric comparison between the two models was conducted for each metric using the Wilcoxon signed rank test in GraphPad Prism version 10.4.1 for macOS (GraphPad Software, Boston, MA, www.graphpad.com). Cohen’s d value for paired data was calculated by dividing the mean of the paired differences by the standard deviation of all paired differences. The rank-biserial correlation was computed as the difference between the proportion of pairs where AraSum outperformed JAIS and the proportion of pairs where JAIS outperformed AraSum, divided by the total number of paired comparisons. Both analyses were conducted in Python (ver. 3.9). For comparing attribute scores from the modified PDQI-9 inventory, normality was assumed through the central limit theorem and law of large numbers, and results were compared through a paired Student’s t-test in Microsoft Excel. All graphs were generated using GraphPad Prism or the Seaborn and Matplotlib libraries in Python.

Source link

Up Next

Brown University to lead national institute focused on intuitive, trustworthy AI assistants

Don't Miss

Artificial Intelligence | NSF – National Science Foundation

Sonu Kumar

Click to comment

AI Research

Agencies and industry announce efforts to further Presidential AI Challenge

Published

1 hour ago

September 4, 2025

Alexandra Kelley

First Lady Melania Trump and multiple cabinet leaders on Thursday unveiled the next steps in the White House’s Presidential AI Challenge — a program mandated in an April executive order and launched Aug. 26 — and how the Trump administration is planning to keep the U.S. at the forefront of AI innovation and education.

The remarks were made at the second White House Task Force on Artificial Intelligence Education meeting and were accompanied by pledges from government agencies and the private sector to advance AI education, as mandated by the order.

“We are here today to talk about our future in the most real sense imaginable: how America’s children can be prepared to build our country tomorrow with the cutting edge tools of today,” White House Office of Science and Technology Policy Director Michael Kratsios said during the meeting. “We are proud and grateful to announce new steps in fulfilling the mission of this task force and the president’s vision for this AI challenge.”

Those upcoming steps include the release of toolkits, webinars, classroom guides and more, as well as agency action items intended to help cultivate a strong American foundation in AI education within academia and the workforce. These include sector-specific, applied AI training materials and ways to incorporate AI in American classrooms.

“Our goal is to empower states and schools to begin exploring AI integration in a way that works best for their communities,” Education Secretary Linda McMahon said during the meeting. “Ed is fully aligned with the Presidential AI Challenge, and is encouraging students and educators to explore AI technologies with curiosity and with creativity. It’s not one of those things to be afraid of. Let’s embrace it.”

Secretary of Agriculture Brooke Rollins spotlighted the expansive partnerships between the agency and external entities to bring AI systems into agrarian workflows.

“Far too often for those living and working in our rural parts of our country, that often those are left behind and do not always have the same access to the most recent technological innovations that our urban counterparts across the country do,” Rollins said. “We cannot let that happen with AI.”

USDA will focus on bringing AI systems into agricultural workflows and education, particularly for predictive analyses based on existing agriculture knowledge and data. Sensor systems, robotics and automation are all areas that are slated to modernize the agricultural industry, with help from private sector partners like Microsoft and academia, including Iowa State University and Texas State University.

Secretary of Labor Lori Chavez-DeRemer said her agency is expanding AI access and literacy through several vehicles — notably via apprenticeship opportunities, part of Labor and Commerce’s joint Talent Strategy that was released earlier in August.

“On-the-job training programs will help fill the mortgage paying jobs that AI will create, while also enhancing the unique skills required to succeed in various industries,” Chavez-DeRemer said. “Expanding these opportunities is a key component of our strategy to reach the president’s goal of 1 million new, active apprentices across the United States.”

Chavez-DeRemer also previewed pending partnerships to help disseminate AI education and training materials across the country, along with future best practices for effective AI literacy training.

Several private sector companies were also in attendance to explain their commitments towards supporting the initiative, noting that developing and expanding AI education is necessary to keep up with the demands of the growing AI-centric labor market. Alphabet CEO Sundar Pichai and IBM CEO Arvind Krishna announced their companies’ individual billion- and million-dollar commitments, respectively, to bolster AI education within academia and the existing workforce.

“This is all in the service of helping the next generation to solve problems, fuel innovation and build an incredible future,” Pichai said. “These are all goals we all share. We are incredibly thankful for the partnership and the leadership from the first lady, the president and the administration, and for showing us the way.”

The updates to the Presidential AI Challenge reflect the Trump administration’s no-holds-barred approach to both incorporating AI and machine learning into the government and ensuring the U.S. will lead in new AI technologies at the global level.

Source link

AI Research

UWF receives $100,000 grant from Air Force to advance AI and robotics research

Published

2 hours ago

September 4, 2025

Tanner Stewart

PENSACOLA, Fla. — The University of West Florida was just awarded a major grant to help innovate cutting-edge technology in Artificial Intelligence.

The US Air Force Research Laboratory awarded $100,000 to UWF’s Intelligent Systems and Robotics doctorate program.

The grant supports research in Artificial Intelligence and robotics while training PhD students.

The funding was awarded to explore how these systems can support military operations, but also be applied to issues we could face here locally like DISA.

Unlike generative AI in apps like ChatGPT, this research focuses on “reinforcement learning.”

“It’s action-driven. It’s designed to produce strategies versus content and text or visual content,” said Dr. Kristen “Brent” Venable with UWF.

Dr. Venable is leading the research.

Her team is designing simulations that teach autonomous systems like robots and drones how to adapt to the environment around them without human help — enabling the drones to make a decision on their own.

“So if we deployed them and let them go autonomously, sometimes far away, they should be able to decide whether to communicate, whether to go in a certain direction,” she said.

The initial goal of the grant is to help the US military leverage machine learning.

But Dr. Venable says the technology has potential to help systems like local emergency management during a disaster.

“You can see how this could be applied for disaster response,” she said. “Think about having some drones that have to fly over a zone and find people to be rescued or assets that need to be restored.”

Dr. Venable says UWF is poised to deliver on their promises to innovate the technology.

The doctorate program was created with Pensacola’s Institute for Human and Machine Cognition, giving students access to world-class AI and robotics research.

Over the last five years, the program has expanded to more than 30 students.

“We are very well positioned because the way we are, in some sense, lean and mean is attractive to funding agencies,” Dr. Venable said. “Because we can deliver results while training the next generation.”

The local investment by the Air Force comes as artificial intelligence takes center stage nationally.

On Thursday, First Lady Melania Trump announced a presidential AI challenge for students and educators.

President Trump has also signed an executive order to expand AI education.

Dr. Venable says she’s confident the administration’s push for research will benefit the university’s efforts, as the one-year grant will only go so far.

“I think the administration is correctly identifying as a key factor in having the US lead on the research,” she said. “It’s a good seedling to start the conversation for one year.”

The research conducted at UWF and the IHMC are helping put the area on the map as an intelligence hub.

Dr. Venable says they’re actively discussing how to apply for more grants to help with this ongoing research.

Source link