AI Research

Google invests $1 billion in AI education training over 3 years in US

Published

1 month ago

August 6, 2025

NVIDIA CEO and co-founder Jensen Huang commends President Donald Trump’s A.I. agenda and outlines what the country’s job future will look like on ‘Special Report.’

Google is investing $1 billion to support artificial intelligence training and education initiatives in the U.S. over the next three years, underscoring how tech giants are working to scale access to their learning models and shape the future of education.

Google’s investment includes funding for new AI learning tools, and the Google AI for Education Accelerator, which is a new initiative that will provide free AI training and Google Career Certificates to every U.S. college student at over 100 universities and community colleges for free.

CHINA CLOSES GAP IN AI MODEL DEVELOPMENT

Students who are at least 18 years old will also get access to a free 12-month Google AI Pro plan, which includes advanced AI tools like NotebookLM for notes, Deep Research for custom reports, and Veo 3 for video generation.

“Guided Learning encourages participation through probing and open-ended questions that spark a discussion and provide an opportunity to dive deeper into a subject,” Google Vice President of Learning Maureen Heymans said, adding that the aim of the tool is to help someone “build a deep understanding instead of just getting answers by acting like your personal AI learning companion.”

Heymans said Guided Learning breaks down problems step-by-step and adapts explanations to your needs.

Gemini and Google logos on a smartphone. (Photo by Lorenzo Di Cola/NurPhoto via Getty Images. / Getty Images)

This follows the company’s June announcement of new Gemini tools for students and educators. Chief among them is Gemini for Education, a version of its Gemini app designed specifically for the “unique needs of the educational community,” according to Google’s blog.

Google says the tool is built to save time, support personalized learning and help students and teachers generate ideas and learn with confidence, all within a secure environment.

Gemini for Education, which is built with Gemini 2.5 Pro, gives students and educators access to the company’s premium AI models. It includes strong privacy protections and is managed by school administrators. To underscore its privacy safeguards, the company notes that user data is not human-reviewed and is not used to train AI models.

Visitors walk through the Law Quad at the University of Michigan in Ann Arbor, Michigan, on Monday, July 28, 2025. ( Emily Elconin/Bloomberg via Getty Images / Getty Images)

ZUCKERBERG WANTS TO GIVE EVERYONE THEIR OWN PERSONAL SUPERINTELLIGENCE THROUGH META’S NEW VISION

Google also added Gemini AI tools in Google Classroom to all schools using Google Workspace for Education for free. Google said it is also using dozens of new features to help teachers work more efficiently. This includes tools that automatically create vocabulary lists with definitions and example sentences.

“Guided Learning represents an important step in our path to helping everyone in the world learn anything in the world,” Heymans said, adding that the company also recognizes “that the path forward is one of immense possibility and shared responsibility to ensure AI truly benefits all learners.”

Google is far from the only one targeting education. Microsoft also announced a slate of new AI features for educators in its Microsoft 365 Copilot, including Copilot Chat for teenage students.

A University of Michigan flag on the Michigan Union building at the school’s campus in Ann Arbor, Michigan, on Monday, July 28, 2025. (Photographer: Emily Elconin/Bloomberg via Getty Images / Getty Images)

The tech giant said that AI in education is advancing daily, with over 80% of surveyed educators using AI this year, which is up 21 points from last year. The company also noted that approximately one in three surveyed United States K-12 educators still lacks confidence in using AI effectively and responsibly. More than half of students surveyed say they have not received AI training.

“It’s critical to engage with students, educators, and all community stakeholders to address challenges, learn together, and co-develop the path forward. Further, we need to collectively prepare for an AI-powered future and support students in building relevant AI skills as every industry and discipline evolves,” Microsoft said in its June post.

Texas A&M administration building on Wednesday, July 3, 2024, in College Station, Texas. ( Ishika Samant/Houston Chronicle via Getty Images. / Getty Images)

GET FOX BUSINESS ON THE GO BY CLICKING HERE

In 2023, Amazon launched a new initiative, dubbed “AI Ready,” with a goal of providing free AI skills training and education to two million people globally by 2025. It reached this mark by 2024.

Ticker	Security	Last	Change	Change %
AMZN	AMAZON.COM INC.	218.38	+4.63	+2.17%
MSFT	MICROSOFT CORP.	526.34	-1.42	-0.27%
GOOG	ALPHABET INC.	196.92	+1.60	+0.82%

Amazon has previously underscored how the potential benefits of these skills are “enormous.” The company said in a 2024 blog that AI skills could boost productivity by at least 39% and increase salaries by up to 30%.

“But in order to fully harness this potential, organizations will need to address the pressing gap in AI-specific skills in the workforce,” Amazon said.

Source link

AI Research

A new ensemble heart attack diagnosis (EHAD) model using artificial intelligence techniques

Published

26 minutes ago

September 15, 2025

El-Sayed M. El-Kenawy

In this segment, we consider the EHAD strategy, in which an) is presented as a novel hybrid diagnostic strategy in the diagnosis of a heart attack. ECT combines three major records: (SVM)¹⁸, (LSTM)¹⁹, and (ANN)²⁰. The combination of these models is conducted with the MV technique to arrive at a final decision concerning the usual heart disease dataset, which can be referred to as a commonly known dataset³³, as depicted in Fig. 1. SVM is a classification model of the boundary type, LSTM is a method of learning on the basis of temporal sequences, and ANN is a general-purpose feedforward neural network. The three classifiers are first trained in parallel by using the same training subset and subsequently tested and proven by using the appropriate testing subset.

Finally, the MV technique will pick the diagnosis that will have the greatest number of agreements between the three classifiers. The implementation hypothesis is based on four steps that need to be carried out one by one, as listed in Fig. 1, and includes training, testing, validation, and final prediction by voting. The dataset was split with the help of a stratified 10-fold cross-validation mechanism. Additionally, 70/30 proportion was applied, with 70% of the data designated to be used in training and 30% to be used in testing. Standard scaling was performed for feature normalization via the StandardScaler function. No data balancing method (such as SMOTE) was used because the distribution of the data was already balanced. Region SVC with the probability setting turned on was used to implement the SVM. The ANN was built on the MLP classifier, the hyperparameters were tuned via GridSearchCV and 3-fold cross-validation, and early stopping was used with different options out of the whole size of the hidden layers, modes of solving, and learning rates.

The implementation of LSTM in Keras as a framework was performed with a reshaped 3D 2D input layer, a single layer of LSTM with 32 units, and a dense final layer with the activation pinned on the sigmoid. To train the model, 10 epochs with a batch size of 32 and binary cross-entropy loss were used. The MV approach was used after the prediction and was achieved by combining the outputs of the three models and taking the class to be agreed upon by at least two of them. The analysis of a strategy with standard performance measures reveals that the accuracy, recall, and precision are calculated on the basis of the confusion matrix given in Table 3. The formulas of these evaluation metrics are summarized in Table 4. The classifiers, both being tested separately and as an ensemble, were tested using the same testing set. Table 5 lists all the parameters and hyperparameters that were used to make each model reproducible and transparent in implementation. The ensemble-like EHAD strategy yields better results than the individual models do; therefore, the model diversity and use of voting rules are feasible for increasing the accuracy of heart disease diagnosis.

Table 3 Confusion matrix which depicts how diagnostic on cases.

Table 4 Confusion matrix formulas.

Table 5 The used parameters and their values.

Grid search in the space of predetermined hyperparameters is the main method used to perform hyperparameter tuning of artificial neural networks (ANNs). The combinations of the hidden_layer_sizes [(32, 16), (64, 32)], solver [‘adam’, ‘lbfgs’], max_iter [1000, 2000], and learning_rate_init [0.001, 0.01] were investigated during the search. Threefold cross-validation was applied as a method of performance evaluation, and the best configuration was chosen considering the validation accuracy and F1 score. The optimal values were as follows: hidden_layer_sizes = (64, 32), solver=’adam’, learning_rate_init = 0.001 and max_iter = 2000. The learning rate was chosen via this process of grid search, and early stopping was used to avoid overfitting, which provides a sort of implicit regularization. Although there is no dropout in practice (as the MLP classifier of scikit-learn is not capable of dropout), we are following up on this by adding dropout as an explicit regularization to deep learning approaches, such as Keras. In the case of the support vector machine (SVM), default hyperparameters were chosen (the kernel was the RBF), and in the future, it can be optimized to C, gamma, and kernel parameters via a randomized search technique or the grid pattern method. The LSTM model was developed in Keras with only one LSTM neuron and 32 units and a dense output layer with sigmoid activation. It was constructed via the Adam optimizer, binary cross-entropy loss, and default learning rate to complete 10 epochs with a batch size of 32. In this version, no deliberate hyperparameter optimization was applied on behalf of the LSTM, but the model architecture and approach parameters were selected on the basis of precedent empirical findings.

Dataset 1 preprocessing

The statistical data utilized in the research included 303 records without missing values³³. This was validated by means of an in-depth check of the data, which revealed that all of its attributes were filled in and ready to process. Continuous variables such as age, heart rate, CK-MB and troponin levels were normalized with min–max scaling to normalize the values between 0 and 1, which increases the performance of machine learning algorithms, especially ANN and LSTM. With respect to the target variable, a moderate class imbalance was detected: 61.2% of the records were positive (presence of heart attack), whereas 38.8% of the records were negative (absence of heart attack). To compensate for this, we used stratified 10-fold cross-validation so that the same class distribution was maintained in each fold. Moreover, precision, recall and F1-score metrics, which are more suitable for uneven datasets than for accuracy only, were used to assess performance.

Dataset description

In this section, two datasets are described in detail.

Description of dataset 1

This dataset is commonly known as the “heart disease” dataset³³, which has been widely used in cardiovascular research and machine learning studies. It comprises 303 patient records, each with 14 attributes related to heart health. The primary goal of this dataset is to predict the presence of heart disease in patients on the basis of these attributes. (1) Age: Age of the patient in years, (2) Sex: Gender of the patient (1 = male; 0 = female), (3) Chest Pain Type (cp.): 0: Typical Angina 1: Atypical Angina 2: Non-Anginal Pain and 3: Asymptomatic; (4) Resting blood pressure (trestbps): Resting blood pressure in mm Hg upon hospital admission; (5) Serum cholesterol (chol): Serum cholesterol level in mg/dl; (6) Fasting blood sugar (fbs): Fasting blood sugar > 120 mg/dl (1 = true; 0 = false), (7) Resting Electrocardiographic Results (Results): 0: Normal; 1: ST-T wave abnormality (e.g., T wave inversions and/or ST elevation or depression > 0.05 mV); and 2: probable or definite left ventricular hypertrophy according to Estes’ criteria, (8) Maximum heart rate achieved (thalach): maximum heart rate achieved during exercise; (9) Exercise-induced angina (exang): Exercise-induced angina (1 = yes; 0 = no), (10) ST Depression (old peak): ST depression induced by exercise relative to rest; 11. Slope of the Peak Exercise ST segment (slope): 0: Upsloping, 1: Flat and 2: Downsloping, 12. Number of major vessels colored by fluoroscopy (ca.): Number of major vessels (0–3) colored by fluoroscopy, 13. Thalassemia (thal): 1: normal, 2: fixed defect and 3: reversible defect, 14. Target: Diagnosis of heart disease (0 = absence; 1 = presence).

To better understand the contribution of the dataset features to the prediction process, we applied SHAP (SHapley Additive exPlanations) analysis using an XGBoost model trained on the dataset. The SHAP feature importance results Fig. 6 indicate that variables such as age, cholesterol, resting blood pressure, and maximum heart rate had the highest impact on prediction outcomes. Other features, including fasting blood sugar, ST depression (oldpeak), and exercise-induced angina, also contributed meaningfully but to a lesser extent. Features such as sex and resting ECG showed relatively lower influence. This analysis provides an initial understanding of the dataset characteristics, highlighting which clinical attributes are most relevant to heart disease risk. Incorporating SHAP at this stage supports transparent reporting of feature importance and complements the subsequent modeling and evaluation process.

Fig. 6

SHAP dot plot of feature contributions.

Description of dataset 2

The given dataset, known as the “heart attack dataset”⁴⁰, covers elaborate clinical and diagnostic data of 1,319 individuals, concentrating on indicators that are highly relevant to the state of the cardiovascular organism, especially in the case of possible heart-related disorders. It contains nine factors both in the category of demographic variables and biomarkers, and it is designed to answer binary classification tasks on the basis of a target variable called Result and denotes whether the patient is diagnosed as positive or negative with a condition that is presumably a cardiac event, possibly a myocardial infarction. The demographic variables cover five categories, i.e., Age (14–103) and Gender (0 and 1, presumably female and male, respectively). The age distribution within the dataset is fairly even, with a mean age of approximately 56 years, representing the elderly to middle-aged population of patients, which is normally associated with a greater risk of cardiovascular repercussions. Heart rate, systolic blood pressure and diastolic blood pressure are used to indicate vital signs. These factors are critical in the evaluation of cardiovascular functionality.

The heart rate values are rather variable, ranging from 20 beats per minute to 1111 beats per minute, indicating the possibility of outliers or even false entries. Blood pressure measurements are also very broad: the values of systolic blood pressure are between 42 and 223 mmHg, and diastolic blood pressure is between 38 and 154 mmHg, which includes hypotensive and hypertensive disorders. Examples of biochemical markers are blood sugar, CK-MB and troponin, all of which play important roles in determining a disease such as acute coronary syndrome. Blood sugar changes between 35 and 541 mg/dl, with normal, prediabetic and diabetic levels. The level of creatine kinase-MB (CK-MB), a heart enzyme released during damage to the heart, ranges from 0.321 to 300, whereas the level of troponin, which is used as a more specific cardiac injury marker, ranges from 0.001 to 10.3. These broad scopes of data imply that the population consists of normal and pathological cases. There are two values in the target column, i.e., “positive” and “negative”, which indicate unequal data distributions; the strength of the positive category (or the majority of the data, 810 in 1319 cases) opportunity recommends the use of class balancing tasks during model training. The organized character of the dataset, the absence of missing values, is adequate to allow the use of these data in machine learning tasks, including predictive modeling, risk stratification, or a clinical decision support system in cardiology.

Testing the EHAD strategy against other techniques on dataset 1

In this segment, the EHAD strategy is executed and compared with other diagnostic techniques to ensure that EHAD can provide fast and accurate diagnoses. These techniques include SVM²², ANN^29,30, LSTM^31,32, gradient boosting¹³, XGBoost¹¹, random forest¹¹ and ensemble classification (ECT) techniques. In fact, the EHAD model is applied on the basis of valid data without useless features to obtain quick and correct results. The accuracy, error, precision, recall, and F1-measure calculations are illustrated in Table 5. The implementation time measurement is also provided in ig1***2. Finally, the classification according to the confusion matrix formulas for the testing dataset is illustrated in Figs. 7, 8, 9 and 10, and 11 to prove the computational efficiency of the proposed EHAD strategy over other strategies. EHAD provides the best performance values; thus, it outperforms the other techniques.

Figures 7, 8, 9, 10, and 11 present the estimated performance metrics—accuracy, precision, recall, and F1 score—for four models (SVM, ANN, LSTM, and majority voting) (EHAD) across varying dataset sizes: 75, 150, 225, and 303. For a dataset size of 75, the SVM achieves an accuracy of 79.4%, with a precision of 80% and a recall value of 81%, resulting in an F1 score of 80.5%. The ANN shows slightly better performance, with an accuracy of 81.6%, and the LSTM model excels in terms of recall at 81%, with an accuracy of 81.8%, leading to an F1 score of 82.5%.

The majority voting method, which combines the predictions of all the models, performs best among the groups at this size, with an accuracy of 86.81%. At the base dataset size of 150, the SVM achieves an accuracy of 80.8%, whereas the ANN improves to 82%, and the LSTM reaches an accuracy of 83.4%. The majority voting method has advantages over the other methods, with an accuracy of 86.6%. When the dataset size increases to 225, all the metrics increase across all the models, but the majority voting method achieves an accuracy of 87.8%, which reflects the benefit of the ensemble. Finally, when the dataset size was increased to 303, improvements in all the models were observed.

The accuracy was 82.42%, the error rate was 17.58%, the specificity was 80.49, the MCC was 64.49, and the ROC-AUC was 88.54. Moreover, to a lesser extent, the ANN model demonstrated superior performance, with an error rate of 16.48%, accuracy of 83.52%, specificity of 82.93%, MCC of 66.80%, and ROC-AUC of 88.63. The LSTM model achieved high-quality classification, especially in terms of sensitivity, with the best recall of 90%. It was also able to achieve an accuracy rate of 84.62%, specificity of 82.93%, and MCC of 71.10%, as well as an ROC-AUC of 89.39%, which meant that it had a good balance of detecting positive samples and falsely detecting negative samples. The majority voting ensemble (EHAD) was the best performing model, with the highest accuracy of 86.81%, specificity of 85.37, MCC of 75.50 and ROC-AUC of 89.51. This ensemble approach was successful in using the capabilities of separate classifiers (SVM, ANN and LSTM) and strengthened the robustness and generalization capability of the model. Table 6 provides a detailed overview of the performance indicators, in which not only accuracy and recall are used but also specificity, the MCC, and the ROC-AUC are used; it has also been confirmed that the majority voting strategy is optimal compared with individual models in the problem of heart disease diagnosis.

Table 6 Comparison between diagnosis models.

Figure 12 shows the estimated prediction times for the majority voting method across various dataset sizes: 30, 60, 90, 120, and 150 samples. For each dataset size, the prediction time for each individual model, called the SVM, ANN, and LSTM, is presented, followed by the total voting time. For example, with a dataset of 30 samples, the SVM takes approximately 0.004 s, the ANN takes approximately 0.278 s, and the LSTM takes approximately 0.659 s, resulting in a cumulative voting time of 0.941 s. As the dataset size increases, the prediction times for each model also increase incrementally. For 150 samples, the SVM prediction time reaches 0.02 s, the ANN prediction time is 1.39 s, the LSTM prediction time is 3.29 s, and the EHAD prediction time is 4.7 s. Overall, the table illustrates that while the majority of the voting process involves multiple models, the total time remains relatively quick compared with the individual training times, highlighting the efficiency of aggregating predictions in the ensemble methods. However, the execution time of EHAD is greater than that of other techniques, but it is the most accurate model where its execution time is neglected compared with its accuracy.

A variety of models based on boosting techniques, including gradient boosting¹³, XGBoost¹¹, and random forest¹¹, are tested and compared with the proposed EHAD, as shown in Table 6. The gradient boosting model has moderate performance and 78.02% accuracy, a recall of 76%, a specificity of 80.49%, an MCC of 56.2 and an ROC-AUC score of 87.27 but still has a small execution time of 0.10 s. Conversely, XGBoost led to improvements in accuracy and precision, with an accuracy rate of 80.22, recall of 80, specificity of 80.44, MCC of 60.3, and ROC-AUC of 87.22. However, it took the longest execution time, 20.84 s, which can affect the efficiency of the program in terms of real-time programs. The random forest model also achieved a balanced performance, with a recall of 86%, accuracy of 83.52%, specificity of 80.39%, and MCC of 66.7, with an ROC-AUC of 89.00, which took 0.19 s; these parameters illustrate that the RF model is effective and fairly quick. Finally, the EHAD ensemble method outperforms the individual models in all the measures. It had the best accuracy of 86.81%, recall of 90%, specificity of 85.37%, MCC of 75.5%, ROC-AUC of 89.51%, and acceptable execution time of 4.7 s. These findings indicate the usefulness of the majority voting ensemble in highly enhancing diagnostic performance, whereas it does not cause an impractical performance time in predicting heart disease, as shown in Table 6.

Testing the EHAD strategy against other techniques on dataset 2

With dataset 2, which has 1319 records, the performance of the models is improved significantly. The SVM model achieves an accuracy of 78.79 against an error rate of 21.21%, whereas the ANN model also achieves an accuracy of 77.78 against an error of 22.22%. Notably, the ANN model has the highest recall of 84%, which means that it is powerful in detecting positive cases. The LSTM model has an accuracy rate of 79.55%, whereas the corresponding error rate is 20.45%. Gradiant boosting achieved a precision of 82.10%, an accuracy of 79.20%, and an error of 20.8%. XGBoost achieved higher accuracy and precision values of 79.50% and 82.70%, respectively. The random forest model achieves the highest accuracy for a single technique, with a value of 79.90% and an error of 20.1%. With respect to all the models, the majority voting technique (EHAD) still has the best output data, with the highest accuracy rate of 80.05 and the lowest error rate of 19.95. This indicates its usefulness in integrating several classifiers to increase its prediction performance. Overall, Table 7 shows a comparative assessment of the metrics of all the models, which confirms that the ensemble (EHAD) method yields consistent enhancements with increasing data size.

Table 7 Comparison between diagnosis models.

Statistical analysis

To evaluate the diversity and agreement among the individual models of the ensemble, we calculate both the Q statistic and the Cohen kappa coefficient of 2 models of each pair of classifiers. The Q statistic between SVM and ANN was 0.813, and it was higher, but not so much, between SVM and LSTM, at 0.937, which shows that there was strong agreement but did not eliminate diversity. The Q statistic of 0.813 was also common to both the ANN and LSTM models. These values imply values of complementary behaviors between classifiers, not redundancy. Similarly, the values of Cohen’s kappa were 0.665 when the SVM and ANN were compared, 0.779 when the SVM and LSTM were compared, and 0.667 between the ANN and LSTM, which indicates high agreement between classifiers. This variance, especially as expressed through Q statistics, is essential to ensemble learning, as it forms the reasoning behind the adoption of majority voting in that it can be established that the classifiers help provide a unique point of view in the ensuing decision.

For dataset 1, we used stratified 10fold cross-validation to encourage statistical robustness and reduce overfitting in all the models of dataset 1. This method maintains the level of each fold in the classes, which is a key factor in medical data, where imbalance occurs between classes. The metrics used to evaluate it are the accuracy, precision, recall, F1 score, Matthews correlation coefficient (MCC), and ROC-AUC. The majority voting ensemble was superior to the individual models on some of the most important metrics. It has the maximum accuracy of (82.50% +/− 5.56), followed by the SVM (81.51% +/− 5.38), ANN (81.17%+/− 6.2), and LSTM (7.88-/+80.87). In terms of the F1 score, the ensemble was the best once again at 85.03% +/− 4.59, followed closely by the LSTM with 84.04% +/− 5.12, the ANN with 84.08% +/− 4.2 and the SVM with 6.8-/+83.44. In the case of recall, the highest value was 90.91% (+/− 4.79 and +/− 6.43 standard deviations, respectively) for the ensemble and LSTM, compared with 89.09% +/− 5.50% for the ANN and 88.48% +/− 7.23% for the SVM. In terms of precision, the ensemble method had the highest precision (80.05% +/− 7.26), followed closely by the LSTM and ANN methods (79.98% +/− 5.61) and (78.98% +/− 6.83), respectively, with the lowest score to the SVM method (78.32% +/− 5.68%). The model ensemble method also had the greatest MCC (0.651 +/− 0.1128) compared with the LSTM (0.6326 +/− 0.1076), ANN (0.6264 ± 0.1277) and SVM methods. (0.6156 +/− 0.1615)

Similarly, the ensemble model had the best ROC-AUC score (0.8165+/− 0.0579) compared with the LSTM (0.8072 +/− 0.0581), ANN (0.8018 +/− 0.0644), and SVM (0.8012 +/− 0.0795) models.

For dataset 2, to be statistically robust and reduce the possibility of overfitting, in this research, we implemented stratified 10-fold cross-validation on all the models. The approach preserves the global class balance among folds, which is especially important when dealing with medical data that lack class balance. The support vector machine (SVM) obtained a mean accuracy of 76.4% +/− 4.3%, a recall of 79.3% +/− 4.6%, a precision of 80.1% +/− 4.1%, and an F1 score of 79.5% +/− 4.0%. The ANN obtained an accuracy of 75.7% +/− 4.6, a recall of 80.3% +/− 4.4, 78.8% +/− 4.3 precision and an F1 score of 79.3% +/− 4.2. The LSTM model thus fared higher in terms of the F1 score. 80.6% +/− 3.6, with an accuracy of 77.1% +/− 4.0, a recall of 80.9% +/− 4.1 and a precision of 80.4% +/− 3.8. Gradient boosting and XGBoost achieved very good results, with accuracies of 77.3% +/− 3.6% and 77.5% +/− 3.4%, respectively, and F1 scores of approximately 80%, with high sensitivity and specificity. The random forest model provided even more moderate results, with an accuracy of 77.80% +/− 3.1 and an F1 score of 80.60% +/− 3.0. Notably, the EHAD ensemble model was the most effective since it performed better than all the single models in terms of nearly all the evaluation measures, with an accuracy of 82.50% +/− 3.2, a recall of 85.30 +/− 3.5, a precision of 84.90 +/− 3.1 and an F1 score of 85.60 +/− 3.4 being the highest. These findings affirm the efficacy of the ensemble method of EHAD in pooling the predictive abilities of the base classifiers and increase the generalizability and reliability of the heart disease diagnosis framework.

Figure 13 shows the training and validation loss curves of the LSTM model, as shown in Fig. 13, over 25 epochs. The losses steadily decrease, and the validation loss closely follows the training loss, which means that we have a well-generalized model and that overfitting is minimal. The trend of the line that indicates a decrease in loss in time indicates that the model was able to learn inherent patterns of the input features and still managed to maintain generalization ability on unseen data. This convergence pattern also suggests that selected hyperparameters, such as the learning rate and batch size, are suitable for stabilizing training. This type of behavior is characteristic of successful learning, in which the model performance on the validation data increases even more rapidly than the training performance does, as expected because of the effectiveness of the regularization techniques and early stopping criterion³⁴.

Second, Andrews curve visualization uses a trigonometric function to convert high-dimensional data into a two-dimensional curve³⁵. A model (SVM, ANN, LSTM, or EHAD) over several metrics is represented by each line. Similar wave patterns are displayed by all the models, suggesting similar performance trends. Nonetheless, EHAD exhibits a marginally greater amplitude, particularly in the vicinity of the peak, indicating that it consistently performs better in terms of specific criteria. The robustness of EHAD across several performance metrics is vividly highlighted by this curve, as shown in Fig. 14.

The third correlation heatmap³⁶ displays the Pearson correlation coefficients between four performance metrics: accuracy, precision, F1 score, and recall. The color intensity of the correlations indicates how strong the relationship is; accuracy and F1 score have a nearly perfect correlation (0.99), meaning that improving one is likely to improve the other; recall and precision have almost no correlation (0.031), meaning that models that are strong in one may not necessarily perform well in the other. This differentiation aids in identifying trade-offs, as shown in Fig. 15.

The fourth heatmap of the Z score³⁷ for each model across the four metrics is shown as the point’s distance from the mean. With continuously high positive Z scores (particularly in precision, F1 score, and accuracy), EHAD performs better than the other methods do, whereas SVM has the lowest ranking with all negative values. The performance of the ANN and LSTM is modest. Z scores emphasize EHAD’s outstanding position and enable normalized comparisons, as shown in Fig. 16.

Fifth Metrics Bump Chart³⁸. Each model’s rank position across several performance measures is displayed visually in a bump chart. The models are listed on the X-axis, whereas the metric values are displayed on the Y-axis. In terms of most measures, particularly precision and accuracy, EHAD is at the top, followed by LSTM. SVM consistently has the lowest ranking and barely varies among the metrics. This graphic does a good job of conveying the models’ relative strengths and performance changes, as shown in Fig. 17.

Sixth accuracy gain from the Min Waterfall Chart³⁹. Each model’s incremental accuracy gains in comparison to the baseline (SVM), the model with the lowest performance, are displayed in the waterfall chart shown in Fig. 18. With an accuracy improvement of 4.39%, EHAD outperforms ANN and LSTM. These step-by-step graphics highlight the efficiency of EHAD in increasing classification accuracy and reveal how much better each model performs than the baseline.

Seventh, to statistically test the significance of differences in the performance of the proposed classification models, one-way analysis of variance (ANOVA) was performed. Table 8 shows the ANOVA results summarizing the variance of the models (SVM, ANN, LSTM, gradient boosting, XGBoost, random forest, and EHAD) and the residual variance of the models. The resulting F statistic was F (6, 56) = 47.11 with a P value < 0.0001, which shows that the variations between the performances of at least some of the models are very significant. In addition to the statistical summary, residual diagnostic plots were created Fig. 19 to test the assumptions of the model. These include residuals versus the fitted values, residuals versus the observed values, the fitted versus the observed values (with a line of reference), and a heatmap of the correlation between features of the model. The fact that the residuals are randomly and symmetrically distributed around zero, as seen in the plot, indicates that the errors in the models are not biased, which justifies the validity of the ANOVA outcomes. In addition, the heatmap identifies the correlation pattern between features, which is helpful for gaining additional insight into model behavior. The combination of these insights serves to confirm that there is a statistically significant difference in the performance of the models as well as that ANOVA can accept that the assumptions made are quite acceptable.

Table 8 ANOVA results comparing the performance of classification models.

In addition to the Wilcoxon signed rank test analysis, to statistically prove the differences in the performance of the models, the Wilcoxon signed rank test was used with all the models in action, i.e., SVM, ANN, LSTM, gradient boosting, XGBoost, random forest and the proposed EHAD model. When the test was performed, the total number of signed ranks (W) of the models was 45, with no negative ranks but was positive, suggesting equal levels of equal directions of superiority. Each model had a p value of 0.0039, which was significantly lower than the significance level of 0.05, indicating that the difference observed was indeed statistically significant. The p values are indicated as exact, and the outcomes are reported as statistically significant (**); thus, the validity and durability of the comparative analysis were confirmed. Chi-square analysis also lends more credence to the fact that the proposed EHAD model could be an advantageous alternative to traditional individual models. As shown in Table 9.

Table 9 Wilcoxon signed rank test results for model performance comparison.

Ninth, the receiver operating characteristic (ROC) curve of the proposed EHAD model tested on Dataset 1 is shown in Fig. 20. The ROC curve plots the values of the false positive rate (1-specificity) and true positive rate (sensitivity) at multiple classification levels. The blue graph presents the graph of the performance of the EHAD classifier, and the dashed red diagonal line represents the performance of the random classifier stupid (AUC = 0.5). The ROC curve of the EHAD model increases sharply upward and at the top left, indicating high sensitivity and a low false positive rate. Notably, the model provided an area under the curve (AUC) value of 1.00, that is, classification perfection in Dataset 1. This is a very good outcome, affirming the robust discriminative ability and power of the EHAD model in correctly classifying target classes in this dataset.

Figure 21 shows a histogram describing the relative performance of various classification models, such as SVM, ANN, LSTM, gradient boosting, XGBoost, random forest, and the EHAD-suggested model, according to several metrics of performance, such as accuracy, precision, recall, F1 score, and AUC. The histogram is a good visual representation of the strengths and weaknesses with respect to the classification performance of each model. Notably, the EHAD system remains superior to individual models, as it beats them in all measures that depict its effectiveness and ability to pull together with their constituent strengths across its base learners. As a thorough assessment, this promotes the strength of the hybrid EHAD model in the improvement of diagnostic accuracy in predicting heart attack through the use of Dataset 1.

Limitation of dataset size

Even though the size of dataset 1 involved in the application of hearth diseases in this study (303 samples) is not large enough to train deep learning models (which face the risk of overfitting), well-thought preprocessing and model design have been applied to mitigate the Likelihood of overfitting. We chose models such as LSTM and ANN because they are able to learn complicated nonlinear manifestations in clinical patient data, which are not able to be identified by simpler models such as logistic regression. Normalization as well as 10-fold cross-validation techniques were also implemented to attain sound and generalized performance. Although, in implementation, SMOTE or GAN data augmentation techniques have not been explored, they are a further direction of exploration due to the problems of data imbalance and variety. Interestingly, we performed one more comparison with an even larger heart disease dataset 2 (1319 samples) to prove the generality of the model. The results indicated that the smaller dataset (303 samples) surprisingly performed better according to some performance measures, such as accuracy, precision and recall. This implies that when limited data are used, low-income models can be trained successfully in the medical field, with even the best performing diagnostic models.

Discussion of results

The proposed EHAD framework proposes a new ensemble classification approach by combining three heterogeneous classification models, namely, SVM, ANN, and LSTM. In contrast with classical techniques for constructing ensemble strategies, which frequently assemble equivalent varieties of classifiers (e.g., models based on trees), a significant enhancement is EHAD, which combines boundary-based learning (SVM), nonlinear pattern recognition (ANN), and sequential feature learning (LSTM) in a single architecture. The main peculiarity of EHAD is its parallel design; in this case, all three classifiers are developed, tested, and validated by using the same data split. The MV technique is then used to aggregate their outputs to determine the final diagnostic decision. This architecture helps EHAD combine the advantages of every model type, enhancing diagnostic accuracy and generalizability. The algorithmic power is formed in the fact that EHAD has the ability to address both static clinical features and ordered dependencies within the data on patients, which, in traditional ensemble models, are rarely addressed in a single reserve setting. These are based on theoretical grounds that are founded on the ensemble learning theory, which mentions that ensembles of various weak and weak classifiers tend to be more robust. This contribution is also supported by the experimental results, whereby EHAD outperformed all the single classifiers and traditional ensembles in terms of essential performance measures to show its practicality and dependability in the diagnosis of heart attacks.

To obtain an empirical confirmation of the efficiency of the EHAD model, a comparison of its performance with that of individual constituent classifiers of the model, i.e., SVM, ANN, and LSTM, was performed on Dataset 1. As demonstrated in Table 6, EHAD has the best score among all the major assessment criteria. In particular, EHAD achieved 90%, 86.54%, 88.24% and 86.81% recall, precision, F1 score, and accuracy, respectively, and exceeded those of all the single models. Conversely, the high-ranking single model (LSTM) produced a poor F1 score (86.54%) and accuracy (84.62%). This performance improvement shows that EHAD not only combines models but also cleverly blends disparate types of learning paradigms, together-boundary learning (SVM), nonlinear feature extraction (ANN), and sequential pattern recognition (LSTM), and fuses them, in the form of majority voting, to a more balanced and general predictor. In addition, although the inference time increases only slightly (4.7 s against 3.8 s on EHAD), the tradeoff is worth increasing the diagnostic accuracy. Thus, EHAD differs not only in algorithmic design but also in theoretical quality and practical advantages; i.e., it can be an adequate candidate in real-world clinical decision support systems.

According to Dataset 2, as shown in Table 7, the SVM model achieves an accuracy of 78.79 against an error rate of 21.21%, whereas the ANN model also achieves an accuracy of 77.78 against an error of 22.22%. Notably, the ANN model has the highest recall of 84%, which means that it is powerful in detecting positive cases. The LSTM model has an accuracy rate of 79.55%, which makes it correct in 79.55% of the cases, and an error of 20.45, which means that it will be incorrect in 20.45% of the instances. With respect to all the models, the majority voting technique (EHAD) still has the best output data, with the highest accuracy rate of 80.05 and the lowest error rate of 19.95.

According to the statistical analysis, both datasets used in this paper were cross-validated and stratified 10-fold to make the results of the study statistically sound and somewhat resistant to overfitting, although maintaining the balance of classes, which is paramount in medical diagnosis applications. For every evaluation measurement, the EHAD ensemble performed better than did the individual models (SVM, ANN, and LSTM) across all the measurements of the evaluation (accuracy, precision, recall, F1 score, MCC, and ROC-AUC). Q statistics and the Cohen kappa coefficient were used to determine the diversity and agreement among the classifiers, confirming that the combinations of the classifiers were strongly and nonredundantly complementary, making ensemble learning viable. The training behavior of the LSTM was very well generalized, with an overall reduction in the loss curves. The best performance, in turn, was supported by visualization tools (such as Andrews curves, which have a stronger performance pattern across all the models; the Pearson correlation heatmap; and the Z score heatmap, which shows that the standardized performance of EHAD was stronger in all of them): the bump chart indicated that EHAD had the highest rank in most of the metrics, and the waterfall chart indicated that EHAD obtained a clear advantage in terms of accuracy in comparison with the baseline of SVM. Taken together, these analyses concur that EHAD outperforms the other frameworks in terms of achieving the highest scores and providing strong, consistent and generalizable results with respect to heart disease diagnosis.

Testing the majority voting method against weighted voting and stacking

Compared with the more elaborate ensemble methods such as stacking or weighted voting, majority voting (MV) was a purposeful decision to employ in the proposed model of EHAD because it scored highly both in merit metrics and simplicity of structure. Table 6 presents the results, which demonstrate that although each specific model has rather high values of recall (90% in the case of LSTM and 85.71% in the case of ANN), the performances of the models against all the other metrics were fairly equal. Notably, the MV-based EHAD model performed better than all the other models on all of the critical measures: the MV-based model had the greatest accuracy (86.81%), F1 score (88.24%), ROC-AUC (89.51%), and Matthews correlation coefficient (75.5%). These findings indicate that a straightforward majority procedure was enough to capitalize the approach of the complementary powers of the fundamental classifiers without the extra protocol and computation expense stacking or weight plans. Furthermore, in the initial trials, they employed a larger set of heart diseases in the comparison process. Interestingly, fewer records datasets (303 records) yielded more stable, consistent results when combined with MV, probably because of less noise and class imbalance. Other advantages of the less complicated MV method are its interpretability and lack of overfitting, which is prone to usually complicated ensemble designs, particularly with the use of small collections of data. Therefore, choosing MV was driven not only by empirical factors but also by practical limitations in terms of computational efficiency and model generalizability.

Model explainability and clinical interpretation

To further improve the interpretability of the EHAD model, we also utilized SHAP (SHapley Additive exPlanations), which is a game-theory approach to explain the contribution of each feature to individual predictions. The SHAP headline plot indicated that attributes such as age, cholesterol level, resting blood pressure and maximum heart rate were the major contributors to the output of the model. The results of this study are in line with accepted clinical knowledge, which increases reliability and offers the prospect of incorporation into clinical decision support systems. By knowing what features contribute to the predictions, healthcare professionals can acquire practical information on patient risk characteristics and make more informed decisions on patient treatment.

To increase the interpretability of the EHAD hybrid model, which uses SVM, ANN and LSTM integrated via majority voting, we conducted a SHAP interaction analysis on Dataset 1 on the basis of the interaction between cholesterol and age. The SHAP interaction values obtained as a result ranged from − 0.005 to 0.05, showing different levels of influence on the basis of the combination of these two features. The positive maximum (0.05) occurred when Age and Cholesterol were high, indicating a strong collective influence with regard to the EHAD model. Conversely, combinations with lower levels displayed little or negative effects. This analysis offers an explanation of how EHAD understands the play of features, which is informative in terms of explainability and can assist clinicians in making sense of model outcomes of diagnosis. This is shown in Fig. 22.

Source link

AI Research

Artificial Intelligence at Bayer – Emerj Artificial Intelligence Research

Published

51 minutes ago

September 15, 2025

The Editors

Bayer is a global life sciences company operating across Pharmaceuticals, Consumer Health, and Crop Science. In fiscal 2024, the group reported €46.6 billion in sales and 94,081 employees, a scale that makes internal AI deployments consequential for workflow change and ROI.

The company invests heavily in research, with more than €6 billion allocated to R&D in 2024, and its leadership frames AI as an enabler for both sustainable agriculture and patient-centric medicine. Bayer’s own materials highlight AI’s role in planning and analyzing clinical trials as well as accelerating crop protection discovery pipelines.

This article examines two mature, internally used applications that convey the central role AI plays in Bayer’s core business goals:

Herbicide discovery in crop science: Applying AI to narrow down molecular candidates and identify new modes of action.
Clinical trial analytics in pharmaceuticals: Ingesting heterogeneous trial and device data to accelerate compliant analysis.

AI-Assisted Herbicide Discovery

Weed resistance is a mounting global challenge. Farmers in the US and Brazil are facing species resistant to multiple herbicide classes, driving up costs and threatening crop yields. Traditional herbicide discovery is slow — often 12 to 15 years from concept to market — and expensive, with high attrition during early screening.

Bayer’s Crop Science division has turned to AI to help shorten these timelines. Independent reporting notes Bayer’s pipeline includes Icafolin, its first new herbicide mode of action in decades, expected to launch in Brazil in 2028, with AI used upstream to accelerate the discovery of new modes of action.

Reuters reports that Bayer’s approach uses AI to match weed protein structures with candidate molecules, compressing the early discovery funnel by triaging millions of possibilities against pre-determined criteria. Bayer’s CropKey overview describes a profile-driven approach, where candidate molecules are designed to meet safety, efficacy, and environmental requirements from the start.

The company claims that CropKey has already identified more than 30 potential molecular targets and validated over 10 as entirely new modes of action. These figures, while promising, remain claims until independent verification.

For Bayer’s discovery scientists, AI-guided triage changes workflows by:

Reducing early-stage wet-lab cycles by focusing on higher-probability matches between proteins and molecules.
Integrating safety and environmental criteria into the digital screen, filtering out compounds unlikely to meet regulatory thresholds.
Advancing promising molecules sooner, enabling earlier testing and potentially compressing development timelines from 15 years to 10.

Coverage by both Reuters and the Wall Street Journal notes this strategy is expected to reduce attrition and accelerate discovery-to-commercialization timelines.

The CropKey program has been covered by multiple independent outlets, a signal of maturity beyond a single press release. Reuters reports Bayer’s assertion that AI has tripled the number of new modes of action identified in early research compared to a decade ago.

The upcoming Icafolin herbicide, expected for commercial release in 2028, demonstrates that CropKey outputs are making their way into the regulatory pipeline. The presence of both media scrutiny and near-term launch candidates suggests CropKey is among Bayer’s most advanced AI deployments.

Video explaining Bayer’s CropKey process in crop protection discovery. (Source: Bayer)

By focusing AI on high-ROI bottlenecks in research and development, Bayer demonstrates how machine learning can trim low-value screening cycles, advancing only the most promising candidates into experimental trials. At the same time, acceleration figures reported by the company should be treated as claims until they are corroborated across multiple seasons, geographies, and independent trials.

Clinical Trial Analytics Platform (ALYCE)

Pharmaceutical development increasingly relies on complex data streams: electronic health records (EHR), site-based case report forms, patient-reported outcomes, and telemetry from wearables in decentralized trials. Managing this data volume and variety strains traditional data warehouses and slows regulatory reporting.

Bayer developed ALYCE (Advanced Analytics Platform for the Clinical Data Environment) to handle this complexity. In a PHUSE conference presentation, Bayer engineers describe the platform as a way to ingest diverse data, ensure governance, and deliver analytics more quickly while maintaining compliance.

The presentation describes ALYCE’s architecture as using a layered “Bronze/Silver/Gold” data lake approach. An example trial payload included approximately 300,000 files (1.6 TB) for 80 patients, requiring timezone harmonization, device ID mapping, and error handling before data could be standardized to SDTM (Study Data Tabulation Model) formats. Automated pipelines provide lineage, quarantine checks, and notifications. These technical details were presented publicly to peers, reinforcing their credibility beyond internal marketing.

For statisticians and clinical programmers, ALYCE claims to:

Standardize ingestion across structured (CRFs), semi-structured (EHR extracts), and unstructured (device telemetry) sources.
Automate quality checks through pipelines that reduce manual intervention and free staff up to focus on analysis.
Enable earlier insights by preparing analysis-ready datasets faster, shortening the lag between data collection and review.

These objectives are consistent with Bayer’s broader statement that AI is being used to plan and analyze clinical trials safely and efficiently.

PHUSE is a respected industry forum where sponsors share methods with peers, and Bayer’s willingness to disclose technical details indicates ALYCE is in production. While Bayer has not released precise cycle-time savings, its emphasis on elastic storage, regulatory readiness, and speed suggests measurable efficiency gains.

Given the specificity of the presentation — real-world payloads, architecture diagrams, and validation processes — ALYCE appears to be a mature platform actively supporting Bayer’s clinical trial programs.

Screenshot from Bayer’s PHUSE presentation illustrating ALYCE’s automated ELTL pipeline.
(Source: PHUSE)

Bayer’s commitment to ALYCE reflects its broader effort to modernize and scale clinical development. By consolidating varied data streams into a single, automated environment, the company positions itself to shorten study timelines, reduce operational overhead, and accelerate the movement of promising therapies from discovery to patients. This infrastructure also prepares Bayer to expand AI-driven analytics across additional therapeutic areas, supporting long-term competitiveness in a highly regulated industry.

While Bayer has not published specific cycle-time reductions or quantified cost savings tied directly to ALYCE, the company’s willingness to present detailed payload volumes and pipeline architecture at PHUSE indicates that the platform is actively deployed and has undergone peer-level scrutiny. Based on those disclosures and parallels with other pharma AI implementations, reasonable expectations include faster data review cycles, earlier anomaly detection, and improved compliance readiness. These outcomes—though not yet publicly validated—suggest ALYCE is reshaping Bayer’s trial workflows in ways that could yield significant long-term returns.

Source link

AI Research

The STARD-AI reporting guideline for diagnostic accuracy studies using artificial intelligence

Published

56 minutes ago

September 15, 2025

Ahmad Guni

Reichlin, T. et al. Early diagnosis of myocardial infarction with sensitive cardiac troponin assays. N. Engl. J. Med. 361, 858–867 (2009).

Article
PubMed
CAS

Google Scholar

Hawkes, N. Cancer survival data emphasise importance of early diagnosis. BMJ 364, l408 (2019).

Article
PubMed

Google Scholar

Neal, R. D. et al. Is increased time to diagnosis and treatment in symptomatic cancer associated with poorer outcomes? Systematic review. Br. J. Cancer 112, S92–S107 (2015).

Article
PubMed
PubMed Central

Google Scholar

Leifer, B. P. Early diagnosis of Alzheimer’s disease: clinical and economic benefits. J. Am. Geriatr. Soc. 51, S281–S288 (2003).

Article
PubMed

Google Scholar

Crosby, D. et al. Early detection of cancer. Science 375, eaay9040 (2022).

Article
PubMed
CAS

Google Scholar

Fleming, K. A. et al. The Lancet Commission on diagnostics: transforming access to diagnostics. Lancet 398, 1997–2050 (2021).

Article
PubMed
PubMed Central

Google Scholar

Whiting, P. F., Rutjes, A. W., Westwood, M. E. & Mallett, S. A systematic review classifies sources of bias and variation in diagnostic test accuracy studies. J. Clin. Epidemiol. 66, 1093–1104 (2013).

Article
PubMed

Google Scholar

Glasziou, P. et al. Reducing waste from incomplete or unusable reports of biomedical research. Lancet 383, 267–276 (2014).

Article
PubMed

Google Scholar

Ioannidis, J. P. et al. Increasing value and reducing waste in research design, conduct, and analysis. Lancet 383, 166–175 (2014).

Article
PubMed
PubMed Central

Google Scholar

Lijmer, J. G. et al. Empirical evidence of design-related bias in studies of diagnostic tests. JAMA 282, 1061–1066 (1999).

Article
PubMed
CAS

Google Scholar

Irwig, L., Bossuyt, P., Glasziou, P., Gatsonis, C. & Lijmer, J. Designing studies to ensure that estimates of test accuracy are transferable. BMJ 324, 669–671 (2002).

Article
PubMed
PubMed Central

Google Scholar

Moons, K. G., van Es, G. A., Deckers, J. W., Habbema, J. D. & Grobbee, D. E. Limitations of sensitivity, specificity, likelihood ratio, and Bayes’ theorem in assessing diagnostic probabilities: a clinical example. Epidemiology 8, 12–17 (1997).

Article
PubMed
CAS

Google Scholar

Bossuyt, P. M. et al. The STARD statement for reporting studies of diagnostic accuracy: explanation and elaboration. Ann. Intern. Med. 138, W1–W12 (2003).

Article
PubMed

Google Scholar

Bossuyt, P. M. et al. STARD 2015: an updated list of essential items for reporting diagnostic accuracy studies. BMJ 351, h5527 (2015).

Article
PubMed
PubMed Central

Google Scholar

Cohen, J. F. et al. STARD 2015 guidelines for reporting diagnostic accuracy studies: explanation and elaboration. BMJ Open 6, e012799 (2016).

Article
PubMed
PubMed Central

Google Scholar

Cohen, J. F. et al. STARD for Abstracts: essential items for reporting diagnostic accuracy studies in journal or conference abstracts. BMJ 358, j3751 (2017).

Article
PubMed

Google Scholar

Korevaar, D. A. et al. Reporting diagnostic accuracy studies: some improvements after 10 years of STARD. Radiology 274, 781–789 (2015).

Article
PubMed

Google Scholar

Korevaar, D. A., van Enst, W. A., Spijker, R., Bossuyt, P. M. & Hooft, L. Reporting quality of diagnostic accuracy studies: a systematic review and meta-analysis of investigations on adherence to STARD. Evid. Based Med. 19, 47–54 (2014).

Article
PubMed

Google Scholar

Miao, Z., Humphreys, B. D., McMahon, A. P. & Kim, J. Multi-omics integration in the age of million single-cell data. Nat. Rev. Nephrol. 17, 710–724 (2021).

Article
PubMed
PubMed Central

Google Scholar

Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018).

Article
PubMed
PubMed Central
CAS

Google Scholar

Williamson, E. J. et al. Factors associated with COVID-19-related death using OpenSAFELY. Nature 584, 430–436 (2020).

Article
PubMed
PubMed Central
CAS

Google Scholar

Lu, R. et al. Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding. Lancet 395, 565–574 (2020).

Article
PubMed
PubMed Central
CAS

Google Scholar

De Fauw, J. et al. Clinically applicable deep learning for diagnosis and referral in retinal disease. Nat. Med. 24, 1342–1350 (2018).

Article
PubMed

Google Scholar

McKinney, S. M. et al. International evaluation of an AI system for breast cancer screening. Nature 577, 89–94 (2020).

Article
PubMed
CAS

Google Scholar

Topol, E. J. High-performance medicine: the convergence of human and artificial intelligence. Nat. Med. 25, 44–56 (2019).

Article
PubMed
CAS

Google Scholar

Benjamens, S., Dhunnoo, P. & Meskó, B. The state of artificial intelligence-based FDA-approved medical devices and algorithms: an online database. NPJ Digit. Med. 3, 118 (2020).

Article
PubMed
PubMed Central

Google Scholar

Liu, X. et al. A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis. Lancet Digit. Health 1, e271–e297 (2019).

Article
PubMed

Google Scholar

Liu, X. et al. Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension. Nat. Med. 26, 1364–1374 (2020).

Article
PubMed
PubMed Central
CAS

Google Scholar

Rivera, S. C., Liu, X., Chan, A.-W., Denniston, A. K. & Calvert, M. J. Guidelines for clinical trial protocols for interventions involving artificial intelligence: the SPIRIT-AI Extension. BMJ 370, m3210 (2020).

Article
PubMed
PubMed Central

Google Scholar

Collins, G. S. et al. TRIPOD+AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods. BMJ 385, e078378 (2024).

Article
PubMed
PubMed Central

Google Scholar

Tejani, A. S. et al. Checklist for Artificial Intelligence in Medical Imaging (CLAIM): 2024 Update. Radiol. Artif. Intell. 6, e240300 (2024).

Aggarwal, R. et al. Diagnostic accuracy of deep learning in medical imaging: a systematic review and meta-analysis. NPJ Digit.Med. 4, 65 (2021).

Article
PubMed
PubMed Central

Google Scholar

McGenity, C. et al. Artificial intelligence in digital pathology: a systematic review and meta-analysis of diagnostic test accuracy. NPJ Digit. Med. 7, 114 (2024).

Article
PubMed
PubMed Central

Google Scholar

Rajkomar, A. et al. Scalable and accurate deep learning with electronic health records. NPJ Digit. Med. 1, 18 (2018).

Article
PubMed
PubMed Central

Google Scholar

Moons, K. G. M., de Groot, J. A. H., Linnet, K., Reitsma, J. B. & Bossuyt, P. M. M. Quantifying the added value of a diagnostic test or marker. Clin. Chem. 58, 1408–1417 (2012).

Article
PubMed

Google Scholar

Bossuyt, P. M. M., Reitsma, J. B., Linnet, K. & Moons, K. G. M. Beyond diagnostic accuracy: the clinical utility of diagnostic tests. Clin. Chem. 58, 1636–1643 (2012).

Article
PubMed
CAS

Google Scholar

Gallifant, J. et al. The TRIPOD-LLM reporting guideline for studies using large language models. Nat. Med. 31, 60–69 (2025).

Article
PubMed
PubMed Central
CAS

Google Scholar

Kelly, C. J., Karthikesalingam, A., Suleyman, M., Corrado, G. & King, D. Key challenges for delivering clinical impact with artificial intelligence. BMC Med. 17, 195 (2019).

Article
PubMed
PubMed Central

Google Scholar

Yang, Y., Zhang, H., Gichoya, J. W., Katabi, D. & Ghassemi, M. The limits of fair medical imaging AI in real-world generalization. Nat. Med. 30, 2838–2848 (2024).

The White House. Delivering on the Promise of AI to Improve Health Outcomes. https://bidenwhitehouse.archives.gov/briefing-room/blog/2023/12/14/delivering-on-the-promise-of-ai-to-improve-health-outcomes/ (2023).

Coalition for Health AI. Blueprint for Trustworthy AI Implementation Guidance and Assurance for Healthcare. https://www.chai.org/workgroup/responsible-ai/blueprint-for-trustworthy-ai (2023).

Guni, A., Varma, P., Zhang, J., Fehervari, M. & Ashrafian, H. Artificial intelligence in surgery: the future is now. Eur. Surg. Res. https://doi.org/10.1159/000536393 (2024).

Chen, R. J. et al. Algorithmic fairness in artificial intelligence for medicine and healthcare. Nat. Biomed. Eng. 7, 719–742 (2023).

Article
PubMed
PubMed Central

Google Scholar

Krakowski, I. et al. Human-AI interaction in skin cancer diagnosis: a systematic review and meta-analysis. NPJ Digit. Med. 7, 78 (2024).

Article
PubMed
PubMed Central

Google Scholar

Moor, M. et al. Foundation models for generalist medical artificial intelligence. Nature 616, 259–265 (2023).

Article
PubMed
CAS

Google Scholar

Tu, T. et al. Towards generalist biomedical AI. NEJM AI 1, AIoa2300138 (2024).

Article

Google Scholar

Acosta, J. N., Falcone, G. J., Rajpurkar, P. & Topol, E. J. Multimodal biomedical AI. Nat. Med. 28, 1773–1784 (2022).

Article
PubMed
CAS

Google Scholar

Barata, C. et al. A reinforcement learning model for AI-based decision support in skin cancer. Nat. Med. 29, 1941–1946 (2023).

Article
PubMed
PubMed Central
CAS

Google Scholar

Mankowitz, D. J. et al. Faster sorting algorithms discovered using deep reinforcement learning. Nature 618, 257–263 (2023).

Article
PubMed
PubMed Central
CAS

Google Scholar

Corso, G., Stark, H., Jegelka, S., Jaakkola, T. & Barzilay, R. Graph neural networks. Nat. Rev. Methods Primers 4, 17 (2024).

Article
CAS

Google Scholar

Li, H. et al. CGMega: explainable graph neural network framework with attention mechanisms for cancer gene module dissection. Nat. Commun. 15, 5997 (2024).

Article
PubMed
PubMed Central
CAS

Google Scholar

Pahud de Mortanges, A. et al. Orchestrating explainable artificial intelligence for multimodal and longitudinal data in medical imaging. NPJ Digit. Med. 7, 195 (2024).

Article
PubMed
PubMed Central

Google Scholar

Johri, S. et al. An evaluation framework for clinical use of large language models in patient interaction tasks. Nat. Med. 31, 77–86 (2025).

Article
PubMed
CAS

Google Scholar

EQUATOR Network. Enhancing the QUAlity and Transparency Of health Research. https://www.equator-network.org/

Sounderajah, V. et al. Developing specific reporting guidelines for diagnostic accuracy studies assessing AI interventions: the STARD-AI Steering Group. Nat. Med. 26, 807–808 (2020).

Article
PubMed
CAS

Google Scholar

Sounderajah, V. et al. Developing a reporting guideline for artificial intelligence-centred diagnostic test accuracy studies: the STARD-AI protocol. BMJ Open 11, e047709 (2021).

Article
PubMed
PubMed Central

Google Scholar

Source link