This study complies with all relevant ethical regulations and has been approved by the institutional review boards of Johns Hopkins Medicine and Atrium Health.
Patient population and datasets
JHH-HCM registry (internal)
A retrospective analysis was performed on patient data from the JHH-HCM registry spanning 2005–2015. Enrollment in the registry was based on the first visit to the Johns Hopkins HCM Center of Excellence, where patients meeting the diagnostic criteria for HCM were included. These criteria focused on the presence of unexplained left ventricular hypertrophy (maximal wall thickness ≥15 mm) without evidence of uncontrolled hypertension, valvular heart disease and HCM phenocopies, such as amyloidosis and storage disorders. Patients were followed for a mean duration of 2.86 years (median 1.92 years; 25th–75th percentile = 0.94–4.28 years). The current study focused on a subset of patients with HCM who were enrolled between 2005 and 2015 and had adequate LGE-CMR images, totaling 553 patients (Extended Data Fig. 3).
SHVI-HCM registry (external)
A retrospective analysis was performed on patient data from the Atrium Health SHVI-HCM registry spanning 2015–2023. This registry includes patients who presented to the SHVI HCM Center of Excellence with a preexisting HCM diagnosis or were subsequently diagnosed based on cardiac imaging, personal and family history, and/or genetic testing in accordance with current guideline definitions. Patients within this longitudinal database are still being followed, as the endpoint for registry inclusion is the transfer of care to an outside facility or death. For the purposes of this study, the SHVI-HCM registry was interrogated for patients who had undergone CMR imaging and ICD placement, and enrollment was delineated by the patient’s first visit with the SHVI.
Data collection and primary endpoint
Clinical data, including demographics, symptoms, comorbidities, medical history and stress test results, were ascertained during the initial clinic visit and at each follow-up visit. Rest and stress echocardiography and CMR imaging were performed as routine components of clinical evaluation for all patients referred to the HCM centers. For the internal JHH-HCM registry, echocardiography and CMR imaging were conducted before the first clinic visit, with typically 3 months between the imaging assessment and the first clinic visit. For the SHVI-HCM registry, patients typically underwent echocardiography and CMR imaging after the first clinic visit. The full list of covariates used in MAARS can be found in Extended Data Tables 1 and 2. The data were extracted through a manual search of patients’ EHRs. EchoPAC software (GE Healthcare) was used to quantitatively analyze the echocardiogram and compute related covariates. Of note, the internal and external cohorts have distinct patient populations with different demographic characteristics and different levels of risk factors (Table 1).
The CMR images in the JHH-HCM registry were acquired using 1.5-T magnetic resonance imaging (MRI) devices (Aera, Siemens; Avanto, Siemens; Signa, GE; Intera, Phillips). In the SHVI-HCM registry, most CMR images were acquired using 1.5-T MRI devices (Aera, Siemens; Sola, Siemens), and a small proportion of CMR images were acquired using 3-T MRI devices (Vida, Siemens). LGE images were obtained 10–15 min after intravenous administration of 0.2 mmol kg−1 gadopentetate dimeglumine. An inversion scout sequence was used to select the optimal inversion time for nulling normal myocardial signal. All images used were 2D parallel short-axis left ventricular stacks. Typical spatial resolutions were in the range of 1.4–2.9 × 1.4–2.9 × 7–8 mm, with 1.6- to 2-mm gaps.
The primary endpoint for the JHH-HCM registry was SCDA defined as sustained ventricular tachycardia (ventricular rate ≥130 beats per min lasting for ≥30 s) or ventricular fibrillation resulting in defibrillator shocks or antitachycardia pacing. Arrhythmic events were ascertained by reviewing electrocardiogram, Holter monitor and ICD interrogation data. The primary endpoint for the SHVI-HCM registry was SCDA defined as device shock, appropriate interventions or out-of-hospital cardiac arrest.
More details regarding patient inclusion, assessment, follow-up, echocardiography and CMR acquisition can be found in previous work23,51.
Data preparation
The multimodal inputs to MAARS included LGE-CMR scans and clinical covariates from EHRs and CIRs (Extended Data Tables 1 and 2). The labels were the outcomes (SCDA or non-SCDA). The preprocessing steps for LGE-CMR scans (described below) aimed to exclude nonrelevant background information and to standardize the CMR image volume for consistent analysis across all patients. We first obtained the left ventricular region of interest using our previously developed and validated deep learning algorithm52. Once each patient’s LGE-CMR 2D slices were processed using this algorithm, all pixels outside the left ventricle were zeroed out, and the pixels within the left ventricle were normalized by the median blood pool pixel intensity in each slice. Finally, the processed slices were stacked and interpolated to a regular 96 × 96 × 20 grid with voxel dimensions of 4.0 × 4.0 × 6.3 mm.
The EHR and CIR data were structured as tabular data. The input features included in the analysis were ensured to have <40% missing values originally; missing values were imputed using multivariate imputation by chained equations (MICE)53. MICE is a fully conditional specification approach that models each input feature with missing values as a function of all other features iteratively. To address the feature mismatch issue between the internal and external cohorts, we used a MICE imputer based on the internal dataset to impute the missing values in both datasets. After the imputation, the EHR and CIR data were standardized using the z-score method, which involves subtracting the mean and dividing by the s.d. of each feature.
Transformer-based multimodal neural network
Modality-specific branch networks
Three unimodal branch networks are included in MAARS, each learning from a specific input modality: a 3D-ViT29 for LGE-CMR images, an FNN for EHR data and an FNN for CIR data.
In the LGE-CMR branch, the image vector embeddings ζ are obtained by dividing the original 3D image X into n flattened nonoverlapping 3D image patches xi and following the operations
$$begin{array}{c}{zeta }_{{rm{CMR}}}^{,0}=left[{z}_{{rm{cls}}},E{x}_{1},E{x}_{2},ldots ,E{x}_{n}right]+{p}end{array}$$
(1)
where E is a linear projection, zcls is a classification token (CLS-token) and ‘p’ is a learnable positional embedding to retain positional information.
The image vector embeddings ({zeta }_{{rm{CMR}}}^{,0}) are then processed by a sequence of LViT transformer encoder blocks. Each transformer encoder block, ({zeta }_{{rm{CMR}}}^{,l+1}={rm{Transformer}}left({zeta }_{{rm{CMR}}}^{,l};{theta }_{{rm{ViT}}}^{l}right)), consists of two submodules: (1) a multihead self-attention (MSA) module and (2) a two-layer fully connected FNN.
$$begin{array}{c}{nu }^{l}={rm{MSA}}left({rm{LN}}left({zeta }_{{rm{CMR}}}^{,l}right)right)+{zeta }^{,l}end{array}$$
(2)
$$begin{array}{c}{zeta }_{{rm{CMR}}}^{,l+1}={rm{FNN}}left({rm{LN}}left({nu }^{l}right)right)+{nu }^{l}end{array}$$
(3)
where LN is the layer normalization operation. In the final transformer encoder block, the encoded CMR knowledge, ξCMR, is defined as
$$begin{array}{c}{zeta }_{{rm{CMR}}}^{{,L}_{{rm{ViT}}}}=left[{z}_{{rm{cls}}}^{{,L}_{{rm{ViT}}}},{z}_{1}^{{,L}_{{rm{ViT}}}},{z}_{2}^{{,L}_{{rm{ViT}}}},ldots ,{z}_{n}^{{,L}_{{rm{ViT}}}}right]={rm{Transformer}}left({zeta }_{{rm{CMR}}}^{{,L}_{{rm{ViT}}}-1};{theta }_{{rm{ViT}}}^{{L}_{{rm{ViT}}}-1}right)end{array}$$
(4)
$$begin{array}{c}{{rm{xi }}}_{{rm{CMR}}}={rm{LN}}left({z}_{{rm{cls}}}^{{,L}_{{rm{ViT}}}}cdot Wright)end{array}$$
(5)
where W is a learnable matrix.
In the EHR and CIR branches, processed EHR and CIR data are converted to vectors ζEHR, ζCIR fed into two FNNs, with outputs ξEHR and ξCIR representing the encoded EHR and CIR knowledge.
$$begin{array}{c}{xi }_{{rm{EHR}}}={rm{FNN}}left({zeta }_{{rm{EHR}}};{theta }_{{rm{EHR}}}right)end{array}$$
(6)
$$begin{array}{c}{xi }_{{rm{CIR}}}={rm{FNN}}left({zeta }_{{rm{CIR}}};{theta }_{{rm{CIR}}}right)end{array}$$
(7)
Multimodal fusion
Following knowledge encoding from the LGE-CMR, EHR and CIR subnetworks, we used an MBT consisting of multiple blocks to fuse the knowledge across modalities. MBT has demonstrated state-of-the-art performance in multimodal fusion tasks and has a light computational cost30. In each MBT block, the unimodal knowledge vectors concatenated with a shared fusion vector, ξfsn, are fed into modality-specific transformers:
$$begin{array}{c}left[{{xi }_{* }^{l+1},hat{xi }}_{{rm{fsn}},* }^{l+1}right]={rm{Transformer}}left(left[{xi }_{* }^{l},{xi }_{{rm{fsn}}}^{l}right];{theta }_{{rm{MBT}},* }^{l}right)end{array}$$
(8)
The fusion vector in layer l + 1 is updated as follows:
$$begin{array}{c}{xi }_{{rm{fsn}}}^{,l+1}={rm{Avg}}left({hat{xi}}_{{rm{fsn}},* }^{,l+1}right)end{array}$$
(9)
The last MBT block outputs a predicted SCDA risk score p using the following equation:
$$begin{array}{c}p={rm{sigmoid}}left(left[{xi}_{{rm{CMR}}}^{{,L}_{{rm{MBT}}}},{xi}_{{rm{EHR}}}^{{,L}_{{rm{MBT}}}},{xi }_{{rm{CIR}}}^{{,L}_{{rm{MBT}}}}right]cdot W+bright)end{array}$$
(10)
Model training and implementation details
For patient i, their SCDA outcome yi is 1 if they experienced an SCDA event during the follow-up, and 0 otherwise. We adopted the balanced focal loss as the loss function54:
$$L=-sum _{i}{alpha }_{i}{({,y}_{i}-{p}_{i})}^{gamma }log {p}_{i}$$
(11)
where αi is a class-dependent scaling factor, and γ is the focusing parameter that controls the level of how the model focuses on its mistakes and prioritizes improving on the hard examples, which was set as γ = 2 in this study.
The LGE-CMR, EHR and CIR branch networks were first trained independently, and then MAARS was trained end-to-end with all the branch networks and the multimodal fusion module. All models were trained with a batch size of 64 and a maximum of 150 epochs with early stopping based on loss. The Adam optimizer was used, with β1 = 0.9, β2 = 0.999, and the learning rate was initially set at 1 × 10−3 for the LGE-CMR branch network, 1 × 10−2 for the EHR and CIR branch networks, and 3 × 10−2 for the multimodal fusion and was adaptively adjusted during the training process. For the LGE-CMR branch network, the ViT has LViT = 8 transformer encoder blocks, eight heads for each attention module and dimension d = 512. The EHR branch network used an FNN with two hidden layers and a latent dimension of 16. The CIR branch network used an FNN with one hidden layer and a latent dimension of 16. The encoded unimodal knowledge vectors have dimensions ξCMR ∈ R32, ξEHR ∈ R16, ξCIR ∈ R16. We set LMBT as 3 and the bottleneck fusion vector dimension as 8.
Assessing model performance and clinical validation
Performance metrics
The values of metrics derived from the confusion matrix (BA, sensitivity and specificity) were computed at optimal probability decision thresholds selected to maximize Youden’s J statistic. When comparing the AI model’s performance to that of the clinical tools, we also adjusted the decision threshold by matching the sensitivities of the clinical tools to evaluate their specificities. All metrics were in the range of 0 to 1, with the baseline levels obtained by random chance being as follows: AUROC = 0.5, BA = 0.5, AUPRC = 0.03 and Bs = 0.25.
Internal and external validation
The internal model performance was assessed in a fivefold cross-validation of the JHH-HCM cohort on the patient level stratified by outcome. The training and test sets were split on the patient level; that is, all LGE-CMR scans corresponding to a given patient case were only present in either the training or validation set and never simultaneously partly in both. After five training folds, the model’s performance metrics were calculated based on the aggregation of all validation folds.
For the external performance evaluation, we trained the model using the entire JHH-HCM dataset (with 90% as the training set and 10% as the development set) and tested the model’s performance on the SHVI-HCM cohort. Of note, the model for external validation inherited the same hyperparameters as the internal model.
Model interpretability
We interpreted the MAARS network weights and predictions using attribution- and attention-based methods.
Shapley value
The EHR and CIR branch networks were interpreted using the Shapley value, which quantifies the incremental attribution of every input feature to the final prediction. The Shapley value32 is based on the cooperative game theory and explains a prediction as a coalitional game played by the feature values. The Shapley value has a collection of desirable properties, including efficiency, symmetry, dummy and additivity. In this study, the Shapley values were estimated using a permutation formulation implemented in SHAP55.
Attention rollout
For the LGE-CMR branch network, we used a technique called attention rollout to quantify attention flows from the start to the end throughout the ViT. Formally, at transformer encoder block l, the average of the attention matrices of all attention heads is Al. The residual connection at each block is modeled by adding the identity matrix I to the attention matrix. Therefore, the attention rollout is recursively computed by
$$begin{array}{c}{A}_{{rm{Rollout}}}^{l}=left({A}^{l}+Iright)cdot{A}_{{rm{Rollout}}}^{l-1}end{array}$$
(12)
We explained the predictions of the LGE-CMR branch network using the attention rollout at the end of the ViT after flowing through LViT transformer blocks, ({A}_{{rm{Rollout}}}^{{L}_{{rm{ViT}}}}).
Statistical analysis
The P values of clinical covariates between the internal and external cohorts were based on a two-sample Welch’s t-test for numerical variables and the Mann–Whitney U test for categorical variables before data imputation. Kolmogorov–Smirnov tests for the risk score distributions were based on the aggregated predictions on all internal validation folds. The means and CIs of model performance metrics in the internal fivefold cross-validation were estimated using 200 bootstrapping samples of the aggregated predictions on all validation folds. The performance metrics in the external validation were calculated using model predictions on 200 bootstrapping resampled datasets of the SHVI-HCM cohort. The computations were based on the bias-corrected and accelerated bootstrap method. Pearson’s r for clinical covariates in the network interpretations was based on aggregated interpretations from all internal validation folds.
Computational hardware and software
MAARS was built in Python 3.9 using packages including PyTorch 2.0, NumPy 1.23.5, Pandas 1.5.3, SciPy 1.10, scikit-learn 1.2.0, scikit-image 0.19.3, pydicom 2.3, SimpleITK 2.2.1 and SHAP 0.41. Data preprocessing, model training and result analysis were performed on a machine with an AMD Ryzen Threadripper 1920X 12-core CPU and NVIDIA TITAN RTX GPUs, and on the Rockfish cluster at Johns Hopkins University using NVIDIA A100 GPU nodes, with NVIDIA software CUDA 11.7 and cuDNN 8.5. For a reference of the computational requirements of MAARS inference, on a machine with an AMD Ryzen 2700X 8-core CPU and an NVIDIA GeForce RTX 2060 GPU, the average processing time for inference is 0.034 s per patient using GPU or 0.086 s per patient using solely CPU.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.