Ethics statement
The derivation cohort included pre-hospital data from the City of Pittsburgh Bureau of Emergency Medical Services and in-hospital data from three tertiary care hospitals from the University of Pittsburgh Medical Center (UPMC) healthcare system: UPMC Presbyterian Hospital, UPMC Shadyside Hospital and UPMC Mercy Hospital (Pittsburgh, Pennsylvania). All consecutive eligible patients were recruited under a waiver of informed consent. This observational trial was approved by the institutional review board of the University of Pittsburgh and was registered at https://www.clinicaltrials.gov/ (identifier NCT04237688). The analyses described in this paper were pre-specified by the trial protocol that was funded by the National Institutes of Health. The first external validation cohort included data from Orange County Emergency Medical Services (Chapel Hill, North Carolina). This study actively consented eligible patients and was approved by the institutional review board of the University of North Carolina at Chapel Hill. The second external validation cohort included data from Mecklenburg County Emergency Medical Services and Atrium Health (Charlotte, North Carolina). Data were collected through a healthcare registry, and all consecutive eligible patients were enrolled under a waiver of informed consent. This study was also approved by the institutional review board of the University of North Carolina at Chapel Hill. These two external datasets were collected by the same local investigative team and were similar in terms of age, sex and disease prevalence. Thus, we combined these two datasets into one cohort for external validation purposes.
Study design and data collection
This was a prospective, observational cohort study. The methods for each study cohort were described in detail elsewhere57,58. All study cohorts enrolled adult patients with an emergency call for non-traumatic chest pain or anginal equivalent symptoms (arm, shoulder or jaw pain, shortness of breath, diaphoresis or syncope). Eligible patients were transported by an ambulance and had at least one recorded pre-hospital 12‑lead ECG. There were no selective exclusion criteria based on sex, race, comorbidities or acuity of illness. For this pre-specified analysis, we included only non-duplicate ECGs from unique patient encounters, and we removed patients with pre-hospital ECGs showing ventricular tachycardia or ventricular fibrillation (that is, these patients are managed by ACLS algorithms). We also removed patients with confirmed pre-hospital STEMI, which included machine-generated ***ACUTE MI*** warning, EMS documentation of STEMI and medical consult for potential catheterization laboratory activation.
Independent reviewers extracted data elements from hospital systems on all patients meeting eligibility criteria. If a pre-hospital ECG had no patient identifiers, we used a probabilistic matching approach to link each encounter with the correct hospital record. This previously validated data linkage protocol was based on the ECG-stamped birth date, sex and date/time logs as well as based on EMS dispatch logs and receiving hospital records. All probabilistic matches were manually reviewed by research specialists for accuracy. The match success rate ranged from 98.6% to 99.8%.
Clinical outcomes
Adjudications were made by independent reviewers at each local site after reviewing all available medical records within 30 d of the indexed encounter. Reviewers were blinded from all ECG analyses and models’ predictions. OMI was defined as coronary angiographic evidence of an acute culprit lesion in at least one of the three main coronary arteries (left anterior descending (LAD), left circumflex (LCX) and right coronary artery (RCA)) or their primary branches with TIMI flow grade of 0–1. TIMI flow grade of 2 with severe coronary narrowing >70% and peak troponin of 5–10.0 ng ml−1 was also considered indicative of OMI17,21. These adjudications were made by two independent reviewers. The kappa coefficient statistic between the two reviewers was 0.771 (that is, substantial agreement). All disagreements were resolved by a third reviewer.
ACS was defined per the Fourth Universal Definition of Myocardial Infarction as the presence of symptoms of ischemia (that is, diffuse discomfort in the chest, upper extremity, jaw or epigastric area for more than 20 min) and at least one of the following criteria: (1) subsequent development of labile, ischemic ECG changes (for example, ST changes and T inversion) during hospitalization; (2) elevation of cardiac troponin (that is, >99th percentile) during the hospital stay with rise and/or drop on serial testing; (3) coronary angiography demonstrating greater than 70% stenosis, with or without treatment; and/or (4) functional cardiac evaluation (stress testing) that demonstrates ECG, echocardiographic or radionuclide evidence of focal cardiac ischemia5. Patients with type 2 myocardial infarction and pre-existing subacute coronary occlusion were labeled as negative for ACS and OMI. This included around 10% of patients with positive troponin but with no rise and/or drop in concentration on serial testing (that is, chronic leak) or with troponin leak attributed to non-coronary occlusive conditions, such as pericarditis. On a randomly selected small subset of patients (n = 1,209), the kappa coefficient statistic for ACS adjudication ranged from 0.846 to 0.916 (that is, substantial to perfect agreement).
ECG methods
Pre-hospital ECGs were obtained in the field by paramedics as part of routine care. ECGs were acquired using either Heart Start MRX (Philips Healthcare) or LIFEPAK-15 (Physio-Control) monitor–defibrillator devices. All digital 12-lead ECGs were acquired at a sampling rate of 500 samples per second (0.05–150 Hz) and transmitted to the respective EMS agency and receiving hospital. Digital ECG files were exported in .xml format and stored in a secondary server at each local site. ECG images were de-identified and manually annotated by independent reviewers or research specialists; ECGs with poor quality or missing leads were removed from the study. Next, digital .xml files were transmitted to the Philips Advanced Algorithm Research Center (Cambridge, Massachusetts) for offline analysis.
ECG featurization was described in detail elsewhere18. In brief, ECG signal pre-processing and feature extraction were performed using manufacturer-specific software (Philips DXL diagnostic 12/16 lead ECG analysis program). ECG signals were first pre-processed to remove noise, artifacts and baseline wander. Ectopic beats were removed, and representative median beats were calculated for each lead. Median beats refer to the representative average (or median) of the sequential beats in a given ECG lead after temporal alignment of R peaks. Next, we used the root mean square (RMS) signal to identify global waveform fiducials, including the onset, offset and peak of the P wave, QRS complex and T wave. Lead-specific fiducials were then identified to further segment individual waveforms into Q, R, R′, S, S′ and J point.
We then computed a total of 554 ECG features based on (1) the amplitude, duration, area, slope and/or concavity of global and lead-specific waveforms; (2) the QRS and T axes and angles in the frontal, horizontal, spatial, x–y, x–z and y–z planes, including directions at peak, inflection point and initial/terminal loops; (3) eigenvalues of the principal components of orthogonal ECG leads (I, II and V1–V6), including PCA ratios for individual ECG waveform segments; and (4) T loop morphology descriptors. Features with zero distribution were removed to prevent representation bias.
Next, we previously identified an optimal parsimonious list of the most important ECG features that are mechanistically linked to cardiac ischemia as described in detail elsewhere18. In brief, to prevent omitted feature bias, we used a hybrid approach that combines domain knowledge with a data-driven strategy. First, clinical scientists identified 24 classical features that are known to correlate with cardiac ischemia (that is, lead-specific ST and T wave amplitudes). Next, starting with a comprehensive list of 554 candidate features, we used data-driven algorithms (for example, recursive feature elimination and LASSO) to identify 198 supplemental features potentially related to ischemia. LASSO selects features with non-zero coefficients after L1 norm regularization, and recursive feature elimination uses repeated regression iterations to identify the features that have significant impact on model predictions. We then examined the feature pairs in this expanded list of 222 features and removed features with very high collinearity scores that contains redundant information (for example, we kept QTc if both QT and QTc were selected by the model). Finally, we used feature importance ranking to identify the most parsimonious subset of features that are complementary and can boost the classification performance. This hybrid approach eventually yielded a subset of 73 features that can serve as plausible markers of ischemia18.
Machine learning methods
We followed best practices recommended by ‘ROBUST-ML’ and ‘ECG-AI stress test’ checklists to design and benchmark our machine learning algorithms51,59. To prevent measurement bias, ECG features were manually reviewed to identify erroneous calculations. Physiologically plausible outliers were replaced with ±3 s.d. On average, each feature had a 0.34% missingness rate (range, 0.1–1.6%). Thus, we imputed missing values with the mean, median or mode of that feature after consultation with clinical experts. ECG metrics were then z-score normalized and used as input features in machine learning models. The derivation and validation datasets were cleaned independently to prevent data leakage. Both cohorts were recruited over the same time window, suggesting the lack of temporal bias. To prevent potential mismatch with intended use, input features for model development included only ECG data plus the machine-stamped age. No other clinical data were used for model building.
We randomly split the derivation cohort into an 80% training set and a 20% internal testing set. On the training set, we fit 10 machine learning classifiers: regularized logistic regression, linear discriminant analysis, support vector machine (SVM), Gaussian naive Bayes, RF, gradient boosting machine, extreme gradient boosting, stochastic gradient descent logistic regression, k-nearest neighbors and artificial neural networks. Each classifier was optimized over 10-fold cross validation to fine-tune hyperparameters. After selecting optimal hyperparameters, models were retrained on the entire training subset to derive final weights and create a lockout model to evaluate on the hold-out test set. We calibrated our classifiers to produce a probabilistic output that can be interpreted as a confidence level (probability risk score). Trained models were compared using the AUROC curve with Wilcoxon signed-rank test for pairwise comparisons. ROC-optimized cutoffs were chosen using the Youden index, and classifications on confusion matrix were compared using McNemar’s test.
The RF classifier achieved high accuracy on the training set (low bias) with a relatively small drop in performance on the test set (low variance), indicating an acceptable bias–variance tradeoff and low risk of overfitting (Extended Data Fig. 8). Although the SVM model had lower variance on the test set, when compared to the RF model there were no significant differences in AUROC (Delong’s test) or their binary classifications (McNemar’s test). Moreover, there were no differences between the RF and SVM models in terms of Kolmogorov–Smirnov goodness of fit (0.716 versus 0.715) or the Gini purity index (0.82 versus 0.85). Due to its scalability and intuitive architecture, we chose the probability output of the RF model to build our derived OMI score. We generated density plots of these probability scores for positive and negative classes and selected classification thresholds for low-risk, intermediate-risk and high-risk groups based on pre-specified NPV > 0.99 and true-positive rate > 0.50. Finally, we used the lock-out RF classifier to generate probability scores and risk classes on the completely unseen external validation cohort. The code to generate probability scores is included with the supplementary materials of this manuscript.
Reference standard
To reduce the risk of evaluation bias, we benchmarked our machine learning models against multiple reference standards used during routine care in clinical practice. First, we used a commercial, FDA-approved ECG interpretation software (Philips DXL diagnostic algorithm) to denote the likelihood of ischemic myocardial injury. This likelihood (yes/no) was based on a composite of the following: (1) diagnostic codes for ‘»>Acute MI«<’, including descriptive statements that denote ‘acute’, ‘recent’, ‘age indeterminate’, ‘possible’ or ‘probable’; and (2) diagnostic codes for ‘»>Acute Ischemia«<’, including descriptive statements that denote ‘possible’, ‘probable’ or ‘consider’. Diagnostic statements that denoted ‘old’ [infarct], ‘nonspecific’ [ST depression] or ‘secondary to’ [LVH or high heart rate] were excluded from this composite reference standard.
We also used practicing clinicians’ overread of ECGs to denote the likelihood of ischemic myocardial injury on a given ECG (yes/no) when a STEMI pattern does not exist, which is congruent with how emergency department physicians evaluate these patients in clinical practice. Independent physician reviewers annotated each 12-lead ECG image as per the Fourth Universal Definition of Myocardial Infarction criteria5, including two contiguous leads with ST-elevation (≥0.2 mV for V2–V3 in men ≥40 years of age and ≥2.5 mm in men <40 years of age; ≥0.15 mV for V2–V3 in women; or ≥0.1 mV in other leads) or ST-depression (new horizontal or downsloping depression ≥ 0.05 mV), with or without T wave inversion (>0.1 mV in leads with prominent R wave or R/S ratio > 1). Reviewers were also prompted to use their clinical judgment to identify highly suspicious ischemic changes (for example, reciprocal changes and hyperacute T waves) as well as to account for potential confounders (for example, BBBs and early repolarization). On a randomly selected subset of patients in the derivation cohort (n = 1,646), the kappa coefficient statistic between two emergency physicians who interpreted the ECGs was 0.568 (that is, moderate agreement). A third reviewer was used to adjudicate discrepancies on this randomly selected subset. Similarly, on a randomly selected subset of patients in the external validation cohort (n = 375), the kappa coefficient statistic between the two board-certified cardiologists who interpreted the ECGs was 0.690 (that is, substantial agreement).
Finally, given that clinicians largely depend on risk scores to triage patients in the absence of STEMI, which would greatly affect how patients with OMI are diagnosed and treated in clinical practice, we compared our derived OMI risk score against the HEART score. This score is commonly used in US hospitals, and it has been well validated for triaging patients in the emergency department60. The HEART score is based on the patient’s history at presentation, ECG interpretation, age, risk factors and initial troponin values (range, 0–10). This score places patients in low-risk (0–3), intermediate-risk (4–6) and high-risk (7–10) groups. Given that troponin results are not usually available at first medical contact, we used a modified HEAR score after dropping the troponin values, which has also been previously validated for use by paramedics before hospital arrival36. The comparison against the HEART score herein focused on establishing the incremental gain of using the derived OMI score over routine care at initial triage. We compared how the new risk classes assigned by our derived OMI score agree with or differ from the risk classes assigned by the HEART score, which could inform potential incremental gain over routine care.
Statistical analysis
Descriptive statistics were reported as mean ± s.d. or n (%). Missing data were assessed for randomness and handled during ECG feature selection (see ‘Machine learning methods’ subsection above). Normality of distribution was assessed before hypothesis testing where deemed necessary. ECG features were z-score normalized as part of standard input architectures for machine learning models. Comparisons between cohorts were performed using the chi-square test (for discrete variables) and independent samples t-test or the Mann–Whitney U-test (for continuous variables). The level of significance was set at an alpha of 0.05 for two-tailed hypothesis testing where applicable.
All diagnostic accuracy values were reported as per Standards for Reporting Diagnostic Accuracy Studies (STARD) recommendations. We reported classification performance using AUROC curve, sensitivity (recall), specificity, PPV (precision) and NPV, along with 95% CI where applicable. For 10-fold cross validation, we compared the multiple classifiers using the Wilcoxon signed-rank test (for AUROC curves) and McNemar’s test (for confusion matrices). We derived low-risk, intermediate-risk and high-risk categories for the final classifier using kernel density plot estimates between classes. The adequacy of these risk classes was evaluated using log-rank chi-square of accumulative risk for clinically important outcomes over the length of stay during the indexed admission.
For assessing the incremental gain in classification performance, we compared the AUROC of the final model against reference standards using DeLong’s test. For ease of comparison, the confidence bounds for AUROC of the reference standards (commercial system and practicing clinicians) were generated using 1,000 bootstrap samples. To place the incremental gain value in a broader context of the clinical workflow, we also computed the NRI index of our model against the HEAR score during the initial assessment at first medical contact. Risk scores are an integral part of clinical workflow in patients with suspected ACS who do not meet STEMI criteria. As per STARD recommendations, the NRI index evaluates the net gain between up-triage and down-triage when correctly reclassifying risk class assignments of an ‘old’ test (HEART score) using a ‘new’ test (the derived OMI score).
We used logistic regression to identify the independent predictive value of OMI risk classes. We used variables significant in univariate analysis and then built multivariate models with the stepwise backward selection method using Wald chi-square criteria. We reported ORs with 95% CI for all significant predictors. All analyses were completed using Python version 3.8.5 and SPSS version 24.
Reporting Summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
