Electrocardiogram-based artificial intelligence for the diagnosis of heart failure: a systematic review and meta-analysis

2023-01-06 05:22:52XinMuLIXinYiGAOGaryTseShenDaHONGKangYinCHENGuangPingLITongLIU

Journal of Geriatric Cardiology 2022年12期

Xin-Mu LI, Xin-Yi GAO, Gary Tse,2, Shen-Da HONG, Kang-Yin CHEN, Guang-Ping LI,Tong LIU,?

1. Tianjin Key Laboratory of Ionic-Molecular Function of Cardiovascular Disease, Department of Cardiology, Tianjin Institute of Cardiology, Second Hospital of Tianjin Medical University, Tianjin, China; 2. Kent and Medway Medical School,Canterbury, United Kingdom; 3. National Institute of Health Data Science at Peking University, Peking University, Beijing,China; 4. Institute of Medical Technology, Peking University Health Science Center, Beijing, China

ABSTRACT BACKGROUND The electrocardiogram (ECG) is an inexpensive and easily accessible investigation for the diagnosis of cardiovascular diseases including heart failure (HF). The application of artificial intelligence (AI) has contributed to clinical practice in terms of aiding diagnosis, prognosis, risk stratification and guiding clinical management. The aim of this study is to systematically review and perform a meta-analysis of published studies on the application of AI for HF detection based on the ECG. METHODS We searched Embase, PubMed and Web of Science databases to identify literature using AI for HF detection based on ECG data. The quality of included studies was assessed using the Quality Assessment of Diagnostic Accuracy Studies 2 (QUADAS-2) criteria. Random-effects models were used for calculating the effect estimates and hierarchical receiver operating characteristic curves were plotted. Subgroup analysis was performed. Heterogeneity and the risk of bias were also assessed. RESULTS A total of 11 studies including 104,737 subjects were included. The area under the curve for HF diagnosis was 0.986,with a corresponding pooled sensitivity of 0.95 (95% CI: 0.86-0.98), specificity of 0.98 (95% CI: 0.95-0.99) and diagnostic odds ratio of 831.51 (95% CI: 127.85-5407.74). In the patient selection domain of QUADAS-2, eight studies were designated as high risk. CONCLUSIONS According to the available evidence, the incorporation of AI can aid the diagnosis of HF. However, there is heterogeneity among machine learning algorithms and improvements are required in terms of quality and study design.

Heart failure (HF) is a complex clinical syndrome caused by various etiologies of abnormal cardiac structure or function,resulting in ventricular dysfunction, reduced cardiac output, and failing to meet the needs of peripheral tissues.[1,2]There are currently 64.3 million people with HF worldwide and the prevalence in the general adult population of developed countries is 1%-2% and increases with age.[3]As a serious manifestation or late stage of various heart diseases, HF is significantly associated with higher mortality and rehospitalizations, placing an increasing burden on the healthcare systems, and has been a matter of public health issue.[4,5]

Early diagnosis and intervention are important for the management of HF, leading to better disease prognosis and decreases of HF-related expenditures. However, the symptoms and signs derived from history and physical examination are indispensable but usually nonspecific. With the advantages of noninvasive, conveniently available and inexpensive, the electrocardiogram (ECG) has been an essential investigation of cardiovascular medicine. Given that the pathophysiological changes of the heart lead to the alterations of cardiac electrical activity, the rational application of ECG data may provide opportunities for early detection of HF.

In the past decades, with the revolutionary development and widespread application of artificial intelligence (AI) in medicine, we have ushered in a novel era of disease diagnosis and management.The efficient processing ability of AI achieves specific goals and flexible tasks, so that reliable evidence-based medicine advice was provided for clinicians in clinical decision-making. AI mainly includes machine learning (ML) and deep learning (DL),and DL consists of multiple processing layers and is capable of processing more complex data.[6]The application of AI has contributed to clinical practice in terms of aiding diagnosis,[7]early detection,[8]predicting prognosis,[9,10]risk stratification,[11]and guiding clinical management.[12]One specific area of interest is the combination of AI with ECG for the detection of HF, which may overcome some difficulties in diagnosis clinically.[13]The aim of the systematic review and meta-analysis was to evaluate the diagnostic performance of AI in people with HF based on the ECG.

METHODS

This systematic review and meta-analysis was performed in accordance with the Preferred Reporting Information for Systematic Reviews and Meta-Analysis (PRISMA) recommendations.[14]

Search Strategy and Study Selection

Embase, PubMed and Web of Science databases were systematically searched, from inception up to January 17, 2022, to identify original literature that evaluated the diagnostic accuracy of AI algorithm using ECG data for people with HF. The following search terms were used: ((heart failure) OR (left ventricle dysfunction)) AND ((ECG) OR (electrocardiogram)) AND ((Machine Learning) OR (Artificial Intelligence) OR (Neural Networks) OR (Support Vector Machine) OR (Naive Bayes)). The field of search were abstract and title. The references of all relevant articles were reviewed to identify any other appropriate articles. The search was limited to publications written in English.

Studies that met the following criteria were included: (1) involved people with HF or left ventricular dysfunction [left ventricular ejection fraction(LVEF) ≤ 35%]; (2) AI algorithm using ECG data was the index test; and (3) diagnostic accuracy of performance were reported [i.e., true positives (TP, clinical diagnosis showing HF detected/predicted-as-HF by AI), false positives (FP, clinical diagnosis showing non-HF detected/predicted-as-HF by AI),true negatives (TN, clinical diagnosis showing non-HF without AI detection or prediction as non-HF),and false negatives (FN, clinical diagnosis showing HF missed by AI or predicted as non-HF)] or if they could be extracted.

Studies satisfying the following criteria were excluded: (1) letters, editorials, conference abstracts,systematic reviews, meta-analyses, consensus statements and guidelines; (2) articles not related to the research topic; and (3) articles without sufficient data to construct a 2 × 2 confusion matrix (i.e., TP, FP, TN,and FN).

Two review authors independently performed the literature search and title/abstract screening, and the articles were excluded which were not associated with research topic. The remaining articles were screened for eligibility by analyzing the full text. Disagreements were resolved by an independent cardiologist. The reasons for exclusion were recorded. Results of the literature search are shown in the PRISMA flowchart (Figure 1).

Figure 1 Flow diagram of the study selection process.

Data Extraction and Study Quality Assessment

The data of the included studies were extracted,including authors, year of publication, TP, FP, TN,FN, type of analyzed ECG data [heart rate variability (HRV) or ECG signals or raw ECG], type of ML model (DL or classical ML), type of dataset, definition of HF (congestive HF or left ventricular systolic dysfunction), sizes of ECG samples, number of patients and country of origin. Herein, the DL was defined as the studies that applying deep neural networks (e.g., convolutional neural networks) as their main algorithm. Otherwise, the studies were classified as classical ML.[15]

Assessment for risk of bias and quality were completed by following the QUADAS-2.[16]Two categories, risk of bias and concerns regarding applicability, were assessed in the three domains of patient selection, index test and reference standard, with the former being assessed in addition by the domain flow and timing. For specific assessment of the risk of bias, we set the following criteria, that is, for each of the four domains: (1) when the answer to each question is “yes”, the overall bias risk of the domain is “l(fā)ow”;(2) when the answer to any question is “no”, there is a bias risk, and the overall bias risk of the domain is“high”; (3) define “unclear” when the data reported is insufficient to make a judgment; (4) when any domain is high risk, the overall bias risk score is “high”;and (5) only when the bias risk of one domain is unclear, the overall bias risk of the study is “unclear”.The recommendation of the QUADAS-2 tool was followed, and the clinical applicability of each study was scored by evaluating whether it matched the concerns of our review, and rated as “l(fā)ow”, “high” or“unclear”. Two reviewers independently performed the data extraction and quality assessment. Disagreements were resolved through discussion and independent assessment by another researcher to reach consensus. The final study quality was classified as low risk of bias, high risk of bias, and unclear.

Statistical Analysis

The pooled proportions of sensitivity, specificity and diagnostic odds ratio (DOR) were calculated using an inverse-variance model. A random-effects model was utilized to calculate the overall proportions with 95% confidence interval (CI) and forest plots were provided. The hierarchical summary receiver operating characteristic (SROC) model was used to draw the SROC curve and the area under the curve (AUC) was calculated.[17]All studies and the summary estimate were presented by a triangle and a dot which was surrounded by a 95% CI, respectively.

The heterogeneity and threshold effect of the included studies were determined. The heterogeneity was evaluated using Higgins inconsistency index(I2). We consideredI2cutoff points of 25%, 50%, and 75% as indicative of low heterogeneity, moderate heterogeneity, and high heterogeneity, respectively.[18]

The Spearman correlation coefficient between sensitivity and specificity with a strong negative correlation indicates a threshold effect. The heterogeneity was explained by performing subgroup meta-analyses. Variables which might impact on the diagnostic accuracy were selected as follow: (1) type of ECG data;(2) type of ML model; (3) definition of HF; (4) sizes of ECG samples; and (5) country of origin. Subgroup analyses were performed if at least two studies provided the data of interest. The publication bias was assessed by Deeks’ funnel plot.[19]

Deeks’ funnel plot was performed in Stata 16.0(Stata Corp, College Station, Texas, USA). Other statistical analyses were performed using R statistical software 4.1.1 (http://www.r-proje ct.org).

RESULTS

Study Search

The search strategy yielded a total of 479 studies,and three additional articles were identified after searching the bibliographies of these articles. After removing 131 duplicates, the screening of the remaining 351 titles and abstracts yielded 66 potentially eligible articles for full-text review. Overall, 49 articles were then further identified in system review. We excluded 38 studies due to insufficient data to perform meta-analytic approaches, leading to 11 articles included in the final analysis. Reasons for exclusion were recorded. The disposition of studies excluded after the full-text review is shown in Figure 1.

Study Characteristics

A total of 104,737 participants were included in this meta-analysis, with individual studies ranging from 40 to 97,829 participants. Eight studies[20-27]were conducted using public ECG datasets and three studies[28-30]were from patient-level datasets. Eight studies were conducted in American centers,[20-24,28-30]and three studies were conducted in both American and European centers.[25-27]DL was used in seven studies,[23,25-30]and classical ML was used in four studies as AI system.[20-22,24]Besides, type of data included in studies were raw ECG, ECG signals and HRV.In terms of the sample sizes utilized in the modeling, eight studies[21,23,25-30]were greater than 1000 and the remaining part[20,22,24]were the opposite. Study characteristics are comprehensively shown in Table 1. Although several different algorithms were used by part of the studies, all algorithms were included to show the frequencies of algorithms in HF detection. In general, neural networks and support vector machine were dominant, accounting for 45%and 24%, respectively; followed by random forest(supplemental material, Figure 1S).

Table 1 Characteristics of the included studies.

There was also no significant threshold effect by Spearman correlation coefficient between sensitivity and specificity on detection of HF.

Diagnostic Performance of HF

For all of the 11 included studies, the pooled sensitivity, specificity and DOR were 0.95 (95% CI: 0.86-0.98), 0.98 (95% CI: 0.95-0.99) and 831.51 (95% CI:127.85-5407.74), respectively. The sensitivity, specificity and DOR ranged from 0.27-0.99, 0.86-1.00 and 13.7-88902.8, respectively. Heterogeneity was present for all sensitivity, specificity and DOR. The diagnosis in HF was associated with pooled AUC of 0.986. Performances of the analysis are illustrated in three forest plots (Figures 2-4) and a SROC curve(Figure 5).

To investigate the possible sources of heterogeneity, subgroup analyses according to the categorical variables were performed. All the heterogeneity level of pooled sensitivity, specificity and DOR turn to be low heterogeneity (I2= 0), if sizes of ECG samples of the study are lower than 1000 in the subgroup of sample size.[20,22,24]There is no statistical difference between groups, so it may be inappropriate to identify sample size of data as a source of heterogeneity. All the heterogeneity level of pooled sensitivity, specificity and DOR turn to be moderate heterogeneity, if type of analyzed ECG data is identified as sources of heterogeneity on the subgroup analysis, with significant statistical difference (P＜ 0.01).The forest plots for subgroup can be found in supplemental material, Figure 2S. No significant publication bias was found (P= 0.19), according to visual inspection of Deeks’ funnel plot (supplemental material, Figure 3S) and by Deeks’ regression test.

Figure 2 Forest plot of sensitivity.

Figure 3 Forest plot of specificity.

Figure 4 Forest plot of diagnostic odds ratio.

Quality of Evidence and Risk of Bias

The quality assessment of the 11 studies we included met the criteria of QUADAS-2. Detailed results of risk of bias are shown in Table 2. Amongst these, eight studies were designated as high risk of bias,[20-27]two studies had unclear risk of bias,[28,30]and the remaining study was classified as low risk of bias.[29]Each of the eight studies had a high risk of bias, because they included data from the healthy population databases and the HF patients databases, respectively; without avoiding a case-control study. The subjects included of the remaining three studies were all from registered studies, and two of them were assessed as unclear, because they did not describe the inclusion process in detail. In addition,low clinical applicability was identified in three trials in the domain of patient selection due to the presence of comorbidities among the subjects.[28-30]And unclear clinical applicability was identified in eight studies on the same domain as it was not clarified whether the data obtained from the databases was continuous or random.[20-27]The risk of bias and concern of applicability were shown in Figure 6 and supplemental material, Figure 4S, respectively.

Figure 6 Risk of bias according to the Quality Assessment of Diagnostic Accuracy Studies 2 (QUADAS-2) criteria.

Table 2 Quality assessment according to the Quality Assessment of Diagnostic Accuracy Studies 2 (QUADAS-2) criteria.

Figure 5 SROC curve of artificial intelligence based on electrocardiogram for heart failure diagnosis. Each triangle represents an individual study, the triangle size is proportional to each study size. The diamond represents summary sensitivity and specificity. The ellipse represents the 95% confidence region. SROC: summary receiver operating characteristics.

DISCUSSION

The meta-analysis demonstrated the potential performance of AI-ECG as a decision-making tool to assist in identifying HF. Herein, we showed that the pooled sensitivity, specificity and DOR were 0.95, 0.98 and 831.51, respectively; with AUC of 0.986. However, an inevitable heterogeneity was present due to several challenges, resulting in low clinical applicability probably.

With the booming development of AI applied in clinical medicine, the excellent performance of algorithms based on ECG has been increasingly reported.[13]Given that lack of confusion matrix, several studies were excluded, which may lead to bias of meta-analysis. Current work tends to apply DL with large amounts of patient level data,[28,30,31]which it is consistent with clinical state, and we are undergoing a transition period from classical ML to DL. It has been demonstrated DL outperformed other ML methods.[31-34]In the field of HF detection, ML algorithms were utilized by a total of eight studies before 2019, while there were four studies associated with DL. By the end of 2021, the number of studies using DL has increased to 13 and four times more frequently used compared to traditional ML. The combination of more complex network architecture with big-ger training data provides a chance for the model to explore formerly unknown clues in detection field.In addition, the demand of data augmentation may contribute to developing the clinical dataset.

Attia,et al.[28]have identified LVEF ≤ 35% using a combination of 12-lead ECG and echocardiogram data derived from more than 52,000 patients in a convolutional neural network model, with an accuracy of 85.7% and AUC of 0.93. In patients without ventricular dysfunction, those with a positive AI screen were at a fourfold risk of developing future ventricular dysfunction compared with those with a negative screen. In the external validation with 4277 patients, the performance of the model for the detection of left ventricular systolic dysfunction remains robust, as well as in patients requiring admission to the cardiac intensive care unit.[35]Furthermore, AI-ECG networks have demonstrated performance stability and robustness across multiple ethnicities, supporting wide applicability.[36]To identify milder patients in the early phases of left ventricular dysfunction, which was defined by combination of diastolic echocardiogram parameters and global longitudinal strain. Potter,et al.[37]developed a ML algorithm from “energy waveform”ECG. The sensitivity and specificity of random forest as a screening tool for at-risk asymptomatic individuals were 88% and 70%, respectively; and meanwhile its discriminative ability is superior to clinical score, biomarker and automated ECG analysis.

Despite the presence of normal LVEF, early stages of left ventricular diastolic dysfunction can progress to HF.[38]For HF patients with reduced LVEF,changes in diastolic function precede or develop concomitantly with the onset of systolic dysfunction.[39]The above evidence motivates researchers to take into account of left ventricular diastolic dysfunction in the aspect of HF management. In 2018,an innovative study showed an AUC of 0.91 for prediction of abnormal myocardial relaxation from signal-processed surface ECG.[40]Subsequently, the team incorporated ML into a novel ECG-based model to predict the quantitative values of the left ventricular relaxation velocities (e’) measured by echocardiography using signal-processed surface ECG,traditional ECG, and clinical features.[41]The analysis revealed that the estimated e’ can discriminate abnormal myocardial relaxation and diastolic, with AUC of 0.84 and 0.80 in the external test sets, respectively. It has been recognized ECG signals as a type of biosignal need to receive special attention, because ML algorithms of biosignal analysis can be based on multidimensional features in time domain, frequency domain, or time-frequency domain.[42]In biosignal pre-processing, improper filtering can lead to waveform deformation, change of time-domain features, and narrowed frequency range.[43]Therefore,the pre-processing of ECG signals plays a key role in improving the performance of ML algorithms.

A DL model, recently, was developed to simultaneously identify right and left ventricular dysfunction from the ECG, which was the first time to estimate LVEF value as a continuous variable.[44]Effective information was extracted by using 700,000 ECGs for approximately 150,000 unique patients for the evaluation of LVEF and composite right ventricular outcome. For LVEF classification in external validation, AUC at detection of LVEF ≤ 40%, 40% ＜LVEF ≤ 50%, and LVEF ＞ 50% were 0.94, 0.73, and 0.87, respectively. For prediction of composite right ventricular outcome, AUC of 0.84 was achieved.Moreover, the first convolutional neural network combined with ECG yielded a sensitivity of 99% at a specificity of 60% for detection of heart failure with preserved ejection fraction according to the European Society of Cardiology criteria.[45]The studies provided deeper insights into cardiological application capabilities of AI.

The main drawback of our systematic review and meta-analysis is the heterogeneity across the published studies, leaving uncertainty on the interaction between AI and human in clinical practice. Despite subgroup analysis was performed based on vital characteristics, there still remains some points to be highly regarded. Numerous algorithms in our analysis were trained and validated on relatively small and similar datasets, such as accessible the Beth Israel Deaconess Medical Center congestive HF database consisting of only 15 congestive HF patients.[21-23,25-27]All of the published models were developed in America. The current trend has a potential of promoting inequality in healthcare as no models were developed or validated in low-income countries. As for presence or absence of comorbidities in patients with HF, since previous studies were not rigorously characterized, the current study is difficult to pre-characterize. To ensure generalizability,the development of models using datasets from a wide range of ethnicities and countries is needed.Moreover, most of these studies were conducted in experimental settings and further validation work is required in real-world settings. Recognizing this,the pooled results may not consistent with clinical practice and could potentially be biased. Interpreting the application of clinical aspects is difficult, hence multidisciplinary cooperation is in greater need.Notably, the ECGs of 22,641 adult patients without prior HF were obtained to detect a new diagnosis of an ejection fraction of ≤ 50% in the ECG AI-Guided Screening for Low Ejection Fraction study representing a real-life setting.[46]The AI screening tool increased the diagnosis of low ejection fraction from 1.6%to 2.1%, which highlighted the technique integrated into primary care was meaningful and improved the diagnosis of low ejection fraction.

LIMITATIONS

There are several limitations that must be noted.Firstly, technical details are usually not disclosed in full details, leading to arbitrariness of methodologies and inevitable statistical heterogeneity, and thereby have the potential to impact on the accuracy of conclusion. Despite lack of uniform standards,12 critical questions have been suggested for cardiovascular health professionals to appraise the development and testing of AI prediction models.[47]Secondly, researchers tend to publish diagnostic tests with best diagnostic accuracies and similar datasets were used to develop models, resulting in data overfitting. It is not certain whether some algorihms with nonsatisfactory results were removed or not. However,there was no obvious asymmetry on the Deeks’ plot to suggest significant publication bias. Thirdly, the data analyzed in studies included was heterogeneous, including ECG signals, HRV and raw ECGs.Maybe raw ECGs are optimal for clinical setting, but the availability of large amounts of clinical data limits further application. Last but not least, most studies were conducted in relatively small number of patients, thereby limiting the accuracy of ML algorithms.

CONCLUSIONS

The meta-analysis showed AI-ECG is a valuable method for predicting HF or reduced LVEF. However, because of the potential bias and heterogeneity, further exploration with large datasets is urgently needed and improvements are required of quality and study design to confirm the role of AI in clinical practice.

ACKNOWLEDGMENTS

This study was supported by the National Natural Science Foundation of China (No.81970270 & No.8217 0327), the Tianjin Natural Science Foundation (20JC ZDJC00340 & 20JCZXJC00130), and the Tianjin Key Medical Discipline (Specialty) Construction Project(TJYXZDXK-029A). All authors had no conflicts of interest to disclose.

Journal of Geriatric Cardiology2022年12期

Journal of Geriatric Cardiology的其它文章: Unplanned J-valve implantation during open heart surgery for severe valvular annulus and ventricle calcification; Point-of-care ultrasonography in geriatric medicine: usefulness for approaching infectious endocarditis diagnosis; Fragmented QRS complex with an additional R-wave attenuated by short RR interval in a patient with acute pulmonary embolism and cardiogenic shock; Aortic valve leaflet disruption techniques in transcatheter aortic valve replacement; Mild haemoglobin drop and clinical outcomes in acute coronary syndrome patients: finding from the BleeMACS registry; Development and validation of a nomogram predicting oneyear mortality in patients undergoing percutaneous coronary intervention