Jin Bian, Jun-Yu Long, Xu Yang, Xiao-Bo Yang, Yi-Yao Xu, Xin Lu, Xin-Ting Sang, Hai-Tao Zhao
Abstract
Key Words: Gastric cancer; Deoxyribonucleic acid methylation; Molecular subtypes; Prognosis; Risk score; The Cancer Genome Atlas
Gastric cancer (GC) ranks as the third leading cause of cancer-related deaths and is the fifth most commonly diagnosed cancer worldwide[1,2]. While curative resection, adjuvant or neoadjuvant therapy che-motherapy, and targeted therapies such as trastuzumab or ramucirumab may be curative treatment options for a select population of GC patients, high postoperative recurrence and metastasis make longterm survival dismal[3,4]. Studies have indicated that patients with metastasis had a survival of only 4 to 12 mo when treated with only best supportive care or chemotherapy[5]. Since GC is a genetically and epigenetically heterogeneous disease, identifying robust biomarkers is critical for early detection and survival prognosis. Conventional biomarkers, including carcinoembryonic antigen, carbohydrate antigen 19-9, carbohydrate antigen 72-4, and human epidermal growth factor receptor 2, have been widely used in clinical practice. Novel biomarkers, such as fibroblast growth factor receptor 2, vascular endothelial growth factor, E-cadherin, and microsatellite instability, have also been explored and shown to be valuable biomarkers[6,7]. However, due to inefficient specificity and sensitivity, limited novel biomarkers have been put into routine clinical practice. Therefore, it is needed to explore more efficient biomarkers based on genetic and epigenetic alterations. Deoxyribonucleic acid (DNA) methylation is a major epigenetic event that regulates gene transcription and maintains genome stability[8,9]. Oncogene hypomethylation and tumor suppressor gene hypermethylation are common methylation aberrations that have been shown to play important roles in cancer development, including the tumorigenesis of GC[10,11]. Detecting DNA methylation patterns and understanding the roles of these methylation events might help elucidate the underlying molecular mechanisms and pathogenesis of GC. Although there are abundant studies on the relationship between dysregulated DNA methylation and the prognosis of GC patients[12-14], individualized prognostic models based on a DNA methylation signature are lacking. In this study, we explored molecular subgroups of GC by integrating methylation and mRNA expression profile data, and generated a prognostic model comprising two DNA methylation sites. Our study may deepen our understanding and improve individualized therapies for GC.
A total of 407 RNA-sequencing profiles (375 GC samples and 32 nontumor samples) and the corresponding clinical information were downloaded from The Cancer Genome Atlas (TCGA) (https://portal.gdc.cancer.gov/, up to October 1, 2019, Supplementary Tables 1 and 2). We obtained DNA methylation profiles from the University of California Santa Cruz Cancer Browser (https://xena.ucsc.edu/), including the analysis of 397 patients with Illumina Infinium Human Methylation 450 platform. Methylation levels were quantified using beta values ranging from 0 to 1 (unmethylated to totally methylated). Samples with a follow-up time of less than 30 d or with a lack of clinical survival information were excluded. Probes for which CpG data were missing in more than 70% of the samples were removed. The K-nearest neighbors imputation procedure was used to impute the remaining probes with data not available. The ComBat algorithm in the sva R package[15]was used to remove batch effects by integrating all DNA methylation array data and incorporating batch and patient clinical information. Data with unstable methylation sites (CpGs in sex chromosomes and single nucleotide polymorphisms) were removed from the dataset. CpGs in promoter regions were selected and studied because the DNA methylation level in promoter regions are associated with gene expression. Promoter regions are located 2 kb upstream to 0.5 kb downstream from transcription start sites of genes. We selected samples for which RNA-sequencing data and DNA methylation data were available. In total, 366 samples and 21121 methylation sites were included in subsequent analyses. Moreover, 366 samples were then randomly stratified into the training set (n= 183) and test set (n= 183). DNA methylation-based subgroup analysis was performed in the training set and a risk score model was built, which was subsequently validated in the test set. The study flow chart is shown in Figure 1.
To determine GC molecular subtypes, we first selected CpG sites that were significantly associated with prognosis as classification features. Univariate and multivariate analyses were conducted using the Cox proportional hazard regression model. Univariate Cox proportional risk regression models were constructed for each CpG site, age, sex, T category, N category, M category, TNM stage, and survival time using methylation levels. The significant CpG sites obtained from univariate Cox proportional risk regression models were then analyzed using multivariate Cox proportional risk regression models. Consequently, N category, TNM stage, age, and sex, which were significant in the univariate survival analysis, were used as covariates in the multivariate analysis. CpG sites that were significant in both univariate and multivariate Cox regression analyses were selected as characteristic CpG sites. Univariate and multivariate analyses were performed with aPvalue of 0.05 as the cutoff.
Unsupervised consensus clustering using the ConsensusClusterPlus package in R[16]was performed to identify GC subgroups based on the characteristic CpG sites that were significant in both univariate and multivariate Cox regression analyses. To achieve higher intracluster similarity and lower intercluster similarity, we chose the kmeans clustering algorithm with the Euclidean distance and a subsampling ratio of 0.8 for 100 iterations. The values of k where the magnitude of the relative change in area under the cumulative distribution function that began to fall were chosen as the optimal cluster numbers. The pheatmap package in R was used to generate the heatmap corresponding to the consensus clustering.
Differential analysis was conducted on the screened methylation profiles of each subtype to identify the specific methylation sites. A total of 1061 methylation sites among each subtype were analyzed. Every methylation site in each molecular subtype was compared with that in the other subtypes, and all methylation sites were analyzed using the Wilcoxon rank-sum test (false discovery rate < 0.05 and|log2 (fold change [FC])| > 1). Furthermore, the differential frequency of every CpG site in each subtype was further detected for the final screening of the CpG sites. One methylation site was defined as a specific methylation site if it satisfied the differential condition in only one subtype. The obtained specific methylation sites were subsequently subjected to genome annotations to identify their corresponding genes.
Figure 1 Flow chart of the study. GC: Gastric cancer; LASSO: Least absolute shrinkage and selector operation; GO: Gene ontology; KEGG: Kyoto encyclopedia of genes and genomes.
The overall survival (OS) for each DNA methylation subtype among GC patients was evaluated by Kaplan–Meier (K-M) analysis. The significance of differences among the clusters was assessed by the log-rank test. Associations between both the clinical and biological characteristics and DNA methylation clustering were analyzed using the chi-square test. Survival analyses were performed using the survival package in R. The statistical significance levels were all two-sided atP< 0.05, and the hazard ratio (HR) and 95% confidence interval (CI) were also calculated.
Corresponding genes in the promoter regions of these specific methylation sites were subjected to gene ontology (GO) and Kyoto encyclopedia of genes and genomes pathway enrichment analyses with the help of the clusterProfiler package in R[17]. Enriched functional annotations with an adjustedPvalue < 0.05 were considered significant.
Least absolute shrinkage and selection operator (LASSO) and multivariate Cox regression analyses were utilized to evaluate relationships between the specifically expressed methylation sites in each subtype and prognosis and to generate a prognostic prediction model for the training set. Using coefficients from multivariate Cox regression analysis as the weights, a prognostic prediction model was constructed through a linear combination of expression profile data of independent specific CpG methylation sites. The formula is as follows: Risk score = -1.483954476 × cg17398595 - 2.34637809416689 × cg20496643. Based on the risk score prediction model, GC patients were classified into low and high-risk groups with the optimal risk score as the cutoff value. X-tile[18]software was employed to determine the optimal cutoff value. The threshold for the risk score that was the output from the prediction model, which was utilized for separating patients into high and low-risk groups, was defined as the risk score that generated the largest value ofχ2 in the Mantel-Cox test. K-M and log-rank methods were used to evaluate the survival differences between high and low-risk patients. Time-dependent receiver operating characteristic curves were employed to measure the predictive performance, and the prognostic model was validated in the test set.
As described in the Materials and Methods, 21121 methylation sites were identified, of which 1507 CpG sites were identified as potential DNA methylation biomarkers for OS in GC patients using univariate Cox regression analysis (Supplementary Table 3). Univariate Cox proportional-hazards regression analysis revealed that N category (regional lymph nodes), TNM stage, age, and sex were significantly associated with OS (respective log-rankPvalues: 0.021503, 0.015607, 0.005479, and 0.033011). Subsequently, 1061 independent prognosis-associated CpG sites were obtained using multivariate Cox regression analysis of the 1507 methylation sites, with N category, TNM stage, age, and sex as covariates (Supplementary Table 4). These 1061 sites were significant in both univariate and multivariate analyses, and were selected as potential prognostic methylation sites.
Unsupervised clustering of 1061 significant methylation sites was conducted to identify the molecular subtypes for subgroup classification in the training set. We then calculated the average cluster consensus and the coefficient of variation among clusters for each category number. The values of k where the largest magnitude of the relative change in area under the cumulative distribution function began to fall were chosen as the cluster numbers. After comprehensive consideration, k = 3 was selected to obtain three molecular subtypes for further analysis (Figure 2A). A heatmap of 1061 DNA methylation sites in three clusters was then constructed, with the T category, N category, M category, TNM stage, age, and DNA methylation subgroup as the annotations (Figure 2B). As shown in Figure 1B, although the abundance of most CpG sites was relatively low in each sample, there were obvious differences in the DNA methylation status among the three clusters. As shown in the boxplot, cluster 1 had the highest methylation level, while cluster 3 had the lowest methylation level (Supplementary Figure 1). K-M survival analysis showed significant differences in prognosis among the three clusters defined by DNA methylation unsupervised clustering (P= 0.005, Figure 3A). Cluster 1 had the best prognoses, while cluster 3 had the worst prognoses, indicating an association of lower methylation level with poorer survival for GC patients. To explore the clinical features of different methylation subtypes, we analyzed the distribution of T category, N category, M category, TNM stage, and age for the three clusters (Figure 3B-F). Compared to clusters 1 and 2, cluster 3 was prone to lymphatic invasion and metastasis and associated with a more advanced stage, which suggested an important role of neoadjuvant therapy for these patients. Notably, cluster 2 was associated with the lowest rate of T1 and high relevance with N3-4, indicating a more radical surgical approach in clinical practice. There were no differences observed in the grade or age among the three subtypes of GC patients.
We performed genome annotations for the 1061 CpG sites described above and identified 1394 corresponding genes. The expression levels of these corresponding genes were visualized in a heatmap (Figure 4A). GO analyses were conducted to elucidate the functional characteristics of these promoter genes (P< 0.05, Figure 4B, Supplementary Table 5). GO functions of these genes were significantly enriched in protein synthesis and energy metabolism categories, such as “acetyl?CoA biosynthetic process from pyruvate”, “l(fā)arge ribosomal subunit”, and “structural constituent of ribosome”. The differences in the 1061 methylation sites in each subtype of GC were further analyzed using the Wilcoxon rank-sum test (false discovery rate < 0.05 and |log2 (fold change [FC]) > 1), and heatmap is presented in Figure 5A (Supplementary Table 6). We subsequently identified 41 subtype-specific CpG sites that were specifically hypermethylated or hypomethylated in only one subgroup (Supplementary Table 7). These 41 specific methylation sites were subsequently subjected to gene annotations, identifying 52 corresponding genes. To illustrate the expression of these specific methylation corresponding genes in the subgroups, the expression values of 167 samples in the training set for 46 of the 52 genes were obtained (Figure 5B). Distinct expression levels of these genes in specific subgroups were observed, indicating that the expression profiles of these specific methylation site-corresponding genes were consistent with the DNA methylation level. To gain a further understanding of the biological effects of the corresponding genes of these specific methylation sites, Kyoto encyclopedia of genes and genomes analysis was performed with a threshold ofP< 0.05 (Figure 5C and D, Supplementary Table 8). As shown in Figure 5C, the top five signaling pathways are the PI3K-Akt signaling pathway, non-small cell lung cancer, adipocytokine signaling pathway, PPAR signaling pathway, and Ras signaling pathway. Crosstalk analysis showed close relationships among the 13 pathways. Most of these signaling pathways are reported to be involved in carcinogenesis and tumor growth and progression, indicating that the genes corresponding to the specific methylation sites are critical in the molecular mechanisms of GC development.
Figure 2 Cluster analysis for Deoxyribonucleic acid methylation classification and the corresponding heatmap. A: Delta area curve obtained from unsupervised clustering using 1061 Deoxyribonucleic acid methylation sites, which indicates the relative change in the area under the CDF curve for each category number k compared with k-1; B: Heatmap corresponding to the 1061 Deoxyribonucleic acid methylation sites in three clusters.
Figure 3 Survival curves of deoxyribonucleic acid methylation subtypes and comparison of TNM stage, grade, and age between clusters.
Figure 4 Gene annotations of 1061 methylated sites. A: Cluster analysis heatmap for annotated genes associated with the 1061 CpG sites; B: Gene ontology enrichment analysis of the annotated genes.
LASSO regression analysis is a penalized regression method that uses an L1 penalty to shrink regression coefficients toward zero, thereby eliminating a number of variables based on the principle that fewer predictors are selected when the penalty is larger[19]. Thus, seed methylation sites with nonzero coefficients were regarded as potential prognostic predictors. Based on 1000 iterations of Cox-LASSO regression analysis with 10-fold cross-validation using the glmnet package in R, the seed methylation sites were shrunk into multiple-site sets. Methylation sites with nonzero coefficients were considered potential prognostic genes. The 41 selected DNA methylation sites were analyzed by 1000 iterations of Cox-LASSO regression to reduce the number. Applying LASSO regression analysis, in which the selected DNA methylation sites were required to appear 500 times out of 1000 repetitions, five methylation sites were selected as prognostic CpGs (Figure 6A and B). Then, using the regression coefficient from a multivariate Cox proportional hazard model, we established a model including two methylation sites by Akaike Information Criterion in a stepwise algorithm. According to the optimal cutoff value, the patients were stratified into high and lowrisk groups. High-risk patients showed significantly worse OS (HR = 2.24, 95%CI: 1.28-3.92,P< 0.001) than low-risk patients (Figure 7A). Figure 7B-D displays methylation levels of CpG sites and risk score distributions. Methylation levels for the two methylation sites significantly decreased as risk scores increased. Receiver operating characteristic analysis was performed to determine the specificity and sensitivity of the prognostic model. The time-dependent area under the curves for the 3-year OS rates for GC patients with the prognostic model were 0.610 (Supplementary Figure 2A). The predictive ability and stability of the prognostic model were further tested using 183 GC samples with OS time and survival status in the test set. The patients in the test set were classified into high and low-risk groups using the same formula and cutoff obtained from the training set. Consistent with the results in the training set, patients in the high-risk group in the testing set had a significantly shorter median OS than those in the low-risk group (HR = 2.12, 95%CI: 1.19-3.78,P= 0.002) (Figure 7E). Figure 7F-H shows the distribution of risk scores and CpG site methylation levels. The time-dependent area under the curve of the 3-year OS rate with the prognostic model for GC patients was 0.696 (Supplementary Figure 2B).
GC is one of the most common malignancies, causing one of the highest public health burdens[1,20]. Studies have shown that GC carcinogenesis is a multistep and multifactorial process caused by genetic changes and epigenetic alterations[2,21]. GC is characterized by accumulated genomic modifications, including somatic mutations and genomic amplifications and deletions[22]. However, evidence has shown that both genomic aberrations andHelicobacter pylori-induced precursors are associated with multiple epigenetic changes, such as hypermethylation of tumor suppressors and hypomethylation of oncogenes[12,21]. For instance,Helicobacter pylorican induce methylation of multiple CpG islands in GC patients, which subsequently increases genome instability by stimulating activation-induced cytidine deaminase or altering microRNA expression[23]. Therefore, it is important to identify key mechanisms involved in epigenetic alterations and elucidate the role of DNA methylation in GC development and progression.
Epigenetic changes, including DNA and histone modifications, can result in dysregulated expression of tumor suppressor genes and oncogenes. Aberrant methylation changes occur frequently in human cancers. For instance, the DNA methyltransferase family is responsible for DNA methylation, and altered expression of DNA methyltransferase has been shown to be involved in the pathogenesis of GC[13,24]. There is evidence that altered DNA methylation is an early event in the development and progression of GC[25], and these aberrant DNA methylations can be targeted by DNA methylation inhibitors[26]. Studies have shown that epigenetic changes occurred prior to genome alterations in normal and nonneoplastic gastric mucosa, and abnormal methylation levels were associated with an increased risk of GC[27-29]. Methylation of tumor suppressor genes, such as RUNX3, CDH1, APC, CHFR, DAPK, and GSTP1, is associated with the onset of GC and plays important roles in the early stages of tumor development. DNA methylation alterations have not only been associated with GC development in the early stage, but can also be useful for survival prognosis. For example, GC patients with the hypermethylation of MADGA2, which is a tumor suppressor, were associated with significantly decreased survival time[14]. Clarifying altered DNA methylation can aid in the early diagnosis and survival prognosis of GC. As in most cancers, GC is a heterogeneous disease with distinct phenotypes. Integrative molecular subtype analysis of cancer can provide insights into carcinogenesis, diagnosis, and prognosis. Recent studies have highlighted the predictive role of methylation patterns in different cancers[30-32]. However, the association between methylation status and survival prognosis is controversial in different studies. While some studies indicated that GC hypermethylation was associated with a good prognosis[33,34], others reported an association with poor survival[35,36]. A meta-analysis of 918 patients showed that hypermethylation of CpG islands was significantly associated with a poor 5-year survival; however, the results were less convincing due to great heterogeneity among the included studies[37].
Figure 5 Differential analysis of CpG sites for each deoxyribonucleic acid methylation subtype. A: The red and blue bars represent hypermethylated CpG sites and hypomethylated CpG sites, respectively (FDR < 0.05 and |log2 (fold change [FC])| > 1). The vertical bar to the left of the heatmap indicates the significance of methylation sites in each cluster, with the red and blue bars representing significance and insignificance, respectively; B: Heatmap for the annotated genes of specific sites among three Deoxyribonucleic acid methylation clusters; C: Kyoto encyclopedia of genes and genomes pathway enrichment analysis of the specific methylation sites; D: Crosstalk analysis of the enriched Kyoto encyclopedia of genes and genomes pathways shown in the enrichment map.
Our study contributed to the understanding of the epigenetic landscape of GC. In
Figure 6 Selection of the prognostic methylation sites for gastric cancer patients by least absolute shrinkage and selection operator analysis. A: The changing trajectory of each independent variable. The horizontal axis represents the log value of the independent variable lambda and the vertical axis represents the coefficient of the independent variable; B: Confidence intervals for each lambda. The optimal values of the penalty parameter lambda were determined by ten-fold cross-validation.
this study, we identified three subtypes of GC based on DNA methylation, which were characteristic with distinct prognoses and clinical features. These molecular subtypes of GC may shed light on future clinical stratification and subtype-based targeted therapies. We focused on specific DNA methylation markers and analyzed DNA methylation prognosis subgroups of GC. We attempted to address the relations between specific methylation status and prognosis by developing a classification model that integrated two DNA methylation biomarkers for the prognostic evaluation of GC patients. Moreover, our signature is based on two specific methylation sites and is easy to test in clinical practice, with considerable cost-effectiveness. However, our research has limitations because it was retrospective, and our results need to be further confirmed by prospective studies. Moreover, due to the relatively small number of patients, the efficiency of the prognostic model should be further validated using a large number of GC patients.
In summary, our study identified three molecular subtypes based on DNA methylation in GC and established a prognostic prediction model with prognosisspecific methylation sites. These results may help improve outcome prediction, and facilitate precision therapy for patients with GC.
Figure 7 Survival analysis and risk score distribution of the prognostic model for the training and test sets. A and E: K-M curves of the prognostic model in the training set and test set, respectively; B-D: The risk score distribution and heatmap of the methylation site profiles in the training set; F-H: The risk score distribution and heatmap of the methylation site profiles in the test set.
We thank Yu Lin for assistance with the data interpretation.
World Journal of Gastroenterology2020年41期