Chengyu Wu,Qiang Shi,Dinh Pham,Afzal Nikaein(. Texas Medical Specialty,Inc., Dallas 7520, Texas,American ;2. Baylor Scott & White Medical Center,Temple 76508, Texas,American ;. Division of Transplantation,Department of Surgery,School of Medicine and Public Health,University of Wisconsin-Madison,Madison 5792,Wisconsin,American)
【編者按】下一代測(cè)序(NGS)已經(jīng)被證明可有效的減少人類白細(xì)胞抗原(HLA)分型的不準(zhǔn)確性和檢測(cè)成本,同時(shí)還可以檢測(cè)出之前未測(cè)序的HLA基因的詳細(xì)信息。本研究介紹了在Illumina公司的MiSeq平臺(tái)上使用NGS開發(fā)的HLA分型測(cè)定的性能要求。共納入288個(gè)樣品,其之前以HLA-A,HLA-B,HLA-C,HLA-DRB1,HLA-DQA / B和HLA-DPA / B為特征,其使用Sanger測(cè)序、序列特異性引物和序列特異性寡核苷酸技術(shù)進(jìn)行高分辨率的分型。這些樣本攜帶高比例HLA特異性的等位基因。測(cè)序數(shù)據(jù)使用Omixon的HLA TwinTM進(jìn)行分析。評(píng)估等位基因平衡、敏感性、特異性、精確性、準(zhǔn)確性和不準(zhǔn)確性。這些結(jié)果證明了NGS對(duì)HLA分型的可行性和獲益處,因?yàn)檫@項(xiàng)技術(shù)非常準(zhǔn)確,幾乎排除了所有的不確定性,為HLA基因長(zhǎng)度提供了完整的測(cè)序信息,并形成了利用單一方法進(jìn)行HLA分型的基礎(chǔ)免疫遺傳學(xué)實(shí)驗(yàn)室。
【Key words】 HLA typing; Next-generation sequencing; whole genome Illumina data analysis;clinical application
The human leukocyte antigen (HLA) locus is one of the most complex genetic regions, and HLA genes are known to be the most polymorphic genes of the human genome1-2. HLA genes are located within one of the most gene-rich regions of the human genome, the major histocompatibility complex (MHC), which is on the short arm of chromosome 6 (6p21.3). Many of the genes in the MHC, including HLA, encode proteins that have critical roles in immune responses. The MHC is divided into three distinct regions referred to as classⅠ,Ⅱ andⅢ, with the HLA genes being located within the class Ⅰ and class Ⅱ regions.
For clinical purposes, four-digit HLA typing at the amino-acid level is necessary, because amino-acid differences among HLA proteins with the same (twodigit) antigenic peptide can lead to allogeneic responses.Established methods for high resolution, four-digit HLA typing include polymerase chain reaction (PCR) using sequence-specific oligonucleotides (SSO) or Sanger sequencing - based typing (SBT)3-4.
Over the past decade, the development of nextgeneration sequencing (NGS) has paved the way for whole-genome analysis of individuals. In principle, the concept behind NGS technology is similar to capillary electrophoresis sequencing. DNA polymerase catalyzes the incorporation of fluorescently-labeled deoxyribonucleotide triphosphates (dNTPs) into a DNA template strand during sequential cycles of DNA synthesis. During each cycle, at the point of incorporation, the nucleotides are identified by fluorophore excitation. The critical difference is that, instead of sequencing a single DNA fragment as in capillary electrophoresis sequencing, NGS covers millions of fragments in a massively parallel fashion5. More than 90% of the world's sequencing data are generated by Illumina sequencing by synthesis (SBS) chemistry.It delivers highly accurate, error-free reads and a high percentage of base calls with an accuracy of 99.9%.
NGS has rapidly replaced Sanger sequencing as the method of choice for diagnostic gene-panel testing. For hereditary-cancer testing, the technical sensitivity and specificity of the assay are paramount, as clinicians use results to make important clinical management and treatment decisions. There is significant debate within the diagnostics community regarding the necessity of confirming NGS variant calls by Sanger sequencing,considering that numerous laboratories report having 100% specificity from the NGS data alone6.
Research on HLA, an extensively studied molecule involved in immunity, has benefitted from NGS technologies.Thus far, several high-throughput NGS-based HLA typing methods have been developed, enabling high-resolution four-digit HLA typing. NGS facilitates complete HLA sequencing and is expected to improve our understanding of the mechanisms through which HLA genes are modulated,including transcription, regulation of gene expression and epigenetics.
Standard Illumina NGS workflows include four basic steps: library preparation, cluster generation,sequencing and data analysis.
1.1.1 Library preparation—The sequencing library is prepared by random fragmentation of the DNA sample.Adapter-ligated fragments are then PCR amplified and gel purified.
1.1.2 Cluster generation—The library is loaded into a flow cell where fragments are captured on a lawn of surface-bound oligonucleotides complementary to the library adapters. Each fragment is then amplified into distinct clonal clusters through bridge amplification.When cluster generation is complete, the templates are ready for sequencing.
1.1.3 Sequencing—Illumina SBS technology uses a proprietary reversible terminator-based method that detect single bases as they are incorporated into DNA template strands. As all four reversible terminatorbound dNTPs are present during each sequencing cycle, natural competition minimizes incorporation bias and greatly reduces raw error rates compared to other technologies.
1.1.4 Data analysis—The newly identified sequence reads are aligned to a reference genome. Following alignment, many variations of an analysis are possible.
A total of 288 samples that were HLA typed by SBT using AlleleSEQR reagents (Abbott Laboratories,Des Plaines, IL, USA) or LABType?SSO probes (One Lambda, Thermo Fisher Scientific, Los Angeles,CA, USA) between December 2016 and October 2017 were selected for the assessment. The ambiguities that arose from SBT were resolved to the greatest extent possible through the use of sequence-specific primer(SSP) kits (Olerup-SSP, Stockholm, Sweden and SSP UniTrayTM, Invitrogen, Thermo Fisher Scientific,Waltham, MA, USA) and allele-specific sequencing primers (AlleleSEQR HARPs, Abbott Laboratories).All 288 samples were typed for eleven HLA loci: HLA-A,HLA-B,HLA-C,HLA-DRB1,HLA-DRB3/4/5,HLA-DQA1,HLA-DQB1,HLA-DPA1 and HLADPB1.
Sequencing runs ranged in size from 8 to 24 samples per run. All runs were performed using standard or micro flow cells.
Genomic DNA was extracted on a Promega Maxwell?RSC with the Maxwell?RSC Buffy Coat DNA Kit(Promega, Madison, WI, USA). DNA was quantitated by Promega Quantus with the QuantiFluor ONE dsDNA System and adjusted to a concentration of 30 ng/μl for application. The fragment size of the DNA generated,as disclosed by Qiagen, is significantly higher than 1 000 bases, rendering this DNA preparation appropriate for long-range PCR.
Samples were amplified at eleven loci (HLA-A,HLA-B,HLA-C,HLA-DRB1,HLA-DRB3/4/5,HLA-DQA1,HLA-DQB1,HLA-DPA1 and HLADPB1) by long range PCR using Omixon-designed primers that delineate full length HLA genes [5′ untranslated region (UTR) to 3′ UTR] with the exception of DRB1 for which the amplicon spans intron 1 to intron 4. The DQB1 gene was amplified by two PCR reactions with overlapping segments (5′ UTR to intron 4 and intron 1 to 3′ UTR). Following PCR, the amplicons were cleaned with Exo-SAP (Affymetrix, Santa Clara, CA, USA),quantitated with the QuantiFluor ONE dsDNA System,and normalized to approximately 50 ng/μl.
Libraries from individual HLA amplicons were prepared by enzymatic fragmentation, end repair,adenylation and ligation of indexed adaptors. Amplification and library preparation reagents have subsequently been licensed and are available as HoloType HLATMkits(Omixon, Inc., Budapest, Hungary). The indexed libraries were pooled and concentrated with Ampure XP beads (Beckman Coulter, Brea, CA, USA) prior to selection of a range of fragments between 650 and 1350 bp using a PippinPrepTM(Sage Science, Beverly, MA,USA). The size-selected library pool was quantitated by quantitative polymerase chain reaction (qPCR)(Kapa Biosystems, Wilmington, MA, USA) and adjusted to 2 nmol/L. The library was then denatured with NaOH and diluted to a final concentration of 8 pmol/L for optimal cluster density, and 600 μl was loaded into the MiSeq Reagent Kit v2 (300 cycle) cartridge (Illumina, San Diego, CA, USA). The reagent cartridge and flow cell were placed on the Illumina MiSeq for cluster generation and 2 × 250 bp paired-end sequencing. Testing was performed using the MiSeq Reagent Kit v2 with full-sized or micro-sized flow cells.
The sequence data produced by the MiSeq were stored as FASTQ files. The FASTQ format includes both sequence and corresponding Phred quality scores in a single file. Through the use of indexed adaptors,the sequence for each indexed library was parsed into two unique FASTQ files (one for each end of the pair-end sequence). Each locus for each sample was indexed with a unique identifying nine-base sequence. FASTQ files were analyzed with HLA TwinTM(Omixon, Inc.) and NGSengineTMversion 1.0.0.762(GenDx, Utrecht, the Netherlands). HLA TwinTMwas run with the following parameters: analyzing at least 20000 pairs of reads; examining exons only; ignoring cross-mapped reads; and having a maximum insert size of 1 600 bp. At the end of each analysis, the output was compared to produce the final genotype assignment. The sensitivity, specificity, precision and accuracy of the genotyping were calculated as shown in Table 1.
Sequencing of the 288 samples was spread across 21 sequencing runs. Libraries were prepared for each locus individually, and the sequencing runs ranged from 8 to 24 samples at all nine loci. The samples were run on the standard or micro flow cells, which have 14 times the capacity of a nano flow cell. After selection based on fragment size, the quantitated libraries had an average concentration of 2 nmol/L,The average sequence output for the sequencing runs are shown in Table 1, meeting the expected output per flow cell type as defined by Illumina for the MiSeq. Closter density range from 601 to 1730 k/mm for standard and micro flow cells. The percentage of clusters that passed filters, which represents the usable reads at the end of the sequencing, ranged from a minimum value of 93.2% to 98.96% (average 96.45%). Finally, this analysis achieved an average of 97.68% of bases that had a quality Phred score ≥30 for each run.
Table 1 Sequencing run metrics from experiments performed on flow cells
Sequence reads were first aligned to the whole IMGT/HLA database (all known HLA alleles). Then, the best matching alleles were selected based on various alignment statistics, such as the number of reads covering exons and the extent of exons covered. Only reads that were mappable as homologous to any allele in the IMGT/HLA database with a low number of mismatches were retained.For the analysis, sequence reads from all exons of HLA genes were used. However, the sequence reads were not evenly distributed for each gene region, and the average depth implied that there may be holes in coverage.
After sequencing, all reads were demultiplexed on the MiSeq instrument and analyzed with the genotyping software program HLA TwinTMby Omixon. Confidence in the software program’s ability to accurately determine the HLA genotype is partially dependent upon the depth of coverage for each locus (shown in Table 2). We used read count, noise, key, key exon spot noise ratio, consensus coverage key exon minimum depth, and key exon allele imbalance as keys interpretation before getting results.
Table 2 Quality control metrics
The profile of coverage is highly reproducible across the samples tested, as shown by a low spread in the deviation from the minimum to the maximum depth of coverage for every position in each amplicon.For all genes, the depth of coverage varies most at the beginning and the end of the amplicons, as these regions have a pile up of reads caused by the nature of the fragmentation enzyme and the large size of fragments selected.
Precision is an important factor to consider, as it shows that the allele is present when the NGS genotyping program produces the allele call. Finally, the accuracy of the analysis program was 99.3% for HLA TwinTM.
Table 3 General operation parameters
Table 4 Summary of short read usage throughout the steps of the HLA typing process
Tables 3 ~ 6 list the relevant QC data based on Illumina data analysis: general operation parameters;short read usage; allele imbalance; mappability of samples. Tables 3 and 4 present a summary of our data analysis during operation.
Table 6 The mappability of samples involved in this study.
2.4.1 Allele imbalance
The ability to accurately detect two alleles when a sample is heterozygous is essential for a reliable HLA genotyping method. During the initial PCR, alleles amplify independently of one another, and if one allele amplifies more than the other, which can occur for a variety of reasons, the resulting sequencing data will show more counts of one allele. In cases where the majority of the data is significantly higher than the other allele >70% of the total data), we consider the alleles to be imbalanced, as shown in Figure 1.
Figure 1 Allele imbalance in this study. Ratio of the alleles.Values around 50% - 50% indicate balanced heterozygous sample,while values tending to 100%-0%indicate imbalance in the proportion of reads derived from the two chromosomes.Homozygous samples have 50%~50%assigned. Note that distinguishing between homozygous samples with small contamination and heterozygous samples with high imbalance is not always possible
Note:The first two columns show the number of reads mappable for each locus and corresponding ratios relative to the totalnumber of processed reads. Reads mappable to multiple loci are counted in multiple rows-for detailed crossmappinganalysis please use the crossmapping section below.The second two columns show the number of reads best mapping for each targeted locus and corresponding ratios relative to the total number of processed reads.
HLA TwinTMhas the ability to detect allele imbalance at a ratio of 10% minor allele to 90% major allele. There were no events of complete allele dropout, in which one allele was completely absent in the NGS data. Thus,detecting cases of true heterozygosity can be resolved by changing the thresholds for determining heterozygous bases in each software program. Subsequent releases of both programs allow for better sensitivity at distinguishing heterozygous samples.
2.4.2 NGS software errors
The largest cause of discrepant results between NGS and older technologies was software related. Each software program has its own set of challenges with respect to determining the proper allele. These challenges are related to the design and implementation of each algorithm. Many of the incorrect allele calls made by HLA TwinTMwere due to the algorithm preferentially calling alleles that had more exons characterized, even if there was an allele with better alignment but fewer exons, independent of the locus.
After the completion of typing by NGS, more than 99% of all allele calls were unambiguous with genotypes reported out to the third field. The two major sources for ambiguity in other HLA typing methods are due to not examining all exons of the gene and not being able to phase polymorphisms within or between those exons. The only ambiguities that were left unresolved at the third field level in the 73 samples were found in the DQA and DPB1 locus (Tables 5 and 6). There is a fourth field ambiguity that does not impact the allele call at three fields, and as such we did not include this in the counting of ambiguous alleles. There remains the possibility for ambiguities if the polymorphic positions occur at long distances (>1 000 bp)that the MiSeq system cannot sequence in a paired-end configuration.
Table 7 NGS raw data by loci
Table 8 Summary of NGS data analysis results
We used 73 patient specimens to run full locus typing by NGS using the Illumina MiSeq Dx and SBT using the Genetic Analyzer 3500 xL Dx. For 93 patient specimens, all 11 detectable HLA loci typing results from NGS exactly matched SBT sequencing typing results. DQA1 and DPA1 were compared with PCR-SSO, as we do not perform them by SBT. 20/93 specimens were blind tested by another laboratory.Summary results are shown in Tables 7 and 8.
Seventy-three samples were processed using the Omixon HLA kit for NGS as well as SBT or SSO for comparison and validation purposes. SBT is the current method used for high resolution typing in our laboratory with the exception of DQA1, and DPA1. In some cases, data were insufficient at specific loci to determine typing by NGS (see Tables 7 and 8 ).
Table 2 describes the concordance between NGS and SBT or SSO for all loci. Overall concordance between NGS and SBT/SSO was 97.6%. Loci with≤95%concordance were further evaluated. Locus A has one mismatch, therefore 99.9% concordance. Locus B has 100% concordance. Locus C has 99.9% concordance.Locus DR has 98.5% concordance. Locus DR345 has 97.2 % concordance. Locus DQA has 92.4% concordance.Locus DQB has 99.2% concordance. Locus DPA has 100% concordance. Locus DPB has 94.5% concordance.
DQA and DPB were further evaluated to resolve discrepancies. Eight of the ten discordant samples for DQA were run by SSP. SSP confirmed the NGS result in eight of the ten samples that were tested. The SSP confirmations increased the concordance for DQA to 97.9% (143/146). Six of the seven discordant samples for DPB were also tested by SSP to resolve the discrepancies. Both matched NGS results, therefore the concordance for DPB was increased to 98.6% (144/146). After further evaluation of the discordant samples all loci were determined to be ≥95% concordance with SBT or SSO results. Other discordant alleles were confirmed by SSP and results matched NGS.
DNA isolated from patient’s whole blood were used to determine the limits of HLA allele detection (analytical sensitivity) of this assay. Nucleic acid extraction was performed using the Qiagen EZ-1 robot. Concentrations of DNA were 10 ng/μl, 5 ng/μl, 1 ng/μl with 1 ng/μl also run using 3 more amplification cycles separately.Manufacturer’s protocol states 1 ~ 20 ng/μl may be used per reaction. Our results showed that 10 ng/ul and 1 ng/ul had no loss of loci, and results were equal. Seven buccal swap specimens did not produce results. One reason for this is that the concentration of the DNA samples was not exact and may have degraded over time.
Table 9 NGS blind test raw data by loci
Table 10 NGS summary of blind test analysis results
Twenty-four samples were processed using the Omixon HLA kit for NGS and by Immucor. As test the test was blind, the results were obtained from another laboratory after our own results were reviewed.Comparative results are summarized in Tables 9 and 10.Overall concordance between Omixon and Immucor was 99.3%. Loci with≤95% concordance were further evaluated. Locus A had 100% concordance. Locus B has 100% concordance. Locus C had 100% concordance.Locus DR had 100% concordance. Locus DR345 had 97.9% concordance. Locus DQA had 100% concordance [there are two mismatch loci between Omixon and Immucor at DQA, so Omixon data was compared to another laboratory results (blind testing), with which it was matched exactly]. DQB had 97.9% concordance (in the first DQB test, Omixon did not reveal DQB alleles in 4 specimens,but repeating the test, 3/4 revealed the alleles). Locus DPA had 100% concordance. Locus DPB had 97.9 %concordance.
NGS for HLA typing face many challenges for use in clinical laboratories. Those challenges are broken down into pre-analytic, analytic and post-analytic categories.
The most prominent pre-analytic challenges we experienced were HLA typing from low concentration buccal swabs and rotating the known HLA typed samples to QC all the indices used in library preparation. The shorter DNA fragment sizes often obtained from buccal swabs prevent successful long-range PCR amplification and can lead to allele dropout. Fragmented DNA impacted allele dropout more than improper magnetic bead handling, technical error, or low DNA concentration.DNA being used for NGS should be quantified using a fluorometric dsDNA assays (Qubit, PicoGreen,etc.). UNC uses QuBit (ThermoFisher Scientific) for DNA quantification prior to long-range PCR. We have successfully HLA typed samples with a concentration of 2 ng/L and will attempt NGS on lower concentration samples on a case-by-case basis. However, DNA samples with low (less than 10 ng/L) concentration can lead to allele dropout.
For Illumina sequencers, there is a control material available, PhiX, from which the sequencer will automatically determine an error rate based on the reference genome. PhiX is a small bacteriophage with a known genome that acts as a control for the MiSeq sequencing independent of NGS application7.
Our typical MiSeq runs average an error rate of 0%-0.39%, which we monitor for instrument QA.The error rate of the system is related to the quality of the data being generated, but not specifically related to HLA genotype. As shown in Table 1, monitoring parameters such as error rate and cluster density can provide effective QA and QC. However, when overclustering occurs, the amount of clusters passing the filter decreases. Over-clustering prevents accurate determination of individual base pairs because the clusters of single molecules are overlapping. This prevents the Illumina sequencer from accurately identifying which base pair was incorporated. The impact of over-clustering is reduced overall data quality regardless of application. If over-clustering occurs,there is likely enough high quality data to provide accurate HLA genotyping; however, utilizing a known HLA typed sample can aid in that determination. These incidences were noticed prior to data analysis, which allowed for expedited troubleshooting and prevented unnecessary delay in repeat testing.
Analytic challenges include determination of HLA alleles susceptible to allelic dropout (mostly HLA-DQA1 and DPB1 in our test), rare alleles,novel alleles, and homozygous samples. Homozygous specimens are especially difficult given the sensitivity of NGS to detect small amounts of DNA.
A typical NGS run on the MiSeq platform generates about 18 GB of total data. Most of the data is in the form of raw bcl files generated by the sequencer during each cycle and demultiplexed FASTQ files. A clinical laboratory must save the FASTQ files, InterOp folder,run info, run parameters, and analyzed data. These files, which account for approximately 7 GB of data when using a standard flow cell on the MiSeq platform,must be stored long-term. Each NGS platform will generate varying amount of data; however, the need for a long-term storage solution is consistent. To address data storage challenges, laboratories should work with their information technology departments to obtain a secure, hospital-controlled storage (4 TB) for longterm storage that is backed up daily. The amount of storage required depends on the number of NGS runs expected and the length of time raw data are stored(minimum of 2 years, ASHI standard). An alternative to onsite storage is cloud-based storage of data, which has varying pricing options depending on vendor and amount of storage required. Clinical application of NGS for HLA typing not only generates data storage challenges, but also turn-around time challenges.
NGS can reduce the ambiguity rate and cost of HLA typing. Interpretation of NGS data for clinical HLA typing is challenging due to a number of issues,including sample type, complexity of the HLA genes,and reliance on software for accuracy8-9. Like the application of any new technology, the best policies and practices for NGS-based HLA typing are evolving.Open discussion among regulatory agencies and clinical laboratories will facilitate standardization and implementation of newer technologies. HLA typing by NGS allows the histocompatibility and immunogenetics fields to ask new and important questions regarding HLA gene expression, regulation, and their impact on patient outcomes. These exciting areas will drive not only the future of HLA field but also improvements in patient HLA matching.
This work was supported by Tianjin Science and Technology Commission (17ZXSCSY00100).