劉永鑫,秦媛,3,郭曉璇,白洋,3
微生物組數(shù)據(jù)分析方法與應(yīng)用
劉永鑫1,2,秦媛1,2,3,郭曉璇1,2,白洋1,2,3
1. 中國(guó)科學(xué)院遺傳與發(fā)育生物學(xué)研究所,植物基因組學(xué)國(guó)家重點(diǎn)實(shí)驗(yàn)室,北京 100101 2. 中國(guó)科學(xué)院遺傳與發(fā)育生物學(xué)研究所,中國(guó)科學(xué)院–英國(guó)約翰英納斯中心植物和微生物科學(xué)聯(lián)合研究中心,北京 100101 3. 中國(guó)科學(xué)院大學(xué)現(xiàn)代農(nóng)學(xué)院,北京 100049
高通量測(cè)序技術(shù)的發(fā)展衍生出一系列微生物組(microbiome)研究技術(shù),如擴(kuò)增子、宏基因組、宏轉(zhuǎn)錄組等,快速推動(dòng)了微生物組領(lǐng)域的發(fā)展。微生物組數(shù)據(jù)分析涉及的基礎(chǔ)知識(shí)、軟件和數(shù)據(jù)庫(kù)較多,對(duì)于同領(lǐng)域研究者開展學(xué)習(xí)和選擇合適的分析方法具有一定困難。本文系統(tǒng)概述了微生物組數(shù)據(jù)分析的基本思想和基礎(chǔ)知識(shí),詳細(xì)總結(jié)比較了擴(kuò)增子和宏基因組分析中的常用軟件和數(shù)據(jù)庫(kù),并對(duì)高通量數(shù)據(jù)下游分析中常用的幾種方法,包括統(tǒng)計(jì)和可視化、網(wǎng)絡(luò)分析、進(jìn)化分析、機(jī)器學(xué)習(xí)和關(guān)聯(lián)分析等,從可用性、軟件選擇以及應(yīng)用等幾個(gè)方面進(jìn)行了概述。本文擬通過對(duì)當(dāng)前微生物組主流分析方法的整理和總結(jié),為同領(lǐng)域研究者更方便、靈活的開展數(shù)據(jù)分析,快速選擇研究分析工具,高效挖掘數(shù)據(jù)背后的生物學(xué)意義提供參考,進(jìn)一步推動(dòng)微生物組研究在生物學(xué)領(lǐng)域的發(fā)展。
微生物組;數(shù)據(jù)分析;擴(kuò)增子;宏基因組;分析流程
微生物組(microbiome)是指包括細(xì)菌、古菌、低(高)等真核生物、病毒等微生物的基因和基因組,及其周圍環(huán)境在內(nèi)的全部[1]。研究表明微生物組在人類和動(dòng)植物的營(yíng)養(yǎng)吸收[2]、疾病抵抗[3]和環(huán)境適應(yīng)中起重要作用[4,5]。
近年來第二代測(cè)序(next generation sequencing, NGS)技術(shù)的發(fā)展使得基于非培養(yǎng)方法研究微生物組成為可能,并推動(dòng)了微生物組研究進(jìn)入了黃金發(fā)展時(shí)期[6]。目前對(duì)微生物組樣本的研究主要集中在3個(gè)層面(圖1A):(1)微生物培養(yǎng)層面:培養(yǎng)組學(xué)(Culturome)是該層面最重要的研究手段。通過在固體培養(yǎng)皿挑單菌落或使用96孔板液體高通量培養(yǎng)的方式獲得微生物群落中可培養(yǎng)的菌落,隨后結(jié)合標(biāo)記基因(marker gene)測(cè)序、分離純化等方法進(jìn)行菌種鑒定和保藏。目前該方法已在人類[7]、擬南芥()[8]、水稻()[9]等物種中應(yīng)用和報(bào)道;(2) DNA層面:針對(duì)DNA易于提取和保存的特點(diǎn),研究者相繼發(fā)展出擴(kuò)增子(amplicon)、宏基因組(metagenome)[10]和宏病毒組(metavirome)等測(cè)序研究手段[11]。擴(kuò)增子測(cè)序常用的標(biāo)記基因主要包括原核生物的16S rRNA基因、真核生物的18S rRNA基因以及轉(zhuǎn)錄間隔區(qū)(internal transcribed spacers, ITS)等。由于擴(kuò)增子測(cè)序僅能獲得研究對(duì)象的物種組成信息,要想進(jìn)一步研究物種所攜帶的其他功能基因,就需要開展宏基因組測(cè)序和分析;(3) mRNA層面:通過對(duì)微生物組樣本提取RNA進(jìn)行宏轉(zhuǎn)錄組(metatranscriptome)測(cè)序,可以根據(jù)微生物組樣本中的基因表達(dá)譜進(jìn)一步揭示微生物群落原位功能[12]。病毒包括DNA和RNA病毒兩大類,想要全面開展宏病毒組學(xué)研究需要宏基因組結(jié)合宏轉(zhuǎn)錄組測(cè)序(圖1A)。
鑒于微生物組編碼的基因近千萬[13],想要從微生物組海量數(shù)據(jù)中挖掘有效信息,必須了解和掌握本領(lǐng)域相關(guān)軟件和數(shù)據(jù)庫(kù)的使用,才能在計(jì)算機(jī)或服務(wù)器上開展可重現(xiàn)(reproducible)的數(shù)據(jù)分析。而傳統(tǒng)的生物學(xué)家由于生物信息學(xué)知識(shí)相對(duì)薄弱、微生物組數(shù)據(jù)分析經(jīng)驗(yàn)不足等情況,在數(shù)據(jù)分析過程中經(jīng)常會(huì)面臨Linux使用、代碼重用和軟件選擇等眾多困難。本文系統(tǒng)概述了當(dāng)前微生物組數(shù)據(jù)主流分析的基本思路和步驟,同時(shí)對(duì)開展微生物組數(shù)據(jù)分析提供了建議,最后對(duì)本領(lǐng)域常用分析方法的優(yōu)缺點(diǎn)和適用范圍進(jìn)行總結(jié),以期對(duì)同行更高效地開展微生物組數(shù)據(jù)分析,挖掘大數(shù)據(jù)背后的生物學(xué)規(guī)律有所幫助。
圖1 微生物組研究方法概述
A:微生物組常用的研究層面和對(duì)應(yīng)方法。微生物組按研究層面主要分為微生物培養(yǎng)、DNA和mRNA等3個(gè)層面;按研究技術(shù)主要包括培養(yǎng)組學(xué)(culturome)、擴(kuò)增子(amplicon)、宏基因組(metagenome)、宏病毒組(metavirome) 和宏轉(zhuǎn)錄組(metatranscriptome)等測(cè)序技術(shù)[1,12]。B:微生物組研究的基本步驟?;跍y(cè)序技術(shù)為基礎(chǔ)的微生物組研究,主要分為樣本制備、測(cè)序、數(shù)據(jù)處理和統(tǒng)計(jì)分析4個(gè)階段。C:微生物組數(shù)據(jù)分析的基本步驟、常用環(huán)境和思想。組學(xué)數(shù)據(jù)分析主要分3步,圖中箭頭上描述了實(shí)現(xiàn)分析的常用語(yǔ)言環(huán)境Shell和/或R;圖中箭頭下展示各步分析的目的,即通過降維和可視化的基本思想,實(shí)現(xiàn)將大數(shù)據(jù)轉(zhuǎn)化為可讀圖表。
微生物組研究主要分為4個(gè)階段(圖1B):(1)微生物組樣品制備:基于科學(xué)的實(shí)驗(yàn)設(shè)計(jì),采集來自人、動(dòng)植物或環(huán)境中的微生物組樣本,并根據(jù)研究的目的,選擇提取DNA或RNA等;(2)宏組學(xué)(meta- omics)數(shù)據(jù)產(chǎn)出:抽提樣品的DNA或RNA后,通過構(gòu)建測(cè)序文庫(kù)和進(jìn)行高通量測(cè)序來獲得宏組學(xué)數(shù)據(jù)。例如,擴(kuò)增子16S rRNA基因片段主要采用雙端250 bp (pair-end 250 bp, PE250)測(cè)序,單個(gè)樣本3~5萬條序列的深度;宏基因組多采用PE150測(cè)序,獲得微生物部分至少2千萬條序列(150 bp′2′20 Mb = 6 Gb);(3)數(shù)據(jù)處理(質(zhì)控定量):當(dāng)獲得微生物組數(shù)據(jù)后,首先要進(jìn)行質(zhì)量控制,包括去除測(cè)序和建庫(kù)過程中人為添加的引物、接頭以及測(cè)序過程中產(chǎn)生的低質(zhì)量序列等。此外,宿主相關(guān)的微生物組測(cè)序結(jié)果中含有大量宿主序列,需采用比對(duì)宿主基因組的方式去除。獲得的純凈序列(clean data)再比對(duì)至參考數(shù)據(jù)庫(kù)或從頭()組裝的參考基因集,定量為特征表(feature table),根據(jù)序列注釋類型可將特征表分為物種或功能基因組成表;(4)統(tǒng)計(jì)分析和可視化:特征表還需要進(jìn)一步結(jié)合樣本元數(shù)據(jù)(metadata)進(jìn)行統(tǒng)計(jì)分析,并選擇合適的圖形進(jìn)行可視化,有利于生物學(xué)規(guī)律的觀察和總結(jié),提高結(jié)果的可讀性和傳播性(圖1B)。本文將主要對(duì)第3和第4步驟進(jìn)一步討論和總結(jié)。
當(dāng)獲得微生物組原始數(shù)據(jù)后,如何對(duì)其進(jìn)行分析至高可讀性的出版級(jí)別圖表?為便于理解,本文將微生物組數(shù)據(jù)分析過程劃分為3個(gè)主要步驟(圖1C):
第一步:原始數(shù)據(jù)轉(zhuǎn)換為特征表。微生物組數(shù)據(jù)通常為NGS產(chǎn)生的fastq格式序列文件,包括堿基序列和質(zhì)量值,序列數(shù)量級(jí)可達(dá)106~109條。這就需要在高效的Shell環(huán)境下使用命令行工具對(duì)大數(shù)據(jù)進(jìn)行質(zhì)控和定量,降維至數(shù)量級(jí)為103~105的特征表。特征表常為計(jì)數(shù)型數(shù)據(jù)(count data),如物種分類學(xué)(taxonomy)表、可操作分類單元(operational ta-xonomic unit, OTU)表、擴(kuò)增序列變異(amplicon seq-uence variant, ASV)表、基因豐度(gene abundance)表和通路豐度(pathway abundance)表等。
第二步:特征表轉(zhuǎn)換為多樣性和/或差異特征。例如,微生物組研究中擴(kuò)增序列變異表和基因豐度表仍然很大,因此研究者常采用Alpha或Beta多樣性分析、物種或功能層級(jí)注釋、差異比較等方法,將數(shù)據(jù)表進(jìn)一步降維至101~103。該數(shù)據(jù)結(jié)果更方便研究者運(yùn)用專業(yè)知識(shí)挖掘規(guī)律和解釋生物學(xué)問題。
第三步:數(shù)據(jù)可視化為出版級(jí)圖表。近年來可視化語(yǔ)言和工具的發(fā)展提高了數(shù)據(jù)挖掘和結(jié)果解讀的效率,如折線柱、柱狀圖、箱線圖、散點(diǎn)圖和熱圖等的廣泛使用,更易于幫助研究者發(fā)現(xiàn)數(shù)據(jù)中的規(guī)律(圖1C)。
從微生物組數(shù)據(jù)分析的全過程中可以看出,降維和可視化是大數(shù)據(jù)分析的核心指導(dǎo)思想,即把數(shù)據(jù)降維至可讀的數(shù)量,通過可視化分析方便同領(lǐng)域研究者閱讀和傳播。實(shí)現(xiàn)這兩個(gè)過程主要涉及兩種語(yǔ)言環(huán)境,即首先通過Linux系統(tǒng)中的Shell語(yǔ)言配合工具軟件實(shí)現(xiàn)大數(shù)據(jù)分析和降維,然后利用R語(yǔ)言(https://www.r-project.org)實(shí)現(xiàn)基于特征表的統(tǒng)計(jì)和可視化。因此熟悉Shell和R這兩門語(yǔ)言的基礎(chǔ)操作即可滿足研究者微生物組數(shù)據(jù)分析的絕大多數(shù)需求。當(dāng)然,微生物組分析中也常涉及Perl、Python、Java等語(yǔ)言的使用,它們更多作為軟件和腳本在Shell環(huán)境下運(yùn)行,用戶可以根據(jù)自己的基礎(chǔ)和習(xí)慣選擇不同的語(yǔ)言環(huán)境進(jìn)行分析和可視化。
微生物組數(shù)據(jù)分析需要在專門的語(yǔ)言環(huán)境下開展,熟悉常用的語(yǔ)言環(huán)境能夠幫助我們更好地利用現(xiàn)有工具開展數(shù)據(jù)分析。目前本領(lǐng)域的分析工具主要集中在Shell和R兩種語(yǔ)言環(huán)境下運(yùn)行。幾乎所有的服務(wù)器都是Linux系統(tǒng),默認(rèn)的Shell環(huán)境自帶上百個(gè)命令和Bioconda近萬個(gè)生物信息軟件可快速搭建各種分析流程[14]。R語(yǔ)言開源免費(fèi),官網(wǎng)CRAN (https://cran.r-project.org/)發(fā)布了14 767個(gè)統(tǒng)計(jì)和可視化包,Bioconductor (http://www.bioconductor.org)上更有1741個(gè)生物學(xué)專用包(數(shù)量統(tǒng)計(jì)截止2019年8月20日),可實(shí)現(xiàn)最靈活的統(tǒng)計(jì)分析。掌握這兩門語(yǔ)言基礎(chǔ),可以高效地利用現(xiàn)有軟件開展數(shù)據(jù)分析、統(tǒng)計(jì)和可視化。本文重點(diǎn)介紹Shell和R語(yǔ)言,是因?yàn)檫@兩類語(yǔ)言環(huán)境下有非常多可利用的生物學(xué)軟件(包),用戶可以通過極少的代碼串聯(lián)現(xiàn)有工具來實(shí)現(xiàn)數(shù)據(jù)分析。特別是對(duì)于初級(jí)使用者來說,學(xué)習(xí)和應(yīng)用相對(duì)更加便捷。
Shell語(yǔ)言是與Linux系統(tǒng)交互命令的合集,幾乎所有的微生物組分析工具都有可以在Linux服務(wù)器的的Shell環(huán)境下運(yùn)行,而在其他環(huán)境中搭建分析流程非常困難。如果用戶的電腦為Windows系列,需要安裝遠(yuǎn)程訪問Linux服務(wù)器的軟件,如XShell、putty或ssh secure shell等,這里推薦使用商業(yè)化開發(fā)且對(duì)學(xué)校免費(fèi)的XShell。而Mac系統(tǒng)是類UNIX系統(tǒng)內(nèi)核,系統(tǒng)自帶的Terminal程序即可實(shí)現(xiàn)遠(yuǎn)程訪問Linux。R語(yǔ)言自帶圖形界面RGui,可以實(shí)現(xiàn)交互式統(tǒng)計(jì)分析和可視化,RScript命令可在命令行下執(zhí)行R腳本。近兩年快速發(fā)展的集成開發(fā)環(huán)境RStudio (https://www.rstudio.com/),自2018年升級(jí)至1.1版后同時(shí)支持Shell和R腳本的編輯和運(yùn)行。RStudio是跨平臺(tái)軟件,在Windows/ Mac/Linux上都可以輕松安裝,還有服務(wù)器版本可以在網(wǎng)頁(yè)中運(yùn)行,保證不同終端無需安裝任何額外程序,即可保持?jǐn)?shù)據(jù)分析工作環(huán)境的一致性。對(duì)于初學(xué)數(shù)據(jù)分析的研究者來說,可通過學(xué)習(xí)RStudio來掌握數(shù)據(jù)分析、代碼管理、程序調(diào)試、結(jié)果圖片調(diào)整和保存等操作。
有了好用的分析代碼管理工具,還需要學(xué)習(xí)語(yǔ)言基礎(chǔ)讀懂分析代碼,才能使用和修改現(xiàn)有的分析流程和方法。對(duì)于以數(shù)據(jù)分析為主的研究者,建議系統(tǒng)學(xué)習(xí)Shell和R語(yǔ)言基礎(chǔ)。Shell語(yǔ)言推薦學(xué)習(xí)《鳥哥的LINUX私房菜基礎(chǔ)學(xué)習(xí)篇(第四版)》,其中Linux的基本命令、文件系統(tǒng)和Shell腳本編寫可重點(diǎn)學(xué)習(xí),服務(wù)器管理員還需要學(xué)習(xí)系統(tǒng)和用戶管理等內(nèi)容。R語(yǔ)言推薦學(xué)習(xí)《ggplot2:數(shù)據(jù)分析與圖形藝術(shù)(第2版)》[15],該工具書對(duì)系統(tǒng)認(rèn)識(shí)各種圖形、了解繪圖原理和實(shí)現(xiàn)數(shù)據(jù)可視化非常有幫助。此外,通過學(xué)習(xí)網(wǎng)絡(luò)上相關(guān)研究者整理總結(jié)的的基礎(chǔ)知識(shí)和代碼注釋,對(duì)于初學(xué)者以及偶爾使用數(shù)據(jù)分析的研究者來說,可能更具有針對(duì)性和時(shí)效性。
近10年,隨著高通量測(cè)序技術(shù)的發(fā)展和應(yīng)用,微生物組研究領(lǐng)域的相關(guān)分析方法和工具也取得了快速發(fā)展,大量?jī)?yōu)秀的軟件、流程和可視化工具相繼發(fā)布,進(jìn)一步推動(dòng)了本領(lǐng)域的發(fā)展。
擴(kuò)增子分析是微生物組領(lǐng)域應(yīng)用最廣泛的技術(shù),可以快速獲悉研究對(duì)象中的微生物多樣性。本文將重點(diǎn)介紹3款(mothur, QIIME和USEARCH)在近10年內(nèi)發(fā)表且引用過萬次的擴(kuò)增子分析軟件(圖2),其他更多相關(guān)軟件介紹詳見表1。
(1) mothur:由美國(guó)密歇根大學(xué)的Patrick D. Schloss教授團(tuán)隊(duì)在2009年發(fā)布的首個(gè)擴(kuò)增子分析流程[16]。它整合了之前發(fā)表的OTU定義軟件DOTUR[17]、OTU差異比較工具SONS[18]以及其他可用工具,實(shí)現(xiàn)了第一套較完整的分析流程,讓廣大研究者開展擴(kuò)增子分析成為可能(圖2)。
(2) QIIME:2010年,美國(guó)科羅拉多大學(xué)的Rob Knight教授(現(xiàn)單位美國(guó)加州大學(xué)圣地亞哥分校)團(tuán)隊(duì)發(fā)布QIIME (發(fā)音同chime)分析流程[19]。該流程可在Linux或Mac系統(tǒng)中運(yùn)行,相比mothur具有更多的優(yōu)點(diǎn),主要包括:整合了200多款相關(guān)軟件和包,實(shí)現(xiàn)每個(gè)步驟更多軟件和方法的選擇;提供150多個(gè)腳本,實(shí)現(xiàn)各種個(gè)性化分析,并可以應(yīng)對(duì)不同類型數(shù)據(jù)和實(shí)驗(yàn)設(shè)計(jì);流程開放程度高,容易整合新軟件和方法;增強(qiáng)統(tǒng)計(jì)和可視化,實(shí)現(xiàn)多樣性、物種組成、差異比較和網(wǎng)絡(luò)等眾多方法和出版級(jí)圖表繪制。由于QIIME允許同領(lǐng)域研究者較自主地開展擴(kuò)增子數(shù)據(jù)的個(gè)性化分析和可視化,逐漸成為本領(lǐng)域最受歡迎的軟件(圖2)。為了滿足日益增長(zhǎng)的測(cè)序數(shù)據(jù)量和可重復(fù)計(jì)算的要求,Gregory J. Caporaso教授于2016年起發(fā)起了基于Python 3語(yǔ)言從頭編寫的QIIME 2項(xiàng)目[20]。該項(xiàng)目實(shí)現(xiàn)了分析流程的可追溯以滿足科研可重復(fù)計(jì)算的要求;同時(shí)推出了一系列新算法,如基于進(jìn)化距離的快速算法條型(Striped) UniFrac[21]、物種分類新方法2-feature-classifier[22]等;更重要的是軟件的可擴(kuò)展性和得到了同際同行的廣泛支持,如接頭和引物序列去除工具cutadapt[23]、序列質(zhì)量控制R包DADA2[24]、聚類和去冗余的軟件VSEARCH[25]、縱向和成對(duì)樣本分析工具longitud-inal[26]等,甚至包括宏基因組、宏代謝組分析和中文幫助文檔,極大了提高了流程的適用范圍和易用性。
圖2 近10年來微生物組領(lǐng)域的重要軟件和算法
圖中橙色為Patrick D. Schloss教授開發(fā)的分析流程mothur,綠色為Rob Knight教授主持開發(fā)的QIIME系列分析流程,藍(lán)色顯示Robert Edgar獨(dú)立研究員編寫的相關(guān)軟件和算法。
續(xù)表
(3) USEARCH-based的擴(kuò)增子分析流程。雖然已經(jīng)發(fā)布了兩套較完整的擴(kuò)增子分析流程,但研究中存在的諸多問題卻仍沒有很好的解決。物理學(xué)背景的生物信息學(xué)家、獨(dú)立研究員Robert Edgar在本領(lǐng)域開發(fā)了一系列經(jīng)典的算法和軟件,如高速序列比對(duì)軟件USEARCH[27]、嵌合體檢測(cè)軟件UCHIME[28]、OTU代表性序列鑒定算法UPARSE[29]和測(cè)序數(shù)據(jù)錯(cuò)誤過濾和去噪算法UNOISE等[30]。這些算法和軟件的推出,極大的提高了擴(kuò)增子數(shù)據(jù)分析的速度和準(zhǔn)確度。在以上算法和軟件的基礎(chǔ)上,Robert逐漸將USEARCH發(fā)展成為包括近200種命令的完整擴(kuò)增子分析流程,而且跨平臺(tái)、體系小巧、無依賴關(guān)系和容易安裝,其32位版本免費(fèi),64位版商業(yè)版和非贏利版分別售價(jià)1485和885美元,條件允許的實(shí)驗(yàn)室推薦購(gòu)買,軟件分析速度快且易用性強(qiáng),可有效降低入門學(xué)習(xí)成本并節(jié)約寶貴時(shí)間。同時(shí)也有USEARCH類似的工具推出,如64位完全免費(fèi)的VSEARCH[25],可實(shí)現(xiàn)USEARCH的核心功能,但下游分析功能略少。
從使用難易程度看,推薦初涉擴(kuò)增子分析人員從使用USEARCH[25]或VSEARCH[25]開始,這兩款軟件允許用戶在Windows或Mac筆記本上完成多達(dá)幾百個(gè)樣本分析項(xiàng)目。對(duì)于有一定基礎(chǔ)且有Linux服務(wù)器的研究者,可進(jìn)一步學(xué)習(xí)QIIME 2來實(shí)現(xiàn)更多種類的分析方法。
統(tǒng)計(jì)分析和可視化部分的工作常在R語(yǔ)言中實(shí)現(xiàn)。擴(kuò)增子數(shù)據(jù)分析常用R包有vegan[31]、phyloseq[32]和microbiome[33]。vegan是群落生態(tài)包,可實(shí)現(xiàn)多樣性、主坐標(biāo)等分析,在微生物生態(tài)領(lǐng)域有廣泛應(yīng)用,甚至發(fā)展出了基于ggplot2版本的ggvegan[31]。phyloseq[32]包的功能主要包括多樣性分析、差異比較和可視化等。針對(duì)沒有R使用經(jīng)驗(yàn)的用戶,phyloseq還推出了網(wǎng)頁(yè)版工具shiny-phyloseq[34],在瀏覽器中即可實(shí)現(xiàn)擴(kuò)增子數(shù)據(jù)交互式分析。micro-biome 包[33]包括多樣性、核心OTU、物種組成、相關(guān)性和格式轉(zhuǎn)換等80余個(gè)分析函數(shù),提高微生物組分析的工作效率。
近年來,鳥槍法宏基因組(shotgun metagenomic)測(cè)序隨著通量提高和價(jià)格下降得到了進(jìn)一步發(fā)展,隨之而來的是大量相關(guān)軟件的研發(fā)和發(fā)表(表2)。較擴(kuò)增子測(cè)序相比,宏基因組測(cè)序不僅能獲得無偏的物種組成,還得獲得研究對(duì)象的功能組成,甚至能拼接出部分微生物的基因組草圖。
對(duì)于人類腸道這類研究較多的領(lǐng)域,可選擇基于參考數(shù)據(jù)庫(kù)比對(duì)快速實(shí)現(xiàn)宏基因組物種和功能組成定量的分析方案,如MetaPhlAn2[47]、Kraken2[48]實(shí)現(xiàn)序列的物種分類,HUMAnN2[49]實(shí)現(xiàn)功能組成定量。對(duì)于缺少高質(zhì)量宏基因組參考數(shù)據(jù)庫(kù)的領(lǐng)域,則需要從頭()拼接宏基因組數(shù)據(jù),并進(jìn)行基因預(yù)測(cè)。常用的宏基因組拼接軟件有MEGAHIT[70]和metaSPAdes[50]等,基因注釋軟件如Prokka[51]和GeneMarkS-2[52]等(表2)。對(duì)于多樣品或多批次的宏基因組數(shù)據(jù)進(jìn)行合并分析,通常還要采用CD-HIT[53]構(gòu)建非冗余基因集(non-redundancy gene catalog),實(shí)現(xiàn)將所有樣本基于統(tǒng)一的參考序列進(jìn)行定量和比較。獲得的基因集比對(duì)至多種蛋白功能注釋數(shù)據(jù)庫(kù),提供更多角度觀察數(shù)據(jù)的生物學(xué)意義,如常用的數(shù)據(jù)庫(kù)有碳水化合物基因數(shù)據(jù)庫(kù)CAZy[54]、抗生素抗性基因綜合數(shù)據(jù)庫(kù)CARD[55]和毒力因子數(shù)據(jù)庫(kù)VFDB[56]等。
表2 宏基因組分析常用軟件和數(shù)據(jù)庫(kù)
續(xù)表
宏基因組測(cè)序除了可以揭研究對(duì)象的物種和功能組成外,還可能通過分箱(binning)方法組裝出單菌基因組。近年來分箱軟件快速發(fā)展,使獲得不可培養(yǎng)微生物的基因組成為可能。目前常用的分箱工具有MetaBAT 2[57]、MaxBin 2[58]和CONCOCT[59]等,但結(jié)果差別較大。去年發(fā)表了兩款分箱提純工具metaWRAP[60]和DAS_Tool[61]解決了分箱工具選擇難、結(jié)果差異大的問題,他們通常整合3~5款分箱工具的結(jié)果,進(jìn)一步篩選和綜合利用,獲得更高質(zhì)量的單菌基因組,同時(shí)提供分箱的定量、注釋等一系列常用分析功能。值得注意的是,分箱獲得的單菌基因組存在著不完整和高污染等問題,因此想要提高宏基因組中單菌組裝的完整性,從實(shí)驗(yàn)手段進(jìn)行改進(jìn)并采用配套專用分析方法是未來的發(fā)展方向,如采用流式細(xì)胞術(shù)單細(xì)胞分選[62]、10×建庫(kù)[63]、二三代混合測(cè)序[64,65]等新方法在宏基因組拼接和分箱中取得了較好的效果。宏基因組分析中常用的軟件和數(shù)據(jù)庫(kù)簡(jiǎn)介詳見表2。
擴(kuò)增子和宏基因組分析獲得的物種和功能組成表統(tǒng)稱為特征表,是第二代測(cè)序數(shù)據(jù)分析結(jié)果中的通用格式,在下游分析中可以通過選擇多種R包、圖形化界面、命令行或網(wǎng)頁(yè)版工具進(jìn)行數(shù)據(jù)的轉(zhuǎn)換和呈現(xiàn)。Bioconductor網(wǎng)站提供了上千種生物學(xué)數(shù)據(jù)分析R包,例如計(jì)數(shù)型數(shù)據(jù)可選基于負(fù)二項(xiàng)分布模型的差異統(tǒng)計(jì)R包edgeR[79]或DESeq2[80],組成型數(shù)據(jù)差異分析可選limma包[81],結(jié)合已知影響因素?cái)?shù)據(jù)校正的差異比較可選支持廣義線性混合效應(yīng)模型的lme4包[82]。STAMP是為微生物組數(shù)據(jù)開發(fā)的跨平臺(tái)、圖形界面統(tǒng)計(jì)分析工具[83],可以實(shí)現(xiàn)主成分分析、多種統(tǒng)計(jì)方法進(jìn)行兩組或多組差異比較,結(jié)果可選散點(diǎn)圖、箱線圖、柱狀圖、熱圖和擴(kuò)展柱狀圖等展示方法。LEfSe可以實(shí)現(xiàn)基于線性判別分析尋找特征向量的命令行工具[84],結(jié)果可選柱狀圖和基于GraPhlAn繪制的進(jìn)化分枝圖(cladogram)等展示方式[85],沒有Linux服務(wù)器或不熟悉命令行工作的研究者還可以選擇網(wǎng)頁(yè)版LEfSe開展分析。此外,還有一些專門收集整理微生物組工具并提供在線分析和可視化的平臺(tái),讓用戶在瀏覽器中即可完成分析工作,例如MicrobiomeAnalyst[86]可實(shí)現(xiàn)基于特征表和元數(shù)據(jù)進(jìn)行數(shù)據(jù)篩選、標(biāo)準(zhǔn)化、多樣性分析、差異比較和機(jī)器學(xué)習(xí)等多種分析和可視化方案。
網(wǎng)絡(luò)分析是一門基于圖論的學(xué)科,因其獨(dú)特的視角和直觀的可視式結(jié)果在微生物組數(shù)據(jù)分析中也有廣泛的應(yīng)用。2018年,發(fā)表綜述文章系統(tǒng)介紹了目前主流網(wǎng)絡(luò)分析方法的優(yōu)缺點(diǎn)、適用范圍和選擇依據(jù)[87];發(fā)表綜述文章介紹了網(wǎng)絡(luò)圖在群落結(jié)構(gòu)研究中的作用和意義[88];此外,陳亮2017年在宏基因組公眾號(hào)發(fā)布的《Co-occurrence網(wǎng)絡(luò)圖在R中的實(shí)現(xiàn)》對(duì)相關(guān)基礎(chǔ)概念和具體的實(shí)現(xiàn)方法進(jìn)行了介紹,也可供學(xué)習(xí)參考。常用的分析方法有網(wǎng)頁(yè)工具M(jìn)ENAP[89],本地相似分析LSA[90]、專為微生物組稀疏型數(shù)據(jù)開發(fā)的相關(guān)性算法SPARCC[91]、作為Cytoscape[92]插件使用的CoNet[93]、R語(yǔ)言中的WGCNA[94]和SpiecEasi[95]包等。具體的操作也比較容易實(shí)現(xiàn),例如在R語(yǔ)言環(huán)境中使用WGCNA[94]中包計(jì)算網(wǎng)絡(luò)相關(guān)性質(zhì),采用igraph[96]包實(shí)現(xiàn)網(wǎng)絡(luò)的可視化。對(duì)于網(wǎng)絡(luò)的進(jìn)一步分析、可視化細(xì)節(jié)調(diào)整,可將網(wǎng)絡(luò)數(shù)據(jù)導(dǎo)入Cytoscape[92]或Gephi[97]中調(diào)整細(xì)節(jié)。目前該分析已在pH與微生物群落組裝[98]、妊娠糖尿病與健康孕婦微生物組結(jié)構(gòu)、洗牙后口腔微生物群落結(jié)構(gòu)恢復(fù)等研究中得到應(yīng)用[99,100]。
微生物組數(shù)據(jù)非常適合開展進(jìn)化分析,因?yàn)閱挝锓N的研究需要搜集和整理大量相關(guān)研究中的同源基因,而微生物組研究中的擴(kuò)增子測(cè)序可獲得的序列就是成千上萬的同源基因,方便開展物種系統(tǒng)發(fā)育關(guān)系研究。進(jìn)化分析主要分為多序列對(duì)齊、進(jìn)化樹構(gòu)建和進(jìn)化樹美化等3個(gè)基本過程。由于微生物組中序列種類多且復(fù)雜度高,需要選擇計(jì)算速度快的工具。多序列對(duì)齊可采用MAFFT[101]或MUSCLE[102];進(jìn)化樹構(gòu)建可選FastTree[103]或IQ-TREE[104,105];最后采用Evolview[106]或iTOL[107]在線進(jìn)行進(jìn)化樹的可視化和美化。推薦將序列對(duì)應(yīng)的物種和豐度信息表使用R腳本table2itol (https://github.com/mgoeker/ table2itol)格式化為iTOL的輸入文件。此外,R語(yǔ)言中的ggtree包也可以實(shí)現(xiàn)進(jìn)化樹的注釋和美化[108]。展示物種注釋層級(jí)結(jié)構(gòu)的進(jìn)化分枝圖,推薦使用GraPhlAn進(jìn)行可視化[85]。宏基因組測(cè)序是鳥槍法隨機(jī)片段測(cè)序,進(jìn)化分析需要采用Ortho-Finder[77]基于分箱結(jié)果鑒定單拷貝同源基因,并構(gòu)建多基因進(jìn)化樹。
機(jī)器學(xué)習(xí)是當(dāng)前計(jì)算機(jī)算法研究中最熱門的領(lǐng)域,專門研究計(jì)算機(jī)如何模擬或?qū)崿F(xiàn)人類的學(xué)習(xí)行為,以獲取新的知識(shí)或技能,重新組織已有的知識(shí)結(jié)構(gòu)使之不斷改善自身的性能[109]。目前在微生物組領(lǐng)域常用的機(jī)器學(xué)習(xí)方法有隨機(jī)森林(random forest)、支持向量機(jī)(support vector machine, SVM)和Adaboost等。其中隨機(jī)森林分類(classification)在飲食習(xí)慣分型[110]、疾病診斷[111]、植物亞種預(yù)測(cè)[9]等領(lǐng)域有較多應(yīng)用;隨機(jī)森林回歸(regression)在嬰兒營(yíng)養(yǎng)健康[2]、法醫(yī)學(xué)[112]、時(shí)間序列預(yù)測(cè)[113]等領(lǐng)域有廣泛的應(yīng)用。開展隨機(jī)森林分析可在R語(yǔ)言中通過使用randomForest包實(shí)現(xiàn)[114]。深度學(xué)習(xí)是機(jī)器學(xué)習(xí)領(lǐng)域新發(fā)展的方法,最近預(yù)印本服務(wù)器BioRxiv在線發(fā)表了基于腸道菌群數(shù)據(jù)的深度學(xué)習(xí)可準(zhǔn)確預(yù)測(cè)人類真實(shí)年齡[115],此項(xiàng)研究還被雜志新聞報(bào)導(dǎo)。
許多其他領(lǐng)域的分析方法在微生物組中也得到了推廣和應(yīng)用。全基因組關(guān)聯(lián)分析(genome-wide association study, GWAS)[116]在鑒定人類疾病相關(guān)基因中發(fā)揮了巨大作用,目前也應(yīng)用于微生物組領(lǐng)域來大規(guī)模探索人類與微生物組間的調(diào)控規(guī)律[117,118]、植物微生物組與產(chǎn)量[119]等。環(huán)境因子關(guān)聯(lián)分析也有較多的分析方法在微生物生態(tài)學(xué)中得到廣泛應(yīng)用,如揭示溫度[120]、pH[121]和鹽分[122]等在不同環(huán)境中是微生物群落結(jié)構(gòu)的決定因素。更多關(guān)于微生物組下游分析工具的介紹,詳見表3。
很多文章中的分析和可視化結(jié)果并非基于發(fā)表軟件,而且作者自編程實(shí)現(xiàn)的分析。如果想?yún)⒖嘉恼轮械姆治龇椒ê蛨D表,根據(jù)方法描述自行組合工具或編寫代碼是非常有挑戰(zhàn)的工作。目前很多文章發(fā)表時(shí)提供了分析代碼,鏈接位于文章“代碼可用(Code Available)”欄目,代碼保存于Github等代碼備份網(wǎng)站。基于文章作者分享的代碼和測(cè)試數(shù)據(jù),更容易重復(fù)文章中發(fā)表的分析方法,在理解的基礎(chǔ)上替換為自己的數(shù)據(jù)開展分析,甚至可在源代碼基礎(chǔ)上修改分析方案,獲得更合理的結(jié)果。分析代碼的重現(xiàn)性在研究中可極大地提高工作效率,節(jié)省研究者大量開發(fā)分析代碼的時(shí)間。表4列舉了一些提供可重復(fù)分析代碼的實(shí)驗(yàn)室,供研究者參考。
表3 微生物組下游通用分析工具
表4 部分提供統(tǒng)計(jì)分析代碼的實(shí)驗(yàn)室
近10年來,第二代測(cè)序技術(shù)通量的提高和價(jià)格的下降,極大地推動(dòng)了微生物組領(lǐng)域的發(fā)展,使得研究者拓寬了微生物組研究對(duì)象的深度和廣度,揭示了極端環(huán)境、植物、動(dòng)物、人類腸道、海洋、土壤等領(lǐng)域的微生物組成和功能[6]。目前宏基因組研究主要以短讀長(zhǎng)的Illumina Seq/Nova系列或華大基因的BGI Seq系列平臺(tái)產(chǎn)出數(shù)據(jù)為主,雖然獲得數(shù)據(jù)通量大,但數(shù)據(jù)拼接質(zhì)量仍有較大提升空間。近年來,Pacific BioSciences (PacBio)和Oxford Nanopore Technologies (ONT)等三代測(cè)序技術(shù)快速發(fā)展,雖然受到測(cè)序錯(cuò)誤率高和配套軟件缺乏的困擾,但在讀長(zhǎng)、測(cè)序速度等方面的優(yōu)勢(shì)正在逐漸突顯。Char-alampou等[134]應(yīng)用ONT技術(shù)對(duì)患者呼吸道細(xì)菌宏基因組進(jìn)行測(cè)序,實(shí)現(xiàn)了6 h內(nèi)快速診斷致病菌。
目前微生物組研究中應(yīng)用最廣泛的是擴(kuò)增子測(cè)序技術(shù),該技術(shù)可以快速地揭示群落的微生物組成,而且具有操作簡(jiǎn)單、成本低、有效避免宿主污染、方便開展大規(guī)模研究等優(yōu)勢(shì)。但擴(kuò)增子的研究范圍僅限引物可擴(kuò)增部分DNA的物種組成,而且受擴(kuò)增基因拷貝數(shù)和多態(tài)性的影響,如果想進(jìn)一步了解微生物組的全貌和功能基因,宏基因組是更有效的研究方法。宏基因組不僅可以無偏的獲得研究對(duì)象中細(xì)菌、真菌、古菌、病毒和原生動(dòng)物等一切以DNA為遺傳物種的物種序列信息、確定其物種和功能組成,更有潛力獲得未培養(yǎng)物種的功能基因,甚至是基因組草圖。目前雖然已經(jīng)有一些宏基因組分箱、分箱提純的工具,但仍處于發(fā)展的初級(jí)階段,還有很多有待改進(jìn)的方向,如計(jì)算不同長(zhǎng)度K-mer頻率、比對(duì)參考數(shù)據(jù)庫(kù)去除已知物種降低復(fù)雜度和/或結(jié)合三代長(zhǎng)讀長(zhǎng)的測(cè)序數(shù)據(jù)等[64,135]。
提高微生物組數(shù)據(jù)分析的效率,高質(zhì)量的參考數(shù)據(jù)庫(kù)是基礎(chǔ),而這一領(lǐng)域的發(fā)展依賴于大規(guī)模培養(yǎng)組學(xué)的應(yīng)用和更多高質(zhì)量參考基因組的公布。同時(shí),對(duì)發(fā)表數(shù)據(jù)的分類整理、提高可用性以及進(jìn)一步挖掘也十分必要。例如,R包c(diǎn)uratedMetagenomic-Data整理了46個(gè)研究中的8184個(gè)宏基因組樣本,對(duì)超100 TB的原始數(shù)據(jù)采取了嚴(yán)格質(zhì)控進(jìn)而獲得了相關(guān)物種和功能組成表,方便同領(lǐng)域研究者對(duì)數(shù)據(jù)進(jìn)一步挖掘和查詢[136];ML Repo數(shù)據(jù)庫(kù)整理來自15篇文章中的33個(gè)人類微生物組IBD、糖尿病、肥胖和癌癥等分類和年齡回歸數(shù)據(jù)集,研究者可按類瀏覽下載這些數(shù)據(jù),用于進(jìn)一步挖掘和方法評(píng)估[137];意大利特倫托大學(xué)Nicola Segata團(tuán)隊(duì)利用來自不同地理位置、生活方式和年齡人群的9428個(gè)宏基因組,突破性地重建了15萬個(gè)人體微生物基因組草圖[138]。以上對(duì)發(fā)表數(shù)據(jù)整理和再利用的例子,為今后開發(fā)更多基于發(fā)表數(shù)據(jù)的數(shù)據(jù)庫(kù)和分析工具提供了借鑒和參考。
[1] Marchesi JR, Ravel J. The vocabulary of microbiome research: a proposal.,2015, 3(1): 31.
[2] Subramanian S, Huq S, Yatsunenko T, Haque R, Mahfuz M, Alam MA, Benezra A, DeStefano J, Meier MF, Muegge BD, Barratt MJ, VanArendonk LG, Zhang Q, Province MA, Petri WA Jr, Ahmed T, Gordon JI. Persistent gut microbiota immaturity in malnourished Bangladeshi children.,2014, 510: 417–421.
[3] Bai Y, Qian JM, Zhou JM, Qian W. Crop Microbiome: breakthrough technology for agriculture.,2017, 32(3): 260–265.白洋, 錢景美, 周儉民, 錢韋. 農(nóng)作物微生物組:跨越轉(zhuǎn)化臨界點(diǎn)的現(xiàn)代生物技術(shù). 中國(guó)科學(xué)院院刊,2017, 32(3): 260–265.
[4] Wang J, Jia H. Metagenome-wide association studies: fine-mining the microbiome.,2016, 14: 508–522.
[5] Xie JP, Han YB, Liu G, Bai LQ. Research advances on microbial genetics in China in 2015.,2016, 38(9): 765–790.謝建平, 韓玉波, 劉鋼, 白林泉. 2015年中國(guó)微生物遺傳學(xué)研究領(lǐng)域若干重要進(jìn)展. 遺傳,2016, 38(9): 765–790.
[6] White RA Ⅲ, Callister SJ, Moore RJ, Baker ES, Jansson JK. The past, present and future of microbiome analyses.,2016, 11: 2049–2053.
[7] Zou Y, Xue W, Luo G, Deng Z, Qin P, Guo R, Sun H, Xia Y, Liang S, Dai Y, Wan D, Jiang R, Su L, Feng Q, Jie Z, Guo T, Xia Z, Liu C, Yu J, Lin Y, Tang S, Huo G, Xu X, Hou Y, Liu X, Wang J, Yang H, Kristiansen K, Li J, Jia H, Xiao L. 1,520 reference genomes from cultivatedhuman gut bacteria enable functional microbiome analyses.,2019, 37(2): 179–185.
[8] Bai Y, Müller DB, Srinivas G, Garrido-Oter R, Potthoff E, Rott M, Dombrowski N, Münch PC, Spaepen S, Remus-Emsermann M, Hüttel B, McHardy AC, Vorholt JA, Schulze-Lefert P. Functional overlap of theleaf and root microbiota.,2015, 528(7582): 364–369.
[9] Zhang J, Liu YX, Zhang N, Hu B, Jin T, Xu H, Qin Y, Yan P, Zhang X, Guo X, Hui J, Cao S, Wang X, Wang C, Wang H, Qu B, Fan G, Yuan L, Garrido-Oter R, Chu C, Bai Y.is associated with root microbiota composition and nitrogen use in field-grown rice.,2019, 37(6): 676–684.
[10] Shi W, Li M, Wei G, Tian R, Li C, Wang B, Lin R, Shi C, Chi X, Zhou B, Gao Z. The occurrence of potato common scab correlates with the community composition and function of the geocaulosphere soil microbiome.,2019, 7(1): 14.
[11] Ma Y, You X, Mai G, Tokuyasu T, Liu C. A human gut phage catalog correlates the gut phageome with type 2 diabetes.,2018, 6(1): 24.
[12] Yu K, Yi S, Li B, Guo F, Peng X, Wang Z, Wu Y, Alvarez-Cohen L, Zhang T. An integrated meta-omics approach reveals substrates involved in synergistic interactions in a bisphenol A (BPA)-degrading microbial community.,2019, 7(1): 16.
[13] Li J, Jia H, Cai X, Zhong H, Feng Q, Sunagawa S, Arumugam M, Kultima JR, Prifti E, Nielsen T, Juncker AS, Manichanh C, Chen B, Zhang W, Levenez F, Wang J, Xu X, Xiao L, Liang S, Zhang D, Zhang Z, Chen W, Zhao H, Al-Aama JY, Edris S, Yang H, Wang J, Hansen T, Nielsen HB, Brunak S, Kristiansen K, Guarner F, Pedersen O, Doré J, Ehrlich SD, MetaHIT Consortium, Bork P, Wang J, Pons N, Le Chatelier E, Batto JM, Kennedy S, Haimet F, Winogradski Y, Pelletier E, LePaslier D, Artiguenave F, Bruls T, Weissenbach J, Turner K, Parkhill J, Antolin M, Casellas F, Borruel N, Varela E, Torrejon A, Denariaz G, Derrien M, van Hylckama Vlieg JET, Viega P, Oozeer R, Knoll J, Rescigno M, Brechot C, M'Rini C, Mérieux A, Yamada T, Tims S, Zoetendal EG, Kleerebezem M, de Vos WM, Cultrone A, Leclerc M, Juste C, Guedon E, Delorme C, Layec S, Khaci G, van de Guchte M, Vandemeulebrouck G, Jamet A, Dervyn R, Sanchez N, Blottière H, Maguin E, Renault P, Tap J, Mende DR. An integrated catalog of reference genes in the human gut microbiome.,2014, 32(8): 834–841.
[14] Grüning B, Dale R, Sj?din A, Chapman BA, Rowe J, Tomkins-Tinch CH, Valieris R, K?ster J, Bioconda Team. Bioconda: sustainable and comprehensive software distribution for the life sciences.,2018, 15(7): 475–476.
[15] Wickham H. ggplot2: elegant graphics for data analysis., 2016.
[16] Schloss PD, Westcott SL, Ryabin T, Hall JR, Hartmann M, Hollister EB, Lesniewski RA, Oakley BB, Parks DH, Robinson CJ, Sahl JW, Stres B, Thallinger GG, van Horn DJ, Weber CF. Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities.,2009, 75(23): 7537–7541.
[17] Schloss PD, Handelsman J. Introducing DOTUR, a computer program for defining operational taxonomic units and estimating species richness.,2005, 71(3): 1501–1506.
[18] Schloss PD, Handelsman J. Introducing SONS, a tool for operational taxonomic unit-based comparisons of microbial community memberships and structures.,2006, 72(10): 6773–6779.
[19] Caporaso JG, Kuczynski J, Stombaugh J, Bittinger K, Bushman FD, Costello EK, Fierer N, Pe?a AG, Goodrich JK, Gordon JI, Huttley GA, Kelley ST, Knights D, Koenig JE, Ley RE, Lozupone CA, McDonald D, Muegge BD, Pirrung M, Reeder J, Sevinsky JR, Turnbaugh PJ, Walters WA, Widmann J, Yatsunenko T, Zaneveld J, Knight R. QIIME allows analysis of high-throughput community sequencing data.,2010, 7(5): 335–336.
[20] Bolyen E, Rideout JR, Dillon MR, Bokulich NA, Abnet CC, Al-Ghalith GA, Alexander H, Alm EJ, Arumugam M, Asnicar F, Bai Y, Bisanz JE, Bittinger K, Brejnrod A, Brislawn CJ, Brown CT, Callahan BJ, Caraballo- Rodríguez AM, Chase J, Cope EK, Da Silva R, Diener C, Dorrestein PC, Douglas GM, Durall DM, Duvallet C, Edwardson CF, Ernst M, Estaki M, Fouquier J, Gauglitz JM, Gibbons SM, Gibson DL, Gonzalez A, Gorlick K, Guo J, Hillmann B, Holmes S, Holste H, Huttenhower C, Huttley GA, Janssen S, Jarmusch AK, Jiang L, Kaehler BD, Kang KB, Keefe CR, Keim P, Kelley ST, Knights D, Koester I, Kosciolek T, Kreps J, Langille MGI, Lee J, Ley R, Liu YX, Loftfield E, Lozupone C, Maher M, Marotz C, Martin BD, McDonald D, McIver LJ, Melnik AV, Metcalf JL, Morgan SC, Morton JT, Naimey AT, Navas-Molina JA, Nothias LF, Orchanian SB, Pearson T, Peoples SL, Petras D, Preuss ML, Pruesse E, Rasmussen LB, Rivers A, Robeson MS 2nd, Rosenthal P, Segata N, Shaffer M, Shiffer A, Sinha R, Song SJ, Spear JR, Swafford AD, Thompson LR, Torres PJ, Trinh P, Tripathi A, Turnbaugh PJ, Ul-Hasan S, van der Hooft JJJ, Vargas F, Vázquez-Baeza Y, Vogtmann E, von Hippel M, Walters W, Wan Y, Wang M, Warren J, Weber KC, Williamson CHD, Willis AD, Xu ZZ, Zaneveld JR, Zhang Y, Zhu Q, Knight R, Caporaso JG. Reproducible, interactive, scalable and extensible microbiome data science using QIIME 2.,2019, 37(8): 852–857.
[21] McDonald D, Vázquez-Baeza Y, Koslicki D, McClelland J, Reeve N, Xu Z, Gonzalez A, Knight R. Striped UniFrac: enabling microbiome analysis at unprecedented scale.,2018, 15(11): 847–848.
[22] Bokulich NA, Kaehler BD, Rideout JR, Dillon M, Bolyen E, Knight R, Huttley GA, Gregory Caporaso J. Optimizing taxonomic classification of marker-gene amplicon sequences with QIIME 2’s q2-feature-classifier plugin.,2018, 6(1): 90.
[23] Martin M. Cutadapt removes adapter sequences from high-throughput sequencing reads., 17(1), doi: 10.14806/ej.17.1.200..
[24] Callahan BJ, McMurdie PJ, Rosen MJ, Han AW, Johnson AJA, Holmes SP. DADA2: High-resolution sample inference from Illumina amplicon data.,2016, 13(7): 581–583.
[25] Rognes T, Flouri T, Nichols B, Quince C, Mahé F. VSEARCH: a versatile open source tool for metagenomics.,2016, 4: e2584.
[26] Bokulich NA, Dillon MR, Zhang Y, Rideout JR, Bolyen E, Li H, Albert PS, Caporaso JG. Q2-longitudinal: longitudinal and paired-sample analyses of microbiome data.,2018, 3(6): e00219–00218.
[27] Edgar RC. Search and clustering orders of magnitude faster than BLAST.,2010, 26(19): 2460–2461.
[28] Edgar RC, Haas BJ, Clemente JC, Quince C, Knight R. UCHIME improves sensitivity and speed of chimera detection.,2011, 27(16): 2194–2200.
[29] Edgar RC. UPARSE: highly accurate OTU sequences from microbial amplicon reads.,2013, 10(10): 996–998.
[30] Edgar RC, Flyvbjerg H. Error filtering, pair assembly and error correction for next-generation sequencing reads.,2015, 31(21): 3476–3482.
[31] Oksanen J, Kindt R, Legendre P, O’Hara B, Stevens MHH, Oksanen MJ, Suggests M. The vegan package.,2007, 10: 631–637.
[32] McMurdie PJ, Holmes S. Phyloseq: an R package for reproducible interactive analysis and graphics of microbiome census data.,2013, 8(4): e61217.
[33] Lahti L, Shetty S. Microbiome R package.,2012-2019. doi: 10.18129/B9.bioc.microbiome.
[34] McMurdie PJ, Holmes S. Shiny-phyloseq: web application for interactive microbiome analysis with provenance tracking.,2014, 31(2): 282–283.
[35] Gonzalez A, Navas-Molina JA, Kosciolek T, McDonald D, Vázquez-Baeza Y, Ackermann G, DeReus J, Janssen S, Swafford AD, Orchanian SB, Sanders JG, Shorenstein J, Holste H, Petrus S, Robbins-Pianka A, Brislawn CJ, Wang M, Rideout JR, Bolyen E, Dillon M, Caporaso JG, Dorrestein PC, Knight R. Qiita: rapid, web-enabled microbiome meta-analysis.,2018, 15(10): 796–798.
[36] Mitchell AL, Scheremetjew M, Denise H, Potter S, Tarkowska A, Qureshi M, Salazar GA, Pesseat S, Boland MA, Hunter FMI, Ten Hoopen P, Alako B, Amid C, Wilkinson DJ, Curtis TP, Cochrane G, Finn RD. EBI Metagenomics in 2017: enriching the analysis of microbial communities, from sequence reads to assemblies.,2018, 46(D1): D726–D735.
[37] Shi W, Qi H, Sun Q, Fan G, Liu S, Wang J, Zhu B, Liu H, Zhao F, Wang X, Hu X, Li W, Liu J, Tian Y, Wu L, Ma J. GcMeta: a global catalogue of metagenomics platform to support the archiving, standardization and analysis of microbiome data.,2018, 47(D1): D637–D648.
[38] McDonald D, Price MN, Goodrich J, Nawrocki EP, DeSantis TZ, Probst A, Andersen GL, Knight R, Hugenholtz P. An improved greengenes taxonomy with explicit ranks for ecological and evolutionary analyses of bacteria and archaea.,2012, 6(3): 610–618.
[39] Quast C, Pruesse E, Yilmaz P, Gerken J, Schweer T, Yarza P, Peplies J, Gl?ckner FO. The SILVA ribosomal RNA gene database project: improved data processing and web-based tools.,2013, 41 (Database issue): D590–596.
[40] Cole JR, Wang Q, Fish JA, Chai B, McGarrell DM, Sun Y, Brown CT, Porras-Alfaro A, Kuske CR, Tiedje JM. Ribosomal Database Project: data and tools for high throughput rRNA analysis.,2014, 42(Database issue): D633–D642.
[41] Nilsson RH, Larsson K-H, Taylor AFS, Bengtsson- Palme J, Jeppesen TS, Schigel D, Kennedy P, Picard K, Gl?ckner FO, Tedersoo L, Saar I, K?ljalg U, Abarenkov K. The UNITE database for molecular identification of fungi: handling dark taxa and parallel taxonomic classifications.,2019, 47(D1): D259–D264.
[42] Langille MGI, Zaneveld J, Caporaso JG, McDonald D, Knights D, Reyes JA, Clemente JC, Burkepile DE, Vega Thurber RL, Knight R, Beiko RG, Huttenhower C. Predictive functional profiling of microbial communities using 16S rRNA marker gene sequences.,2013, 31(9): 814–821.
[43] A?hauer KP, Wemheuer B, Daniel R, Meinicke P. Tax4Fun: predicting functional profiles from metagenomic 16S rRNA data.,2015, 31(17): 2882– 2884.
[44] Louca S, Parfrey LW, Doebeli M. Decoupling function and taxonomy in the global ocean microbiome.,2016, 353(6305): 1272–1277.
[45] Ward T, Larson J, Meulemans J, Hillmann B, Lynch J, Sidiropoulos D, Spear JR, Caporaso G, Blekhman R, Knight R, Fink R, Knights D. BugBase predicts organism-level microbiome phenotypes.,2017: 133462.
[46] Nguyen NH, Song Z, Bates ST, Branco S, Tedersoo L, Menke J, Schilling JS, Kennedy PG. FUNGuild: an open annotation tool for parsing fungal community datasets by ecological guild.,2016, 20: 241–248.
[47] Truong DT, Franzosa EA, Tickle TL, Scholz M, Weingart G, Pasolli E, Tett A, Huttenhower C, Segata N. MetaPhlAn2 for enhanced metagenomic taxonomic profiling.,2015, 12(10): 902–903.
[48] Wood DE, Salzberg SL. Kraken: ultrafast metagenomic sequence classification using exact alignments.,2014, 15(3): R46.
[49] Franzosa EA, McIver LJ, Rahnavard G, Thompson LR, Schirmer M, Weingart G, Lipson KS, Knight R, Caporaso JG, Segata N, Huttenhower C. Species-level functional profiling of metagenomes and metatranscriptomes.,2018, 15(11): 962–968.
[50] Nurk S, Meleshko D, Korobeynikov A, Pevzner PA. MetaSPAdes: a new versatile metagenomic assembler.,2017, 27(5): 824–834.
[51] Seemann T. Prokka: rapid prokaryotic genome annotation.,2014, 30(14): 2068–2069.
[52] Lomsadze A, Gemayel K, Tang S, Borodovsky M. Modeling leaderless transcription and atypical genes results in more accurate gene prediction in prokaryotes.,2018, 28(7): 1079–1089.
[53] Fu L, Niu B, Zhu Z, Wu S, Li W. CD-HIT: accelerated for clustering the next-generation sequencing data.,2012, 28(23): 3150–3152.
[54] Lombard V, Golaconda Ramulu H, Drula E, Coutinho PM, Henrissat B. The carbohydrate-active enzymes database (CAZy) in 2013.,2014, 42(Database issue): D490–D495.
[55] Jia B, Raphenya AR, Alcock B, Waglechner N, Guo P, Tsang KK, Lago BA, Dave BM, Pereira S, Sharma AN, Doshi S, Courtot M, Lo R, Williams LE, Frye JG, Elsayegh T, Sardar D, Westman EL, Pawlowski AC, Johnson TA, Brinkman FSL, Wright GD, McArthur AG. CARD 2017: expansion and model-centric curation of the comprehensive antibiotic resistance database.,2017, 45(D1): D566–D573.
[56] Liu B, Zheng D, Jin Q, Chen L, Yang J. VFDB 2019: a comparative pathogenomic platform with an interactive web interface.,2019, 47(D1): D687– D692.
[57] Kang D, Li F, Kirton E, Thomas A, Egan R, An H, Wang Z. MetaBAT 2: an adaptive binning algorithm for robust and efficient genome reconstruction from metagenome assemblies.,2019, 7: e7359.
[58] Wu YW, Simmons BA, Singer SW. MaxBin 2.0: an automated binning algorithm to recover genomes from multiple metagenomic datasets.,2015, 32(4): 605–607.
[59] Alneberg J, Bjarnason BS, de Bruijn I, Schirmer M, Quick J, Ijaz UZ, Lahti L, Loman NJ, Andersson AF, Quince C. Binning metagenomic contigs by coverage and composition.,2014, 11(11): 1144– 1146.
[60] Uritskiy GV, DiRuggiero J, Taylor J. MetaWRAP—a flexible pipeline for genome-resolved metagenomic data analysis.,2018, 6(1): 158.
[61] Sieber CMK, Probst AJ, Sharrar A, Thomas BC, Hess M, Tringe SG, Banfield JF. Recovery of genomes from metagenomes via a dereplication, aggregation and scoring strategy.,2018, 3(7): 836–843.
[62] Ji P, Zhang Y, Wang J, Zhao F. MetaSort untangles metagenome assembly by reducing microbial community complexity.,2017, 8: 14306.
[63] Bishara A, Moss EL, Kolmogorov M, Parada AE, Weng Z, Sidow A, Dekas AE, Batzoglou S, Bhatt AS. High- quality genome sequences of uncultured microbes by assembly of read clouds.,2018, 36(11): 1067–1075.
[64] Bertrand D, Shaw J, Kalathiyappan M, Ng AHQ, Kumar MS, Li C, Dvornicic M, Soldo JP, Koh JY, Tong C, Ng OT, Barkham T, Young B, Marimuthu K, Chng KR, Sikic M, Nagarajan N. Hybrid metagenomic assembly enables high-resolution analysis of resistance determinants and mobile elements in human microbiomes.,2019, 37(8): 937–944.
[65] Stewart RD, Auffret MD, Warr A, Walker AW, Roehe R, Watson M. Compendium of 4,941 rumen metagenome- assembled genomes for rumen microbiome biology and enzyme discovery.,2019, 37(8): 953– 961.
[66] Ewels P, Magnusson M, Lundin S, K?ller M. MultiQC: summarize analysis results for multiple tools and samples in a single report.,2016, 32(19): 3047– 3048.
[67] Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data.,2014, 30(15): 2114–2120.
[68] Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2.,2012, 9(4): 357–359.
[69] Suzek BE, Wang Y, Huang H, McGarvey PB, Wu CH, UniProt Consortium. UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches.,2015, 31(6): 926–932.
[70] Li D, Liu CM, Luo R, Sadakane K, Lam TW. MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph.,2015, 31(10): 1674–1676.
[71] Mikheenko A, Saveliev V, Gurevich A. MetaQUAST: evaluation of metagenome assemblies.,2016, 32(7): 1088–1090.
[72] Patro R, Duggal G, Love MI, Irizarry RA, Kingsford C. Salmon provides fast and bias-aware quantification of transcript expression.,2017, 14(4): 417– 419.
[73] Buchfink B, Xie C, Huson DH. Fast and sensitive protein alignment using DIAMOND.,2015, 12(1): 59–60.
[74] Huerta-Cepas J, Szklarczyk D, Heller D, Hernández- Plaza A, Forslund SK, Cook H, Mende DR, Letunic I, Rattei T, Jensen Lars J, von Mering C, Bork P. EggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses.,2019, 47(D1): D309–D314.
[75] Kanehisa M, Sato Y, Morishima K. BlastKOALA and GhostKOALA: KEGG tools for functional characterization of genome and metagenome sequences.,2016, 428(4): 726–731.
[76] Gibson MK, Forsberg KJ, Dantas G. Improved annotation of antibiotic resistance determinants reveals microbial resistomes cluster by ecology.,2014, 9(1): 207–216.
[77] Emms DM, Kelly S. OrthoFinder: solving fundamental biases in whole genome comparisons dramatically improves orthogroup inference accuracy.,2015, 16(1): 157.
[78] Comeau AM, Douglas GM, Langille MGI. Microbiome helper: a custom and streamlined workflow for microbiome research.,2017, 2(1): e00127–00116.
[79] Robinson MD, McCarthy DJ, Smyth GK. EdgeR: a Bioconductor package for differential expression analysis of digital gene expression data.,2010, 26(1): 139–140.
[80] Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.,2014, 15(12): 550.
[81] Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, Smyth GK. Limma powers differential expression analyses for RNA-sequencing and microarray studies.,2015, 43(7): e47.
[82] Bates D, M?chler M, Bolker B, Walker S. Fitting Linear Mixed-Effects Models using lme4., 2014.
[83] Parks DH, Tyson GW, Hugenholtz P, Beiko RG. STAMP: statistical analysis of taxonomic and functional profiles.,2014, 30(21): 3123–3124.
[84] Segata N, Izard J, Waldron L, Gevers D, Miropolsky L, Garrett WS, Huttenhower C. Metagenomic biomarker discovery and explanation.,2011, 12(6): R60.
[85] Asnicar F, Weingart G, Tickle TL, Huttenhower C, Segata N. Compact graphical representation of phylogenetic data and metadata with GraPhlAn.,2015, 3: e1029.
[86] Dhariwal A, Chong J, Habib S, King IL, Agellon LB, Xia J. MicrobiomeAnalyst: a web-based tool for comprehensive statistical, visual and meta-analysis of microbiome data.,2017, 45(W1): W180–W188.
[87] R?ttjers L, Faust K. From hairballs to hypotheses– biological insights from microbial networks.,2018, 42(6): 761–780.
[88] Banerjee S, Schlaeppi K, van der Heijden MGA. Keystone taxa as drivers of microbiome structure and functioning.,2018, 16(9): 567–576.
[89] Deng Y, Jiang YH, Yang Y, He Z, Luo F, Zhou J. Molecular ecological network analyses.,2012, 13(1): 113.
[90] Durno WE, Hanson NW, Konwar KM, Hallam SJ. Expanding the boundaries of local similarity analysis.,2013, 14(1): S3.
[91] Friedman J, Alm EJ. Inferring correlation networks from genomic survey data.,2012, 8(9): e1002687.
[92] Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B, Ideker T. Cytoscape: a software environment for integrated models of biomolecular interaction networks.,2003, 13(11): 2498–2504.
[93] Faust K, Sathirapongsasuti JF, Izard J, Segata N, Gevers D, Raes J, Huttenhower C. relationships in the human microbiome.,2012, 8(7): e1002606.
[94] Langfelder P, Horvath S. WGCNA: an R package for weighted correlation network analysis.,2008, 9(1): 559.
[95] Kurtz ZD, Müller CL, Miraldi ER, Littman DR, Blaser MJ, Bonneau RA. Sparse and compositionally robust inference of microbial ecological networks.,2015, 11(5): e1004226.
[96] Csardi G, Nepusz T. The igraph software package for complex network research.,,2006, 1695(5): 1–9.
[97] Bastian M, Heymann S, Jacomy M. Gephi: an open source software for exploring and manipulating networks. In Third international AAAI conference on weblogs and social media: 2009.
[98] Fan K, Weisenhorn P, Gilbert JA, Shi Y, Bai Y, Chu H. Soil pH correlates with the co-occurrence and assemblage process of diazotrophic communities in rhizosphere and bulk soils of wheat fields.,2018, 121: 185–192.
[99] Wang J, Zheng J, Shi W, Du N, Xu X, Zhang Y, Ji P, Zhang F, Jia Z, Wang Y, Zheng Z, Zhang H, Zhao F. Dysbiosis of maternal and neonatal microbiota associated with gestational diabetes mellitus.,2018, 67(9): 1614–1625.
[100] Wang J, Jia Z, Zhang B, Peng L, Zhao F. Tracing the accumulation of in vivo human oral microbiota elucidates microbial community dynamics at the gateway to the GI tract.,2019: gutjnl-2019-318977.
[101] Katoh K, Standley DM. MAFFT Multiple sequence alignment software version 7: improvements in performance and usability.,2013, 30(4): 772–780.
[102] Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput.,2004, 32(5): 1792–1797.
[103] Price MN, Dehal PS, Arkin AP. FastTree 2–approximately maximum-likelihood trees for large alignments.,2010, 5(3): e9490.
[104] Nguyen LT, Schmidt HA, von Haeseler A, Minh BQ. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies.,2015, 32(1): 268–274.
[105] Trifinopoulos J, Nguyen LT, von Haeseler A, Minh BQ. W-IQ-TREE: a fast online phylogenetic tool for maximum likelihood analysis.,2016, 44(W1): W232–W235.
[106] Subramanian B, Gao S, Lercher MJ, Hu S, Chen WH. Evolview v3: a webserver for visualization, annotation, and management of phylogenetic trees.,2019, 47(W1): W270–W275.
[107] Letunic I, Bork P. Interactive Tree Of Life (iTOL) v4: recent updates and new developments.,2019, 47(W1): W256–W259.
[108] Yu G, Smith DK, Zhu H, Guan Y, Lam TTY. Ggtree: an r package for visualization and annotation of phylogenetic trees with their covariates and other associated data.,2017, 8(1): 28–36.
[109] LeCun Y, Bengio Y, Hinton G. Deep learning.,2015, 521: 436–444.
[110] Wilck N, Matus MG, Kearney SM, Olesen SW, Forslund K, Bartolomaeus H, Haase S, M?hler A, Balogh A, Markó L, Vvedenskaya O, Kleiner FH, Tsvetkov D, Klug L, Costea PI, Sunagawa S, Maier L, Rakova N, Schatz V, Neubert P, Fr?tzer C, Krannich A, Gollasch M, Grohme DA, C?rte-Real BF, Gerlach RG, Basic M, Typas A, Wu C, Titze JM, Jantsch J, Boschmann M, Dechend R, Kleinewietfeld M, Kempa S, Bork P, Linker RA, Alm EJ, Müller DN. Salt-responsive gut commensal modulates TH17 axis and disease.,2017, 551(7682): 585–589.
[111] Ren Z, Li A, Jiang J, Zhou L, Yu Z, Lu H, Xie H, Chen X, Shao L, Zhang R, Xu S, Zhang H, Cui G, Chen X, Sun R, Wen H, Lerut JP, Kan Q, Li L, Zheng S. Gut microbiome analysis as a tool towards targeted non- invasive biomarkers for early hepatocellular carcinoma.,2019, 68(6): 1014–1023.
[112] Metcalf JL, Xu ZZ, Weiss S, Lax S, van Treuren W, Hyde ER, Song SJ, Amir A, Larsen P, Sangwan N, Haarmann D, Humphrey GC, Ackermann G, Thompson LR, Lauber C, Bibat A, Nicholas C, Gebert MJ, Petrosino JF, Reed SC, Gilbert JA, Lynne AM, Bucheli SR, Carter DO, Knight R. Microbial community assembly and metabolic function during mammalian corpse decomposition.,2016, 351(6269): 158–162.
[113] Zhang J, Zhang N, Liu YX, Zhang X, Hu B, Qin Y, Xu H, Wang H, Guo X, Qian J, Wang W, Zhang P, Jin T, Chu C, Bai Y. Root microbiota shift in rice correlates with resident time in the field and developmental stage.,2018, 61(6): 613–621.
[114] Liaw A, Wiener M. Classification and regression by randomForest.,2002, 2(3): 18–22.
[115] Galkin F, Aliper A, Putin E, Kuznetsov I, Gladyshev VN, Zhavoronkov A. Human microbiome aging clocks based on deep learning and tandem of permutation feature importance and accumulated local effects.,2018, 507780.
[116] Yang C, Yang RF, Cui YJ. Bacterial genome-wide association study: methodologies and applications., 2018, 40(1): 57–65.楊超, 楊瑞馥, 崔玉軍. 細(xì)菌全基因組關(guān)聯(lián)研究的方法與應(yīng)用. 遺傳,2018, 40(1): 57–65.
[117] Wang J, Thingholm LB, Skiecevi?ien? J, Rausch P, Kummen M, Hov JR, Degenhardt F, Heinsen FA, Rühlemann MC, Szymczak S, Holm K, Esko T, Sun J, Pricop-Jeckstadt M, Al-Dury S, Bohov P, Bethune J, Sommer F, Ellinghaus D, Berge RK, Hübenthal M, Koch M, Schwarz K, Rimbach G, Hübbe P, Pan WH, Sheibani-Tezerji R, H?sler R, Rosenstiel P, D'Amato M, Cloppenborg-Schmidt K, Künzel S, Laudes M, Marschall HU, Lieb W, N?thlings U, Karlsen TH, Baines JF, Franke A. Genome-wide association analysis identifies variation in vitamin D receptor and other host factors influencing the gut microbiota.,2016, 48(11): 1396–1406.
[118] Wang J, Chen L, Zhao N, Xu X, Xu Y, Zhu B. Of genes and microbes: solving the intricacies in host genomes.,2018, 9(5): 446–461.
[119] Jin T, Wang Y, Huang Y, Xu J, Zhang P, Wang N, Liu X, Chu H, Liu G, Jiang H, Li Y, Xu J, Kristiansen K, Xiao L, Zhang Y, Zhang G, Du G, Zhang H, Zou H, Zhang H, Jie Z, Liang S, Jia H, Wan J, Lin D, Li J, Fan G, Yang H, Wang J, Bai Y, Xu X. Taxonomic structure and functional association of foxtail millet root microbiome.,2017, 6(10): 1–12.
[120] Wang Z, Lu G, Yuan M, Yu H, Wang S, Li X, Deng Y. Elevated temperature overrides the effects of N amendment in Tibetan grassland on soil microbiome.,2019, 136: 107532.
[121] Shi Y, Li Y, Xiang X, Sun R, Yang T, He D, Zhang K, Ni Y, Zhu YG, Adams JM, Chu H. Spatial scale affects the relative role of stochasticity versus determinism in soil bacterial communities in wheat fields across the North China Plain.,2018, 6(1): 27.
[122] Zhang K, Shi Y, Cui X, Yue P, Li K, Liu X, Tripathi BM, Chu H. Salinity is a key determinant for soil microbial communities in a desert ecosystem.,2019, 4(1): e00225–00218.
[123] Doherty MK, Ding T, Koumpouras C, Telesco SE, Monast C, Das A, Brodmerkel C, Schloss PD. Fecal microbiota signatures are associated with response to ustekinumab therapy among crohn’s disease patients.,2018, 9(2): e02120–02117.
[124] DiGiulio DB, Callahan BJ, McMurdie PJ, Costello EK, Lyell DJ, Robaczewska A, Sun CL, Goltsman DSA, Wong RJ, Shaw G, Stevenson DK, Holmes SP, Relman DA. Temporal and spatial variation of the human microbiota during pregnancy.,2015, 112(35): 11060–11065.
[125] Garrido-Oter R, Nakano RT, Dombrowski N, Ma KW, McHardy AC, Schulze-Lefert P. Modular traits of the Rhizobiales root microbiota and their evolutionary relationship with symbiotic Rhizobia.,2018, 24(1): 155–167.e5.
[126] Castrillo G, Teixeira PL, Paredes SH, Law TF, de Lorenzo L, Feltcher ME, Finkel OM, Breakfield NW, Mieczkowski P, Jones CD, Paz-Ares J, Dangl JL. Root microbiota drive direct integration of phosphate stress and immunity.,2017, 543(7646): 513–518.
[127] Herrera Paredes S, Gao T, Law TF, Finkel OM, Mucyn T, Teixeira PJPL, Salas González I, Feltcher ME, Powers MJ, Shank EA, Jones CD, Jojic V, Dangl JL, Castrillo G. Design of synthetic bacterial communities for predictable plant phenotypes.,2018, 16(2): e2003962.
[128] Almeida A, Mitchell AL, Boland M, Forster SC, Gloor GB, Tarkowska A, Lawley TD, Finn RD. A new genomic blueprint of the human gut microbiota.,2019, 568(7753): 499–504.
[129] Vandeputte D, Kathagen G, D’hoe K, Vieira-Silva S, Valles-Colomer M, Sabino J, Wang J, Tito RY, De Commer L, Darzi Y, Vermeire S, Falony G, Raes J. Quantitative microbiome profiling links gut community variation to microbial load.,2017, 551(7681): 507–511.
[130] Stewart CJ, Ajami NJ, O’Brien JL, Hutchinson DS, Smith DP, Wong MC, Ross MC, Lloyd RE, Doddapaneni H, Metcalf GA, Muzny D, Gibbs RA, Vatanen T, Huttenhower C, Xavier RJ, Rewers M, Hagopian W, Toppari J, Ziegler AG, She JX, Akolkar B, Lernmark A, Hyoty H, Vehik K, Krischer JP, Petrosino JF. Temporal development of the gut microbiome in early childhood from the TEDDY study.,2018, 562(7728): 583–588.
[131] Meadow JF, Altrichter AE, Kembel SW, Moriyama M, O’Connor TK, Womack AM, Brown GZ, Green JL, Bohannan BJM. Bacterial communities on classroom surfaces vary with human contact.,2014, 2(1): 7.
[132] Huang AC, Jiang T, Liu YX, Bai YC, Reed J, Qu B, Goossens A, Nützmann HW, Bai Y, Osbourn A. A specialized metabolic network selectively modulatesroot microbiota.,2019, 364(6440): eaau6389.
[133] Chen Q, Jiang T, Liu YX, Liu H, Zhao T, Liu Z, Gan X, Hallab A, Wang X, He J, Ma Y, Zhang F, Jin T, Schranz ME, Wang Y, Bai Y, Wang G. Recently duplicated sesterterpene (C25) gene clusters inthaliana modulate root microbiota.,2019, 62(7): 947–958.
[134] Charalampous T, Kay GL, Richardson H, Aydin A, Baldan R, Jeanes C, Rae D, Grundy S, Turner DJ, Wain J, Leggett RM, Livermore DM, O’Grady J. Nanopore metagenomics enables rapid clinical diagnosis of bacterial lower respiratory infection.,2019, 37(7): 783–792.
[135] Bradley P, den Bakker HC, Rocha EPC, McVean G, Iqbal Z. Ultrafast search of all deposited bacterial and viral genomic data.,2019, 37(2): 152– 159.
[136] Pasolli E, Schiffer L, Manghi P, Renson A, Obenchain V, Truong DT, Beghini F, Malik F, Ramos M, Dowd JB, Huttenhower C, Morgan M, Segata N, Waldron L. Accessible, curated metagenomic data through.,2017, 14(11): 1023–1024.
[137] Vangay P, Hillmann BM, Knights D. Microbiome learning repo (ML Repo): a public repository of microbiome regression and classification tasks.,2019, 8(5).
[138] Pasolli E, Asnicar F, Manara S, Zolfo M, Karcher N, Armanini F, Beghini F, Manghi P, Tett A, Ghensi P, Collado MC, Rice BL, DuLong C, Morgan XC, Golden CD, Quince C, Huttenhower C, Segata N. Extensive unexplored human microbiome diversity revealed by over 150,000 genomes from metagenomes spanning age, geography, and lifestyle.,2019, 176(3): 649–662.e20.
Methods and applications for microbiome data analysis
Yongxin Liu1,2, Yuan Qin1,2,3, Xiaoxuan Guo1,2, Yang Bai1,2,3
Development of high-throughput sequencing stimulates a series of microbiome technologies, such as amplicon sequencing, metagenome, metatranscriptome, which have rapidly promoted microbiome research. Microbiome data analysis involves a lot of basic knowledge, software and databases, and it is difficult for peers to learn and select proper methods. This review systematically outlines the basic ideas of microbiome data analysis and the basic knowledge required to conduct analysis. In addition, it summarizes the advantages and disadvantages of commonly used software and databases used in the comparison, visualization, network, evolution, machine learning and association analysis. This review aims to provide a convenient and flexible guide for selecting analytical tools and suitable databases for mining the biological significance of microbiome data.
microbiome; data analysis; amplicon; metagenome; pipeline
2019-07-30;
2019-08-21
中國(guó)科學(xué)院前沿科學(xué)重點(diǎn)研究項(xiàng)目(編號(hào):QYZDB-SSW-SMC021),國(guó)家自然科學(xué)基金面上項(xiàng)目(編號(hào):31772400)和中國(guó)科學(xué)院重點(diǎn)部署項(xiàng)目(編號(hào):KFZD-SW-219)資助[Supported by the Key Research Program of Frontier Sciences of the Chinese Academy of Science (No. QYZDB-SSW-SMC021), the National Natural Science Foundation of China (No. 31772400), and the Key Research Program of the Chinese Academy of Sciences (No. KFZD-SW-219)]
劉永鑫,博士,工程師,研究方向:生物信息學(xué)、宏基因組學(xué)。E-mail: yxliu@genetics.ac.cn
白洋,博士,研究員,研究方向:根系微生物組。E-mail: ybai@genetics.ac.cn
10.16288/j.yczz.19-222
2019/9/2 16:21:23
URI: http://kns.cnki.net/kcms/detail/11.1913.R.20190902.1620.001.html
(責(zé)任編委: 趙方慶)