賀建波 劉方東 邢光南 王吳彬 趙團(tuán)結(jié) 管榮展 蓋鈞鎰
?
限制性兩階段多位點(diǎn)全基因組關(guān)聯(lián)分析方法的特點(diǎn)與計(jì)算程序
賀建波 劉方東 邢光南 王吳彬 趙團(tuán)結(jié) 管榮展 蓋鈞鎰*
南京農(nóng)業(yè)大學(xué)大豆研究所 / 農(nóng)業(yè)部大豆生物學(xué)與遺傳育種重點(diǎn)實(shí)驗(yàn)室 / 國家大豆改良中心 / 作物遺傳與種質(zhì)創(chuàng)新國家重點(diǎn)實(shí)驗(yàn)室, 江蘇南京 210095
全基因組關(guān)聯(lián)分析(genome-wide association study, GWAS)的理論及應(yīng)用是近十幾年來國內(nèi)外數(shù)量性狀研究的熱點(diǎn), 但是以往GWAS方法注重于個(gè)別主要QTL/基因的檢測(cè)與發(fā)掘。為了相對(duì)全面地解析全基因組QTL及其等位基因構(gòu)成, 本研究提出了限制性兩階段多位點(diǎn)GWAS方法(RTM-GWAS, https://github.com/njau-sri/rtm-gwas)。RTM-GWAS首先將多個(gè)相鄰且緊密連鎖的SNP分組, 成為具有多個(gè)單倍型(復(fù)等位變異)的連鎖不平衡區(qū)段(SNPLDB)標(biāo)記, 然后采用兩階段分析策略, 基于多位點(diǎn)復(fù)等位變異遺傳模型, 在節(jié)省計(jì)算空間的條件下保障全基因組QTL及其復(fù)等位變異檢出的精確度。和以往GWAS方法相比, RTM-GWAS以性狀遺傳率為上限, 能夠較充分地檢測(cè)出QTL及其相應(yīng)的復(fù)等位變異并能有效地控制假陽性的膨脹。由其結(jié)果建立的QTL-allele矩陣代表了群體中所研究性狀的全部遺傳組成。依據(jù)這種QTL-allele矩陣的信息, 可以設(shè)計(jì)最優(yōu)基因型的遺傳組成, 預(yù)測(cè)群體中最優(yōu)化的雜交組合, 并用以進(jìn)行群體遺傳和特有與新生等位變異的研究。本研究首先對(duì)RTM-GWAS方法的特點(diǎn)和計(jì)算程序功能進(jìn)行說明, 然后通過大豆試驗(yàn)數(shù)據(jù)說明RTM-GWAS計(jì)算程序的使用方法。
限制性兩階段多位點(diǎn)全基因組關(guān)聯(lián)分析; 連鎖不平衡區(qū)段; 多位點(diǎn)模型; QTL-allele矩陣; 種質(zhì)資源群體; 優(yōu)化組合設(shè)計(jì)
基于全基因組高密度單核苷酸多態(tài)性(single- nucleotide polymorphism, SNP)分子標(biāo)記的全基因組關(guān)聯(lián)分析(genome-wide association study, GWAS)已經(jīng)成為數(shù)量性狀遺傳基礎(chǔ)解析的重要方法。GWAS充分利用了自然群體大量的歷史重組事件, 具有較高檢測(cè)精度, 已廣泛應(yīng)用于動(dòng)植物復(fù)雜性狀基因的發(fā)掘, 其理論方法與應(yīng)用也是近十幾年來數(shù)量性狀研究的熱點(diǎn)[1]。然而, 自然群體通常具有的群體未知結(jié)構(gòu)會(huì)對(duì)GWAS產(chǎn)生干擾, 從而導(dǎo)致檢測(cè)結(jié)果較高的假陽性。目前, 考慮群體結(jié)構(gòu)矯正的GWAS統(tǒng)計(jì)方法主要包括結(jié)構(gòu)關(guān)聯(lián)法(structured association, SA)[2]、主成分分析法(principal components analysis, PCA)[3]和混合模型方法(linear mixed model, LMM)[4-7]。結(jié)構(gòu)關(guān)聯(lián)法首先利用基于模型的聚類程序如STRUCTURE[8]、ADMIXTURE[9]等推斷獲得群體結(jié)構(gòu), 然后將群體結(jié)構(gòu)作為模型協(xié)變量進(jìn)行關(guān)聯(lián)測(cè)驗(yàn)。主成分分析法是將基于分子標(biāo)記的遺傳關(guān)系矩陣特征向量作為模型協(xié)變量進(jìn)行關(guān)聯(lián)測(cè)驗(yàn)。混合模型方法是在結(jié)構(gòu)關(guān)聯(lián)法和主成分分析法的基礎(chǔ)上再將遺傳背景效應(yīng)作為隨機(jī)效應(yīng)加入分析模型, 并將基于分子標(biāo)記的親屬關(guān)系矩陣作為該隨機(jī)效應(yīng)的協(xié)方差結(jié)構(gòu), 從而同時(shí)控制群體結(jié)構(gòu)和家系結(jié)構(gòu), 該方法也是目前植物GWAS的常用方法[10-13]。
植物中GWAS通常將種質(zhì)資源群體作為試驗(yàn)群體, 此類群體存在廣泛的遺傳變異, 且普遍存在復(fù)等位基因。植物育種實(shí)質(zhì)上是一個(gè)優(yōu)異等位基因的聚合過程, 因此復(fù)等位基因的檢測(cè)及其效應(yīng)估計(jì)對(duì)植物育種極其重要, 優(yōu)異等位基因的發(fā)掘不僅能為分子標(biāo)記輔助選擇提供依據(jù), 更是設(shè)計(jì)育種的前提條件[14]。然而以往GWAS方法使用的SNP分子標(biāo)記在一個(gè)標(biāo)記位點(diǎn)上僅有2個(gè)等位變異, 自然無法估計(jì)資源群體中大量存在的復(fù)等位基因效應(yīng), 從而限制了其在植物育種中的應(yīng)用。盡管GWAS在數(shù)量性狀遺傳研究中發(fā)揮了重要作用, 但由于以往GWAS方法注重于個(gè)別主要QTL/基因的檢測(cè)與發(fā)掘, 通常使用較為嚴(yán)格的顯著水平進(jìn)行多重測(cè)驗(yàn)矯正, 如Bonferroni矯正方法使用0.05/作為全基因組顯著水平, 其中為標(biāo)記數(shù)目。這種嚴(yán)格的顯著水平將可能導(dǎo)致較高的假陰性, 檢測(cè)的關(guān)聯(lián)位點(diǎn)往往僅能解釋表型變異的微小部分, 不能全面解析全基因組遺傳位點(diǎn)。例如水稻中41個(gè)性狀的GWAS結(jié)果顯示平均每個(gè)性狀僅能檢測(cè)到5個(gè)位點(diǎn), 解釋大約22%的表型變異[15-16]。因此, 為提供種質(zhì)資源群體遺傳構(gòu)成信息, 有必要相對(duì)全面地檢測(cè)全基因組QTL。另外, 上述常用GWAS方法均基于單位點(diǎn)模型, 當(dāng)控制數(shù)量性狀的位點(diǎn)有多個(gè)時(shí), 每個(gè)位點(diǎn)的效應(yīng)估計(jì)可能會(huì)受到相鄰位點(diǎn)的影響[17], 最明顯的表現(xiàn)就是導(dǎo)致位點(diǎn)表型變異解釋率估計(jì)的膨脹, 同時(shí)對(duì)小效應(yīng)位點(diǎn)的檢測(cè)功效也可能會(huì)偏低。然而由于全基因組關(guān)聯(lián)分析通常涉及海量的分子標(biāo)記, 直接將單位點(diǎn)遺傳模型擴(kuò)展至多位點(diǎn)遺傳模型的最大難題是模型中變量個(gè)數(shù)遠(yuǎn)遠(yuǎn)超過觀測(cè)值數(shù)目, 無法直接求解, 從而限制了多位點(diǎn)模型在GWAS中的應(yīng)用。針對(duì)上述GWAS在植物應(yīng)用中的局限性, He等[18]通過合并多個(gè)相鄰且緊密連鎖的SNP標(biāo)記組成具有復(fù)等位變異的SNPLDB標(biāo)記, 并基于多位點(diǎn)復(fù)等位基因模型進(jìn)行全基因組QTL檢測(cè), 提出了限制性兩階段多位點(diǎn)全基因組關(guān)聯(lián)分析方法(RTM-GWAS)。該方法解決了GWAS中單個(gè)SNP僅有兩個(gè)等位變異的限制, 更適用于存在廣泛復(fù)等位基因的種質(zhì)資源群體, 多位點(diǎn)模型通過擬合多個(gè)QTL, 提高了檢測(cè)功效, 降低了假陽性。
RTM-GWAS方法通過全面解析資源群體QTL及其復(fù)等基因, 建立資源群體的遺傳構(gòu)成, 以進(jìn)一步應(yīng)用于數(shù)量性狀基因發(fā)掘、群體遺傳分化研究以及最優(yōu)親本組合的全基因組選擇。目前, RTM-GWAS已被應(yīng)用于大豆數(shù)量性狀的全基因組遺傳解析。Zhang等[19]使用RTM-GWAS方法分析了中國大豆地方品種群體的百粒重性狀, 檢測(cè)到55個(gè)顯著關(guān)聯(lián)的SNPLDB標(biāo)記位點(diǎn), 解釋了98.5%的表型變異, 并進(jìn)一步基于55個(gè)位點(diǎn)上的263個(gè)等位變異效應(yīng)估計(jì), 預(yù)測(cè)了生態(tài)區(qū)內(nèi)及生態(tài)區(qū)間育成大粒品種的優(yōu)化親本組合。Meng等[20]使用RTM-GWAS方法分析了中國大豆地方品種異黃酮性狀, 檢測(cè)到44個(gè)顯著關(guān)聯(lián)的SNPLDB標(biāo)記位點(diǎn), 解釋了72.2%的表型變異, 同樣預(yù)測(cè)到培育高異黃酮含量的超親組合。此外, Li等[21]研究發(fā)現(xiàn)RTM-GWAS方法也適用于巢式關(guān)聯(lián)定位(nested association mapping, NAM)群體, 對(duì)包含4個(gè)重組自交家系群體的大豆NAM群體分析顯示, RTM-GWAS方法的應(yīng)用效果優(yōu)于以往方法。以上研究報(bào)告發(fā)表后, 許多讀者來信表示想深入了解RTM-GWAS方法程序及其使用方法, 因此本文說明RTM-GWAS方法的特點(diǎn)和計(jì)算程序功能, 并通過大豆試驗(yàn)數(shù)據(jù)說明RTM-GWAS計(jì)算程序的使用方法。
RTM-GWAS方法可概括為5個(gè)關(guān)鍵點(diǎn): (1)基于全基因組高密度SNP分子標(biāo)記構(gòu)建具有復(fù)等位變異的SNPLDB標(biāo)記; (2)利用SNPLDB標(biāo)記計(jì)算用于群體結(jié)構(gòu)矯正的遺傳相似系數(shù)矩陣; (3)基于兩階段多位點(diǎn)復(fù)等位基因模型檢測(cè)全基因組QTL; (4)使用普通顯著水平, 不需要進(jìn)行多重測(cè)驗(yàn)矯正; (5)性狀遺傳率作為模型位點(diǎn)表型解釋率上限。
首先依據(jù)基于連鎖不平衡置信區(qū)間的區(qū)段劃分方法定義基因組區(qū)段[21]。然后將區(qū)段內(nèi)的所有SNP合并稱為SNPLDB標(biāo)記, 區(qū)段內(nèi)各SNP組成的單倍型作為SNPLDB標(biāo)記的復(fù)等位變異, 群體內(nèi)個(gè)體的基因型由各SNP組成的單倍型確定。為了控制稀有等位基因頻率以便后續(xù)的統(tǒng)計(jì)分析, 將稀有單倍型(頻率小于0.01)替換為與其最為相似的單倍型。此處單倍型間的相似性定義為處于狀態(tài)同樣(identity- by-state) SNP個(gè)數(shù)占區(qū)段內(nèi)總SNP個(gè)數(shù)的比例。此外, 在設(shè)定的連鎖不平衡條件下, 有的區(qū)段僅含單個(gè)SNP, 這種區(qū)段也被視為一個(gè)獨(dú)立的SNPLDB標(biāo)記。因此, SNPLDB標(biāo)記有2種類型, 即包含多個(gè)SNP的SNPLDB標(biāo)記和僅包含一個(gè)SNP的SNPLDB標(biāo)記; 隨著SNP密度的增加, 這類單個(gè)SNP的區(qū)段數(shù)將相應(yīng)減少。
以往用于GWAS群體結(jié)構(gòu)矯正的基于標(biāo)記的親緣關(guān)系矩陣計(jì)算方法僅適用于SNP標(biāo)記[22-24], 不適用于具有多個(gè)等位變異的SNPLDB標(biāo)記。因此, RTM-GWAS方法將基于SNPLDB的遺傳相似系數(shù)矩陣作為群體結(jié)構(gòu)的全面估計(jì)。群體內(nèi)個(gè)體間的遺傳相似系數(shù)可定義為處于狀態(tài)同樣位點(diǎn)所占的比例, 即
其中,c定義為在第個(gè)SNPLDB上個(gè)體與個(gè)體的共有等位基因數(shù)目(取值為0, 1, 2),是SNPLDB總個(gè)數(shù)。該遺傳相似系數(shù)矩陣的特征向量可作為線性模型中的協(xié)變量以降低群體結(jié)構(gòu)對(duì)關(guān)聯(lián)分析的影響。
GWAS通常涉及數(shù)萬或數(shù)百萬的分子標(biāo)記, 直接進(jìn)行多位點(diǎn)模型擬合將導(dǎo)致模型空間過大進(jìn)而計(jì)算困難。而事實(shí)上, 大部分標(biāo)記都與目標(biāo)性狀不相關(guān), 為了有效縮減多位點(diǎn)擬合的模型空間, RTM-GWAS方法采用兩階段分析策略。簡(jiǎn)單起見, 假定群體內(nèi)個(gè)體為純合個(gè)體。第一階段, 利用基于單位點(diǎn)模型的關(guān)聯(lián)分析篩選所有SNPLDB標(biāo)記, 考慮復(fù)等位基因的線性模型可表示如下。
其中,y表示個(gè)體的表型觀測(cè)值;表示總體平均數(shù);w表示遺傳相似系數(shù)矩陣第個(gè)特征向量在個(gè)體上的系數(shù),α為第個(gè)特征向量的效應(yīng),為用于群體結(jié)構(gòu)矯正的特征向量的個(gè)數(shù);x為測(cè)驗(yàn)標(biāo)記位點(diǎn)第個(gè)等位基因?qū)τ趥€(gè)體的基因型指示變量, 取值0或1;β為第個(gè)等位基因的效應(yīng);為測(cè)驗(yàn)標(biāo)記位點(diǎn)的等位基因數(shù)目;ε為假定服從正態(tài)分布的殘差效應(yīng)。
第二階段, 基于第一階段篩選得到的SNPLDB標(biāo)記, 將模型(1)拓展為多位點(diǎn)模型進(jìn)行QTL檢測(cè), 多位點(diǎn)復(fù)等位基因模型如下。
其中,x為第個(gè)位點(diǎn)的第個(gè)等位基因在個(gè)體上的基因型指示變量, 取值0或1;β為第個(gè)位點(diǎn)的第個(gè)等位基因的效應(yīng);L為第個(gè)位點(diǎn)的等位基因數(shù)目;為總QTL數(shù)目。其他符號(hào)含義與模型(1)相同。
模型(1)可以使用回歸分析方法求解, 我們建議第一階段用相對(duì)寬松的顯著水平, 例如不小于0.05, 對(duì)標(biāo)記初步篩選, 以保證真實(shí)的位點(diǎn)不被誤判。模型(2)可以使用逐步回歸分析方法求解, 由于多位點(diǎn)模型內(nèi)建全試驗(yàn)水平誤差控制的特性, 我們建議使用常規(guī)的顯著水平, 例如0.01或0.05, 作為檢測(cè)QTL的顯著水平。由于QTL檢測(cè)基于多位點(diǎn)模型, 因此檢測(cè)的QTL所解釋的總遺傳變異應(yīng)小于群體總遺傳變異或表型變異解釋率不應(yīng)超過性狀遺傳率。
我們編制了實(shí)現(xiàn)RTM-GWAS方法的計(jì)算程序, 可從項(xiàng)目網(wǎng)站https://github.com/njau-sri/rtm-gwas/下載使用。RTM-GWAS計(jì)算程序使用C++編程語言實(shí)現(xiàn), 可運(yùn)行于Microsoft Windows、Linux和Mac OS X等主流操作系統(tǒng)平臺(tái)。借助針對(duì)不同處理器優(yōu)化的高性能線性代數(shù)運(yùn)算庫, RTM-GWAS計(jì)算程序具有較高的計(jì)算效率。RTM-GWAS計(jì)算程序擁有交互友好的圖形用戶界面和用于批量任務(wù)的命令行界面(圖1)。RTM-GWAS計(jì)算程序構(gòu)架如圖2所示, 由SNPLDB標(biāo)記構(gòu)建、遺傳相似系數(shù)矩陣計(jì)算和關(guān)聯(lián)分析三個(gè)核心模塊組成, 用戶可分別通過圖形界面或命令行界面進(jìn)行相應(yīng)的計(jì)算分析, 詳細(xì)操作說明見https://github.com/njau-sri/rtm-gwas/wiki。RTM- GWAS計(jì)算程序輸出結(jié)果以文本文件存儲(chǔ), 可使用任意本文編輯軟件查看輸出結(jié)果。
分子標(biāo)記數(shù)據(jù)采用國際通用的VCF文件格式(https://github.com/samtools/hts-specs), 該文件格式適用于各種標(biāo)記類型, 也是軟件支持較為廣泛的標(biāo)記數(shù)據(jù)格式之一, 因此便于不同軟件間協(xié)同分析。表型數(shù)據(jù)是空格或制表符作為分隔符的文本文件, 圖3所示為3個(gè)性狀的表型數(shù)據(jù)文件(僅顯示前7個(gè)材料), 文件第1行為列名, 從第2行開始每行表示一條觀測(cè)值。其中, 第1列為個(gè)體編號(hào), 其余列為不同性狀觀測(cè)值, 第1列的列名可以任意, 其余列名表示不同性狀的名稱。觀測(cè)值必須使用數(shù)值格式記錄, 缺失值可使用“NaN”、“?”、“NA”或“.”表示。對(duì)于多環(huán)境隨機(jī)區(qū)組設(shè)計(jì)試驗(yàn)數(shù)據(jù), 文件必須包含指示環(huán)境和區(qū)組因子的數(shù)據(jù)列, 列名必須是“_ENV_”和“_BLK_”, 分別表示環(huán)境和區(qū)組指示變量。
圖1 RTM-GWAS方法計(jì)算程序圖形用戶界面
圖2 RTM-GWAS方法計(jì)算程序構(gòu)架
斜體文字為相應(yīng)功能的二進(jìn)制程序名稱。
The characters in italic type are names of binary program.
圖3 表型數(shù)據(jù)文件格式
Indiv表示個(gè)體/材料名稱列名; SW、OC、PR為性狀名稱; NaN表示缺失值。
Indiv represents the name of column containing individual/ accession labels; SW, OC, PR are trait names; NaN represents missing value.
指定VCF格式的全基因組SNP基因型數(shù)據(jù)文件后即可開始計(jì)算, 計(jì)算程序?qū)⑤敵鯲CF格式的SNPLDB標(biāo)記基因型數(shù)據(jù)文件、標(biāo)記位點(diǎn)等位變異編碼信息以及基因組組塊統(tǒng)計(jì)信息。計(jì)算程序默認(rèn)單倍型頻率(Min. minor haplotype frequency)≥0.01, 區(qū)段最大長(zhǎng)度(Max. length of blocks)為200 kb, 建議設(shè)置為群體連鎖不平衡半衰距離。構(gòu)建SNPLDB的3組核心參數(shù)是, LD置信區(qū)間閾值(Lower/Upper limit CI for strong LD), 在定義強(qiáng)LD時(shí)對(duì)LD置信上下限均作最小范圍要求, 即要求下限 >70、上限 >98; 強(qiáng)重組置信區(qū)間閾值(Upper limit CI for strong recombination)上限<90; 區(qū)段內(nèi)有效強(qiáng)LD占比(Min. fraction of informative strong LD) > 0.95 (圖4)。
指定構(gòu)建的SNPLDB標(biāo)記文件(VCF格式)后即可計(jì)算, 計(jì)算程序?qū)⑤敵鲞z傳相似系數(shù)矩陣及其特征向量, 默認(rèn)輸出前10個(gè)特征向量。其中輸出的特征向量文件將作為關(guān)聯(lián)分析的協(xié)變量用于群體結(jié)構(gòu)矯正(圖5)。
圖4 SNPLDB標(biāo)記構(gòu)建對(duì)話框
VCF指以VCF格式存儲(chǔ)的基因型數(shù)據(jù)文件路徑; Min.: 最小值; Max.: 最大值; CI: 置信區(qū)間。
VCF represents the VCF genotype file path; Min.: minimum; Max.: maximum; CI: confidence interval.
圖5 遺傳相似系數(shù)計(jì)算對(duì)話框
VCF指以VCF格式存儲(chǔ)的基因型數(shù)據(jù)文件路徑。
VCF represents the VCF genotype file path.
關(guān)聯(lián)分析功能對(duì)話框需要指定SNPLDB標(biāo)記基因型數(shù)據(jù)文件(VCF格式)、數(shù)量性狀表型觀測(cè)數(shù)據(jù)以及用于群體結(jié)構(gòu)矯正的協(xié)變量(SNPLDB遺傳相似系數(shù)矩陣特征向量)數(shù)據(jù)文件, 計(jì)算程序?qū)⑤敵雠c性狀關(guān)聯(lián)的SNPLDB標(biāo)記位點(diǎn)、多位點(diǎn)模型方差分析、位點(diǎn)等位基因效應(yīng)估計(jì)等結(jié)果文件。計(jì)算程序默認(rèn)用于檢測(cè)QTL的顯著水平(significance level)為0.05, 建議設(shè)為0.01或0.05。用于標(biāo)記初步篩選(第一階段)的閾值默認(rèn)為0.05, 一般不建議修改。為防止模型過度擬合, 計(jì)算默認(rèn)設(shè)置了模型表型變異解釋率上限(Max. model r-square)為0.95, 建議設(shè)為性狀遺傳率估計(jì)值(圖6)。關(guān)聯(lián)分析程序也支持多重測(cè)驗(yàn)矯正(multiple testing correction), 包括Bonferroni (BON)和FDR兩種方法, 通常矯正后檢測(cè)的位點(diǎn)也包含于未矯正的結(jié)果。另外, 關(guān)聯(lián)分析程序還支持多環(huán)境試驗(yàn)原始表型數(shù)據(jù)的計(jì)算, 計(jì)算程序默認(rèn)能夠檢測(cè)QTL與環(huán)境互作效應(yīng), 但是當(dāng)基因型與環(huán)境互作方差相對(duì)較小時(shí), 可以排除QTL與環(huán)境互作效應(yīng)(genotype-environment interaction), 以降低統(tǒng)計(jì)模型的復(fù)雜度。
圖6 關(guān)聯(lián)分析功能對(duì)話框
VCF指以VCF格式存儲(chǔ)的基因型數(shù)據(jù)文件路徑; Max.: 最大值; r-square: 模型決定系數(shù)。
VCF represents the VCF genotype file path; Max.: maximum; r-square: coefficient of determination.
以下以中國栽培大豆資源群體株高試驗(yàn)結(jié)果的全基因組關(guān)聯(lián)分析為例說明RTM-GWAS方法的應(yīng)用, 詳細(xì)應(yīng)用可參考已發(fā)表的文獻(xiàn)[18-21]。
參試的723份栽培大豆來自中國大豆種質(zhì)資源群體, 分別于2013年和2014年進(jìn)行田間試驗(yàn), 采用隨機(jī)區(qū)組試驗(yàn)設(shè)計(jì), 設(shè)3次重復(fù), 品種成熟后測(cè)量株高。試驗(yàn)群體株高變異范圍15~165 cm, 平均62.78 cm, 變異系數(shù)41.57%。2年試驗(yàn)株高誤差變異系數(shù)14.33%, 遺傳率0.921, 基因與環(huán)境互作遺傳率0.049?;蛐蛿?shù)據(jù)來自用RAD-seq (restriction site-associated DNA sequencing)技術(shù)對(duì)該群體進(jìn)行的基因型分型[18]。通過序列比對(duì)將測(cè)序片段比對(duì)到大豆Williams 82參考基因組并進(jìn)行SNP鑒別, SNP質(zhì)量控制采用的過濾標(biāo)準(zhǔn)為缺失和雜合率小于或等于20%, 最小等位基因頻率大于或等于1%。最后使用fastPHASE軟件[26]對(duì)SNP缺失基因型進(jìn)行填補(bǔ), 獲得了145 558個(gè)覆蓋全基因組的高質(zhì)量SNP標(biāo)記。
首先, 基于全基因組145 558個(gè)SNP標(biāo)記, 利用RTM-GWAS計(jì)算程序進(jìn)行SNPLDB標(biāo)記構(gòu)建, 采用程序默認(rèn)參數(shù)。程序輸出了36 952個(gè)SNPLDB標(biāo)記的VCF格式基因型數(shù)據(jù)文件, 用于后續(xù)所有分析。其次, 基于構(gòu)建好的SNPLDB標(biāo)記, 利用RTM- GWAS計(jì)算程序計(jì)算群體內(nèi)個(gè)體間的遺傳相似系數(shù)矩陣, 并提取特征值最大的前10個(gè)特征向量作為控制群體結(jié)構(gòu)的協(xié)變量。最后, 基于大豆株高表型原始觀測(cè)值、SNPLDB標(biāo)記數(shù)據(jù)以及遺傳相似系數(shù)矩陣特征向量, 利用RTM-GWAS計(jì)算程序關(guān)聯(lián)分析功能對(duì)大豆株高進(jìn)行全基因組QTL檢測(cè)并估計(jì)QTL等位基因的效應(yīng)。QTL檢測(cè)的顯著水平設(shè)為0.01, 模型解釋率上限設(shè)置為0.921, 其他參數(shù)保持默認(rèn)。關(guān)聯(lián)分析程序輸出5個(gè)結(jié)果文件, 關(guān)聯(lián)標(biāo)記位點(diǎn)名稱文件assoc.out.loc、I型模型方差分析文件assoc.out.aov1、III型模型方差分析文件assoc.out. aov3、等位基因效應(yīng)估計(jì)文件assoc.out.est、標(biāo)記位點(diǎn)統(tǒng)計(jì)檢驗(yàn)概率值(-value)文件assoc.out.ps。
基于所有標(biāo)記位點(diǎn)關(guān)聯(lián)測(cè)驗(yàn)值結(jié)果數(shù)據(jù)文件, 可使用繪圖軟件, 如R軟件(http://www.r-project. org/), 繪制Q-Q圖(圖7)和Manhattan圖(圖8)。用RTM-GWAS方法共檢測(cè)到114個(gè)SNPLDB位點(diǎn)與株高性狀關(guān)聯(lián), 根據(jù)方差分析結(jié)果數(shù)據(jù)文件, 其中10個(gè)位點(diǎn)主效不顯著, 63個(gè)位點(diǎn)與環(huán)境互作效應(yīng)不顯著。104個(gè)主效顯著的位點(diǎn)總表型變異解釋率為78.103%, 51個(gè)顯著的位點(diǎn)與環(huán)境互作效應(yīng)總表型變異解釋率為10.312%, 其中有21個(gè)位點(diǎn)主效表型變異解釋率高于1%, 將結(jié)果進(jìn)一步整理為表1, 所有位點(diǎn)關(guān)聯(lián)結(jié)果詳見附件表1。RTM-GWAS分析程序不會(huì)輸出位點(diǎn)貢獻(xiàn)率, 但可以根據(jù)輸出的平方和分解計(jì)算位點(diǎn)貢獻(xiàn)率, 即位點(diǎn)平方和占總平方和的比例。
圖7 大豆株高全基因組關(guān)聯(lián)分析Q-Q圖
全國從東北到西南大豆資源的株高在南京表現(xiàn)出有114個(gè)位點(diǎn)的差異, 年份間有波動(dòng)。關(guān)聯(lián)的114個(gè)SNPLDB位點(diǎn)共有442個(gè)等位變異, 其中主效顯著的104個(gè)位點(diǎn)共有417個(gè)等位變異。由于本研究大豆株高遺傳基礎(chǔ)以主效位點(diǎn)為主, 簡(jiǎn)便起見, 本文則以主效位點(diǎn)為例進(jìn)行后續(xù)分析, 如要考慮特定環(huán)境的分析, 可將位點(diǎn)主效應(yīng)與相應(yīng)環(huán)境互作效應(yīng)相加后再進(jìn)行分析。根據(jù)主效顯著的104個(gè)位點(diǎn)效應(yīng)估計(jì)結(jié)果, 等位變異效應(yīng)范圍為-43.55~ +38.26, 結(jié)合群體SNPLDB基因型, 可進(jìn)一步將等位變異效應(yīng)整理為位點(diǎn)×材料(104×417)的QTL-allele矩陣作為群體株高性狀的遺傳構(gòu)成(圖9), 可以進(jìn)一步使用繪圖軟件將該矩陣可視化(圖10)。
對(duì)于723個(gè)材料所有可能261 003個(gè)單交組合, 通過F1連續(xù)自交分別模擬產(chǎn)生2000個(gè)純系后代基因型, 依據(jù)包括104個(gè)位點(diǎn)效應(yīng)的株高QTL-allele矩陣計(jì)算所有QTL基因型值總和, 作為后代基因型值預(yù)測(cè)值。親本和親本組合的表型預(yù)測(cè)值為y=g+ (y?g+y?g)/2, 其中g為組合后代基因型預(yù)測(cè)值,y和y分別為雙親表型觀測(cè)值,g和g分別為雙親基因型值預(yù)測(cè)值。所有模擬計(jì)算通過編制的計(jì)算程序Cross (https://github.com/njau-sri/cross)完成, 計(jì)算程序?qū)⑤敵鏊袉谓唤M合后代純合群體的株高性狀描述統(tǒng)計(jì)數(shù)據(jù)(圖11), 可根據(jù)實(shí)際需求在計(jì)算程序中設(shè)置用于篩選優(yōu)化組合的百分位數(shù)統(tǒng)計(jì)量, 計(jì)算程序默認(rèn)輸出第1 (最小值)、25 (Q1)、50 (中位數(shù))、75 (Q3)、100 (最大值)百分位數(shù), 本文設(shè)置第10、第50、第90百分位數(shù)作為選擇依據(jù)。后代純合群體第10、第50、第90百分位數(shù)使用其他繪圖軟件繪制的散點(diǎn)圖, 可以看出第10、第90百分位數(shù)均有超親組合出現(xiàn)(圖12)。以高桿大豆育種為例, 按照第90百分位數(shù)篩選優(yōu)化組合, 可篩選出101個(gè)親本組合預(yù)測(cè)株高大于165 cm的組合, 詳見附件表2, 其中預(yù)測(cè)株高大于180 cm的親本組合有8個(gè)(表2), 預(yù)測(cè)株高最高可達(dá)183 cm, 相比親本株高165 cm提高了10.9% (18 cm)。
圖8 大豆株高全基因組關(guān)聯(lián)分析Manhattan圖
表1 大豆株高顯著關(guān)聯(lián)的大效應(yīng)SNPLDB標(biāo)記位點(diǎn)
大效應(yīng)位點(diǎn)指表型變異解釋率大于1%的位點(diǎn);2: 表型變異解釋率;a: 為QTL檢測(cè)模型顯著性測(cè)驗(yàn);b: 為QTL與互作效應(yīng)。
Locus with a phenotypic variance explained greater than 1% was considered as a large effect locus;2: phenotypic variance explained;a: statistical hypothesis testing performed in QTL detection model;b: QTL-by-environment interaction effect.
圖9 大豆株高主效QTL-allele矩陣數(shù)據(jù)文件
行表示104個(gè)主效顯著的株高關(guān)聯(lián)位點(diǎn), 列表示723份栽培大豆材料, 數(shù)據(jù)為104×723的主效位點(diǎn)等位基因效應(yīng)矩陣。
Rows represent the 104 loci and columns represent the 723 accessions, the data are allele effects and presented in 104×723 matrix.
圖10 大豆株高QTL-allele可視化矩陣
圖11 株高性狀所有親本組合后代預(yù)測(cè)結(jié)果文件
P1、P2分別表示單交組合的2個(gè)親本; MEAN、SD分別表示組合純合后代群體株高平均數(shù)和標(biāo)準(zhǔn)差; P10、P50、P90分別表示組合純合后代群體株高第10、第50、第90百分位數(shù)。
P1 and P2 are labels of parental accessions; MEAN and SD indicate the mean and standard deviation of homozygous progeny population; P10, P50, and P90 are 10-th, 50-th, and 90-th percentiles of homozygous progeny population.
圖12 株高性狀所有親本組合后代預(yù)測(cè)結(jié)果可視化
虛線表示親本群體株高變異范圍(15~165 cm)。
Dotted lines indicate the range (15–165 cm) of plant height in parental population.
和以往GWAS方法專注于個(gè)別主效QTL的檢測(cè)不同, RTM-GWAS方法能夠相對(duì)全面地解析植物種質(zhì)/育種群體數(shù)量性狀的QTL體系。首先, 以往GWAS均基于僅有的2個(gè)等位變異的SNP標(biāo)記, 而無法檢測(cè)一個(gè)遺傳位點(diǎn)上多個(gè)復(fù)等位基因。對(duì)于一個(gè)遺傳位點(diǎn)上存在多個(gè)復(fù)等位基因的情況, 單個(gè)SNP標(biāo)記測(cè)驗(yàn)僅能解釋遺傳位點(diǎn)的部分遺傳變異, 理論上統(tǒng)計(jì)功效自然會(huì)偏低, 從而會(huì)降低GWAS檢測(cè)功效。RTM-GWAS方法通過構(gòu)建具有復(fù)等位變異的SNPLDB標(biāo)記來匹配具有復(fù)等位基因的位點(diǎn), 理論上使得GWAS更適用于存在廣泛復(fù)等位基因的種質(zhì)資源群體。其次, 以往基于單位點(diǎn)模型的GWAS方法由于忽略了其他位點(diǎn)的影響, 可能導(dǎo)致較高假陽性的檢測(cè)結(jié)果。然而由于GWAS通?;谌蚋呙芏确肿訕?biāo)記, 直接擬合多位點(diǎn)模型將導(dǎo)致計(jì)算量過大, 影響計(jì)算效率。RTM-GWAS方法結(jié)合兩階段分析策略和多位點(diǎn)模型, 不僅能夠同時(shí)擬合多個(gè)具有不等數(shù)目等位基因的遺傳位點(diǎn), 提高了檢測(cè)功效, 還大幅降低了計(jì)算量, 提高了計(jì)算效率, 使得RTM-GWAS方法可以應(yīng)用于大規(guī)模GWAS數(shù)據(jù)的分析。
表2 高桿大豆育種的8個(gè)預(yù)測(cè)最優(yōu)組合
P1、P2分別表示單交組合親本代號(hào); Y1、Y2分別表示P1和P2的株高觀測(cè)平均數(shù); P10、P50、P90分別表示第10、50、90百分位數(shù)。
P1 and P2 indicate the two parent labels of single cross; Y1 and Y2 indicate the means of observed plant height; P10, P50, and P90 are 10-th, 50-th, and 90-th percentiles of homozygous progeny population.
植物中表型鑒定試驗(yàn)通常是多個(gè)環(huán)境的重復(fù)試驗(yàn), 以往主流GWAS方法通常不支持多環(huán)境表型數(shù)據(jù)聯(lián)合分析, 而是將基因型多環(huán)境調(diào)整平均數(shù)作為GWAS分析的表型, 不僅無法分析主效QTL與環(huán)境互作效應(yīng), 更無法檢測(cè)僅有互作效應(yīng)而沒有主效的QTL。RTM-GWAS方法計(jì)算程序支持多環(huán)境隨機(jī)區(qū)組試驗(yàn)設(shè)計(jì)的原始表型數(shù)據(jù)分析, 能夠同時(shí)擬合主效和非主效QTL與環(huán)境互作效應(yīng), 檢測(cè)結(jié)果更加全面。另外, 模擬分析顯示應(yīng)用RTM-GWAS方法的樣本容量需要足夠大(例如, >400)且性狀遺傳率也應(yīng)較高(例如, >0.8), 因此表型鑒定需要進(jìn)行合理的試驗(yàn)設(shè)計(jì)以及精確的試驗(yàn)操作[18]。
RTM-GWAS方法不僅適用于種質(zhì)資源群體, 也適用于多親本的NAM群體[21]。RTM-GWAS方法通過構(gòu)建具有親本單倍型的SNPLDB標(biāo)記, 將NAM群體內(nèi)不同RIL群體視為一個(gè)自然的整體, 每個(gè)標(biāo)記位點(diǎn)具有不同的等位變異類型, 而不是像以往分析將RIL群體視為彼此獨(dú)立的群體[27]。由于NAM群體的遺傳設(shè)計(jì)特點(diǎn), 其群體結(jié)構(gòu)已知, 因此RTM-GWAS方法可以較好地控制群體結(jié)構(gòu), 從而獲得較高的檢測(cè)功效和較低的假發(fā)現(xiàn)率。潘麗媛等(未發(fā)表)也將RTM-GWAS方法用于大豆RIL群體的QTL定位, 比較結(jié)果顯示, RTM-GWAS方法不僅覆蓋了復(fù)合區(qū)間作圖法的定位結(jié)果, 還檢測(cè)到更多的已報(bào)道微效QTL。由于RTM-GWAS方法是對(duì)標(biāo)記位點(diǎn)進(jìn)行檢驗(yàn), 無法對(duì)標(biāo)記區(qū)間內(nèi)的任意位置進(jìn)行檢驗(yàn), 因此NAM和RIL群體必須進(jìn)行全基因組SNP標(biāo)記鑒定才可能進(jìn)行全基因組QTL的檢測(cè)。
RTM-GWAS方法較高的檢測(cè)功效使得其檢測(cè)結(jié)果可以全面反映群體數(shù)量性狀遺傳構(gòu)成, 從而能夠進(jìn)一步從全基因組QTL水平對(duì)育種親本組合進(jìn)行潛力預(yù)測(cè)及優(yōu)化組合設(shè)計(jì), 在實(shí)際育種前直接對(duì)QTL進(jìn)行育種選擇?;赗TM-GWAS方法獲得的QTL-allele矩陣, 可通過設(shè)計(jì)分子標(biāo)記進(jìn)一步應(yīng)用于雙親后代選擇, 從而提高選擇效率、縮短育種周期?;谌蚪MQTL的組合預(yù)測(cè)和后代選擇是對(duì)QTL直接選擇, 不同于傳統(tǒng)全基因組選擇(genomic selection, GS)方法對(duì)全基因組分子標(biāo)記進(jìn)行選擇[28]。傳統(tǒng)GS需要對(duì)選擇世代進(jìn)行全基因組分子標(biāo)記測(cè)定, 目前對(duì)于植物育種花費(fèi)十分高昂。GS訓(xùn)練群體與選擇群體的遺傳關(guān)系以及預(yù)測(cè)模型構(gòu)建方法會(huì)直接影響選擇效率, 主要應(yīng)用于組合后代的選擇, 把GS直接應(yīng)用于優(yōu)化組合設(shè)計(jì)會(huì)由于需要對(duì)組合后代進(jìn)行全基因組標(biāo)記模擬, 進(jìn)而導(dǎo)致難以接受的計(jì)算量。
此外, 基于全基因組QTL及其等位基因構(gòu)成信息, 還可以從QTL水平上刻畫群體遺傳特征, 進(jìn)行群體分化以及群體間進(jìn)化關(guān)系的研究。例如Meng等[20]對(duì)RTM-GWAS檢測(cè)的44個(gè)異黃酮含量QTL進(jìn)行生態(tài)區(qū)基因頻率的分析結(jié)果顯示, 84.1% (37個(gè))的位點(diǎn)基因頻率在生態(tài)區(qū)間存在顯著差異, 而在全基因組29 119個(gè)SNPLDB標(biāo)記水平上, 則只有50.6% (14 735個(gè))的位點(diǎn)基因頻率在生態(tài)區(qū)間存在顯著差異, 進(jìn)一步說明了異黃酮含量遺傳構(gòu)成在生態(tài)區(qū)上發(fā)生了分化。
本研究提出的RTM-GWAS方法通過將多個(gè)相鄰且緊密連鎖的SNP分組, 構(gòu)建具有復(fù)等位變異的SNPLDB標(biāo)記, 然后采用兩階段分析策略, 基于多位點(diǎn)模型檢測(cè)全基因組QTL及其復(fù)等位變異。和以往GWAS方法相比, RTM-GWAS方法能較充分地檢測(cè)出QTL及其相應(yīng)的復(fù)等位變異, 并能有效地控制假陽性。由其結(jié)果建立的QTL-allele矩陣代表了試驗(yàn)群體中所研究性狀的全部遺傳構(gòu)成, 不僅可用于設(shè)計(jì)最優(yōu)基因型的遺傳組成, 預(yù)測(cè)最優(yōu)雜交組合, 還能用于群體遺傳和特有與新生等位變異的研究。
[1] Visscher P M, Wray N R, Zhang Q, Sklar P, McCarthy M I, Brown M A, Yang J. 10 Years of GWAS discovery: biology, function, and translation., 2017, 101: 5–22
[2] Pritchard J K, Stephens M, Rosenberg N A, Donnelly P. Association mapping in structured populations., 2000, 67: 170–181
[3] Price A L, Patterson N J, Plenge R M, Weinblatt M E, Shadick N A, Reich D. Principal components analysis corrects for stratification in genome-wide association studies., 2006, 38: 904–909
[4] Yu J, Pressoir G, Briggs W H, Vroh Bi I, Yamasaki M, Doebley J F, McMullen M D, Gaut B S, Nielsen D M, Holland J B, Kresovich S, Buckler E S. A unified mixed-model method for association mapping that accounts for multiple levels of relatedness., 2006, 38: 203–208
[5] Kang H M, Zaitlen N A, Wade C M, Kirby A, Heckerman D, Daly M J, Eskin E. Efficient control of population structure in model organism association mapping., 2008, 178: 1709–1723
[6] Kang H M, Sul J H, Service S K, Zaitlen N A, Kong S Y, Freimer N B, Sabatti C, Eskin E. Variance component model to account for sample structure in genome-wide association studies., 2010, 42: 348–354
[7] Zhang Z, Ersoz E, Lai C Q, Todhunter R J, Tiwari H K, Gore M A, Bradbury P J, Yu J, Arnett D K, Ordovas J M, Buckler E S. Mixed linear model approach adapted for genome-wide association studies., 2010, 42: 355–360
[8] Pritchard J K, Stephens M, Donnelly P. Inference of population structure using multilocus genotype data., 2000, 155: 945–959
[9] Alexander D H, Novembre J, Lange K. Fast model-based estimation of ancestry in unrelated individuals., 2009, 19: 1655–1664
[10] Atwell S, Huang Y S, Vilhjalmsson B J, Willems G, Horton M, Li Y, Meng D, Platt A, Tarone A M, Hu T T, Jiang R, Muliyati N W, Zhang X, Amer M A, Baxter I, Brachi B, Chory J, Dean C, Debieu M, de Meaux J, Ecker J R, Faure N, Kniskern J M, Jones J D, Michael T, Nemri A, Roux F, Salt D E, Tang C, Todesco M, Traw M B, Weigel D, Marjoram P, Borevitz J O, Bergelson J, Nordborg M. Genome-wide association study of 107 phenotypes ininbred lines., 2010, 465: 627–631
[11] Huang X, Wei X, Sang T, Zhao Q, Feng Q, Zhao Y, Li C, Zhu C, Lu T, Zhang Z, Li M, Fan D, Guo Y, Wang A, Wang L, Deng L, Li W, Lu Y, Weng Q, Liu K, Huang T, Zhou T, Jing Y, Li W, Lin Z, Buckler E S, Qian Q, Zhang Q F, Li J, Han B. Genome-wide association studies of 14 agronomic traits in rice landraces., 2010, 42: 961–967
[12] Li H, Peng Z, Yang X, Wang W, Fu J, Wang J, Han Y, Chai Y, Guo T, Yang N, Liu J, Warburton M L, Cheng Y, Hao X, Zhang P, Zhao J, Liu Y, Wang G, Li J, Yan J. Genome-wide association study dissects the genetic architecture of oil biosynthesis in maize kernels., 2013, 45: 43–50
[13] Fang C, Ma Y M, Wu S W, Liu Z, Wang Z, Yang R, Hu G H, Zhou Z K, Yu H, Zhang M, Pan Y, Zhou G A, Ren H X, Du W Q, Yan H R, Wang Y P, Han D Z, Shen Y T, Liu S L, Liu T F, Zhang J X, Qin H, Yuan J, Yuan X H, Kong F J, Liu B H, Li J Y, Zhang Z W, Wang G D, Zhu B G, Tian Z X. Genome-wide association studies dissect the genetic networks underlying agronomical traits in soybean., 2017, 18: 161
[14] Peleman J D, van der Voort J R. Breeding by design., 2003, 8: 330–334
[15] Huang X, Wei X, Sang T, Zhao Q, Feng Q, Zhao Y, Li C, Zhu C, Lu T, Zhang Z, Li M, Fan D, Guo Y, Wang A, Wang L, Deng L, Li W, Lu Y, Weng Q, Liu K, Huang T, Zhou T, Jing Y, Li W, Lin Z, Buckler E S, Qian Q, Zhang Q F, Li J, Han B. Genome-wide association studies of 14 agronomic traits in rice landraces., 2010, 42: 961–967
[16] Zhao K, Tung C W, Eizenga G C, Wright M H, Ali M L, Price A H, Norton G J, Islam M R, Reynolds A, Mezey J, McClung A M, Bustamante C D, McCouch S R. Genome-wide association mapping reveals a rich genetic architecture of complex traits in., 2011, 2: 467
[17] Zeng Z B. Precision mapping of quantitative trait loci., 1994, 136: 1457–1468
[18] He J, Meng S, Zhao T, Xing G, Yang S, Li Y, Guan R, Lu J, Wang Y, Xia Q, Yang B, Gai J. An innovative procedure of genome-wide association analysis fits studies on germplasm population and plant breeding., 2017, 130: 2327–2343
[19] Zhang Y, He J, Wang Y, Xing G, Zhao J, Li Y, Yang S, Palmer R G, Zhao T, Gai J. Establishment of a 100-seed weight quantitative trait locus-allele matrix of the germplasm population for optimal recombination design in soybean breeding programmes., 2015, 66: 6311–6325
[20] Meng S, He J, Zhao T, Xing G, Li Y, Yang S, Lu J, Wang Y, Gai J. Detecting the QTL-allele system of seed isoflavone content in Chinese soybean landrace population for optimal cross design and gene system exploration., 2016, 129: 1557–1576
[21] Li S, Cao Y, He J, Zhao T, Gai J. Detecting the QTL-allele system conferring flowering date in a nested association mapping population of soybean using a novel procedure., 2017, 130: 2297–2314
[22] Gabriel S B, Schaffner S F, Nguyen H, Moore J M, Roy J, Blumenstiel B, Higgins J, DeFelice M, Lochner A, Faggart M, Liu-Cordero S N, Rotimi C, Adeyemo A, Cooper R, Ward R, Lander E S, Daly M J, Altshuler D. The structure of haplotype blocks in the human genome., 2002, 296: 2225–2229
[23] Patterson N, Price A L, Reich D. Population structure and eigenanalysis., 2006, 2: e190
[24] Price A L, Patterson N J, Plenge R M, Weinblatt M E, Shadick N A, Reich D. Principal components analysis corrects for stratification in genome-wide association studies., 2006, 38: 904–909
[25] VanRaden P M. Efficient methods to compute genomic predictions., 2008, 91: 4414–4423
[26] Scheet P, Stephens M. A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase., 2006, 78: 629–644
[27] Buckler E S, Holland J B, Bradbury P J, Acharya C B, Brown P J, Browne C, Ersoz E, Flint-Garcia S, Garcia A, Glaubitz J C, Goodman M M, Harjes C, Guill K, Kroon D E, Larsson S, Lepak N K, Li H, Mitchell S E, Pressoir G, Peiffer J A, Rosas M O, Rocheford T R, Romay M C, Romero S, Salvo S, Sanchez Villeda H, da Silva H S, Sun Q, Tian F, Upadyayula N, Ware D, Yates H, Yu J, Zhang Z, Kresovich S, McMullen M D. The genetic architecture of maize flowering time., 2009, 325: 714–718
[28] Meuwissen T H, Hayes B J, Goddard M E. Prediction of total genetic value using genome-wide dense marker maps., 2001, 157: 1819–1829
附表1 大豆株高顯著關(guān)聯(lián)的SNPLDB標(biāo)記位點(diǎn)
Supplementary table 1 SNPLDBs significantly associated with soybean plant height
SNPLDB染色體Chromosome位置PositionModel aQTLQTL×Env. b–lgP–lgPR2 (%)–lgPR2 (%) LDB_19_449646301944964630–4502958458.26206.017.36152.861.693 LDB_6_44183574644183574–4428124846.46199.507.35911.640.495 LDB_16_8004288168004288–820384542.01171.866.1406.150.280 LDB_13_115392121311539212–1162599028.64120.364.0910.660.053 LDB_17_364748801736474880–3649465222.8196.333.2290.800.059 LDB_4_37782684437782684–3792309322.0683.362.8093.390.170 LDB_8_707513987075139–707709126.1981.672.61616.090.505 LDB_3_26698545326698545–2689826717.2468.852.4234.560.261 LDB_15_167739821516773982–1677401022.0972.652.2735.220.154 LDB_16_288388741628838874–2886811816.4456.111.8871.010.077 LDB_4_11093449411093449–1119212013.6151.181.8003.130.195 LDB_2_586388825863888–597903114.8847.631.4931.060.042 LDB_16_749468116749468116.2148.321.4401.040.018 LDB_1_5027790215027790217.3447.471.4138.980.239 LDB_14_246747514246747515.5745.661.3560.880.014 LDB_4_29936477429936477–2995080314.7438.981.2175.630.186 LDB_3_22147965322147965–2234269911.9832.081.20911.340.516 LDB_7_20253563720253563–2045160713.8336.681.1742.680.108 LDB_14_460956341446095634–4610657012.6735.501.0760.280.008 LDB_3_819777638197776–820246611.5132.011.0261.050.052 LDB_6_22108685622108685–2219136010.9728.351.0045.030.241 LDB_6_127150261271502–127515912.0432.960.9970.610.018 LDB_12_34374105123437410512.3532.820.9560.400.005 LDB_5_27543725527543725–2754737711.2527.310.8511.500.056 LDB_5_408220954082209–412172110.6026.740.8342.200.079 LDB_4_16641482416641482–168397728.7222.210.7862.840.150 LDB_19_392408441939240844–3927696711.0123.490.7595.160.188 LDB_5_1564858851564858810.8525.480.7310.140.001 LDB_4_12429588412429588–126051139.9320.000.7156.460.276 LDB_9_4579134094579134010.3322.420.6380.560.008 LDB_3_36428638336428638–364693368.5619.150.6241.440.066 LDB_16_2807415162807415–28274929.4817.290.5886.020.231
(續(xù)附表1)
SNPLDB染色體Chromosome位置PositionModel aQTLQTL×Env. b–lgP–lgPR2 (%)–lgPR2 (%) LDB_11_356942471135694247–356985849.1318.500.5530.300.009 LDB_7_35303124735303124–353272317.3415.830.5421.430.076 LDB_15_120077681512007768–120115017.6816.510.4920.180.005 LDB_7_32985427732985427–331847126.6813.540.4890.740.057 LDB_2_391138823911388–39281077.1414.070.4661.340.062 LDB_9_20673268920673268–206926096.8713.820.4360.330.016 LDB_16_11517491611517499.3415.530.4324.630.115 LDB_19_446696551944669655–447542878.4013.620.4307.170.233 LDB_5_9070015907001–9070427.8415.210.4230.130.001 LDB_9_806260098062600–81166596.6011.790.3741.120.044 LDB_8_19152075819152075–191521007.4813.340.3670.920.015 LDB_10_5777915105777915–57986267.4610.730.3625.670.204 LDB_19_445505871944550587–445582897.3011.070.3523.420.117 LDB_20_3408918820340891887.5712.420.3401.110.020 LDB_5_2803587528035876.9012.300.3360.350.004 LDB_8_16373446816373446–164208764.9610.500.3340.320.016 LDB_18_574527051857452705–574572396.2610.950.3250.330.010 LDB_7_250968307250968306.8911.750.3200.360.004 LDB_9_38183930938183930–382688005.667.740.3194.070.194 LDB_12_9936416129936416–99945384.458.400.3070.790.050 LDB_3_33432777333432777–334620716.409.520.3051.190.046 LDB_5_1382138513821386.6610.460.2830.700.010 LDB_6_330233216330233216.0310.090.2720.390.004 LDB_11_3516227011351622707.3110.070.2715.100.128 LDB_18_2747814218274781426.1610.060.2710.460.006 LDB_17_337081181733708118–338150887.037.770.2516.960.226 LDB_5_38350041538350041–384323566.767.190.2335.510.182 LDB_11_6370581116370581–64860155.646.030.2313.870.161 LDB_16_319590331631959033–319999634.277.110.2310.460.021 LDB_18_449428781844942878–450645113.285.380.2250.480.044 LDB_2_12669763212669763–126842074.485.690.2041.200.057 LDB_6_415235786415235785.417.710.2030.810.013 LDB_17_157821641715782164–158459773.223.920.1890.780.066 LDB_4_42809656442809656–428096705.275.790.1712.760.081 LDB_9_252442099252442094.206.560.1700.200.002 LDB_3_432453393432453394.916.560.1700.250.002 LDB_15_92026811592026815.406.240.1603.370.079 LDB_20_3498125203498125–36915284.333.820.1592.930.129 LDB_17_113671271711367127–113820093.934.760.1591.850.068 LDB_17_325202751732520275–325528713.032.750.1471.720.107 LDB_16_3291894616329189464.255.590.1421.190.022 LDB_12_34561481234561484.195.440.1371.630.033 LDB_8_6056566860565665.035.370.1361.850.038 LDB_4_955068749550687–95509985.034.570.1353.770.111
(續(xù)附表1)
SNPLDB染色體Chromosome位置PositionModel aQTLQTL×Env. b–lgP–lgPR2 (%)–lgPR2 (%) LDB_5_228526552285265–22852963.314.500.1331.000.030 LDB_15_275233981527523398–277233433.843.020.1323.470.147 LDB_13_45997361345997363.304.780.1190.100.000 LDB_9_10672160910672160–108312075.874.010.1186.770.200 LDB_1_50652928150652928–506691373.702.970.1172.840.113 LDB_18_5266917918526691793.924.710.1171.550.031 LDB_6_188451906188451903.304.490.1111.130.020 LDB_13_1768117013176811703.514.260.1040.250.002 LDB_4_15008106415008106–150302612.722.850.0991.580.059 LDB_20_3955658020395565802.343.880.0940.610.009 LDB_16_2009914116200991412.273.870.0930.180.001 LDB_14_2054166714205416673.053.800.0910.560.008 LDB_15_32428801532428808.413.740.09013.780.380 LDB_1_527503121527503122.933.720.0890.250.002 LDB_16_2124799616212479963.853.590.0851.900.040 LDB_3_284586663284586662.673.260.0760.890.015 LDB_18_39339461839339463.533.140.0731.650.033 LDB_7_15901391715901391–159032814.332.930.0674.150.101 LDB_19_2090261192090261–20906114.432.920.0674.400.108 LDB_7_27922333727922333–281217793.451.470.0673.330.129 LDB_5_416790205416790204.152.900.0673.780.091 LDB_18_574974341857497434–575003293.292.790.0641.070.019 LDB_4_57169345716932.402.630.0591.660.034 LDB_6_46436793646436793–464368332.302.560.0571.050.018 LDB_20_3231470320323147033.452.400.0532.980.069 LDB_8_427525008427525004.232.400.0533.740.090 LDB_4_437577064437577062.862.400.0532.380.052 LDB_11_319943741131994374–319944362.282.310.0510.760.012 LDB_4_164362541643625–17440932.391.650.0493.060.090 LDB_16_82040991682040992.682.080.0441.630.033 LDB_13_3347816413334781642.851.790.0373.210.075 LDB_14_487994911448799491–487994964.121.200.0354.550.134 LDB_7_166304427166304423.581.550.0313.760.090 LDB_3_5207322352073224.231.450.0284.940.123 LDB_7_238699707238699703.281.260.0243.410.081 LDB_1_3351751133517513.420.890.0153.410.081 LDB_7_302788467302788464.230.600.0086.120.157 LDB_1_248958781248958782.400.360.0043.140.073 Total114104c78.10351d10.312
a: QTL檢測(cè)模型顯著性測(cè)驗(yàn);b: QTL與環(huán)境互作效應(yīng);c: 主效測(cè)驗(yàn)顯著位點(diǎn)數(shù)目;d: 互作效應(yīng)測(cè)驗(yàn)顯著位點(diǎn)數(shù)目。
a: statistical hypothesis testing performed in QTL detection model;b: QTL-by-environment interaction effect;c: number of QTL with significant main effect;d: number of QTL with significant QTL-by-environment interaction effect.
附表2 高桿大豆育種化組合設(shè)計(jì)
Supplementary table 2 Optimal cross design of tall soybean breeding
親本 Parent組合 Cross P1P2Y1Y2平均數(shù) Mean標(biāo)準(zhǔn)差 SDP10P50P90 4L0604L311136.3132.3135.436.187.5135.5183.3 4L1194L361125.0138.6132.337.281.6131.8182.2 4L2134L361127.6138.6133.535.984.4134.4181.3 4L0604L119136.3125.0130.637.480.5130.3180.8 4L0544L060133.5136.3134.633.491.3133.7180.8 4L3114L361132.3138.6134.335.287.7134.4180.7 4L0544L361133.5138.6136.333.692.8135.3180.7 4L0604L371136.3137.2138.032.593.9138.8180.5 4L3614L371138.6137.2136.932.493.6137.8179.4 4L0604L213136.3127.6131.736.083.4132.1179.3 4L1594L361143.6138.6141.228.9103.5141.6179.1 4L0604L297136.3131.0133.933.490.5132.6178.9 4L0604L159136.3143.6140.329.5101.2140.2178.9 4L3614L367138.6136.5138.229.899.7137.6178.6 4L2344L361132.0138.6134.831.693.8134.1177.8 4L2974L361131.0138.6134.333.091.6134.1177.5 4L0544L114133.5128.3131.933.785.5131.6177.0 4L2744L361131.5138.6135.431.592.5135.8176.4 4L1144L371128.3137.2133.732.390.0133.8176.3 4L1144L311128.3132.3129.435.781.8128.1176.0 4L1144L213128.3127.6128.235.979.9127.9175.5 4L0604L367136.3136.5136.230.894.1136.8175.4 4L1144L159128.3143.6136.129.897.8135.9175.4 4L0604L274136.3131.5133.631.491.7134.1175.3 4L0604L148136.3124.2130.733.687.2130.2175.1 4L1144L119128.3125.0125.637.774.1125.6175.0 4L2484L361123.4138.6131.033.486.9131.0174.6 4L1144L297128.3131.0129.035.083.7128.9174.6 4L1934L361122.8138.6131.533.188.3130.9173.7 4L0604L234136.3132.0132.731.790.2133.4173.6 4L1144L234128.3132.0131.032.289.0131.0173.3 4L1144L367128.3136.5133.130.692.8133.0173.3 4L0274L361118.0138.6128.033.284.9127.6173.2 4L0604L193136.3122.8129.332.785.5129.4172.8 4L0604L248136.3123.4128.633.284.6128.8172.5 4L2604L361107.0138.6124.336.277.5123.5172.3 4L0604L302136.3115.0125.133.380.6124.1172.1 4L3024L361115.0138.6126.334.482.0125.7172.0 4L0604L111136.3115.0126.733.881.9126.2171.9 4L3154L361120.4138.6130.431.988.3130.9171.8 4L1484L361124.2138.6130.831.689.7130.6171.6 4L1114L361115.0138.6126.533.881.7126.8171.5 4L1464L361112.8138.6126.533.283.2125.4171.5
(續(xù)附表2)
親本 Parent組合 Cross P1P2Y1Y2平均數(shù) Mean標(biāo)準(zhǔn)差 SDP10P50P90 4L0494L361110.6138.6124.635.779.0125.1171.3 4B1814L361118.2138.6128.532.086.4128.9171.3 4L2834L361106.2138.6120.837.472.2120.4171.2 4L0604L315136.3120.4128.831.885.9129.0171.2 4L1144L274128.3131.5130.631.189.6131.5171.2 4L2844L361114.2138.6126.833.782.1127.0170.9 4L1144L193128.3122.8125.633.781.4124.3170.6 4L0274L060118.0136.3126.133.782.1125.4170.6 4B1814L060118.2136.3128.032.284.9128.2170.6 4L0604L284136.3114.2126.034.580.1126.7170.5 4L0604L112136.3119.2127.232.085.0126.9170.5 4L1124L361119.2138.6128.531.287.3128.0170.4 4L2424L361117.6138.6128.431.585.9128.3170.0 4L1914L36192.4138.6114.940.760.0114.9169.8 4L1244L361106.0138.6122.935.077.1122.5169.7 4L1144L248128.3123.4126.433.482.9125.3169.6 4L1454L361119.6138.6127.930.986.3127.6169.5 4L0494L060110.6136.3123.335.876.8123.6169.5 4L0604L145136.3119.6128.531.286.5128.2169.4 4L0014L361120.8138.6129.730.190.4128.9169.4 4L2014L361106.2138.6122.835.176.0121.8169.2 4L0604L260136.3107.0121.836.274.3121.4169.1 4L1544L361118.3138.6128.431.686.4128.3168.9 4L3524L361112.0138.6125.931.785.1125.4168.8 4L0224L15971.5143.6108.745.748.7109.2168.8 4L2244L361109.6138.6123.733.579.6124.1168.8 4L1864L361113.6138.6126.731.285.4125.7168.7 4L2544L361110.2138.6124.433.678.5124.2168.6 4L0604L154136.3118.3127.431.187.1127.2168.5 4L2764L361102.3138.6120.135.872.4119.9168.5 4L0014L060120.8136.3129.230.090.0128.9168.4 4L0604L242136.3117.6125.631.983.3124.4168.3 4B1814L114118.2128.3124.332.881.6124.5168.3 4L0224L06071.5136.3104.646.743.0103.2168.1 4L0604L254136.3110.2122.334.176.5122.7168.0 4L1144L148128.3124.2125.632.683.0125.6168.0 4L0604L146136.3112.8124.032.680.4123.9167.8 4L2964L361121.5138.6129.329.291.3129.3167.8 4L0604L191136.392.4114.740.061.9115.2167.8 4L0604L283136.3106.2120.135.773.3120.7167.7 4L1144L284128.3114.2122.233.777.5121.6167.5 4L0604L352136.3112.0124.332.780.7123.9167.5 4L0604L201136.3106.2120.434.675.5120.1167.2 4L0424L361115.5138.6127.130.585.9127.3167.1
(續(xù)附表2)
親本 Parent組合 Cross P1P2Y1Y2平均數(shù) Mean標(biāo)準(zhǔn)差 SDP10P50P90 4L0604L276136.3102.3119.835.872.2120.2167.0 4L1144L315128.3120.4125.231.782.0125.9167.0 4L0274L114118.0128.3123.433.080.7123.9166.8 4L1144L260128.3107.0117.437.369.0116.9166.7 4L0604L186136.3113.6124.231.483.9123.2166.6 4L0494L114110.6128.3120.235.274.5120.4166.6 4L3614L369138.6112.4125.430.984.6125.0166.4 4L0224L05471.5133.5103.947.940.6104.9166.4 4L0424L060115.5136.3126.330.185.9126.1166.2 4L1174L361117.0138.6128.328.591.6127.2166.2 4L1114L114115.0128.3122.134.276.7122.1166.2 4L0604L124136.3106.0120.434.575.6120.6166.2 4L3604L361111.0138.6124.331.382.8125.6166.1 4L1074L361119.4138.6128.828.690.0129.7166.0
P1、P2分別表示單交組合親本代號(hào); Y1、Y2分別表示P1和P2的株高觀測(cè)平均數(shù); P10、P50、P90分別表示第10、50、90百分位數(shù)。
P1 and P2 indicate the two parent labels of single cross; Y1 and Y2 indicate the means of observed plant height; P10, P500, and P90 are 10-th, 50-th, and 90-th percentiles of homozygous progeny population.
Characterization and Analytical Programs of the Restricted Two-stage Multi- locus Genome-wide Association Analysis
HE Jian-Bo, LIU Fang-Dong, XING Guang-Nan, WANG Wu-Bin, ZHAO Tuan-Jie, GUAN Rong-Zhan, and GAI Jun-Yi*
Soybean Research Institute / National Center for Soybean Improvement, Ministry of Agriculture / Key Laboratory of Biology and Genetic Improvement of Soybean (General), Ministry of Agriculture / State Key Laboratory for Crop Genetics and Germplasm Enhancement, Nanjing Agricultural University, Nanjing 210095, Jiangsu, China
Genome-wide association studies (GWAS) have been widely used for genetic dissection of quantitative trait loci (QTL), and the previous GWAS procedures were concentrated on finding a handful of major loci, while the plant breeders are more likely interested in exploring the whole QTL system for both forward selection and background control. We proposed the restricted two-stage multi-locus genome-wide association analysis (RTM-GWAS, https://github.com/njau-sri/rtm-gwas/) for a relatively thorough detection of QTL and their multiple alleles. Firstly, RTM-GWAS groups the tightly linked sequential SNPs into linkage disequilibrium blocks (SNPLDBs) to form genomic markers with multiple haplotypes as alleles. Secondly, it utilizes two-stage association analysis based on a multi-locus multi-allele model to save computer space for focusing on genome-wide QTL identification along with their multiple alleles. Compared with the previous GWAS methods, RTM-GWAS takes the trait heritability as the upper limit of detected genetic contribution, which can avoid a large amount of false positives for a precise detection of the QTL system of the trait. The QTL-allele matrix as a compact form of the population genetic constitution can be used to design optimal genotypes, to predict optimal crosses in plant breeding, and to study the genetic properties of the population as well as the novel and newly emerged alleles. In the present study, we first introduced the function and usage of the RTM-GWAS analytical programs, and then used the experimental data from a research program on soybean to illustrate the application details of the RTM-GWAS.
restricted two-stage multi-locus genome-wide association study; SNP linkage disequilibrium block; multi-locus model; QTL-allele matrix; germplasm population; optimal cross design
2018-03-19;
2018-06-12;
2018-06-29.
10.3724/SP.J.1006.2018.01274
蓋鈞鎰, E-mail: sri@njau.edu.cn
E-mail: hjbxyz@gmail.com
本研究由國家自然科學(xué)基金項(xiàng)目(31701447, 31671718), 國家重點(diǎn)研發(fā)計(jì)劃項(xiàng)目(2017YFD0101500), 教育部111項(xiàng)目(B08025), 教育部長(zhǎng)江學(xué)者和創(chuàng)新團(tuán)隊(duì)項(xiàng)目(PCSIRT_17R55), 國家現(xiàn)代農(nóng)業(yè)產(chǎn)業(yè)技術(shù)體系建設(shè)專項(xiàng)(CARS-04), 江蘇省優(yōu)勢(shì)學(xué)科建設(shè)工程專項(xiàng), 中央高校基本科研業(yè)務(wù)費(fèi)和江蘇省JCIC-MCP項(xiàng)目資助。
This study was supported by the National Natural Science Foundation of China (31701447, 31671718), the National Key R&D Program for Crop Breeding in China (2017YFD0101500), the MOE 111 Project (B08025), the MOE Program for Changjiang Scholars and Innovative Research Team in University (PCSIRT_17R55), the China Agriculture Research System (CARS-04), the Jiangsu Higher Education PAPD Program, the Fundamental Research Funds for the Central Universities and the Jiangsu JCIC-MCP.
URL:http://kns.cnki.net/kcms/detail/11.1809.S.20180629.1035.002.html