Xin Wang,Zhenliang Zhang,Yang Xu,Penghen Li,Xueai Zhang,Chenwu Xu,*
aCollege of Information Engineering,Yangzhou University,Yangzhou 225009,Jiangsu,China
bKey Laboratory of Plant Functional Genomics of the Ministry of Education/Jiangsu Key Laboratory of Crop Genomics and Molecular Breeding/Jiangsu Co-Innovation Center for Modern Production Technology of Grain Crops,College of Agriculture,Yangzhou University,Yangzhou 225009,Jiangsu,China
cInternational Maize and Wheat Improvement Center(CIMMYT),Mexico D.F.06600,Mexico
ABSTRACT
The concept of general combining ability(GCA),originally defined by Sprague and Tatum[1],refers to the average performance of a line in hybrid combinations.It can be estimated using the difference between the average of its hybrids and the general average for all crossings.GCA is mainly a measure of additive effects,which can be directly transmitted from parents to offspring.In recent years,doubled haploid(DH)technology has facilitated the generation of a large number of inbred lines,making GCA evaluation the major bottleneck in hybrid maize breeding[2].Therefore,the evaluation of GCA is a crucial process for hybrid development,and the maize line with high GCA is an essential component for producing elite hybrids[3].
Estimation of GCA can be obtained easily using the complete diallel cross design or North Carolina Design II(NC II)[4,5].However,with the development of breeding programs,a great quantity of inbred lines are available.The number of possible crosses grows very rapidly,making these designs time and resource intensive.Partial diallel cross designs in which only a subset of possible crosses is performed are more attractive options[6].They allow the evaluation of a greater number of inbred lines in crosses[7].In the classic circulant designs[8],each of n lines is only crossed with s other lines,instead of n-1 lines as in the complete diallel.In this way,each line is guaranteed to be involved in the sampling crosses.Analysis using circulant diallels significantly reduces the number of crosses in which each genitor is involved,and enables the participation of a greater number of genitors[9].However,in breeding practice,only a small proportion of hybrids are possible to be evaluated in the field,making the partial diallel table very sparse.This means that the average value of s is small,and some complex factors of field trials may bring big fluctuation to s for the parental lines.In this scenario,limited research has been reported to estimate the GCA.So,there is an urgent need to develop procedures to allow the accurate evaluation of GCA based on sparse partial diallel cross(SPDC)designs[6].
Previous studies[10–12]have proposed many statistical methods for diallel analysis.Among them,diallel mating models proposed by Griffing[10]are the most widely used models for investigating the genetic parameters(general and specific combining ability).For a partial diallel cross scheme,traditional studies[13]used the ordinary least squares(OLS)to estimate GCA.However,if the diallel table is very sparse,the great reduction in the ratio of observed hybrids will be striking.Using the traditional OLS to evaluate many inbred lines with a small sample of hybrids can't ensure high accuracy.
Fortunately,with the advances of molecular biology,breeders can accurately understand the genetic structure of breeding populations,and thus greatly improve the estimation of genetic parameters using genomic prediction(GP).Estimated breeding values based on the genotypes of individuals were remarkably accurate[14,15].Some studies[16,17]used GS models for directly predicting agronomic traits in inbred lines,and some others[18–20]used GS for predicting hybrid performance.Various methods,such as Genomic best linear unbiased prediction(GBLUP),Bayes,the least absolute shrinkage and selection operator(LASSO)and machine learning[21–24]have been developed for GP,and these differ with respect to assumptions about the marker effects.GBLUP and RR-BLUP[21]assign identical variance to all loci and essentially treat all of them as equally important.In BayesA,markers are assumed to have different variances and follows a posterior scaled inverse chi-square distribution[22].The prior in BayesB assumes that the variance of markers is equal to zero with probabilityπ,and the complement with probability 1–πfollows a scaled inverse chi-square distribution[22].In Bayes Cπ,the mixture probabilityπhas a prior uniform distribution[25].LASSO is a popular method for regression that uses a penalty to achieve a sparse solution,and it is somewhat indifferent to closely correlated markers and tends to pick one and ignore the others[26].Machine learning is also an alternative for GS.It has been employed to enhance the prediction of genetic values for wheat and maize[27,28].Although various models have been successfully applied to GS,some studies[29–31]showed that not much variation in prediction accuracy among the different models was observed.
In terms of the GCA analyses and hybrid prediction,Bernardo[32–34]was one of the first to advocate the BLUP model[35]in maize.From then on,many studies have been reported to analyze GCA especially when predicting hybrids.For predicting the GCA of a maize testcross population,Riedelsheimer et al.[2]proposed a genomic selection(GS)approach based on RR-BLUP,showing that more efficient predictive procedures could be developed using genomic data.With a linear mixed model using the ASReml-R software[36],Kadam et al.[37]evaluated random inbreds derived from biparental families of maize.Using GBLUP and BayesB,Technow et al.[38]investigated genome properties of the parental line based on the Dent×Flint heterotic pattern.Greenberg et al.[6]developed a hierarchical Bayesian model to estimate quantitative genetic parameter from partial diallel cross designs.Werner et al.[39]considered GCA and specific combining ability(SCA)to apply RR-BLUP and Bayesian models for predicting hybrid performance in oilseed rape using a collection of 220 paternal DH lines and five malesterile inbred lines.Recently,using GBLUP and a complete diallel cross design with twenty-eight single-crosses formed between eight parental lines,Velez-Torres et al.[40]concluded that GS is a more effective and efficient approach to predict the GCA of maize lines compared with phenotyping method.However,in most of the previous studies,more attention has been paid to the prediction accuracy of hybrids,and the prediction of parental GCA is not the focus.For an SPDC maize population,few studies have been reported to systematically investigate the GP accuracy for GCA.
The purpose of this study is to assess the efficacy of GP for estimating the GCA of maize inbred lines with SPDC systems.As mentioned earlier,the accuracy of various GS models is similar.Considering that GBLUP is more suitable for quantitative traits influenced by polygenes and its high computational efficiency[15],GBLUP was adopted in this study.Using genome-wide SNPs called from a real maize data set of 266 inbred lines,genetic and phenotypic values of all possible hybrids were simulated.And thus different hybrid sample sizes and different distributions of parental lines involved in crossing were investigated.Such a scheme was implemented to assess the efficacy of GP in estimating the GCA of lines.Then,the utility of statistical approaches was illustrated with an example using an actual trait of maize.The methods we described would be useful in various sets of maize and other crops.
2.1.1. Plant materials
The models were fitted to the maize data set from Yangzhou University. Partial diallel crossings between a total of 266 maize inbred lines were performed during the 2017 and 2018 maize growing seasons from field trials on the experimental farms in Yangzhou and Taian, China, and two replicates were made in each environment. Regardless of reciprocals, ear weight (EW) for 945 hybrids were collected to estimate the GCA of the inbred lines. The phenotypes fitting the statistical model were the average performance of each hybrid from two environments. The 266 inbred lines were genotyped, and 319,668 SNPs evenly distributed on chromosomes were called initially. 61,468 genome-wide SNPs were filtered by eliminating the heterozygosity >0.05 and the missing rate > 0.05.Genotypes of the hybrids were inferred from SNPs of their inbred parents.
2.1.2. Simulations
Based on the genotypes mentioned above, a large number of simulations were performed to assess the prediction accuracy of different statistical approaches. Considering additive and dominant genetic effects of markers, several traits of all the 35,245 possible hybrids with 200 QTLs and different heritabilities were randomly simulated. The numbers of QTLs on chromosome 1–10 are 29, 32, 21, 25, 21, 12, 15, 21, 9, and 15,respectively. According to the study of Meuwissen et al. [22],the additive effects of the 200 QTLs were drawn from a gamma distribution with shape parameter α = 0.4 and scale parameter β = 1.66. Half of the additive effects had positive effects and the other half had negative effects. The dominant effects were determined as the product of the absolute additive effects and the degree of dominance, which was drawn from a normal distribution with mean and variance equal to 0.193 and 0.3122,respectively. In the 200 QTLs, two loci on chromosome 1 and chromosome 3 were found to have weak over-dominant effects. For all the simulated hybrids, the ratio of dominant variance to additive variance was 0.160. Normal independent error deviations with variances calculated were added to meet assumed heritability, and the simulated phenotypes were centered and standardized to unit variance. Finally, three traits with heritability of 0.7, 0.5, and 0.3, named T7, T5, and T3 were obtained.
2.1.3. Sampling designs
Two types of sampling designs were performed to evaluate the difference in predictive accuracy. First, for each trait of the 35,245 hybrids, different numbers of hybrids (m = 500, 1000,and 2000)were sampled as the training sets to estimate the GCA of the 266 parental lines.Second,with a certain sample size of hybrids mentioned above,three distributions of parental lines were designed.One was similar to a circulant diallel table.In this paper it was called balanced sampling,which meant that all parental lines were involved in nearly an equal number of crosses(designated s).Another was called random sampling,which meant that the crossing times of all the lines were random but at least 1.The third was called unbalanced sampling,which meant that only part of the 266 lines(n=200 or 150)were involved in random crossing.Each sampling method was randomly repeated 20 times to obtain the average results of 20 replicates.The density plot for the crossing times of the 266 lines derived from the 20 samplings with different methods is shown in Fig.1.Taking the first round of sampling with m=500 as an example,the detailed sampling scheme is illustrated in Table S1.In this study,for the random samplings,the crossing times(s)of each inbred line was classified into three levels to compare the accuracy for line subsets with different s value.About a quarter of the lines with the lowest s value were defined as low-frequency;a quarter of the lines with the highest s value were defined as high-frequency;the middle half was defined as mediumfrequency.
The Griffing Model[10]was used for analyzing our SPDC schemes:
where yijis the phenotypic value of the hybrids between line i and line j(i,j=1,2,…,n);μis the overall mean;giand gjare the GCA effects of the ith parent and the jth parent,respectively;sijis the SCA effect for the cross between the ith and jth parents;andεijis the random error effect.
Based on the above model,three statistical approaches were applied.One obtained estimations through the OLS and the other two utilized GP to predict the GCA.In our simulations,knowing the phenotypes of all the possible hybrids,the true GCA of all the 266 inbred lines can be calculated with its definition.The coefficient of determination(the squared Pearson correlation coefficient)between the true GCA and predicted GCA was adopted to evaluate the accuracy of different statistical approaches.
2.2.1.OLS approach
In the matrix form,the vector for m observed hybrids can be represented by:
Fig.1–Density plot for the crossing times of the 266 inbred lines derived from 20 rounds of random,balanced and unbalanced samplings,respectively.
2.2.2.GP approach
GBLUP is an efficient method using whole-genome markers to predict genetic values and phenotypes of interest.It exploits the genomic relationships between training population and testing population to predict the genomic values for unknown individuals[42].In this study,two GP approaches using GBLUP were performed for predicting the GCA of the 266 lines.One was designated GP-I,which used the Griffing Model to directly estimate the GCA.The other,designated GP-II,estimated the GCA by predicting the phenotypes of all possible hybrids.
The model of GP-I can be described as:
The model of GP-II can be described as:
For the traits T7,T5 and T3,based on random and balanced samplings,the prediction results of OLS,GP-I and GP-II using different sample sizes were compared in Table 1.In each case,the trait T7 had the highest accuracy,while the trait T3 had the lowest accuracy,showing that the estimation of GCA was largely dependent on heritability.
It is notable that the accuracy of each approach was always higher than the heritability of a target trait,showing that the GCA for maize lines can be accurately predicted with SPDC designs.This character is very beneficial to genetic improvement in breeding practice.For T7,derived from the random sampling i.e.,when the sample size was 500,GP-II gave the highest accuracy(0.8068).It was 4.7% higher than that of GP-I(0.7709)and 9.6% higher than that of OLS(0.7361);when the sample sizes were 1000 and 2000,GP-II gave the highest accuracy as before.At this time,the accuracy of GP-I was only slightly higher than that of OLS and the advantage of GP-II was smaller.For T5,when the sample sizes were 500 and 1000,the statistical approaches showed the similar pattern increasing from OLS to GP-II.When the sample size was 2000,the accuracy of GP-I was once again slightly higher than that of OLS.For T3,no matter which sample size was adopted,GP-II gave much more accurate GCA than GP-I and OLS,and the accuracies obtained by GP-I always substantially exceeded that of OLS.
Obviously,the sample size had a great influence on the accuracy.For all the three traits,statistical approaches with the sample size of 2000 provided the highest accuracies,followed by those with the sample size of 1000,reflecting that a big sample size could substantially contribute to the prediction.Additionally,the sample size affected the significance of GP approaches over OLS.As mentioned above,the smaller the sample size,the higher the level of significance was,showing that GP is particularly beneficial for the estimation of GCA with a sparse partial diallel table.No matter which statistical approach was used,when comparing the random and balanced samplings,their accuracies werenot significantly different,showing that the random sampling commonly used in breeding practice had little impact on the average accuracy of GCA.
Table 1 – Comparison of accuracy using OLS, GP-I, and GP-II with different sampling designs and training set sizes.
In brief,GP-II performed the best in our research and would be a promising approach for estimating the GCA of maize and other crops.
Sometimes,due to insufficient material resources or experimental budget,the number of inbred lines involved in crossing is limited.Therefore,in this study,the effect of line quantity on prediction was explored.To simplify the problem,only the GP-II approach that had the highest accuracy in the previous experiments was investigated.The accuracy for the 266 lines with different numbers of lines(n=150,200,and 266)involved in random crossing was plotted against the sampling number of hybrids(Fig.2).As expected,for each of the three traits,the accuracy was higher for the sampling designs with 266 lines over the sampling designs with 200 and 150 lines.For T7,when the sample size was 500,1000 and 2000,the highest accuracy obtained with n=266 was 22.9%,25.2%,and 30.4%higher than that with n=150,respectively;For T5,when the sample size was 500,1000 and 2000,the highest accuracy obtained with n=266 was 20.8%,22.3%,and 27.7%higher than that with n=150,respectively;For T3,when the sample size was 500,1000 and 2000,the highest accuracy obtained with n=266 was 14.0%,17.5%,and 24.4%higher than that with n=150,respectively.It was clear that the advantage of more inbred lines involved in crossing could be brought into full play when predicting a high-heritability trait with a big sample size of hybrids.
The results of above research demonstrated that no significant differences were found between the accuracies of the random and balanced samplings.However,for the random sampling,inbred lines involved in crossing with different frequency must provide different amounts of information.Therefore,taking GP-II for instance,accuracy for lines involved in crossing with different levels of frequency derived from the random sampling were demonstrated in Table 2.It was shown that the lines involved in crossing with highfrequency always got the highest accuracy,and those with low-frequency performed the worst.For predicting T3,with the sample size of 500,the high-frequency lines achieved the biggest increase(55.3%)in accuracy relative to the lowfrequency lines.That is to say,although the random sampling brings us similar accuracy to the balanced sampling,the prediction for low-frequency lines may suffer from much lower accuracy,especially for a low-heritability trait with a small sample size of hybrids.
Fig.2–Accuracy of GP-II for the 266 inbred lines with different numbers of lines(n=150,200,or 266)involved in the random crossing.
Unbalanced sampling design was also concerned in this research.The accuracies of GP-II for lines involved and not involved in crossing are summarized in Table 3.As expected,the accuracies were always much higher for the involved lines over the non-involved lines.In most cases,the accuracy increases with the sample size.It is interesting to note that the accuracy of non-involved lines is always less sensitive to the sample size than that of the involved lines.In particular for T7,the accuracy of the 166 non-involved lines with m=2000(0.4105)was 6.4%higher than that with m=500(0.3858),while the advantage of the 150 involved lines was 10.5%;in the same way,the accuracy of the 66 non-involved lines with m=2000(0.4594)was only 4.3%higher than that with m=500(0.4405),while the advantage of the 200 involved lines was 14.5%.For T5 and T3,the pattern of the accuracy was similar to that for T7.
Table 2–Accuracy of GP-II for inbred lines involved in crossing with different levels of frequency.
In each case,accuracy of involved lines with n=200 was lower than that with n=150.A good interpretation was that the involved lines with n=150 has higher frequency in crossing,which showed again the contribution of highfrequency to prediction.On the contrary,accuracies of noninvolved lines with n=200 were almost higher than that with n=150.The reason may be that the advantage of highfrequency cannot work on the non-involved lines,and a sample of hybrids derived from more lines can benefit the prediction.
Table 3–Accuracy of GP-II for inbred lines involved and not involved in crossing derived from the unbalanced samplings.
In addition to the simulated studies,actual EW of 945 hybrids were used to estimate the GCA of the 266 inbred lines.The predicted GCA values were sorted in descending order and the prediction results of the top 20 inbred lines are demonstrated in Table 4.As shown,eight common varieties,including B93,B57,B108,B74,B214,B275,B254,and B167 were screen out by all the three statistical approaches.It is noteworthy that the coefficient of determination between the predicted GCA of GPII and OLS was the highest(0.8703),and up to sixteen common varieties were selected by the two approaches simultaneously,indicating that the intersection between the results of GP-II and OLS was more reliable in this scenario.Moreover,the absolute GCA values predicted by GP-I were lower than those predicted by OLS and GP-II.The reason may be that the non-additive effects were excluded from the estimates of GCA when using GP-I.Although this exclusion may not affect the predictive accuracy,the loss of non-additive variance will inevitably reduce the absolute values of the predicted GCA.
With a complete diallel cross scheme,the GCA of inbred line can be easily calculated with its definition[1].However,because of the large number of possible crosses,Kempthorne and Curnow[7]suggested the partial diallel cross to evaluate inbred lines.Each of the n lines are crossed with s other lines,and there will be ns/2 crosses in the whole set.In this scenario,although the OLS was widely used for estimating the GCA[13,44],genomic information was ignored.GS uses all molecular markers for predicting the performance of the candidates,and it has shown tangible genetic gains in maize breeding[14].Recently,Alves et al.[45]pointed out that GS can be used to estimate genetic parameters accurately in maize hybrids.For maize inbred lines,there is no reason to doubt the advantage of GS over the phenotype-based OLS.Comparisons of the three approaches in the present study have shown that GP-II and GP-I are superior to OLS,showing that prediction with genomic data can help improve the estimation of GCA for inbred lines based on SPDC designs.
Riedelsheimer et al.[2]crossed 285 diverse Dent inbred lines with two testers and predicted the GCA using genomic and metabolic information.Predictive accuracies(the Pearson correlation)ranged from 0.72 to 0.81 for GP,which are similar to ours for T3 and T5.However,a fivefold cross-validation scheme was applied in their prediction and the predicted GCA values of one subset were estimated using the observed GCA values of the other four subsets.In our research,the situation was quite different.The partial diallel table was assumed to be very sparse.Even if the hybrid sample size is equal 2000,the ratio of the sample size(2000)to the possible hybrid size(35,245)is only 5.7%,making it impossible to obtain a GCA observation of any line.In other words,our prediction is based on the SPDC.The sampling hybrids are treated as the training set,and their number is often lager than the number of lines with identified GCA.Maybe this is the reason why our accuracies are slightly higher than those of Riedelsheimer et al.[2].
Population design plays a vital role in breeding programs,and partial diallel cross is preferable in many cases.For instance,Miranda Filho and Vencovsky[13]estimated the GCA for ear length of maize in a partial diallel cross with n=10 and s=3.Reis et al.[44]estimated the genetic parameters using a partial circulant diallel cross design with n=34(two groups of parents)and different sizes of s(from 2 to 5).Analysis using cross-validation process showed that the accuracy increased as the value of s increased.Our balanced and unbalanced sampling designs just mimicked the partial diallel cross scheme.With n=266,the hybrid sample size was set to 500,1000,and 2000,respectively.Correspondingly,the mean of s was 3.8,7.5,and 15.0.Note that s was much less than n-1.The inadequate phenotypic information of each parental line in crossing couldn't guarantee the accurate estimation of GCA.Although Vivas et al.[9]declared that it is possible to obtain good agreement(correlation coefficient above 0.8)with s=3,our accuracies with s=3.8 are still lower than those with s=7.5 and s=15.0.When evaluating the efficiency of the circulant diallel,Veiga et al.[46]pointed out that it is advantageous to increase the s value for a lowheritability trait.In our research,with the sample size of 500 for predicting T3,accuracies of OLS were substantially lower than those of GP-I and GP-II,showing that the reduction in the s value decreased the potential accuracy of OLS.However,it is gratifying that the GP approaches using genomic information have been demonstrated to partly compensate the“small s”problem.
The unbalanced sampling design was also worthy of attention in breeding practice.Because of experimental cost and complex factors in field trails,involving all inbred lines in crossing is impossible.Previous studies[19,47]have adopted the strategy of predicting untested single-cross hybrids in maize and rice.However,few GS studies have been undertaken in predicting the GCA of the lines that never participate in crossing.In this case,the traditional phenotype-based OLS is impracticable.Our research has demonstrated that the GCA of the non-involved lines could also be estimated using GP approaches based on SPDC designs.This strategy allows a reliable selection of more inbred lines for their potential to create superior hybrids.But on the other hand,lines never or seldom involved in crossing were found to have lower accuracy than the involved lines.In this respect,our results are in agreement with those reported by Fristche-Neto et al.[48]who have indicated that the number of parents and the crosses per parent in the training sets should be maximized when predicting maize hybrid performance.
Table 5–Accuracy of GP-II for the GCA of 266 inbred lines and for all the potential hybrids with the random samplings.
Maize breeding involves two critical steps,developing superior inbred lines from breeding populations and identifying elite combinations of two inbred lines[52].With the development of DH and other technologies,breeders have been able to develop a large number of inbred lines which need to be evaluated by their performance in crosses.However,the number of potential crosses grows rapidly,making the field evaluation of hybrid performance time and resource intensive.Sparse partial diallel tables are becoming more and more common in breeding practice.GCA is mainly a measure of additive effects,and it is in response to selection of inbred lines.Accurate prediction of the GCA will enhance the efficiency of inbred line selection,and then accelerate the hybrid breeding.In actual breeding projects,especially for the scenario with only datasets based on SPDC designs,breeders can evaluate their inbred lines using the methods proposed in the present study.Then,for different heterotic groups,top inbred lines can be selected,and a few corresponding testers can be used to perform the validation by field trials.We believe that in this way,the efficiency and accuracy of maize breeding can be improved.
Previous studies[3,53,54]have used maize introgression lines or recombination inbred lines to perform testcrosses for detecting significant loci of GCA.In all these experiments,phenotypes of hybrids were observed and the true GCA value could be calculated with certainty.However,in most breeding programs,only a small part of possible hybrids can be identified in field trials,and thus the detection work can't be carried out.In such cases,methods proposed in the present study guarantee a reliable estimation for the GCA, providing an opportunity for further detection of significant loci. This strategy may open up a promising research direction for inbred line selection in maize and other crops.
Supplementary data for this article can be found online at https://doi.org/10.1016/j.cj.2020.04.012.
Declaration of competing interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This work was supported by the National Key Research and Development Program of China(2016YFD0100303),the National Natural Science Foundation of China(31801028,31902101),the Open Research Fund of State Key Laboratory of Hybrid Rice(Wuhan University)(KF201701),the Science and Technology Innovation Fund Project in Yangzhou University(2019CXJ052)and the Priority Academic Program Development of Jiangsu Higher Education Institutions.
Author contributions
Xin Wang performed the analysis and wrote the paper.Zhenliang Zhang,Yang Xu,and Pengchen Li collected the data.Xuecai Zhang assisted with the analysis.Chenwu Xu conceived and designed the analysis.