Genome-wide regression using a number of genome-wide markers as predictors is now widely used for genome-wide association mapping and genomic prediction. We developed novel software for genome-wide regression which we named VIGoR (variational Bayesian inference for genome-wide regression). Variational Bayesian inference is computationally much faster than widely used Markov chain Monte Carlo algorithms. VIGoR implements seven regression methods, and is provided as a command line program package for Linux/Mac, and as a cross-platform R package. In addition to model fitting, cross-validation and hyperparameter tuning using cross-validation can be automatically performed by modifying a single argument. VIGoR is available at https://github.com/Onogi/VIGoR. The R package is also available at https://cran.r-project.org/web/packages/VIGoR/index.html.
Genome-wide association studyGWASgenomic selectionwhole-genome predictionThis work was supported by a Grant-in-Aid for Japan Society for the Promotion of Science (JSPS) Fellows (26.10661).(1) OverviewIntroduction
Linear regression methods where a number of genome-wide markers are used as predictors are currently used in genomic prediction [123] and genome-wide association mapping [456]. Bayesian regression methods have attracted particular attention and a number of variations have been proposed [2], which typically use Markov chain Monte Carlo (MCMC) algorithms for parameter inference (e.g., [178]). Consequently, the software currently available for Bayesian regression (e.g., GenSel [9], AlphaBayes [10], GS3 [11], BayeZ [12], and BGLR [13]) is mainly based on MCMC. However, because of the computational burden associated with MCMC, analyzing huge datasets, such as those consisting of hundreds of thousands of markers, within realistic time scales is often unfeasible. Moreover, intensive cross-validation (CV) for evaluating predictive ability or for tuning hyperparameters is also difficult, even in moderate-sized datasets. These shortcomings of MCMC-based methods have hampered the widespread application of Bayesian methods to genome-wide association mapping and genomic prediction. To tackle the shortcomings of the MCMC-based software, we developed novel software for whole-genome regression in a Bayesian framework, which we named VIGoR (variational Bayesian inference for genome-wide regression). VIGoR is based on variational Bayesian inference (VB), which is computationally much faster than MCMC, and can implement seven regression methods: Bayesian lasso (BL) [14], extended Bayesian lasso (EBL) [15], weighted Bayesian shrinkage regression (wBSR) [16], BayesB [1], BayesC [8], stochastic search variable selection (SSVS) [17], and Bayesian mixture regression (MIX)[18]. BL, EBL, and wBSR implemented by VIGoR were used for genomic prediction of rice agronomic traits [19], and genomic prediction and association mapping in tomato [20]. EBL was also used for genomic prediction of rice heading date [21]. The command line program (CLP) package for the Linux/Mac platform is available at https://github.com/Onogi/VIGoR. A pdf manual is also available at the URL [24]. The R package is cross-platform and is available at https://github.com/Onogi/VIGoR and from the Comprehensive R Archive Network (CRAN) at https://cran.r-project.org/web/packages/VIGoR/index.html. The pdf manual contains the explanations of both the CLPs and the R functions.
Regression methods and algorithms
VIGoR assumes the following linear regression model:
where y_{i} is the phenotypic value (response variable) of individual i, F is the number of covariates other than markers, z_{ij} is the covariate corresponding to the effect α_{j}, P is the number of markers, γ_{p} is the indicator variable that takes 0 or 1, x_{ip} is the genotype of marker p, β_{p} is the effect of marker p, and ε_{i} is the residual. The indicator variables are fixed to 1 except in wBSR. The residual, ε_{i}, is assumed to follow a normal distribution with mean = 0 and variance =
1/τ02
\documentclass[10pt]{article}
\usepackage{wasysym}
\usepackage[substack]{amsmath}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage[mathscr]{eucal}
\usepackage{mathrsfs}
\usepackage{pmc}
\usepackage[Euler]{upgreek}
\pagestyle{empty}
\oddsidemargin -1.0in
\begin{document}
\[
1/\tau _0^2
\]
\end{document}
. α_{j} is assumed to be proportional to a constant value, i.e., assumed to follow a non-informative prior distribution. The prior distribution of β_{p} differs among the regression methods (Table 1). These prior distributions were proposed to select important markers (i.e., markers strongly associated with phenotypes) efficiently from given markers. Variational Bayesian algorithms for these regression methods which are implemented by VIGoR are illustrated in the pdf manual of VIGoR [24]. All the regression methods require the user specifying hyperparameters to determine the shapes of prior distributions. In Table 2, hyperparameters to be specified by the user are presented for each regression method. We briefly describe how to specify these hyperparameter values in the next section. In Table 3, we present the references of the regression methods and variational Bayesian inference.
Prior distributions of the marker effects and indicator variables.^{a}
Influential hyperparameters determined by hyperpara
BL
φ, ω
φ (1.0)
ω
EBL
φ, ω, ψ, θ
φ (0.1), ω (0.1), ψ (1.0)
θ
wBSR
v, S^{2}, κ
v (5.0)
S^{2}
BayesB
v, S^{2}, κ
v (5.0)
S^{2}
BayesC
v, S^{2}, κ
v (5.0)
S^{2}
SSVS
c, v, S^{2}, κ
v (5.0)
c, S^{2}
MIX
c, v, S^{2}, κ
v (5.0)
c, S^{2}
^{a}For each regression model, the hyperparameters in this table correspond to those listed in Table 1. Among these hyperparameters, κ of wBSR, BayesB, BayesC, SSVS, and MIX is determined by the user. The other hyperparameters are set as default or can be determined by the function hyperpara.
References of the regression methods and variational Bayesian algorithms.
Regression methods
Variational Bayesian algorithm
BL
[14]
[22]
EBL
[15]
[22]
wBSR
[16]
[23]
BayesB
[1]
[24]
BayesC
[8]
[25]
SSVS
[17]
[24]
MIX
[18]
[24]
Implementation and architecture
Both the CLP and R packages consist of two programs/functions, vigor and hyperpara (Fig. 1). Vigor conducts genome-wide regression analyses, and has three main functions: fitting regression models to data (Model fitting), fitting models after tuning hyperparameter values using CV (Model fitting after hyperparameter tuning), and evaluating predictive ability of regression methods by CV (Cross-validation). Using Model fitting, users can peform variable selection (association mapping) by fitting genome-wide regression models to data and estimating marker effects. This is the default function. Using Model fitting after hyperparameter tuning, users can estimate marker effects with the hyperparameters tuned automatically using CV. Using Cross-validation, users can evaluate the predictive ability of regression models using CV.
Overview of analysis by VIGoR. VIGoR consists of two programs/functions, vigor and hyperpara. Vigor conducts genome-wide regression analysis and has three main functions; Model fitting, Model fitting after hyperparameter tuning, and Cross-validation. The former two functions output the estimates of the marker and covariate effects, and the fitted values. The last function outputs the predicted values obtained by cross-validation. Vigor requires three kinds of input information, phenotypic values (response variable), marker genotypes (predictor variable), and hyperparameter values. Hyperparameter values can be determined by the user or by using hyperpara. Hyperpara calculates the values of hyperparameters that influence on inference, based on the assumptions of the genetic architecture and values of hyperparameters that influence less.
Vigor requires the phenotypic values (response variables), and the marker genotypes (predictor variables) as mandatory input information (Fig. 1). In addition, all the regression methods implemented by vigor require hyperparameter values that users should specify (Fig.1 and Table 2). To make specification feasible, we provide the other program/function, hyperpara, which calculates the values of hyperparameters that influence on inference, based on the values of hyperparameters that influence less and several assumptions of the genetic architecture (Table 2). Because the values of less influential hyperparameters are determined by default, users only input the assumptions of the genetic architecture to hyperpara. The required assumptions are (1) proportion of phenotypic variance (variance of response variable) that can be explained by the markers (predictor variables) (referred to as Mvar), and (2) proportion of markers with non-zero effects (referred to as κ). For example, when κ and Mvar are 0.01 and 0.5, respectively, this setting corresponds to an assumption that a half of phenotypic variance is explained by 1 % of markers. Based on this assumption, hyperpara calculates the values of influential hyperparameters. Explanations of the calculation of hyperparameter values are provided in the pdf manual of VIGoR [24].
The CLPs, vigor and hyperpara, were written with C, and are distributed as standalone pre-compiled programs. Source codes are also available at https://github.com/Onogi/VIGoR. The programs can be built, for example, by typing gcc vigor.c -o vigor (Mac) or gcc vigor.c -o vigor -lm (Linux). The default function of vigor is Model fitting. Model fitting after hyperparameter tuning and Cross-validation can be conducted by adding options -t and -c, respectively (Fig. 2). The phenotypic values (response variables), and the marker genotypes (predictor variables) are provided as text files.
Examples of the usage of vigor and hyperpara. “sample.pheno.txt” and “PhenoHeight” are the example file and object that contain the phenotypic values (response variables), and “sample.geno.txt” and “Geno” are the example file and object that contain the marker genotypes (predictor variables). These files and objects are included in the comĀ¬mand line program (CLP) and R packages, respectively. The regression methods are specified by their abbreviations (e.g., BL, BayesB, and wBSR). The argument(s) immediately after the regression methods are the hyperparameter values. Hyperparameters should be ordered as in the second column of Table 2. In the example of Model fitting, 1 and 0.1 are the values of φ and ω of Bayesian lasso, respectively. In the example of Model fitting after hyperparameter tuning, two hyperparameter value sets, [v = 5, S2 = 1, κ=0.01] and [v = 5, S2 = 1, κ = 0.1], are provided using the -v option (CLP) and as a matrix (R function). The better set is chosen using cross-validation (CV), and model fitting is performed automatically with the chosen set. The -t option (CLP) and the argument “tuning” (R function) indicate this procedure. In the example of Cross-validation, a five-fold CV is performed. The -c option (CLP) and “cv” (R function) indicate CV, and the argument immediately after this option/argument (here 5) is the fold number. In the example of hyperpara, the second (0.5) and fourth (0.01) arguments are the values of Mvar and κ, respectively (see the main text for the explanations of Mvar and κ).
The R function vigor calls a C function from C library which is included in the package. Thus, calculation speed of the R function is almost equivalent to that of the CLP vigor. Hyperpara was developed with R. Both vigor and hyperpara have no dependency to other R packages except for those included in system library. The usages of the R functions are similar to those of the CLPs (Fig. 2). The phenotypic values and the marker genotypes are input to vigor as a vector and a matrix objects, respectively. The default function of vigor is Model fitting. Model fitting after hyperparameter tuning and Cross-validation can be executed by adding arguments ”tuning” and “cv”, respectively (Fig. 2).
Both the CLP and R packages have advantages. The advantage of the CLP package is that the CLP vigor can accepts PED files of PLINK [26] and allele dosage files of Beagle [27] as the marker genotype file. Both PLINK and Beagle are popular association mapping and genotype imputation software, respectively. Thus, the users of PLINK or Beagle will easily perform analyses of VIGoR. Meanwhile, the advantage of the R package is that it can visualize the analysis results easily, that is, Manhattan plots can be drawn with a one-row R code (see the pdf manual or R documentation). It is also easy to evaluate prediction accuracy when Cross-validation is executed, by calculating Pearson correlation coefficient between the predicted and true values (see the pdf manual or R documentation). Users can select the packages according to their analysis purposes or environments.
Quality control
The CLPs for Linux were compiled under the Linux kernel release 3.13.0-24-generic with a X86-64 machine. We have not tried the programs in other releases of Linux kernel. The CLPs for Mac were compiled under OS X ver. 10.6.8. We verified that the programs run under a recent version, ver. 10.9.5.
The R functions were made under R version 3.0.2, and the package was build using Mac (OS X ver. 10.6.8). We verified that the package can be loaded and the functions run in Windows 7/8, Mac (OS X ver. 10.6.8 and 10.9.5), and Linux (3.13.0-24-generic).
(2) AvailabilityOperating system
CLP package: Linux (kernel 3.13.0-24-generic), and Mac (OS X ver. 10.6.8 and 10.9.5).
R package: Windows 7/8, Linux (kernel 3.13.0-24-generic), and Mac (OS X ver. 10.6.8 and 10.9.5).
Programming language
C (CLP programs) and C and R (R functions).
Additional system requirements
None.
Dependencies
The R package requires installation of R (http://cran.r-project.org/).
List of contributors
Akio Onogi and Hiroyoshi Iwata
Department of Agricultural and Environmental Biology, Graduate School of Agricultural and Life Sciences, The University of Tokyo.
Software locationArchive
Name
CLP package: GitHub
R package: GitHub and CRAN
Persistent identifier
CLP package: https://github.com/Onogi/VIGoR
R package: https://cran.r-project.org/web/packages/VIGoR/index.html
Licence
MIT license
Publisher
Akio Onogi
Date published
20/05/2015
Language
English.
(3) Reuse potential
Because both the CLPs and R functions run by specifying only a few arguments, these programs will be approachable for geneticists who are interested in association mapping or genomic prediction. In addition, both the CLP and R functions vigor and hyperpara can accept predictor variables other than the marker genotypes. Therefore, although we focus on genome-wide regression here, VIGoR can be applied into various problems where variable selection is required for huge data. Thus, VIGoR will have a wide reuse potential.
Competing Interests
The authors declare that they have no competing interests.
MeuwissenT HHayesB JGoddardM EPrediction of total genetic value using genome-wide dense marker mapsKarkkainenH PSillanpaaM JBack to basics for Bayesian model building in genomic selectionde los CamposGHickeyJ MPong-WongRDaetwylerHDCalusM PWhole-genome regression and prediction methods applied to plant and animal breedingXuSEstimating polygenic effects using markers of the entire genomeKarkkainenH PSillanpaaM JRobustness of Bayesian multilocus association models to cryptic relatednessHoffmanG ELogsdonB AMezeyJ GPUMA: a unified framework for penalized multiple regression analysis of GWAS datade los CamposGNayaHGianolaDCrossaJLegarraAManfrediEWeigelKCotesJ MPredicting quantitative traits with regression models for dense molecular markers and pedigreeHabierDFernandoR LKizilkayaKGarrickD JExtension of the Bayesian alphabet for genomic selectionFernandoRGarrickD JGenSel: User Manual for a Portfolio of Genomic Selection Related Analyses Animal Breeding and Genetics2008Ames, IAIowa State UniversityAvailable at http://bigs.ansci.iastate.edu/bigsguiHickeyJ MTierBAlphaBayes (Beta): Software for polygenic and whole genome analysis. User Manual2009Armidale, AustraliaUniversity of New EnglandAvailable at https://sites.google.com/site/hickeyjohn/alphabayesLegarraARicardiAFilangiOGS3: Genomic Selection, Gibbs Sampling, Gauss-Seidel (and BayesCπ and Bayesian Lasso)2010Available at http://snp.toulouse.inra.fr/~alegarra/JanssL L GBayz manual version version 2.03 Janss Biostatistics2010Leiden, The NetherlandsAvailable at http://www.bayz.biz/PerezPde Los CamposGGenome-Wide Regression and Prediction with the BGLR Statistical PackageParkTCasellaGThe Bayesian lassoMutshindaC MSillanpaaM JExtended Bayesian LASSO for multiple quantitative trait loci mapping and unobserved phenotype predictionHayashiTIwataHEM algorithm for Bayesian estimation of genomic breeding valuesGeorgeE IMcCullochR EVariable selection via Gibbs samplingLuanTWoolliamsJ ALienSKentMSvendsenMMeuwissenT HThe accuracy of Genomic Selection in Norwegian red cattle assessed by cross-validationOnogiAIdetaOInoshitaYEbanaKYoshiokaTYamasakiMIwataHExploring the areas of applicability of whole-genome prediction methods for Asian rice (Oryza sativa L.)YamamotoEMatsunagaHOnogiAKajiya-KanegaeHMinamikawaMSuzukiAShirasawaKHirakawaHNunomeTYamaguchiHMiyatakeAOhyamaKIwataHFukuokaHA simulation-based breeding design that uses whole-genome prediction in tomatoOnogiAWatanabeMMochizukiTHayashiTNakagawaHHasegawaTIwataHToward integration of genomic selection with crop modelling: the development of an integrated approach to predicting rice heading datesLiZSillanpaaM JEstimation of quantitative trait locus effects with epistasis by variational Bayes algorithmsHayashiTIwataHA Bayesian method and its variational approximation for prediction of genomic breeding values in multiple traitsOnogiAIwataHDocuments for VIGoR2015MayAvailable at https://github.com/Onogi/VIGoRCarbonettoPStephensMScalable variational inference for Bayesian variable selection in regression, and its accuracy in genetic association studiesPurcellSNealeBTodd-BrownKThomasLFerreiraM A RBenderDMallerJSklarPde BakkerP I WDalyM JShamP CPLINK: a toolset for whole-genome association and population-based linkage analysisBrowningS RBrowningB LRapid and accurate haplotype phasing and missing data inference for whole genome association studies by use of localized haplotype clustering