Linear regression methods where a number of genome-wide markers are used as predictors are currently used in genomic prediction [1, 2, 3] and genome-wide association mapping [4, 5, 6]. Bayesian regression methods have attracted particular attention and a number of variations have been proposed , which typically use Markov chain Monte Carlo (MCMC) algorithms for parameter inference (e.g., [1, 7, 8]). Consequently, the software currently available for Bayesian regression (e.g., GenSel , AlphaBayes , GS3 , BayeZ , and BGLR ) is mainly based on MCMC. However, because of the computational burden associated with MCMC, analyzing huge datasets, such as those consisting of hundreds of thousands of markers, within realistic time scales is often unfeasible. Moreover, intensive cross-validation (CV) for evaluating predictive ability or for tuning hyperparameters is also difficult, even in moderate-sized datasets. These shortcomings of MCMC-based methods have hampered the widespread application of Bayesian methods to genome-wide association mapping and genomic prediction. To tackle the shortcomings of the MCMC-based software, we developed novel software for whole-genome regression in a Bayesian framework, which we named VIGoR (variational Bayesian inference for genome-wide regression). VIGoR is based on variational Bayesian inference (VB), which is computationally much faster than MCMC, and can implement seven regression methods: Bayesian lasso (BL) , extended Bayesian lasso (EBL) , weighted Bayesian shrinkage regression (wBSR) , BayesB , BayesC , stochastic search variable selection (SSVS) , and Bayesian mixture regression (MIX). BL, EBL, and wBSR implemented by VIGoR were used for genomic prediction of rice agronomic traits , and genomic prediction and association mapping in tomato . EBL was also used for genomic prediction of rice heading date . The command line program (CLP) package for the Linux/Mac platform is available at https://github.com/Onogi/VIGoR. A pdf manual is also available at the URL . The R package is cross-platform and is available at https://github.com/Onogi/VIGoR and from the Comprehensive R Archive Network (CRAN) at https://cran.r-project.org/web/packages/VIGoR/index.html. The pdf manual contains the explanations of both the CLPs and the R functions.
Regression methods and algorithms
VIGoR assumes the following linear regression model:
where yi is the phenotypic value (response variable) of individual i, F is the number of covariates other than markers, zij is the covariate corresponding to the effect αj, P is the number of markers, γp is the indicator variable that takes 0 or 1, xip is the genotype of marker p, βp is the effect of marker p, and εi is the residual. The indicator variables are fixed to 1 except in wBSR. The residual, εi, is assumed to follow a normal distribution with mean = 0 and variance = 1). These prior distributions were proposed to select important markers (i.e., markers strongly associated with phenotypes) efficiently from given markers. Variational Bayesian algorithms for these regression methods which are implemented by VIGoR are illustrated in the pdf manual of VIGoR . All the regression methods require the user specifying hyperparameters to determine the shapes of prior distributions. In Table 2, hyperparameters to be specified by the user are presented for each regression method. We briefly describe how to specify these hyperparameter values in the next section. In Table 3, we present the references of the regression methods and variational Bayesian inference.. αj is assumed to be proportional to a constant value, i.e., assumed to follow a non-informative prior distribution. The prior distribution of βp differs among the regression methods (Table
|Regression||Hyperparametersa||Less influential hyperparameters (default values)||Influential hyperparameters determined by hyperpara|
|BL||φ, ω||φ (1.0)||ω|
|EBL||φ, ω, ψ, θ||φ (0.1), ω (0.1), ψ (1.0)||θ|
|wBSR||v, S2, κ||v (5.0)||S2|
|BayesB||v, S2, κ||v (5.0)||S2|
|BayesC||v, S2, κ||v (5.0)||S2|
|SSVS||c, v, S2, κ||v (5.0)||c, S2|
|MIX||c, v, S2, κ||v (5.0)||c, S2|
Implementation and architecture
Both the CLP and R packages consist of two programs/functions, vigor and hyperpara (Fig. 1). Vigor conducts genome-wide regression analyses, and has three main functions: fitting regression models to data (Model fitting), fitting models after tuning hyperparameter values using CV (Model fitting after hyperparameter tuning), and evaluating predictive ability of regression methods by CV (Cross-validation). Using Model fitting, users can peform variable selection (association mapping) by fitting genome-wide regression models to data and estimating marker effects. This is the default function. Using Model fitting after hyperparameter tuning, users can estimate marker effects with the hyperparameters tuned automatically using CV. Using Cross-validation, users can evaluate the predictive ability of regression models using CV.
Vigor requires the phenotypic values (response variables), and the marker genotypes (predictor variables) as mandatory input information (Fig. 1). In addition, all the regression methods implemented by vigor require hyperparameter values that users should specify (Fig.1 and Table 2). To make specification feasible, we provide the other program/function, hyperpara, which calculates the values of hyperparameters that influence on inference, based on the values of hyperparameters that influence less and several assumptions of the genetic architecture (Table 2). Because the values of less influential hyperparameters are determined by default, users only input the assumptions of the genetic architecture to hyperpara. The required assumptions are (1) proportion of phenotypic variance (variance of response variable) that can be explained by the markers (predictor variables) (referred to as Mvar), and (2) proportion of markers with non-zero effects (referred to as κ). For example, when κ and Mvar are 0.01 and 0.5, respectively, this setting corresponds to an assumption that a half of phenotypic variance is explained by 1 % of markers. Based on this assumption, hyperpara calculates the values of influential hyperparameters. Explanations of the calculation of hyperparameter values are provided in the pdf manual of VIGoR .
The CLPs, vigor and hyperpara, were written with C, and are distributed as standalone pre-compiled programs. Source codes are also available at https://github.com/Onogi/VIGoR. The programs can be built, for example, by typing gcc vigor.c -o vigor (Mac) or gcc vigor.c -o vigor -lm (Linux). The default function of vigor is Model fitting. Model fitting after hyperparameter tuning and Cross-validation can be conducted by adding options -t and -c, respectively (Fig. 2). The phenotypic values (response variables), and the marker genotypes (predictor variables) are provided as text files.
The R function vigor calls a C function from C library which is included in the package. Thus, calculation speed of the R function is almost equivalent to that of the CLP vigor. Hyperpara was developed with R. Both vigor and hyperpara have no dependency to other R packages except for those included in system library. The usages of the R functions are similar to those of the CLPs (Fig. 2). The phenotypic values and the marker genotypes are input to vigor as a vector and a matrix objects, respectively. The default function of vigor is Model fitting. Model fitting after hyperparameter tuning and Cross-validation can be executed by adding arguments ”tuning” and “cv”, respectively (Fig. 2).
Both the CLP and R packages have advantages. The advantage of the CLP package is that the CLP vigor can accepts PED files of PLINK  and allele dosage files of Beagle  as the marker genotype file. Both PLINK and Beagle are popular association mapping and genotype imputation software, respectively. Thus, the users of PLINK or Beagle will easily perform analyses of VIGoR. Meanwhile, the advantage of the R package is that it can visualize the analysis results easily, that is, Manhattan plots can be drawn with a one-row R code (see the pdf manual or R documentation). It is also easy to evaluate prediction accuracy when Cross-validation is executed, by calculating Pearson correlation coefficient between the predicted and true values (see the pdf manual or R documentation). Users can select the packages according to their analysis purposes or environments.
The CLPs for Linux were compiled under the Linux kernel release 3.13.0-24-generic with a X86-64 machine. We have not tried the programs in other releases of Linux kernel. The CLPs for Mac were compiled under OS X ver. 10.6.8. We verified that the programs run under a recent version, ver. 10.9.5.
The R functions were made under R version 3.0.2, and the package was build using Mac (OS X ver. 10.6.8). We verified that the package can be loaded and the functions run in Windows 7/8, Mac (OS X ver. 10.6.8 and 10.9.5), and Linux (3.13.0-24-generic).
CLP package: Linux (kernel 3.13.0-24-generic), and Mac (OS X ver. 10.6.8 and 10.9.5).
R package: Windows 7/8, Linux (kernel 3.13.0-24-generic), and Mac (OS X ver. 10.6.8 and 10.9.5).
C (CLP programs) and C and R (R functions).
Additional system requirements
The R package requires installation of R (http://cran.r-project.org/).
List of contributors
Akio Onogi and Hiroyoshi Iwata
Department of Agricultural and Environmental Biology, Graduate School of Agricultural and Life Sciences, The University of Tokyo.
CLP package: GitHub
R package: GitHub and CRAN
CLP package: https://github.com/Onogi/VIGoR
(3) Reuse potential
Because both the CLPs and R functions run by specifying only a few arguments, these programs will be approachable for geneticists who are interested in association mapping or genomic prediction. In addition, both the CLP and R functions vigor and hyperpara can accept predictor variables other than the marker genotypes. Therefore, although we focus on genome-wide regression here, VIGoR can be applied into various problems where variable selection is required for huge data. Thus, VIGoR will have a wide reuse potential.
The authors declare that they have no competing interests.