(1) Overview

Introduction

Linear regression methods where a number of genome-wide markers are used as predictors are currently used in genomic prediction [1, 2, 3] and genome-wide association mapping [4, 5, 6]. Bayesian regression methods have attracted particular attention and a number of variations have been proposed [2], which typically use Markov chain Monte Carlo (MCMC) algorithms for parameter inference (e.g., [1, 7, 8]). Consequently, the software currently available for Bayesian regression (e.g., GenSel [9], AlphaBayes [10], GS3 [11], BayeZ [12], and BGLR [13]) is mainly based on MCMC. However, because of the computational burden associated with MCMC, analyzing huge datasets, such as those consisting of hundreds of thousands of markers, within realistic time scales is often unfeasible. Moreover, intensive cross-validation (CV) for evaluating predictive ability or for tuning hyperparameters is also difficult, even in moderate-sized datasets. These shortcomings of MCMC-based methods have hampered the widespread application of Bayesian methods to genome-wide association mapping and genomic prediction. To tackle the shortcomings of the MCMC-based software, we developed novel software for whole-genome regression in a Bayesian framework, which we named VIGoR (variational Bayesian inference for genome-wide regression). VIGoR is based on variational Bayesian inference (VB), which is computationally much faster than MCMC, and can implement seven regression methods: Bayesian lasso (BL) [14], extended Bayesian lasso (EBL) [15], weighted Bayesian shrinkage regression (wBSR) [16], BayesB [1], BayesC [8], stochastic search variable selection (SSVS) [17], and Bayesian mixture regression (MIX)[18]. BL, EBL, and wBSR implemented by VIGoR were used for genomic prediction of rice agronomic traits [19], and genomic prediction and association mapping in tomato [20]. EBL was also used for genomic prediction of rice heading date [21]. The command line program (CLP) package for the Linux/Mac platform is available at https://github.com/Onogi/VIGoR. A pdf manual is also available at the URL [24]. The R package is cross-platform and is available at https://github.com/Onogi/VIGoR and from the Comprehensive R Archive Network (CRAN) at https://cran.r-project.org/web/packages/VIGoR/index.html. The pdf manual contains the explanations of both the CLPs and the R functions.

Regression methods and algorithms

VIGoR assumes the following linear regression model:

yi=j=1Fzijαj+p=1Pγpxipβp+εi

where yi is the phenotypic value (response variable) of individual i, F is the number of covariates other than markers, zij is the covariate corresponding to the effect αj, P is the number of markers, γp is the indicator variable that takes 0 or 1, xip is the genotype of marker p, βp is the effect of marker p, and εi is the residual. The indicator variables are fixed to 1 except in wBSR. The residual, εi, is assumed to follow a normal distribution with mean = 0 and variance = 1/τ02 . αj is assumed to be proportional to a constant value, i.e., assumed to follow a non-informative prior distribution. The prior distribution of βp differs among the regression methods (Table 1). These prior distributions were proposed to select important markers (i.e., markers strongly associated with phenotypes) efficiently from given markers. Variational Bayesian algorithms for these regression methods which are implemented by VIGoR are illustrated in the pdf manual of VIGoR [24]. All the regression methods require the user specifying hyperparameters to determine the shapes of prior distributions. In Table 2, hyperparameters to be specified by the user are presented for each regression method. We briefly describe how to specify these hyperparameter values in the next section. In Table 3, we present the references of the regression methods and variational Bayesian inference.

Hierarchical level

1st 2nd 3rd

BL
βpN(0,1τ02τp2)
τp2InvG(1,λ22)
λ2G(φ,ϖ)
EBL
βpN(0,1τ02τp2)
τp2InvG(1,δ2ηp22)
δ2G(φ,ϖ)ηp2G(ψ,θ)
wBSR
βpN(0,σp2)γpBernoulli(κ)
σp2χ2(ν,S2)
BayesB
βpN(0,σp2) if ρp=1βp= 0 if ρp=0
σp2χ2(ν,S2)ρpBernoulli(κ)
BayesC
βpN(0,σ2) if ρp=1βp=0 if ρp=0
σ2χ2(ν,S2)ρpBernoulli(κ)
SSVS
βpN(0,σ2) if ρp=1βpN(0,cσ2) if ρp=0
σ2χ2(ν,S2)ρpBernoulli(κ)
MIX
βpN(0,σA2) if ρp=1βpN(0,σB2) if ρp=0
σA2χ2(ν,S2)σB2χ2(ν,cS2)ρpBernoulli(κ)

Table 1

Prior distributions of the marker effects and indicator variables.a

aPrior distributions of the marker effects have three-level hierarchical structures for BL and EBL, and two-level for the other methods.

BL, Bayesian lasso; EBL, extended Bayesian lasso; wBSR, weighted Bayesian shrinkage regression; SSVS, stochastic search variable selection; MIX, Bayesian mixture regression; N, normal distribution; Inv-G, inverse-gamma distribution; G, gamma distribution; Bernoulli, Bernoulli distribution; χ-2, scaled inverse-chi-square distribution.

Regression Hyperparametersa Less influential hyperparameters (default values) Influential hyperparameters determined by hyperpara

BL φ, ω φ (1.0) ω
EBL φ, ω, ψ, θ φ (0.1), ω (0.1), ψ (1.0) θ
wBSR v, S2, κ v (5.0) S2
BayesB v, S2, κ v (5.0) S2
BayesC v, S2, κ v (5.0) S2
SSVS c, v, S2, κ v (5.0) c, S2
MIX c, v, S2, κ v (5.0) c, S2

Table 2

Hyperparameters required by vigor.

aFor each regression model, the hyperparameters in this table correspond to those listed in Table 1. Among these hyperparameters, κ of wBSR, BayesB, BayesC, SSVS, and MIX is determined by the user. The other hyperparameters are set as default or can be determined by the function hyperpara.

BL, Bayesian lasso; EBL, extended Bayesian lasso; wBSR, weighted Bayesian shrinkage regression; SSVS, stochastic search variable selection; MIX, Bayesian mixture regression.

Regression methods Variational Bayesian algorithm

BL [14] [22]
EBL [15] [22]
wBSR [16] [23]
BayesB [1] [24]
BayesC [8] [25]
SSVS [17] [24]
MIX [18] [24]

Table 3

References of the regression methods and variational Bayesian algorithms.

Implementation and architecture

Both the CLP and R packages consist of two programs/functions, vigor and hyperpara (Fig. 1). Vigor conducts genome-wide regression analyses, and has three main functions: fitting regression models to data (Model fitting), fitting models after tuning hyperparameter values using CV (Model fitting after hyperparameter tuning), and evaluating predictive ability of regression methods by CV (Cross-validation). Using Model fitting, users can peform variable selection (association mapping) by fitting genome-wide regression models to data and estimating marker effects. This is the default function. Using Model fitting after hyperparameter tuning, users can estimate marker effects with the hyperparameters tuned automatically using CV. Using Cross-validation, users can evaluate the predictive ability of regression models using CV.

Figure 1 

Overview of analysis by VIGoR. VIGoR consists of two programs/functions, vigor and hyperpara. Vigor conducts genome-wide regression analysis and has three main functions; Model fitting, Model fitting after hyperparameter tuning, and Cross-validation. The former two functions output the estimates of the marker and covariate effects, and the fitted values. The last function outputs the predicted values obtained by cross-validation. Vigor requires three kinds of input information, phenotypic values (response variable), marker genotypes (predictor variable), and hyperparameter values. Hyperparameter values can be determined by the user or by using hyperpara. Hyperpara calculates the values of hyperparameters that influence on inference, based on the assumptions of the genetic architecture and values of hyperparameters that influence less.

Vigor requires the phenotypic values (response variables), and the marker genotypes (predictor variables) as mandatory input information (Fig. 1). In addition, all the regression methods implemented by vigor require hyperparameter values that users should specify (Fig.1 and Table 2). To make specification feasible, we provide the other program/function, hyperpara, which calculates the values of hyperparameters that influence on inference, based on the values of hyperparameters that influence less and several assumptions of the genetic architecture (Table 2). Because the values of less influential hyperparameters are determined by default, users only input the assumptions of the genetic architecture to hyperpara. The required assumptions are (1) proportion of phenotypic variance (variance of response variable) that can be explained by the markers (predictor variables) (referred to as Mvar), and (2) proportion of markers with non-zero effects (referred to as κ). For example, when κ and Mvar are 0.01 and 0.5, respectively, this setting corresponds to an assumption that a half of phenotypic variance is explained by 1 % of markers. Based on this assumption, hyperpara calculates the values of influential hyperparameters. Explanations of the calculation of hyperparameter values are provided in the pdf manual of VIGoR [24].

The CLPs, vigor and hyperpara, were written with C, and are distributed as standalone pre-compiled programs. Source codes are also available at https://github.com/Onogi/VIGoR. The programs can be built, for example, by typing gcc vigor.c -o vigor (Mac) or gcc vigor.c -o vigor -lm (Linux). The default function of vigor is Model fitting. Model fitting after hyperparameter tuning and Cross-validation can be conducted by adding options -t and -c, respectively (Fig. 2). The phenotypic values (response variables), and the marker genotypes (predictor variables) are provided as text files.

Figure 2 

Examples of the usage of vigor and hyperpara. “sample.pheno.txt” and “PhenoHeight” are the example file and object that contain the phenotypic values (response variables), and “sample.geno.txt” and “Geno” are the example file and object that contain the marker genotypes (predictor variables). These files and objects are included in the com¬mand line program (CLP) and R packages, respectively. The regression methods are specified by their abbreviations (e.g., BL, BayesB, and wBSR). The argument(s) immediately after the regression methods are the hyperparameter values. Hyperparameters should be ordered as in the second column of Table 2. In the example of Model fitting, 1 and 0.1 are the values of φ and ω of Bayesian lasso, respectively. In the example of Model fitting after hyperparameter tuning, two hyperparameter value sets, [v = 5, S2 = 1, κ=0.01] and [v = 5, S2 = 1, κ = 0.1], are provided using the -v option (CLP) and as a matrix (R function). The better set is chosen using cross-validation (CV), and model fitting is performed automatically with the chosen set. The -t option (CLP) and the argument “tuning” (R function) indicate this procedure. In the example of Cross-validation, a five-fold CV is performed. The -c option (CLP) and “cv” (R function) indicate CV, and the argument immediately after this option/argument (here 5) is the fold number. In the example of hyperpara, the second (0.5) and fourth (0.01) arguments are the values of Mvar and κ, respectively (see the main text for the explanations of Mvar and κ).

The R function vigor calls a C function from C library which is included in the package. Thus, calculation speed of the R function is almost equivalent to that of the CLP vigor. Hyperpara was developed with R. Both vigor and hyperpara have no dependency to other R packages except for those included in system library. The usages of the R functions are similar to those of the CLPs (Fig. 2). The phenotypic values and the marker genotypes are input to vigor as a vector and a matrix objects, respectively. The default function of vigor is Model fitting. Model fitting after hyperparameter tuning and Cross-validation can be executed by adding arguments ”tuning” and “cv”, respectively (Fig. 2).

Both the CLP and R packages have advantages. The advantage of the CLP package is that the CLP vigor can accepts PED files of PLINK [26] and allele dosage files of Beagle [27] as the marker genotype file. Both PLINK and Beagle are popular association mapping and genotype imputation software, respectively. Thus, the users of PLINK or Beagle will easily perform analyses of VIGoR. Meanwhile, the advantage of the R package is that it can visualize the analysis results easily, that is, Manhattan plots can be drawn with a one-row R code (see the pdf manual or R documentation). It is also easy to evaluate prediction accuracy when Cross-validation is executed, by calculating Pearson correlation coefficient between the predicted and true values (see the pdf manual or R documentation). Users can select the packages according to their analysis purposes or environments.

Quality control

The CLPs for Linux were compiled under the Linux kernel release 3.13.0-24-generic with a X86-64 machine. We have not tried the programs in other releases of Linux kernel. The CLPs for Mac were compiled under OS X ver. 10.6.8. We verified that the programs run under a recent version, ver. 10.9.5.

The R functions were made under R version 3.0.2, and the package was build using Mac (OS X ver. 10.6.8). We verified that the package can be loaded and the functions run in Windows 7/8, Mac (OS X ver. 10.6.8 and 10.9.5), and Linux (3.13.0-24-generic).

(2) Availability

Operating system

CLP package: Linux (kernel 3.13.0-24-generic), and Mac (OS X ver. 10.6.8 and 10.9.5).

R package: Windows 7/8, Linux (kernel 3.13.0-24-generic), and Mac (OS X ver. 10.6.8 and 10.9.5).

Programming language

C (CLP programs) and C and R (R functions).

Additional system requirements

None.

Dependencies

The R package requires installation of R (http://cran.r-project.org/).

List of contributors

Akio Onogi and Hiroyoshi Iwata

Department of Agricultural and Environmental Biology, Graduate School of Agricultural and Life Sciences, The University of Tokyo.

Software location

Archive

Name

CLP package: GitHub

R package: GitHub and CRAN

Persistent identifier

CLP package: https://github.com/Onogi/VIGoR

R package: https://cran.r-project.org/web/packages/VIGoR/index.html

Licence

MIT license

Publisher

Akio Onogi

Date published

20/05/2015

Language

English.

(3) Reuse potential

Because both the CLPs and R functions run by specifying only a few arguments, these programs will be approachable for geneticists who are interested in association mapping or genomic prediction. In addition, both the CLP and R functions vigor and hyperpara can accept predictor variables other than the marker genotypes. Therefore, although we focus on genome-wide regression here, VIGoR can be applied into various problems where variable selection is required for huge data. Thus, VIGoR will have a wide reuse potential.

Competing Interests

The authors declare that they have no competing interests.