Patterns of linkage disequilibrium (LD) across the genome result from a myriad of contributing factors including selection and genetic drift. Natural selection can increase LD near individually selected loci, or it can influence LD between epistatically selected groups of loci. Statistics have previously been derived which compare levels of linkage disequilibrium in subpopulations relative to the total population. These statistics may be leveraged to identify loci that may be under selection or epistatic selection. This is a powerful approach, but to date no framework exists to support its use on a genome-wide scale. We present

Pronounced signatures are left in the genomes of species undergoing selection. These telltale signals may reveal selected loci and details regarding the selection pressures that have been applied [^{’2}_{IS}

Several studies have been conducted that utilize Ohta’s D statistics to test for epistatic selection [

Ohta’s D statistics are computed in a pairwise fashion between markers, so evaluating even a relatively small marker set of a few hundred or thousands of SNPs requires an efficient implementation. Therefore, we have developed

Ohta’s D statistics are a set of five statistics, termed D^{2}_{it}, D^{2}_{is}, D^{2}_{st}, D’^{2}_{is}, and D’^{2}_{st}. The specific forms of these statistics have been covered in depth by Ohta [

D^{2}_{it} is the correlation of two alleles occurring on the same gamete in a subpopulation compared to the expectation of them occurring together in the total population

D^{2}_{is} is the expected variance of LD for subpopulations

D^{2}_{st} is the correlation of alleles in a subpopulation relative to their expected correlation in the total population

D’^{2}_{is} is the correlation of the appearance of two alleles on the same gamete in a subpopulation relative to that of the total population

D’^{2}_{st} is the variance of LD in the total population

Consider a comparison between two loci, A and B. Here, x_{i,k} and y_{j,k} are the frequencies of the i^{th} and j^{th} alleles at loci A and B in the k^{th} subpopulation, g_{ij,k} is the frequency of gametes A_{i}B_{j} in the k^{th} subpopulation. Averages of these values are denoted with bars. These statistics may be calculated as follows:

The ohtadtats package includes five functions:

The first of these functions,

The

It is important to note that the

The ^{2}, where n is the number of genetic markers represented in the dataset. This means that the number of pairwise comparisons to be made scales exponentially with the number of markers being evaluated. This is not a problem for small datasets. Indeed, we successfully executed the

Given a single matrix of genotypes,

The

The

The

Lastly, the

To ensure that this package accurately calculates Ohta’s D statistics, We simulated a small dataset containing 18 individuals across 3 subpopulations and three loci. We evaluated this data set using an implementation of LinkDOS [

The

Windows: Windows 7

MacOS: MacOS 10.9 (Mavericks)

Ubuntu: 14.04 (Trusty)

R

R requires that 150 MB of disk space be available for installation.

Requires the “lattice” and “grDevices” R packages. We also require R version 3.0.0 or later for this package.

Paul F. Petrowski, Timothy M. Beissinger, Elizabeth G. King

English

Ohta’s D statistics are useful quantities for assessing linkage disequilibrium in genomic data sets. As such, this package may be useful to anyone looking to quantify linkage disequilibrium in their system of study. This includes any individual investigating the fields of population, quantitative, or evolutionary genetics. A typical use case may involve looking across a number of subpopulations of a species in an effort to detect evidence of selection. Other methods of using LD as a measurement of selection have been previously described, including the integrated haplotype score (iHS) [

We would like to thank Jake Gotberg from the Mizzou Research Computing Support Service for contributing his time and expertise in setting up an efficient parallelization workflow. Computation for this work was performed on the high performance computing infrastructure provided by Research Computing Support Services and in part by the National Science Foundation under grant number CNS-1429294 at the University of Missouri, Columbia MO.

The authors have no competing interests to declare.