An R Framework for the Partitioning of Linkage Disequilibrium between and Within Populations

Patterns of linkage disequilibrium (LD) across the genome result from a myriad of contributing factors including selection and genetic drift. Natural selection can increase LD near individually selected loci, or it can influence LD between epistatically selected groups of loci. Statistics have previously been derived which compare levels of linkage disequilibrium in subpopulations relative to the total population. These statistics may be leveraged to identify loci that may be under selection or epistatic selection. This is a powerful approach, but to date no framework exists to support its use on a genome-wide scale. We present ohtadstats, an R package designed to facilitate the implementation of Ohta’s D statistics in a variety of use cases. Statistics calculated by this package can be used to determine whether a locus is under selection or not, and can provide insight into the nature of the selection that is taking place (hard sweep or epistatic selection). This package is available on the Comprehensive R Archive Network (CRAN).


Introduction
Pronounced signatures are left in the genomes of species undergoing selection. These telltale signals may reveal selected loci and details regarding the selection pressures that have been applied [1]. Notably, selection modifies the correlation of alleles at sites near the selected locus. This correlation between alleles is known as linkage disequilibrium (LD). As a rule of thumb, selection reduces genetic variability and increases LD [2]. Tomoka Ohta [3] derived a series of statistics (from here on referred to as Ohta's D statistics) designed to partition LD into between and within population components in a manner analogous to Sewall Wright's F statistics, a measure of inbreeding [4]. She posited that for a pair of loci, deviations from expected levels of LD in a given subpopulation may be indicative of an epistatic selection event. More recently, Beissinger et al. (2016) demonstrated that Ohta's D Statistics, particularly the D '2 IS statistic, can identify traditional hard selective sweeps [2] in addition to epistatic selection, and they developed a null-distribution that enables the genomewide identification of selection candidates.
Several studies have been conducted that utilize Ohta's D statistics to test for epistatic selection [5][6][7], which is selection acting on a favorable combination of loci, rather than a single locus independently. However, software suites for measuring Ohta's D statistics are limited. Programs that evaluate the statistics on a locus-by-locus basis exist [8][9][10], but there is currently no framework available to facilitate the implementation of Ohta's D statistics on a genomewide basis. Furthermore, the web-based platform supplied by [9] and its predecessor [8] have two significantly limiting attributes that are resolved by our implementation. First, they require that input data be limited to no more than 80 SNPs in 30 or fewer subpopulations. Second, they treat sample size as population size, an approach that will only occasionally reflect reality.
Ohta's D statistics are computed in a pairwise fashion between markers, so evaluating even a relatively small marker set of a few hundred or thousands of SNPs requires an efficient implementation. Therefore, we have developed ohtadstats, a freely available R package with convenient, flexible, and powerful tools to perform the computation of Ohta's D statistics in a variety of use cases. By leveraging the R statistical software platform [11], ohtadstats is fast, scalable, and most importantly adaptable to an endless array of system architectures and high-throughput computing systems. Here, we describe the capabilities of the ohtadstats package and demonstrate its applicability to datasets both small scale and genome wide. . The specific forms of these statistics have been covered in depth by Ohta [3] so we will not go in depth here. Briefly, from Beissinger et al. [6]: it is the correlation of two alleles occurring on the same gamete in a subpopulation compared to the expectation of them occurring together in the total population • D 2 is is the expected variance of LD for subpopulations • D 2 st is the correlation of alleles in a subpopulation relative to their expected correlation in the total population • D' 2 is is the correlation of the appearance of two alleles on the same gamete in a subpopulation relative to that of the total population • D' 2 st is the variance of LD in the total population Consider a comparison between two loci, A and B. Here, x i,k and y j,k are the frequencies of the i th and j th alleles at loci A and B in the k th subpopulation, g ij,k is the frequency of gametes A i B j in the k th subpopulation. Averages of these values are denoted with bars. These statistics may be calculated as follows:

Implementation and architecture
The ohtadtats package includes five functions: dstat, dwrapper, dheatmap, dparallel, and dfilter. Detailed descriptions and example code can be found at https://github.com/pfpetrowski/OhtaDStats, as well as in the documentation of the R package.
The first of these functions, dstat, is the workhorse of the package. This function computes each of Ohta's D statistics for a given pair of loci, and returns these results in a vector. The dstat function also returns the number of subpopulations included in the analysis. This number may be less than the total number of subpopulations as a result of filtering. To avoid spurious associations between alleles that are not truly in LD, dstat has an initial allele frequency filtering step designed to remove loci that fall below a specified minor allele frequency threshold. During the filtering step, dstat also removes any subpopulations which have a minor allele frequency below a given threshold. This feature prevents subpopulations which are fixed or nearly fixed at a given locus from appearing to be under selection when in reality the effect is due to small sample size. Both of these minor allele frequency thresholds are modifiable arguments with can be changed by the user on demand. The dstat function returns a vector containing the number of populations included in the calculation and each of Ohta's five D statistics for the specified pair of markers. See Figure 1 for an illustration of the dstat function.
It is important to note that the dstat function returns raw D statistic values. To assess statistical significance, dstat can be used to generate an empirical null distribution, as was used in [6]. Null distributions will vary by organism and model system, but the tools provided in ohtadstats can be used for their creation. This is achieved by implementing the function on a large number of pairs of physically unlinked SNPs.
The dwrapper function computes Ohta's D statistics for all possible pairs of loci in a matrix of genotypes (Figure 2). The result is returned as a list of matrices, with one matrix for each of Ohta's D statistics along with a matrix specifying the number of populations used for each comparison. This output format simplifies the process of looking up a D statistic for a specific pair of loci. It is important to note that because dwrapper evaluates all pairwise combinations of loci, it scales on the order of n 2 , where n is the number of genetic markers represented in the dataset. This means that the number of pairwise comparisons to be made scales exponentially with the number of markers being evaluated. This is not a problem for small datasets. Indeed, we successfully executed the dwrapper function on a dataset that included 100 SNPs from 1417 individuals using a 1.3GHz, 8GB RAM MacBook Air in only two minutes. This dataset is a subset of the data used by Beissinger et al. (2016) [6] and is included in

Matrix of Genotypes
Two loci are specified the package as beissinger_data.rda. For large datasets (ie thousands of markers), we suggest parallelization via the dparallel function.
The dheatmap function provides a convenient tool for data visualization. This function, which is based on the levelplot function from the lattice R package [12] takes any of the matrix outputs provided by the dwrapper function and returns a colorized heat map, making patterns of LD visible. Options are provided which allow the user to modify the colors used and how those colors are scaled. The dheatmap function provides three modes for the scaling of the colors. The first mode, "linear", is appropriate for most use cases. In this mode, the values in the matrix are distributed continuously across the color spectrum. The "truncated" mode is provided for use on the ratio matrices, where division by small numbers may cause certain values to be many orders of magnitude greater than others. In these cases, using "linear" is not ideal because large values will drive mid-magnitude values towards the extreme low end of the color spectrum. The "truncated" mode corrects for this issue by changing values greater than one to a value just higher than one. Colors are then scaled across this new spectrum of values. The final mode, "binned", operates similarly to "truncated", except that colors are not scaled across the new spectrum of values. Instead, values are placed into one of five bins, and colored accordingly.
The dparallel function is designed to facilitate parallelization of dstat on commercial high throughput computing platforms. By pairing a simple R script with a scheduler such as slurm [13], massive datasets on the scale commonly seen today can be analyzed in a reasonable amount of time. This function works by generating a virtual table of locus pairs to compare and executing dstat on each. Using this method, a number of equally sized jobs limited only by time and computational resources can be performed. The dparallel function is not meant to be used directly in the R environment. Instead it is designed to be set up in an R script, multiple instances of which are then spawned using a slurm scheduler array, or equivalent (Figure 3). Example scripts of this setup are available in