CURSAT ver. 2.1 is an open-source code in QB64 basic, compilable into an executable file, that produces n pseudoreplicates of an empirical data set. Both resamplings with and without replacement are allowed by the software. The number (n) of pseudoreplicates is set by the user. Pseudoreplicates can be exported in a file that can be opened by a spreadsheet. Thus, pseudoreplicates are permanently stored and available for the calculation of statistics of interest and associated variance. The software also uses the n pseudoreplicate data to reconstruct n accumulation matrices, appended in an output file. Accumulation has applicability in cases in which repeated sample-based data must be evaluated for exhaustiveness. Many situations involve repeated sampling from the same set of observations. For example, if data consist of species occurrence, the software can be used by a wide spectrum of specialists such as ecologists, zoologists, botanists, biogeographers, conservationists for biodiversity estimation. The software allows performing accumulation irrespectively whether the input data set contains abundance (quantitative) or incidence (binary) data. Accumulation matrices can be imported in statistical packages to estimate distributions of successive pooling of samples and depict accumulation and rarefaction curves with associated variance.

CURSAT ver. 2.1 is released in two editions. Edition #1 is recommended for analysis, whereas Edition #2 generates a log file in which the flow of internal steps of resampling and accumulation routines is reported. Edition #2 is primarily designed for educational purposes and quality check.

Many situations involve repeated sampling from the same set of observations. For example, Ridenhour and Grimmett [

When

In ecology, accumulation procedures associated with resampling methods may offer support for measuring biodiversity. In particular, assessing the number of species that occur in a certain area (species richness) is a central issue in ecology because such information participates in defining a number of diversity estimators used in describing and comparing profiles of biodiversity [

Independently whether it is based on abundance or incidence, the estimation of species richness depends on sample size, due to both sampling effects and intrinsic factors such as seasonality, species turnover, etc. As a consequence of this, repeated sampling is expected to increase the cumulative number of species observed in a certain area. Thus, the successive pooling of samples from a single location produces a species accumulation, and a pattern of the species accumulation and rarefaction curves may be described in several model-based ways [

An alternative to model-based ways to estimate accumulation and rarefaction curves is provided by resampling. Resampling with or without replacement has become one of the most widely used measures of statistical support in several disciplines, especially when the analytical estimation of the variance of statistics of interest is not possible. The second option is usually preferred over the first, because resampling with replacement (bootstrap) may suffer some underestimation respect to resampling without replacement. However, the variance among randomizations estimated via resampling without replacement approaches zero, when reaching the last accumulation level, whereas variance estimated via bootstrap does not suffer from such an effect [

Resampling produces what is defined as a pseudoreplicate data set in which columns of the original data matrix may be represented multiple times, not at all, or simply in a different order. Each pseudoreplicate can be analyzed using the same statistics as for the original data matrix. Reiteration for a number of times creates a distribution of the statistics investigated. Such distribution may serve as a basis for statistical testing.

Three major schemes of resampling are known: bootstrap, jackknife, and shuffling. Bootstrapping implies resampling with replacement, whereas jackknifing and shuffling do not. Bootstrapping produces pseudoreplicate data sets in which columns of the original data matrix may be represented multiple times or not at all. By jackknifing, a proportion of columns in the original data set are deleted and not replaced whereas the shuffling procedure simply changes the order by which columns are represented in the pseudoreplicate data set, without deleting any columns. Bootstrapping and jackknifing tend to produce similar results [

Resampling methods assume that elements in a data set must be identically and independently distributed. Such an assumption may be unrealistic, although random resampling would allow such conditions to be met [

CURSAT ver. 2.1 is a simple open-source software that allows the generation of exportable pseudoreplicate data sets and, in combination with commonly used statistical packages, the construction of accumulation and rarefaction curves and associated sample variance by using a rarefaction-by-resampling approach. Such an approach is not novel and has already been applied in ecology and genetics (e.g., [

Whereas CURSAT ver.2.1 is not meant to compete with other more comprehensive and mature software available, especially when dealing with extremely large data sets, it enables to extract replicated data matrices, while other packages do not. Admittedly, for programmers, R requires minor programming efforts to export replicated data matrices resulting from resampling routines. However, CURSAT ver. 2.1 may still prove handy because it does not require previous experience in R or other programming software. It can be run as an executable file and allows the user to save both pseudoreplicate and accumulation data in separate files. This is a valuable feature because pseudoreplicate data may be imported into a spreadsheet and be subsequently used for the calculation of diversity indices or other statistics with associated resampling-based variance in case analytical variance can not be estimated. By exporting pseudoreplicate data sets, the user has the possibility to estimate different statistics using the same pseudoreplicate data set. Additionally, runs are reproducible because the user can set a seed number.

To promote a full understanding of procedures followed by the software during calculations, extended remarks (REM lines) are provided in the source code. The REM lines in the source code explain and describe what is performed by subsequent lines of instructions. Additionally, a second edition (Edition #2) of the software is provided. Edition #2 generates a log file where intermediate matrices created and used by the software during resampling and accumulation routines are appended and saved at each replicate. This may prove useful for check-quality and educational purposes.

CURSAT ver. 2.1 is a short open-source code written in basic (QB64), compilable into an executable file. It uses a random sampling approach and allows the user to choose between two options: 1) resampling with replacements (bootstrap) and 2) resampling without replacement (shuffling).

The software produces

For example, let’s consider the following case, very common in ecological studies. We know that 12 species (objects) occur at a certain location. We sampled that location 10 times and each time we found a certain number of species, according to the following scheme: in the 12 × 10 matrix, species are a-l (rows) and sampling events are 1–10 (columns). The presence and absence (incidence) of each species in each sampling event is represented by 1 and 0, respectively as in Table

Incidence matrix.

1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | |
---|---|---|---|---|---|---|---|---|---|---|

1 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | |

1 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | |

0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | |

0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | |

0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | |

1 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | |

0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | |

1 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 1 | |

1 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | |

0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | |

0 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | |

1 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |

The data matrix is the input information for CURSAT ver. 2.1. Because of the limitation of Basic language, CURSAT ver. 2.1 accepts blanks or commas as separators in the input file. An incidence matrix is used in the example mentioned above, but an abundance matrix may also be used indifferently because the software will handle it when constructing accumulation matrices.

The functional structure of the software is summarized in the flow chart in Figure

Functional structure of CURSAT ver.2.1.

After loading the rectangular input data matrix D, previously dimensioned by the user, CURSAT ver. 2.1 starts the bootstrap by first randomly choosing a number corresponding to one of the columns of D. Subsequently, CURSAT ver. 2.1 extracts from D the elements of that column and starts constructing the bootstrapped data matrix ND using data in D. ND and D matrices have the same dimension. The process is repeated for the number of replicates set by the user.

Similarly to the bootstrap routine, after loading the rectangular input data matrix D, CURSAT ver. 2.1 generates a random number comprised between 1 and the maximum number of columns of D, to construct a row vector F with as many elements as the number of columns of D. Such a process occurs within a cycle that accepts only non-replicated numbers. The vector F stores the shuffling order. The remaining part of the routine is similar to the bootstrap routine. CURSAT ver. 2.1 extracts the elements of each column from D, following the order stored in F, and starts constructing the resampled data matrix ND using data in D. ND and D matrices have the same dimension. Again, the process is repeated as many times as the number of replicates set by the user.

To perform accumulation, for each resampling replicate, CURSAT ver. 2.1 constructs the matrix W where, for each row, each

The number (

The software appends the

File

Correspondent

Accumulation data file

Similar information is reported in the first column of each of the accumulation matrices generated. In these matrices sampling events and associated cumulative number of objects (species in the example) are reported in the second and third columns, respectively. Such a file can be imported in statistical packages to estimate distributions of successive pooling of samples and depict accumulation and rarefaction curves. The graphs in Figure

Accumulation curves (as indicated by black dots) obtained from 100 pseudoreplicates, with seed number = 12348695. The input file was

CURSAT ver. 2.1 is written in basic (QB64). QB64 is not a cross-platform software

Instructions about how to install QB64 on Windows, Linux, and MacOS can be found at the website:

Once that QB64 has been installed on the desired Operating Systems, the source code (file with extension .bas) of CURSAT ver.2.1 can be loaded in QB64 environment and be compiled into an executable file that runs under the specific Operating Systems.

CURSAT ver. 2.1 was originally written using QB64 for Microsoft Windows 64bit. However, I also successfully compiled and run it using QB64 for Apple Mac under MacOS Mojave 10.14.5.

During the run, a bootstrap-replicate counter informs about the status of the run. Once the run has ended, the software informs the user and reports the names of files where output data are saved, along with information about the duration of the run. Files can be opened using a text editor or a spreadsheet and can be inspected for consistency and used for later analyses.

The software is provided in two editions. Edition #1 is recommended for analysis purposes. Edition #2 adds a log file (named by the user) where information used by the software (matrices D, ND, W, B, and F in the source code) generated and used during resampling and accumulation routines is stored and saved. Such information may prove useful for users to identify whether the software is operating as expected. Additionally, the possibility to check on intermediate steps of the software may also be useful for educational purposes.

Figure

An example of a log file generated by running Edition #2, illustrating in detail the algorithm used by CURSAT ver. 2.1. The input data matrix D (top-left) in the file

The software is provided with example sample files, used for testing, and correspondent output files. The input file

The database in the file

Accumulation curves (black dots) obtained from 100 pseudoreplicates. The database as in the file

Output files, including log files, as resulting from the different runs are provided along with the code and Windows executables. Multiple runs using the same seed number produced, as expected, the same results.

The software was also tested by using a 1000 × 1000 data matrix as in the file

Such a matrix is an abundance matrix generated by Microsoft Excel in two steps. In the first step, a random incidence matrix 1000 × 1000 was created. In a second step, elements = 1 were replaced with integer numbers randomly chosen between 1 and 30.

In such a way, the expected accumulation curve for both resampling procedures should predict full saturation after 10 replicates, with the following mean values (rounded to the integer value) from replicate 1 through 10, respectively: 500, 750, 875, 938, 969, 984, 992, 996, 998, 999. One hundred replicates were generated for both bootstrap and shuffling. Accumulation curves are shown in Figure

Accumulation curves (black dots) obtained from 100 pseudoreplicate data. The input data was the abundance 1000 × 1000 matrix as in the file

The Windows executable file was successfully tested for functionality under Microsoft Windows 7 and 10 64 bit. The Apple Mac executable ran properly under MacOS Mojave 10.14.5.

The source code of CURSAT ver. 2.1 is written in QB64 ver.0.960 basic. Syntax used by QB64 may slightly differ from other versions of the basic language (e.g. QB4.5, GWbasic), requiring some adjustments.

Once compiled into a Microsoft Windows 64 bit executable, and irrespective whether running random resampling with or without replacement, Edition #1 required less than 20000 Kb of memory and less than 1 minute to generate 10000 bootstrapped pseudoreplicates of a 12 × 10 incidence matrix, on an Intel® Core™ i7 CPU 860 at 2.80 Ghz, with 4.0 Gb RAM and Windows 7 64 bit. The size of the generated data and accumulation pseudoreplicates files was 3770 Kb and 865 Kb, respectively.

Edition #1 required approximately 10 seconds to generate 100 pseudoreplicates of a 100 × 100 data matrix. Such time increased to approximately 2 minutes and 20 seconds for Edition #2. In these last two cases, the size of pseudoreplicate and accumulation data files was 3975 Kb and 175 Kb, respectively. The log file generated by Edition #2 was 75034 Kb.

When using the abundance dataset in

A 1000 × 1000 data matrix was the largest data set used for the speed test of CURSAT ver.2.1. The software can handle larger matrices of data, but this is time-consuming.

No library is required. No previous knowledge of the BASIC language or QB64 environment is required to run CURSAT ver. 2.1, as the executable file, can run as “stand-alone” software. No official support is provided, but users are welcome to contact the author for trouble shooting.

The software was entirely created by the author.

English

The software can be used by ecologists, biogeographers, zoologists, botanists, conservationists but it also has general applicability in all cases in which data matrices from repeated sampling have to be evaluated by a resampling/rarefaction approach. For example, CURSAT ver. 2.1 has been recently used to account for a possible bias in the number of haplotypes caused by unequal sample size, in a genetic study of hemoparasites of Galápagos iguanas (Fulvo et al., submitted).

CURSAT ver. 2.1 can be used to perform random resampling with and without replacement of every kind of rectangular matrix. Thus, the possibility to extract pseudoreplicates of original data and easily import them into a spreadsheet allows calculating several different kinds of statistics and their associated variance. Variance estimated by a random sampling approach with replacements may provide a valid alternative in case the analytical derivation of asymptotic distribution and variance is not possible.

Edition #2 may be used to illustrate the internal steps of the software. This can be useful for educational purposes.

Following the suggestion of one of the reviewers a command-line version to facilitate the integration of such a program in a pipeline could be implemented in a more advanced version of the software, along with a batch mode allowing the user to run multiple files on the same run.

The additional files for this article can be found as follows:

CURSAT v.2.1 zipped package including README, codes, example files. DOI:

Input and Output files in rar archive. DOI:

CURSAT ver.2.1 Edition 1 source code. DOI:

CURSAT ver.2.1 Edition 2 source code. DOI:

CURSAT ver.2.1 Edition 1 executable. DOI:

CURSAT ver.2.1 Edition 2 executable. DOI:

Readme file for CURSAT ver. 2.1. DOI:

I am grateful to a very special person who provided me a Commodore64 to learn basic programming language.

The author has no competing interests to declare.