%svy_freqs: A Generic SAS Macro for Creating Publication-Quality Three-Way Cross-Tabulations

Cross-tabulations are a simple but important tool for understanding the distribution of socio-demographic characteristics among participants in epidemiological studies. We developed a generic SAS macro, %svy_freqs, to create publication-quality tables from cross-tabulations between a factor and a by-group variable given a third variable using survey or non-survey data. The macro also performs two-way cross-tabulations and provides extra features not available in existing procedures such as ability to incorporate parameters for survey design and replication-based variance estimation methods, performing validation checks for input parameters, transparently formatting variable values from character into numeric and allowing for generalizability. We demonstrate the macro using the 2013–2014 National Health and Nutrition Examination Survey (NHANES), a complex survey designed to assess the health and nutritional status of adults and children in the United States.

(2) AVAILABILITY OPERATING SYSTEM The SAS macro was developed for the Microsoft Windows platform. PROGRAMMING LANGUAGE The code presented here was developed in SAS version 9.3 using the SAS Macro Language. ADDITIONAL SYSTEM REQUIREMENTS Base SAS 9.3 installation. DEPENDENCIES The macro only requires that the user has the base SAS software installed. LANGUAGE The language of repository and supporting files is English.
(3) REUSE POTENTIAL The source code for the macro is freely available, documented, and extensible by the end user, allowing for further adaptation and reuse. We plan to extend this macro to include other statistical techniques. Our key motivation is to generate tools to automate the process of data analysis which will shorten the time required to prepare output and hence provide quick and well-formatted results for consumption and/or dissemination in support of reproducible science. ADDITIONAL FILE The additional file for this article can be found as follows: • SAS Macro, %svy_freqs.sas. The file contains the complete SAS code for macro described and demonstrated in this manuscript. DOI: https://doi.org/10.5334/jors.318.s1

INTRODUCTION
Cross-tabulations are a basic but important tool for understanding the distribution of socio-demographic characteristics among study or survey participants in the fields of epidemiology and disease surveillance. They are useful especially when comparisons need to be performed separately for different levels of a by-group variable such as a key demographic characteristic, e.g., sex, or an outcome status such as positive or negative test result for a disease. Cross-tabulations can be even more informative if one is interested in the distribution of disease prevalence among selected factor variables (table rows) and a by-group variable (table columns). This is useful in cases where the association between disease prevalence and risk factors or exposures needs to be stratified, for instance, by sex or geographic region.
Almost all available statistical analysis software can easily perform cross-tabulations, however, output from these must be processed further to make them readily available for review and use in a publication. In Stata, one can use the table and tabulate [1] commands or Stata user's community-contributed programs like tabout [2] or tabmult [3]. In SAS, there exist a limited number of commands or macros for creating publication-quality tables [4][5][6][7][8][9] but they suffer from limitations of flexibility, usability and generalizability. In particular, the SAS macros available do not provide the analyst with options for specifying replication-based variance estimation methods including Jackknife (JK) or Balanced Repeated Replication (BRR) which are often used in order to obtain correct variances for survey estimates in presence of survey non-response, hence providing valid variance estimates [10][11][12].
We have developed a SAS macro which overcomes the described shortcomings while promoting reproducible research principles [8] such as transparency, reproducibility and reusability, which are attracting increasing attention in epidemiological research [13][14][15][16][17]. It further provides for replication-based variance estimation methods as well as enforced validation checks for input parameters.
The work presented here builds on the development of another SAS macro, %svy_logistic_regression, for producing publication-quality tables from unadjusted and adjusted logistic regression analyses [18].

IMPLEMENTATION AND ARCHITECTURE
The %svy_freqs SAS macro-This macro, written in SAS software version 9.3 [19], uses the SURVEYFREQ and SURVEYMEANS procedures to perform the cross-tabulation and output frequencies, totals and percentages. The macro uses the SAS output delivery system (ODS) to create a publication-quality table, similar to a typical Tables 1 or 2 of a manuscript in the epidemiological research field.
The macro is composed of seven sub-macros, which are called within the main macro. The _outcome and _outvalue, which are the parameters for which prevalence is to be computed, must be specified. Analysis type, _cat_type, must be specified as equal to PREV. If not specified, the macro automatically generates a new variable, _freq whose value equals 1 for all study subjects in the analysis dataset, and proceeds with the analysis as though it were for two-way cross-tabulations with row percentages. The _outcome and _outvalue parameters may be omitted for two-way cross-tabulations. The macro enforces in-built SAS validation checks on input parameters and tests for logical errors. It halts the macro from execution and prints out the error on the log window for the user to address. The user should specify input parameters that are described in Table 1 unless the description is prefixed by (optional). To achieve full potential of the SAS macro, the user must ensure that the analysis dataset is clean, analysis variables are well labelled, and values of variables have been converted into appropriate SAS formats before they can be input to the macro call. Two-way cross-tabulations are also possible. For instance, if users are interested in showing distribution of study participants by a given by-group variable, then column percentages which are most appropriate are obtained using the COL option. If the by-group variable is an outcome of interest such as positive or negative diagnostic test results, then the row percentages are most appropriate and can obtained using the ROW option. The by-group variable can have more than two categories and can be encoded as either a numeric or character variable. For the distribution of continuous variables, one can specify the type of statistic to compute (mean or median).
Where the data to be analyzed come from a complex survey, our macro allows users to specify study design variables containing strata, cluster, and design weights as well as the variance estimation method and replicate weight variables, if necessary. Data from non-survey settings are analyzed by leaving the survey-design parameters unspecified. The macro also provides for domain analysis for sub-populations, and there are options for specifying how missing values should be represented [11,[20][21][22].
If the analysis includes non-coded character variables, the macro automatically encodes them into numeric variables prior to analysis. The macro further provides natural display of results from epidemiological surveys by processing the final output into a refined publication-quality table, which is output into word processing and spreadsheet programs for immediate use in publications or for additional formatting if needed.
The macro has several limitations. First, it has been developed on Microsoft Windows and code adjustments may be needed to adapt it for other operating systems. Second, it cannot handle arbitrary nesting of by-group variables, such as those supported by PROC TABULATE. Additionally, it does not provide interpretation of results, so users should consult a qualified statistician for any inference. Nonetheless, we feel this macro provides a good tradeoff between simplicity and ease of use, flexibility, and generalizability, and should shorten the analysis period for complex surveys, while supporting generation of high-quality outputs.

Quality Control
Example of macro call to analyze the NHANES dataset: We demonstrate the application of the macro in the analysis of a dataset from the 2013-2014 National Health and Nutrition Examination Survey (NHANES). NHANES is a complex survey designed to assess the health and nutritional status of adults and children in the United States (U.S.). A detailed description of the survey design and contents is available elsewhere [23]. The NHANES dataset [24] is publicly available online for free from the U.S. Centers for Disease Control and Prevention (CDC) at: https://www.cdc.gov/nchs/nhanes/Index.htm. Data used for this demonstration is also available at the GitHub repository (https://github.com/kmuthusi/threeway-crosstabulation-macro) We used the macro to generate three different tables with the main one (Table 4 with prevalence percentages) showing the distribution of hepatitis A prevalence across selected socio-demographic characteristics and by sex. The next tables show the distribution of participants' socio-demographic characteristics by sex ( Table 2 with column percentages) and by hepatitis A antibody test result (Table 3 with row percentages). The aim of the analysis was to show the distribution of hepatitis A among participants aged 20+ years who had served active duty in the U.S. Armed Forces. We also show participant's sociodemographic characteristics by sex and by hepatitis A antibody test result. Appropriate survey weights (sample weights for participants with a medical examination) were applied. The working denominator was N = 542. However, 180 observations were dropped during analysis because they had non-positive weights. In addition, the analysis domain sample size was calculated and added at the end of each table title.
The macros were run sequentially after specifying required parameters as shown in code Examples 1-3 The results presented here are purely for illustrative purposes only and do not follow from any specific survey objective. Readers should consult the NHANES analytic guidelines on variable definitions, analytical and statistical recommendations that are available online at https://wwwn.cdc.gov/nchs/nhanes/analyticguidelines.aspx.
The SAS output from the macro consists of several tables specifically for holding parameter estimates, corresponding 95% CI for percentages and means or IQR for median. Table  2 displays distribution of patient characteristics (row variables) by sex (column variable) which was output after running the code in Example 1. Columns include categories and factor labels in the first column, followed by unweighted sample size for each level of the factor, weighted column percentages/or median and corresponding 95% CI or IQR. To compare the distribution of selected factors by sex, we use the 95% CI or IQR. For instance, among participants aged 40-59, there were more females than males 61.8% (95% CI: 38.  Table 3 shows the distribution of patient characteristics by hepatitis A test results which was obtained after running the code in Example 2. The output presents row percentages and includes columns for missing values. The results show the distribution of hepatitis A status across the given factor variables. It can be seen that there are no significant differences in the distribution of hepatitis A status with 37.1% (95% CI: 33.6-40.6%) males and 38.9% (95% CI: 17.3-60.4%) for females reporting positive status, though there are differences between specific age groups. It is important to note that if missing values are suppressed the estimates will also change since the denominator will have changed. Table 4 shows distribution of hepatitis A prevalence by sex obtained after running the code in Example 3. The columns in Table 4 are also organized in a similar way as described for the previous tables. The output shows there were no significant difference across each factor by sex as the confidence intervals are overlapping due to the small sample size of females.
The macro has been extensively tested by the developer by comparing output to direct tabulation in the underlying SAS software. If desired, the end user can also request the NHANES dataset from CDC in order to reproduce the analyses in this paper to confirm the correct operation of the macro on their system by using the corresponding analysis and output files provided in the GitHub repository (https://github.com/kmuthusi/three-waycrosstabulation-macro).  Table 2 Participants' socio-demographic characteristics by sex (Col %), N = 522    Table 4 Distribution of Hepatitis A prevalence by selected socio-demographic characteristics and sex (Prevalence %), N = 522.