A common problem in regression analysis that requires correction of the estimated standard errors of the regression coefficients is the correlation between the residuals in observations that share some observed grouping features. For instance, people that live in the same city, state, or country can display a more similar behaviour than people randomly sampled from different cities, states, or countries. The example extends for any data in which some observations have shared characteristics or belong to the same collective entity or institutional setting. For instance, people from the same school, patients from the same hospital, or groups of the same gender or race can behave more similarly than people across those groups. The within-group correlation can be caused by unobserved shared characteristics of the observations in the groups, such as some unobserved school-specific educational policies, or the unobserved patterns of behavior of doctors in different hospitals.
Non-zero within-group correlations violate a common assumption of classical multivariate regression models, namely that the residuals are independent, or simply uncorrelated. If one mistakenly assumes the residuals are independent/uncorrelated, the estimated standard errors of the regression coefficients will be biased downward, which leads to smaller estimated confidence intervals, and therefore higher chances to reject the hypothesis that the coefficients are null. It can misguide researchers and lead them to be overconfident that their working hypothesis of non-zero effect is true. We can see that easily with a simple example.
Suppose we estimate the following population regression model:
where X ∈ (1, ℝk), β ∈ ℝ(k+1) × 1, y ∈ ℝ, and the last element is the error (or deviance) term ɛ ∈ ℝ. We collect i = 1, …, n observations to estimate β, which gives the statistical equation for each i with the following residuals e:
We usually take X as given (measured without error) and use the OLS estimatorof β, which is obtained by finding the argument that minimizes the square residuals (e) between observed outcome (y) and the outcome if no error had occurred (Xβ):
Assuming XTX is invertible, the first order condition gives the solution for that optimization problem:
Up to this point, if we were simply computing an OLS point estimate of β using 1 The distribution of our estimator , and therefore our inferences, comes from the assumptions about the distribution of e. Denote that distribution generically by f(e | θ), that is:, no assumptions would be needed about the distribution of the residuals (ei). We impose assumptions about the distribution of e to go one step further and make inferences about and investigate its statistical properties.
We can easily derive the first and second moments of:
Assumptions about f(e | θ) will give the small sample properties of the estimator. The classical assumption is that all residuals e comes from the same normal distribution with mean zero, and that they are uncorrelated. That is:
If we assume that 𝔼[e | θ] = 0, as in the expression (3), thenis unbiased (𝔼[ | X, θ] = β), and its standard error is simply:
with the estimated variance of e given by :
Equation (4) provides the exact confidence interval for:
In the expression (5), the value of t comes from a t-distribution and it is given by:
The common practice is to choose α = 0.05, which gives the 95% confidence interval of.
The standard output of the lm() function to estimate linear models in R assumes the zero-mean normal distribution with uncorrelated residuals, which gives the estimated standard errors shown in equation (4) above [21, 3, 18].
The clustering problem emerges in grouped data. Consider that each observation i belongs to a group g that there are G groups in the data; and that the error terms, (e), for individual observations in the same group are correlated. Following the examples above, let us say that multiple observations come from the same schools, hospitals, or countries. It is likely that the assumption of independence of the residuals is violated because individuals of the same group probably share some unobserved characteristics that affect their behavior, which creates a non-zero correlation between the residuals within the observed groups. Then, keeping all the other assumptions of the classical regression model, the distribution of the disturbances can be more generally denoted by:
In this case the standard errors ofunder the assumption of independence or zero correlation of the residuals (se( )) differ from the standard errors computed when the within-group correlations are taken into account (seg( )):
Typically, se() < seg( ). It means that assuming uncorrelated residuals produces confidence intervals of that are smaller than the true ones, and that the researcher will be overconfident about the range of values of the linear coefficients that seem consistent with the data.
There are some approaches to deal with that problem. One is to adjust the confidence intervals. Imbens and Kolesar  adjust the number of the degree of freedom of the t-distribution, producing larger values of t used to construct the confidence intervals. Another approach uses bootstrap methods [2, 9, 15, 19] (see also ). For lack of space, below we review briefly only two other approaches, the Cluster Robust Standard Errors (CRSE), which is widely-used by practitioners, and the Cluster Estimated Standard Errors (CESE) proposed by Jackson , whose implementation in R is originally presented in this paper, alongside an applied example and a brief discussion of cases in which one of these two methods, CRSE or CESE, may be preferred.
The CRSE is the routine solution used by researchers to deal with the estimation of clustered standard errors in grouped data [5, 20, 13, 7]. If the individual-level observations are divided into groups g (e.g., schools, countries, etc.), and g = 1, …, G, we can rewrite the estimated variance of in equation (2) as:
The key problem is how to estimate, the variance-covariance matrix of the residuals for group g. The CRSE solution is to use the raw estimated residuals from the OLS estimates of β, and compute using yg and Xg, the output variable and the covariates, respectively, of observations in group g. It gives the CRSE estimator as follows:
In practice, to compute the CRSE we don’t need to estimate Σg. We just need to compute the covariance matrix of the scores sandwich provides some functions to estimate clustered standard errors using the CRSE solution , and the package clubSandwich provides many other functionalities, including some to improve performance with small samples .for each group g, and use . The R package
Djogbenou et al.  demonstrate the asymptotic validity under general conditions for the CRSE solution. Some limits include poor reliability of the estimated errors if the number of clusters is small and sensitivity both to heterogeneity across clusters and variability of cluster sizes. Djogbenou et al.  provide an extensive treatment of the topic. The CRSE can be biased downward for small samples and possibly for large samples as well and seriously underestimate the true standard errors in many cases [15, 7, 14]. Jackson  also shows other conditions that lead the to provide values that underestimate the true , and therefore the confidence intervals of the regression coefficients. The author proposes an alternative approach to estimate Σg called CESE, which I discuss next.
Jackson  proposes an approach labeled CESE to estimate the standard errors in grouped data with within-group correlation in the residuals. The approach is based on the estimated expectation of the product of the residuals. Assuming that the residuals have the same variance-covariance matrix within the groups, if we denote by σig = and ρig = ρg the variance and the covariance, respectively, of the residuals within the group g, then the expectation of the product of the residuals is given by (see  for details):
where ιg is a unitary column vector, Ig is a g × g identity matrix, and. Equation (7) can be rewriten concisely as:
The equation above explicitly shows that the expectation of the cross-product of the residuals is a function the data through Q1g and Q2g and the unknown varianceand correlation ρg of the residuals eg in each group g. The CESE solution is to explore the linear structure of equation (8) and to estimate and ρg as if the estimated values of were random deviances from their expectations. Denote ξ that deviance. Then
The estimates ofand ρg are obtained using the OLS estimator. That is, if we denote , q1g (or q2g) the vectorized diagonal and lower triangle of Q1g (or Q2g) stacked into a ng(ng + 1)/2 column vector, qg = [q1g, q2g], and seg the corresponding elements of stacked into a column vector as well, then the OLS CESE estimator of the variance and correlation of the residuals in group g is given by
As pointed above for the OLS estimator of β, if we assume thatis invertible, the first order condition gives:
We can rewrite the equation (10) as:
As explained above for the OLS estimates of β, the estimators ofand ρg do not require per se any assumption on ξ, unless we want to construct confidence intervals for the estimates of those parameters.
The CESE is attractive when its assumptions hold and the CRSE is believed to be unreliable. Jackson  shows that CESE produces larger standard errors for the coefficients and much more conservative confidence intervals than the CRSE, which is known to be biased downward in the cases mentioned above. CESE is also less sensitive to the number of clusters and to the heterogeneity of the clusters, which can be a problem for both CRSE and bootstrap methods.
However, it is important to notice, that the CESE is not a replacement for the CRSE because these two methods are based on different parametric assumptions. The CESE requires some assumptions that can be considered stronger than the CRSE approach, as equations (7) to (11) indicate (see more details in ). Each approach may be better suited to different situations. One example is the CESE assumption that the residuals have the same variance-covariance matrix within the groups. For instance, if we cluster by geographic location, but individual data is observed at different points in time as in Bertrand et al. , then the assumption of the same within-cluster residual variation is probably violated, and we would have to cluster the standard errors by time as well. Another example: When one uses fixed-effect models for the clusters, and the correlation of the residuals comes only from cluster-level effect, the cluster fixed effects explain all the variation in at the cluster-level, and the term ρg will be close to zero. In that case, CESE may be a less appealing alternative. However, when the limitations of the CRSE discussed above are a problem, the CESE is a better choice and produces more conservative standard errors.
I implemented CESE in R. It is available in the package named ceser. The next section presents some details of the implementation as well as an example ilustrating how to use the software in practice.
The package ceser provides a function vcovCESE() that takes the output of the function lm() (or any other that produces compatible outputs) and computes the Cluster Estimated Standard Errors (CESE). The basic structure of the function is:
R> vcovCESE(mod, cluster = NULL, type=NULL)
The parameter mod receives the output of the lm() function. The parameter cluster can receive a right-hand side R formula with the summation of the variables in the data that will be used to cluster the standard errors. For instance, if one wants to cluster the standard errors by country, one can use:
R> vcovCESE(…, cluster = ~ country, …)
To cluster by country and gender, simply use (note that it means that each cluster contains observation for one gender and one country):
R> vcovCESE(…, cluster = ~ country + gender, …)
The parameter cluster can also receive, instead of a formula, a string vector with the name of the variables that contain the groups to cluster the standard errors. If cluster = NULL, each observation is considered its own group to cluster the standard errors.
The parameter type receives the procedure to use for heterokedasticity correction. Heterokedasticity occurs when the diagonal elements of Σ are not constant across observations. The correction can also be used to deal with underestimation of the true variance of the residuals due to leverage produced by outliers. The package includes five types of correction. In particular, type can be either “HC0”, “HC1”, “HC2”, “HC3”, and “HC4” . Denote ec the corrected residuals. Each option produce the following corretion:
where k is the number of covariates, hii is the ith diagonal element of the matrix P = X(XTX)–1XT), and.
The estimation also corrects for cases in which 12], we use in those cases.. Following Jackson [
In applied regression analyses, the practioner is usually interested in estimating the linear coefficients and their standard errors to evaluate if the confidence interval of the point estimates of the coefficients includes the null value. It means that two quantites of interest areand se( ).
In this section, we compare the standard output of the lm() function with the standard errors of the linear coefficients produced by the CRSE, as computed by the widely used R package sandwich , and those produced by the ceser package, which contains my implementation of the CESE method proposed by Jackson . As discussed in the previous section, in general the CESE should be more conservative, produce larger estimates of the standard errors, and result in wider confidence intervals.
To ilustrate how to use the ceser package, and to compare the three estimates of the standard errors (raw, CRSE, and CESE), we use the data set dcese provided with the ceser package. The data set was used in Jackson  and comes from Elgie et al. . It contains information of 310 (i = 1, …, 310) observations across 51 countries (g = 1, …, 51). The outcome variable is the number of effective legislative parties (enep). The explanatory variables are: the number of presidential candidates (enpc); a measure of presidential power (fapres); the proximity of presidential and legislative elections (proximity); the effective number of ethnic groups (eneg); the log of average district magnitudes (logmag); an interaction term between the number of presidential candidates and the presidential power (enpcfapres = enpc × fapres), and another interaction term between the log of the district magnitude and the number of ethnic groups (logmag_eneg = logmag × eneg). Elgie et al.  present regression analyses showing a strong relationship between enpc and fapres, enpc, and their interaction. The effective number of legislative parties increases with the number of presidential candidates, but decreases with presidential power. The interactive term has a positive coefficient, implying the negative association between the number of legislative parties and presidential power attenuates as the number of candidates increases. They use a variety of standard errors corrections, including CRSE. We reproduce their study here, and include the estimation of the standard errors using CESE as in Jackson .
Let us start with the functions that provide the variance covariance matrix of the estimated coefficients Table 1 below uses also HC1 for comparison. Let us start by loading the package and the data:. For all the examples below, we use the HC3 correction. The
R> library(ceser) R> data(dcese)
|(Intercept)||2.7043||(1.558, 3.85)||(1.747, 3.662)||(1.502, 3.907)||(0.227, 5.182)|
|enpc||0.3040||(–0.066, 0.674)||(–0.189, 0.797)||(–0.294, 0.902)||(–0.39, 0.998)|
|fapres||–0.6118||(–0.936, –0.288)||(–1.011, –0.212)||(–1.146, –0.077)||(–1.354, 0.13)|
|enpcfapres||0.2078||(0.089, 0.326)||(0.046, 0.37)||(0.004, 0.411)||(–0.013, 0.429)|
|proximity||–0.0224||(–0.568, 0.524)||(–0.521, 0.476)||(–0.651, 0.606)||(–0.755, 0.71)|
|eneg||–0.0657||(–0.356, 0.224)||(–0.343, 0.212)||(–0.391, 0.259)||(–0.455, 0.324)|
|logmag||–0.1815||(–0.664, 0.301)||(–1.041, 0.678)||(–1.206, 0.843)||(–1.11, 0.747)|
|logmag_eneg||0.3605||(0.099, 0.622)||(–0.205, 0.926)||(–0.32, 1.041)||(–0.122, 0.843)|
Before estimating the linear model, we need to sort the data using the cluster variables (this is necessary to estimate the CESE using the ceser package, but it is not necessary to estimate the CRSE using the sandwish package). In our example, we will cluster the data by country. Hence:
R> dcese = dcese[order(df$country), ]
Estimate the linear model using the lm() function.
R> mod = lm(enep ~ enpc + fapres + enpcfapres + proximity + eneg + logmag + logmag_eneg, data=dcese)
The estimated raw values of the variance covariance matrix obtained by running the standard R function from the stats package  are:
R> vcov (mod)
The CRSE, using countries as the grouping variable, obtained using the vcovCL() function of the sandwich package  are:
R> library(sandwich) R> vcovCL(mod, cluster = ~country, type=“HC3”)
In a similar fashion, the CESE are obtained by simply running the function vcovCESE() of the ceser package:
R> vcovCESE(mod, cluster = ~country, type=”HC3”)
Note that the estimated standard errors are ordered as expected. The raw standard errors are smaller than CRSE, which by its turn are smaller than CESE for almost all coefficients:
The standard errors for each method are:
R> sqrt(diag(vcovCL(mod, cluster=~country, type=“HC3”)))
R> sqrt(diag(vcovCESE(mod, cluster=~country, type=“HC3”)))
Summary tables with the raw standard errors, CRSE, and CESE are easy to produce. The package lmtest is specially useful for that purpose. The package ceser integrates nicely with the lmtest package and the function coeftest() of that package, which can be used to create summary tables with the different standard errors. The raw estimates are:
R> summary(mod) Call: lm (formula = enep ~ enpc + fapres + enpcfapres + proximity + eneg + logmag + logmag_eneg, data = dcese) Residuals:
|Estimate||Std.Error||t value||Pr (>|t|)|
codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.65 on 291 degrees of freedom Multiple R-squared: 0.378, Adjusted R-squared: 0.363 F-statistic: 25.3 on 7 and 291 DF, p-value: <0.0000000000000002
We can obtain the summary with CRSE by country by running:
R> library(lmtest) R> coeftest(mod, vcov = vcovCL, cluster = ~ country, type=”HC3”)
t test of coefficients:
codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Similary, to use CESE instead of CRSE, simply run
R> coeftest(mod, vcov = vcovCESE, cluster = ~ country, type=”HC3”)
t test of coefficients:
|Estimate||Std.Error||t value||Pr (>|t|)|
codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Table 1 shows how the confidence intervals differ for the different estimates of the standard error of the coefficients. The CRSE are shown with both the HC1 and HC3 adjustments to the residuals. We can see how the CESE is more conservative, particulary for the two covariates, fapres (presidential power) and enpcfapres [the interaction between effective number of legislative parties (enpc) and presidential power (fapres)]. For them, the null value is consistent with the data when the CESE is used, but not if the other standard errors are adopted for the computation of the confidence intervals.
The user should note that the performance of the estimation is not yet optimized to handle large data sets. There are two reasons for the suboptimal performance. The first is that the current implementation uses only high-level functions in R. The second is that the software avoids storing large matrices by using some nested loops during the estimation. Future versions of the package will implement the core functions in C++ and provide compiled code with the package to improve the performance. Nevertheless, the package is fully functional, and the performance tests shows that on average the a data set with 1000 observations take 5.3 seconds to estimate, with 3000 it takes 85 seconds, and with 5000 observations it takes around 5.5 minutes to compute the standard errors.
The package has been thoroughly quality checked and tested. The package structure successfully passes all CRAN R CMD checks and all continuous integration checks implemented in Travis, including checks to build the package on Windows, Linux, and macOS. The results of the checks can be found on Travis website https://travis-ci.org/github/DiogoFerrari/ceser.
CESER is written in R (>=2.1) and run in any operational system that supports R Statistical Software. R can be obtained freely from https://www.r-project.org/.
R Statistical Software 2.1 or higher.
There is no additional requirements.
The package depends on the following R packages: magrittr, purrr, dplyr, tibble, lmtest.
Name: Cluster Estimated Standard Errors in R (CESER)
Persistent identifier: 10.5281/zenodo.4107151
Publisher: Diogo Ferrari
Version published: v1.0.0
Date published: 10/19/2020
Date published: 10/19/2020
Firstly, the adoption of methods that deal with clustered standard errors is ubiquitous in social sciences. Currently, available packages in R only provide traditional ways (CRSE) to estimate regression models with clustered standard errors, as discussed above. The CESER package provides an easy-to-use implementation of a new method, namely CESER, as proposed in Jackson . It is important to note that the method implemented in our package is not bounded by any specific subfield. The package is of direct interest to any researcher using regression models.
The Cluster Estimated Standard Errors in R (CESER) package is fully compatible with other R packages widely used to compute regression models in economics, psychology, political science, sociology, and many other disciplines. Those packages include the built-in R module stats to complete linear models, as well as some extensions such as glm, lmtest, lme4. Researchers using those packages can seamlessly use our package to deal with clustered standard errors. The CESER package is well-documented and contains working examples for a copy-and-paste experimentation. Moreover, code examples are provided at the package author’s personal website, including a code vignette explaining the package usage. As presented in the paper, the output of the main estimation function follows standard R format and can be manipulated by popular external packages for data visualization and reports, including tidyverse, kable, pipe computing, and ggplot2. Hence, our package can easily be reused or extended.
There are three main options for those interested in extending or contributing to the package. First, we provide full open access to the source code in the package’s GitHub repository. Users can either open a ticket requesting extensions or suggesting changes. They can also make changes to their local version of the code and open a pull request for software extension or modification using the GitHub website. Finally, users are welcome to e-mail to the principal author and request further enhancements.
The authors have no competing interests to declare.
Bertrand M, Duflo E, Mullainathan S. How much should we trust differences-in-differences estimates? The Quarterly journal of economics, 2004; 119(1): 249–275. DOI: https://doi.org/10.1162/003355304772839588
Cameron AC, Gelbach JB, Miller DL. Bootstrap-based improvements for inference with clustered errors. The Review of Economics and Statistics, 2008; 90(3): 414–427. DOI: https://doi.org/10.1162/rest.90.3.414
Djogbenou AA, MacKinnon JG, Nielsen M∅. Asymptotic theory and wild bootstrap inference with clustered errors. Journal of Econometrics, 2019; 212(2): 393–412. DOI: https://doi.org/10.1016/j.jeconom.2019.04.035
Elgie R, Bucur C, Dolez B, Laurent A. Proximity, candidates, and presidential power: How directly elected presidents shape the legislative party system. Political Research Quarterly, 2014; 67(3): 467–477. DOI: https://doi.org/10.1177/1065912914530514
Esarey J, Menger A. Practical and effective approaches to dealing with clustered data. Political Science Research and Methods, 2018; 1–19. DOI: https://doi.org/10.1017/psrm.2017.42
Harden JJ. A bootstrap method for conducting statistical inference with clustered data. State Politics & Policy Quarterly, 2011; 11(2): 223–246. DOI: https://doi.org/10.1177/1532440011406233
Hayes AF, Cai L. Using heteroskedasticity-consistent standard error estimators in ols regression: An introduction and software implementation. Behavior research methods, 2007; 39(4): 709–722. DOI: https://doi.org/10.3758/BF03192961
Imbens GW, Kolesar M. Robust standard errors in small samples: Some practical advice. Review of Economics and Statistics, 2016; 98(4): 701–712. DOI: https://doi.org/10.1162/REST_a_00552
Jackson J. Corrected standard errors with clustered data. Political Analysis, 2020; 28(3): 318–339. DOI: https://doi.org/10.1017/pan.2019.38
Liang K-Y, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika, 1986; 73(1): 13–22. DOI: https://doi.org/10.1093/biomet/73.1.13
MacKinnon JG. How cluster-robust inference is changing applied econometrics. Canadian Journal of Economics, 2019; 52(3): 851–881. DOI: https://doi.org/10.1111/caje.12388
MacKinnon JG, Webb MD. Wild bootstrap inference for wildly different cluster sizes. Journal of Applied Econometrics, 2017; 32(2): 233–254. DOI: https://doi.org/10.1002/jae.2508
Pustejovsky JE, Tipton E. Small-sample methods for cluster-robust variance estimation and hypothesis testing in fixed effects models. Journal of Business & Economic Statistics, 2018; 36(4): 672–683. DOI: https://doi.org/10.1080/07350015.2016.1247004
Roodman D, Nielsen, M∅, MacKinnon JG, Webb MD. Fast and wild: Bootstrap inference in stata using boottest. The Stata Journal, 2019; 19(1): 4–60. DOI: https://doi.org/10.1177/1536867X19830877
White HL. A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica, 1980; 48(4): 817–838. DOI: https://doi.org/10.2307/1912934
Wilkinson GN, Rogers CE. Symbolic descriptions of factorial models for analysis of variance. Applied Statistics, 1973; 22: 392–9. DOI: https://doi.org/10.2307/2346786
Zeileis A. Econometric computing with hc and hac covariance matrix estimators; 2004. DOI: https://doi.org/10.18637/jss.v011.i10