New In
Robust inference for linear models
Stata 18 offers more precise standard errors and confidence intervals (CIs) for three commonly used linear models in Stata: regress, areg, and xtreg, fe.
Highlights
-
Multiway cluster–robust standard errors
-
HC2 standard errors:
-
Degrees-of-freedom adjustment
-
Cluster–robust
-
Cluster–robust and degrees-of-freedom adjustment
-
-
Wild cluster bootstrap confidence intervals and p-values
Small number of clusters? Uneven number of observations per cluster? Use HC2 with degrees-of-freedom adjustment, option vce(hc2 …, dfadjust), or wild cluster bootstrap to obtain valid inference.
Multiple nonnested clusters? Use multiway clustering, option vce(cluster group1 group2 … groupk), to account for potential correlation of observations within different clusters.
Let’s see it work
We have a panel of individuals and would like to study the effect of belonging to a union on the log of wages ln_wage. We control for whether the individual has a college degree collgrad, for length of job tenure, and for time fixed effects.
We compare several methods of computing standard errors: robust, cluster–robust, cluster–robust HC2 with degrees-of-freedom adjustment, and two-way clustering. The second and third methods account for correlation at the industry level. The last method accounts for correlation at both the industry level and occupation level. In our example, we use only 12 clusters, which violates the assumption of asymptotic approximation that the number of clusters grows with the sample size. We restrict our sample to observations where industry code ind_code is available. We also store the estimation results. We type
. webuse nlswork (National Longitudinal Survey of Young Women, 14-24 years old in 1968) . keep if ind_code!=. (341 observations deleted) . quietly regress ln_wage tenure union collgrad i.year, vce(robust) . estimates store robust . quietly regress ln_wage tenure union collgrad i.year, vce(cluster ind_code) . estimates store cluster . quietly regress ln_wage tenure union collgrad i.year, vce(hc2 ind_code, dfadjust) . estimates store HC2 . quietly regress ln_wage tenure union collgrad i.year, vce(cluster idcode ind_code) . estimates store multiway
Instead of looking at all the regression output tables, we combine them into an estimates table by using etable.
We asked etable to use the estimates we stored and to present only the CIs, cstat(_r_ci, …), for the coefficient on union, keep(union). We then export the table to the .html table you see on this page, export(setable.html, replace).
The CIs are the narrowest with robust standard errors. They are the widest with HC2 degrees of freedom–adjusted standard errors. In the latter case, 0 is inside the CI, which suggests we should be careful when interpreting the effect of belonging to a union on wages. This is in contrast with the conclusion we would have made had we used only robust standard errors. Finally, there appears to be little difference between clustering at the industry level and clustering at both industry and occupation levels.
We can also use wild cluster bootstrap to account for a small number of clusters and an unequal number of observations per cluster. It is implemented in the new wildbootstrap command. We describe this feature in detail in Wild cluster bootstrap, but let’s also use it here for comparison.
. wildbootstrap regress ln_wage tenure union collgrad i.year, cluster(ind_code) coefficients(union) rseed(111)
wildbootstrap calls regress. So after it is done, you can still access the regress results. But, additionally, wildbootstrap constructs wild cluster bootstrap CIs for the null hypothesis that a coefficient is 0. By default, it uses all coefficients, but you may select which ones you would like to study. We focus on union. Because we are resampling at the cluster level, we specify the ind_code variable in cluster(), and we set a seed for reproducibility.
The CI reported by wildbootstrap is almost as wide as that reported when we used HC2 standard errors. Although 0 is not in the CI, it suggests that there is a wide variability in the point estimate.