Generalized Variance Functions in Stratified Two-Stage Sampling

Generalized variance functions (GVF's) are used in a number of sample surveys as a convenient method of publishing sampling errors. The method consists of estimating the relative variance (relvariance) of an estimator $hat{T}$ of a total $T$ by using a model with the form $a + b/T$. Using the prediction approach to finite population sampling, some asymptotic theory is presented for estimators of totals that are linear combinations of sample cluster means from stratified, two-stage cluster samples. One choice of GVF estimator is shown to be consistent under a particular class of prediction models. The theory is illustrated by an empirical study in which two-stage stratified samples are selected from a population of households. The prediction model is one in which units within a stratum have a common mean and variance, units in the same cluster are correlated but units in different clusters are not, and in which the common variance is a quadratic function of the common mean in a stratum. Bernoulli and Poisson random variables, for example, have the mean-to-variance relationship studied here. Under the model the approximate prediction-relvariance of $hat{T}$ has the form $a + b/T$, and the parameters $a$ and $b$ can be estimated by least squares. The theory leads to guidelines for selecting a set of survey variables to use in estimating $a$ and $b$. The theory was tested in a simulation study using household data collected in the U.S. Current Population Survey. Totals of binary variables derived from labor force and demographic data were estimated in 2,000 stratified two-stage samples. Using 45 variables, I estimated GVF's of the form $a + b/T$ and $cT^d$, which has been suggested elsewhere in the literature, from each sample and summarized results over all samples. The results illustrate that some GVF's can, for many variables, produce unbiased estimators of relvariance that are more precise than a direct, point estimator, and that also have reasonably good confidence interval coverage properties. The best choice of GVF was based on a weighted least squares fit of the model $a + b/T$. Two limitations of GVF's are that they may not perform particularly well for rare characteristics and that there will inevitably be survey variables for which no GVF will be appropriate.