Bayesian Inference for the Finite Population Total from a Heteroscedastic Probability Proportional to Size Sample

Inference for the population total from probability proportional to size (PPS) sampling provides a comparison of design-based and model-based approaches to survey inference, for an important practical design. The usual design-based approach weights sampled units by the inverse of their inclusion probabilities, using the Horvitz-Thompson or Hajek estimates. The model-based approach predicts the outcome for non-sampled units based on a regression of the outcome on the size variable. Zheng and Little (2003) showed that this regression approach, based on a flexible penalized spline regression model, can provide superior inferences to Horvitz-Thompson or generalized regression, in terms of both precision and confidence coverage. However, the sizes of non-sampled units are exploited in this approach, and this information is rarely included in public-use data files. Little and Zheng (2007) showed that when the sizes of non-sampled units are not available, the spline model, combined with a Bayesian bootstrap (BB) model for predicting the non-sampled sizes, can still provide superior inferences, though gains were reduced and less consistent. We further develop these methods by (a) including an unknown parameter to model heteroscedastic error variance in the spline model, an important modeling feature in the PPS setting; and (b) providing an improved Bayesian method for including summary information about the aggregate size of non-sampled units. Simulation studies suggest that the resulting Bayesian method, which includes information on the number and total size of the non-sampled units, recovers most of the information in the individual sizes of the non-sampled units, and provides significant gains over the traditional Horvitz-Thompson estimator. The method is applied to two public-use data sets from the U.S. Census Bureau as well as a data set from the U.S. Energy Information Administration.