January 21, 2019

## Calibrating Non-probability Samples using Probability Samples

Dr. Jack Kuang Tsung Chen (Program in Survey Methodology PhD 2016), along with SRC researchers Drs. Richard Valliant and Michael Elliott conducted a study that explores and evaluates approaches to combining data from probability and non-probability surveys. Recently published in the Journal of the Royal Statistical Society: Series C (Applied Statistics), and based on Chen’s dissertation, this work used data collected from high-quality probability samples that produce unbiased estimates, but are time-consuming and expensive to collect, to adjust data from non-probability samples that may be tainted by selection bias, but are cheaply and easily obtained.

As senior author Dr. Elliott explains, “… a major issue implied by this type of work is that probability and non-probability sampling are synergistic rather than antagonistic — the proliferation of administrative and non-probability sample data for social science research means that having a small “corral” of high-quality probability samples is more important than ever to assist with calibration, imputation, and other adjustment methods.”

This article developed and advocated *estimated control least angle shrinkage and selection operator* (ECLASSO) regression. This approach extends the existing LASSO approach by incorporating sampling–related measurement error in benchmark probability-sample data into the variance component of model-assisted calibration estimators. LASSO is a useful approach to calibration because it automates both variable selection and parameter estimation for the model used to calibrate a given (likely non-probability) sample to control totals from the probability sample using a penalized regression approach to avoid overfitting.

As the authors explain, their approach “combines both quasi-likelihood and modelling approaches by utilizing a probability-based benchmark sample … (together with) an assisting model to predict an outcome of interest, given a set of calibration variables that exists in both probability and non-probability samples. The outcome variable in the non-probability sample is then calibrated to the predicted outcome total in the probability sample, given the probability sampling weights in the benchmark data.” Simple, right?

LASSO can accommodate numerous predictors in the calibration model, even when using a small probability sample for calibration. The authors note that “(b)ecause we are relying so heavily in non-probability samples on models that can approximate the expected value of (an outcome variable of interest) to compensate for the lack of design weights, a large number of covariates and, consequently, control totals may be required to obtain accurate models.” Both the non-probability data source and the benchmark probability sample data must contain the same outcome variable and substantively and empirically related covariates.

ECLASSO, like all such calibration approaches, assumes the full spectrum of values of model variables in the population has non-zero probability of being observed in both analytical and benchmark samples. This places a quality constraint on the non-probability sample, requiring that it must not be so extreme in its under-coverage that some covariate values are unrepresented.

The authors used large non-probability samples from SurveyMonkey with data on voting preferences for US Senate and state gubernatorial candidates in the 2014 midterm elections, along with probability sample data from the same time period collected by the Pew Research Center, to compare the accuracy and efficiency of election outcome estimates using a number of calibration techniques.

They compare ECLASSO to: 1) unweighted estimates from the non-probability data, 2) a weighting-class adjustment using census data, 3) propensity score weighting using the Pew data, and 4) the estimated control generalized regression estimator (ECGREG). Using the election results as the gold standard, when compared to the other calibration techniques, ECLASSO-derived point estimates had the smallest average absolute bias and relative error. This accuracy, paired with standard errors comparable to those obtained via weighting-class adjustment, resulted in ECLASSO having the lowest RMSE and best confidence interval coverage of the five approaches considered.

Chen and his co-authors also conducted a simulation study using the 2013 National Health Interview Survey (NHIS) as the benchmark study, while also using NHIS to simulate internet-based non-probability samples. Again, ECLASSO outperformed naïve (equal probability) treatment of the biased sample data, as well as GREG and propensity score approaches.

The authors demonstrated that contemporary approaches such as ECLASSO, although complex, can provide improvement over standard linear regression and weighting-class approaches to calibration and imputation.

Although there is much promise in using data from non-probability samples, any sound method of doing so may need to rely on data from high quality probability surveys in specific research areas to provide a source of calibration measures for the adjustment of the data from non-probability sources. This suggests that the lights will burn brightly in the SRC sampling section for many years to come.

Chen, Jack Kuang Tsung; Valliant, Richard L.; Elliott, Michael R. (2018). Calibrating non‐probability surveys to estimated control totals using LASSO, with an application to political polling. Journal of the Royal Statistical Society: Series C (Applied Statistics).