We propose to develop and evaluate simple, variable-specific indices of non-ignorable selection bias for researchers in the health sciences working with data collected from non-probability samples. Classical methods of scientific probability sampling and corresponding “design-based” frameworks for making statistical inferences about populations have long been used in the health sciences to advance scientific knowledge. The random selection of elements from a population of interest into a probability sample, where all population elements have a known non-zero probability of selection, ensures that elements included in the sample mirror the population in expectation. That is, for all variables of interest, the mechanism of selection of a subset of elements into the sample is ignorable, following the theoretical framework for missing data mechanisms originally introduced by Donald Rubin. Unfortunately, the modern survey research environment has had a severe negative impact on these “tried and true” methods of survey research that rely on probability samples: it has become harder and harder to contact sampled units, survey response rates continue to decline in all modes of administration (face-to-face, telephone, etc.), and the costs of collecting and maintaining scientific probability samples are steadily rising. Health science researchers are thus turning to volunteers and relatively inexpensive sources of “big data” collected from samples where the probabilities of selection are unknown (e.g., commercial databases, or social media platforms like Twitter and Facebook).
A key question that arises from analyses of these non-probability samples is how good the resulting population inferences are. If the mechanisms underlying selection into the non-probability sample depend on the variables of research interest, then estimates of population parameters may well be biased. The proposed research aims to draw on recent developments in the survey statistics literature related to assessment of the bias arising from non-ignorable nonresponse in surveys, and develop simple but novel model-based indices of non-ignorable selection bias for non-probability samples, in addition to methods for adjusting population inferences based on those indices. The proposed indices offer advantages over competing indices in terms of their focus on specific survey variables. They are also entirely model-based, enabling researchers to develop appropriate models relating their key survey measures to auxiliary variables that are known for the responding persons and available in aggregate for the target population. This research will have widespread impact, enabling quantification of (and adjustment for) the bias in estimates arising from non-probability samples.