Big Data for Finite Population Inference: Applying Quasi-Random Approaches to Naturalistic Driving Data Using Bayesian Additive Regression Trees

Big Data are a “big challenge” for finite population inference. Lack of control over data-generating processes by researchers in the absence of a known random selection mechanism may lead to biased estimates. Further, larger sample sizes increase the relative contribution of selection bias to squared or absolute error. One approach to mitigate this issue is to treat Big Data as a random sample and estimate the pseudo-inclusion probabilities through a benchmark survey with a set of relevant auxiliary variables common to the Big Data. Since the true propensity model is usually unknown, and Big Data tend to be poor in such variables that fully govern the selection mechanism, the use of flexible non-parametric models seems to be essential. Traditionally, a weighted logistic model is recommended to account for the sampling weights in the benchmark survey when estimating the propensity scores. However, handling weights is a hurdle when seeking a broader range of predictive methods. To further protect against model misspecification, we propose using an alternative pseudo-weighting approach that allows us to fit more flexible modern predictive tools such as Bayesian Additive Regression Trees (BART), which automatically detect non-linear associations as well as high-order interactions. In addition, the posterior predictive distribution generated by BART makes it easier to quantify the uncertainty due to pseudo-weighting. Our simulation findings reveal further reduction in bias by our approach compared with conventional propensity adjustment method when the true model is unknown. Finally, we apply our method to the naturalistic driving data from the Safety Pilot Model Deployment using the National Household Travel Survey as a benchmark.