Sampling as an Epidemic Process

Tags: : Statistics

Respondent driven sampling (RDS) is an approach to sampling design and analysis which utilizes the networks of social relationships that connect members of the target population, using chain-referral. It is especially useful when sampling stigmatized groups, such as injection drug users, sex workers, and men who have sex with men, etc. In our latest contribution, just published in Biometrics, Yakir Berchenko, Simon Frost and myself, take a look at RDS and cast the sampling as a stochastic epidemic. This view allows us to analyze RDS using the likelihood framework, which was previously impossible. In particular, this allows us to debias population prevalence estimates, and estimate the population size! The likelihood framework also allows us to add Bayesian regularization, debias risk estimates a-la AIC, or cross-validation, which were previously impossible, without the sampling distribution.

I particularly like this project, because it is a real end-to-end statistical challenge with nice theory, computational considerations, and a deliverable R package:

A widely applicable problem: sampling in hidden populations is both very important, and a real challenge to classical sampling techniques. RDS is also a potential tool to analyze “Facebook-samples”, which are becoming more prevalent.
The theory: viewing the sampling as a stochastic epidemic, an idea due to Yakir, allows to link the sampling literature to the vast corpus of knowledge on epidemics, software reliability, and counting processes.
A computational challenge: The likelihood function implied by the stochastic epidemic is essentially, a stochastic differential equation. The counting processes literature allowed us to state the likelihood directly, observe it is separable, and solve the maximum-likelihood problem efficiently.
An R package: Our RDS estimator, with the numerical “tricks” above, is implemented in the chords package, available from CRAN.

Written on February 22, 2017