I have a Ph.D. in Statistics & Methodology. I can help you collect high quality data and better understand the quality of the data you already have.
Surveys are a vital tool for understanding public opinion and knowledge, but they can also yield biased estimates of behavior. Here we explore a popular and important behavior that is frequently measured in public opinion surveys: news consumption. Previous studies have shown that television news consumption is consistently overreported in surveys relative to passively collected behavioral data. We validate these earlier findings, showing that they continue to hold despite large shifts in news consumption habits over time, while also adding some new nuance regarding question wording. We extend these findings to survey reports of online and social media news consumption, with respect to both levels and trends. Third, we demonstrate the usefulness of passively collected data for measuring a quantity such as “consuming news” for which different researchers might reasonably choose different definitions. Finally, recognizing that passively collected data suffers from its own limitations, we outline a framework for using a mix of passively collected behavioral and survey-generated attitudinal data to accurately estimate consumption of news and related effects on public opinion and knowledge, conditional on media consumption.
Motivated misreporting occurs when respondents give incorrect responses to survey questions to shorten the interview; studies have detected this behavior across many modes, topics, and countries. This paper tests whether motivated misreporting affects responses in a large survey of household purchases, the U. S. Consumer Expenditure Interview Survey. The data from this survey inform the calculation of the official measure of inflation, among other uses. Using a parallel web survey and multiple imputation, this paper estimates the size of the misreporting effect without experimentally manipulating questions in the survey itself. Results suggest that household purchases are underreported by approximately 5 percentage points in three sections of the first wave of the survey. The approach used here, involving a web survey built to mimic the expenditure survey, could be applied in other large surveys where budget or logistical constraints prevent experimentation.
The U.S. Consumer Expenditure Interview Survey asks many filter questions to identify the items that households purchase. Each reported purchase triggers follow-up questions about the amount spent and other details. We test the hypothesis that respondents learn how the questionnaire is structured and underreport purchases in later waves to reduce the length of the interview. We analyze data from 10,416 four-wave respondents over two years of data collection. We find no evidence of decreasing data quality over time; instead, panel respondents tend to give higher quality responses in later waves. The results also hold for a larger set of two-wave respondents.
Several studies have shown that high response rates are not associated with low bias in survey data. This paper shows that, for face-to-face surveys, the relationship between response rates and bias is moderated by the type of sampling method used. Using data from Rounds 1 through 7 of the European Social Survey, we develop two measures of selection bias, then build models to explore how sampling method, response rate, and their interaction affect selection bias. When interviewers are involved in selecting the sample of households or respondents for the survey, high reported response rates can in fact be a sign of poor data quality. We speculate that the positive association detected between response rates and selection bias is because of interviewers’ incentives to select households and respondents who are likely to complete the survey.
Administrative data are increasingly important in statistics, but, like other types of data, may contain measurement errors. To prevent such errors from invalidating analyses of scientific interest, it is therefore essential to estimate the extent of measurement errors in administrative data. Currently, however, most approaches to evaluate such errors involve either prohibitively expensive audits or comparison with a survey that is assumed perfect. We introduce the “generalized multitrait-multimethod” (GMTMM) model, which can be seen as a general framework for evaluating the quality of administrative and survey data simultaneously. This framework allows both survey and administrative data to contain random and systematic measurement errors. Moreover, it accommodates common features of administrative data such as discreteness, nonlinearity, and nonnormality, improving similar existing models. The use of the GMTMM model is demonstrated by application to linked survey-administrative data from the German Federal Employment Agency on income from of employment, and a simulation study evaluates the estimates obtained and their robustness to model misspecification. Supplementary materials for this article are available online.
Helping data scientists collect more accurate training data, decreasing the cost and time needed to train models
Alternative sampling approaches which do not depend on up-to-date census data or interviewer involvement
The incentives of those producing data impact the quality of the data