``Using a Probabilistic Model to Assist Merging of Large-scale Administrative Records.''



Since most social science research relies upon multiple data sources, merging data sets is an essential part of researchers' workflow. Unfortunately, a unique identifier that unambiguously links records is often unavailable and data sets may contain missing and inaccurate information. These problems are severe especially when merging large-scale administrative records. The existing algorithms to automate the merging process do not scale, fail to identify many matches, and require arbitrary decisions by researchers. We develop a fast and scalable algorithm to implement the canonical probabilistic model of record linkage. The proposed methodology can efficiently handle millions of observations while accounting for missing data and measurement error, incorporating auxiliary information, and adjusting for uncertainty about merging in post-merge analyses. We conduct comprehensive simulation studies to evaluate the performance of our algorithm in realistic scenarios. We also apply our methodology to merge campaign contribution records, survey data, and nationwide voter files. Open-source software is available for implementing the proposed methodology. (Last Revised, July 2017)
Our method is used to validate the self-reported turnout in the 2016 American National Election Study. Our turnout validation data are available at the ANES website. See this paper for details.


Enamorado, Ted, Benjamin Fifield, and Kosuke Imai. ``fastLink: Fast Probabilistic Record Linkage.'' available through The Comprehensive R Archive Network.

© Kosuke Imai
 Last modified: Thu Mar 1 08:03:40 EST 2018