Counting the Invisible
Even in the big data era, access to information can be limited by a variety of factors, ranging from political to practical. This is also the case of casualty records in war scenarios, which often consist of multiple, incomplete and potentially inaccurate lists (for example from different NGOs) instead of a unique and exhaustive official registry. Giacomo Zanella, BIDSA Affiliate and Assistant Professor in Statistics at Bocconi Department of Decision Sciences, has developed advanced methodologies that can be used to estimate the total number of victims from such incomplete datasets.
Over the decades, statisticians have developed a broad set of methods for this problem, which is known as population size estimation. For example, capture-recapture methods estimate the population size by examining the intersection between datasets from different sources. Intuitively, if two independently-collected lists have few records in common, then we expect a major under-reporting. In this case, the total population size will probably be much larger than the number of reported individuals.
To apply this capture-recapture approach to the estimation of war casualties, we first need to identify records referring to the same individual across multiple databases, a procedure known as record linkage or entity resolution. When data are potentially inaccurate and unique identifiers are not available, this task is far from trivial and requires a statistical approach. In particular, Bayesian methods are valuable in quantifying uncertainty on the record matching and hence on subsequent estimates, in this case on the estimated number of victims.
Together with an international network of coauthors, Professor Zanella has contributed to the development of Bayesian methods for entity resolution, from both a theoretical and computational point of view.
«Entity resolution», explains Zanella, «can be seen as a clustering task, with clusters consisting of records associated to the same person. In this context, the number of records in each cluster tends to be extremely small compared to the size of the dataset. For example, one might have hundreds of thousands of records partitioned in clusters containing at most five records each. Such a microclustering behaviour is not well captured by traditional Bayesian clustering models, which assume that each cluster contains a non-negligible fraction of the whole population».
This has motivated Zanella to propose new models for microclustering, study their theoretical properties and apply them to entity resolution. Moreover, since traditional computational techniques performed poorly on this new class of models, he has developed and analyzed novel Markov chain Monte Carlo algorithms that have proven to be orders of magnitude more efficient in exploring the discrete space of record linkage configurations. This opens up the possibility of performing Bayesian microclustering with big data, not only in entity resolution, but also in DNA sequencing, language processing and sparse network analysis, among other applications.
«This project exemplifies my research activity», says Zanella, «which is aimed at a rigorous mathematical understanding of modern statistical and computational methods motivated by real-world applications, in order to develop more effective and reliable methodologies».
Find out more on the Bocconi Knowledge website at this link.