Scaling up Data Science: Turning an Art into Science
In the “good ol’ days” – if they have ever existed – we used to have models aimed at explaining phenomena via a limited number of parameters and we could test them using a small amount of data. When we collected new data, we only had to feed them into the model and compute the outcome.
Nowadays, statistical and machine learning models can have millions of parameters and we can collect billions of heterogeneous datapoints coming from different sources, and no computer in the world is able to process such quantities in a reasonable amount of time. That’s what computational algorithms are for: they are processes that come to around the same results of the original model, but in a simpler and faster way.
There are some issues, though. We don’t always exactly understand why a computational algorithm works and, if it does, we can’t be sure it will work as well with different or considerably larger datasets.
“This lack of understanding results in the routine use of inefficient and largely suboptimal algorithms, and makes the design of efficient algorithms for practically used models something of an art,” said Giacomo Zanella, Assistant Professor at Bocconi Department of Decision Sciences.
Zanella obtained a €1.5mln ERC Starting Grant from the European Research Council (ERC) to better understand computational algorithms for large-scale probabilistic models, thus making their design not an art, but a science. The project (PrSc-HDBayLe - Provable scalability for high-dimensional Bayesian Learning) aims to single out the most promising algorithms using rigorous and innovative mathematical techniques, and to produce guidelines to improve them and develop new ones.
The algorithms Zanella studies have three properties: they are commonly used (“I want to develop knowledge relevant to practitioners,” he said), provably scalable and reliable. In a scalable algorithm, the computer time needed to produce a result increases only in a linear way, i.e. in the same proportion of datapoints or parameters: twice the data, twice the time. Such algorithms promise to stay manageable even if the number of parameters and datapoints continues to increase.
Reliability can only be guaranteed by a correct understanding of an algorithm workings. This includes providing a rigorous quantification of the uncertainty associated to the result of the analysis, as commonly done in Bayesian statistical models, which will be the focus of the project.
“My field is Computational Statistics,” Zanella said, “an intrinsically interdisciplinary field at the interface of Statistics, Machine Learning and Applied Mathematics. My research approach is at the intersection of methodology (designing algorithms that are both scalable and reliable) and theory (proving they are scalable).”
The results of the project will help deal with the statistical and computational challenges due to high-dimensionality (the increasing number of features recorded per individual); potential presence of interactions (the virtually infinite combinations of features that could influence the actual outcome); missing data and sampling bias; and the need to combine data from different sources (e.g. multiple databases with various degree of reliability; individual vs aggregated level data; etc).
These challenges routinely arise in real-life data science problems, with examples ranging from estimating the number of war victims through incomplete reports to predicting election outcomes combining different sources of big, wide, and dirty data.
Source: Bocconi Knowledge