Biases in Data

This post is dedicated to biases – systematic errors, or deviations, from the truth. We have them in psychology, in politics, in sociology, and we have them in data. They are the nemeses of of any statistical analysis, about as annoying as trying to play Anivia against a Fizz. Some biases can be small, and it may be reasonable to ignore them, but some may be big, and it’s these that we need to identify and try to account for.

During development of our model we spent several months trying to identify and account for biases in the data that we thought might affect our results. Careful consideration of these factors is a vital stage in getting valid results from any analysis.

Increasingly, there is a tendency to ignore biases, particularly as we have access to larger and larger sets of data. People assume that with more observations and more variables, biases will disappear, but the vary nature of biases mean that they don’t. In statistics, there are two explanations for why a particular estimate may deviate from the “true” value.

Random Error: The first of these explanations is random error, and this can be fixed by looking at a bigger sample. As we increase the size of a sample, the error of an estimate due to random variation tends to zero. Intuitively, this increases our confidence in our estimate – if we only look at two games and see that Zed beats Azir in mid, we are less confident that he’s a good counter for Azir than if we look at 100 games.

Systematic error (bias): The second explanation is bias, or error in a particular direction away from the “truth”. Let’s say we have a particular bias (we’ll call it $\theta$), and we have a statistic that we want to estimate (such as a mean), which we will call ${X}$. This bias consistently deviates from the truth, so what we observe is: ${X} + \theta$

No matter how big we make our sample, we cannot remove the directional effect of the bias. Our estimate will become more precise, but even as the random error tends to zero, our bias still exists. This means that even if we had an infinitely large sample size we would still observe ${X} + \theta$, meaning that we could never know the true value of ${X}$ as we could never separate it from the bias.

Issues with ignoring biases continue to arise within Big Data analyses, such as during Google’s flu prediction project, where researchers often overestimated flu prevalence due to the ways in which their data were collected. “Big Data” is a confusing word that gets bandied around a lot, but it is generally thought to refer to massive numbers of observations with massive numbers of variables (don’t ask me who decides how many “massive” has to be). We want to believe that these huge swathes of data can be used to answer all of life’s problems, and perhaps they can, but our eagerness to play with these numbers and pieces of information can lead us to ignore often very simple and important biases.

For further reading on what big data can do, and on what it can’t, there’s a great blog post here.