The post Indirectly we can make it count appeared first on χplain.

]]>Let’s illustrate this with an easy football example – I’ll move on to a League example later, but that’s a little more complex! We have three football teams – Barcelona, Manchester United, and Scunthorpe United. As our direct evidence:

- Man United has played Scunthorpe before and consistently beats them
- Barcelona has played Man United before and consistently beats them

However, Scunthorpe has never played Barcelona before. So we don’t have any direct evidence for comparing them…but does that mean that we have to conclude that we don’t know who would win? Of course not – we would say that we know Barcelona would win against Scunthorpe, *even *if they’ve never played before

One way to formalise this mathematically is by defining a consistency equation:

This basically means that the relative effect of A vs B is equal to the effect of A vs C minus B vs C, **provided all other things are equal** (this is a key part of the assumption behind this formula).

It seems simple enough, and it is with just three “nodes”. However, it gets increasingly complicated as we build up larger networks of direct and indirect evidence. Below are all the LoL matches played during Worlds 2015. The blue nodes represent the different teams (their size is proportional to the number of games that team played in the tournament). The lines represent that there is **direct evidence **connecting two teams (i.e. they played each other at least once in the tournament).

By using indirect evidence, we can get an idea of the outcome between teams that haven’t played (e.g. H2k and CLG). But we can also use the indirect evidence to supplement our direct evidence, and give us more information about the relationship between say, OG and FW. This means we can make predictions about the outcome with greater certainty.

Of course, as I mentioned before, the consistency equations require that all other things are equal…and in League they very rarely are from game to game! But, the things that aren’t equal can, over multiple matches, be expected to be distributed randomly, meaning that our consistency equation still holds. We just have to account for the uncertainty due to the randomness, we which can do during modelling.

This is the type of analysis I use in my job when comparing multiple scientific studies that look at multiple treatments (known as *network meta-analysis*). It’s how we identify which treatments work, and which ones don’t. Indirect evidence is all around us – we just have to find new ways to make use of it! After all, indirect evidence is just extra information, and information is beautiful.

The post Indirectly we can make it count appeared first on χplain.

]]>The post Alpha-release sign-up appeared first on χplain.

]]>After many moons of tears, tantrums and tribulations, we’ve got ourselves a working model. But, as any statistician worth his salt knows, no predictive model is complete without validation. And that’s where we need your help.

To be able to participate in our data analysis study participants must:

- Be League of legends players of any skill level (but at least level 30)
- Play regularly with a registered team
- Play a minimum of 4 games per week (or 18 games per month) with at least 3 members of your team playing with you in those games

All we would need as a minimum is to register your team with the service so that we can collect your data (no personal data, only League data!) This will really help us improve the model where needed, which will benefit not only League players but also other applications for which this model may be used in the future.

However, as an extra, it would be great if you can:

- Report any bugs you encounter in the service
- Give us feedback on the model and the platform as a whole

In return we can offer:

- First look at a fantastic new platform for team management and analysis
- A first invite to our closed beta launching this Summer
- Pioneer status on our website – we’ll list your summoner name on our website page and acknowledge you as the ones who helped us get it all off the ground
- Acknowledgement in any papers we publish in academic journals on the model
- Science collaboration points (cos let’s be honest they’re the most important bit of it all!)

Obviously if you change your mind at any point we’ll remove your details from our database.

So if you want to help us develop League of Legends data analysis and modelling, whilst having an early look at the new platform we’ll be releasing, please sign up!

The post Alpha-release sign-up appeared first on χplain.

]]>The post Biases in Data appeared first on χplain.

]]>During development of our model we spent several months trying to identify and account for biases in the data that we thought might affect our results. Careful consideration of these factors is a vital stage in getting valid results from any analysis.

Increasingly, there is a tendency to ignore biases, particularly as we have access to larger and larger sets of data. People assume that with more observations and more variables, biases will disappear, but the vary nature of biases mean that they don’t. In statistics, there are two explanations for why a particular estimate may deviate from the “true” value.

**Random Error: **The first of these explanations is random error, and this can be fixed by looking at a bigger sample. As we increase the size of a sample, the error of an estimate due to random variation tends to zero. Intuitively, this increases our confidence in our estimate – if we only look at two games and see that Zed beats Azir in mid, we are less confident that he’s a good counter for Azir than if we look at 100 games.

**Systematic error (bias): **The second explanation is bias, or error in a particular direction away from the “truth”. Let’s say we have a particular bias (we’ll call it ), and we have a statistic that we want to estimate (such as a mean), which we will call . This bias consistently deviates from the truth, so what we observe is:

No matter how big we make our sample, we cannot remove the directional effect of the bias. Our estimate will become more precise, but even as the random error tends to zero, our bias still exists. This means that even if we had an infinitely large sample size we would still observe , meaning that we could never know the true value of as we could never separate it from the bias.

Issues with ignoring biases continue to arise within Big Data analyses, such as during Google’s flu prediction project, where researchers often overestimated flu prevalence due to the ways in which their data were collected. “Big Data” is a confusing word that gets bandied around a lot, but it is generally thought to refer to massive numbers of observations with massive numbers of variables (don’t ask me who decides how many “massive” has to be). We want to believe that these huge swathes of data can be used to answer all of life’s problems, and perhaps they can, but our eagerness to play with these numbers and pieces of information can lead us to ignore often very simple and important biases.

For further reading on what big data can do, and on what it can’t, there’s a great blog post here.

The post Biases in Data appeared first on χplain.

]]>The post The LoL Synergy Model appeared first on χplain.

]]>**Synergy: The interaction of two or more agents or forces so that their combined effect is greater than the sum of their individual effects. **

We’ve spent the past six months developing and implementing a model to do this. Countless nights poring over messy diagrams, confusing algorithms and (as a contrast) reasonably elegant code. Sometimes we loved it, sometimes it made our heads hurt, but we finally have a product that we believe achieves our objectives.

To outline some of the benefits of our model over other LoL team statistical analyses – we use indirect evidence on synergistic relationships:

- to help give more information to champion relationships where data are available
- to give estimates of champion relationships where there are as yet no data available
- to account for relationships within team compositions which might hide specific effects that we want to know about

All this means that we can predict the success of specific team comps which have not been played before within the dataset.

We’ve used one thousand games from solo queue (mixed tiers) available from Riot’s API as our input for the model. We’d like a larger set of data ideally, but we’re happy to make do with what we’ve got for now! What we do is we look at the synergy between every available champion in every lane on every available team, and we inform these relationships from our thousand games of data.

We then pool these synergistic effects to identify the team with the most synergy, based on the data. To do this also means accounting for various uncertainties and dependencies. For instance, if there was less data to inform a particular team composition then we had to give our estimate less “weight” in the analysis. This is best achieved by calculating a *confidence interval, *which gives an indication of how confident we are that our estimate is the “true” value.

The tricky part of assessing this confidence is that to calculate it accurately we need to consider which of our inputs can be thought of as independent and which cannot. An easy way to illustrate this is that multiple measurements of anything (e.g. blood pressure taken at several time points) are likely to be more similar than multiple measurements from different people. These dependencies therefore need to be properly quantified so that they can be accounted for when calculating the confidence of the estimates.

Once we’d managed this, all we had to do was to rank the teams in order of which we predicted would be the best, and to look at how much confidence we had in these predictions. These two things were the results of our model.

The post The LoL Synergy Model appeared first on χplain.

]]>The post What’s so great about video games? appeared first on χplain.

]]>Video games, and in particular e-sports, have huge potential for data analysis. We can collect vast quantities of data on all sorts of attributes…after all video games are made from ones and zeros, the same ones and zeroes that can be used in an analysis. So all we have to do is collect it and utilise it. In this post, I’m going to χplain a little more about why e-sports are so suited to statistical analysis.

Statistical modelling in sport really took off in the 90s, and quite quickly people began to see the huge benefit from this approach to team management in terms of selecting strategies, team composition and new players. The book (and film) Moneyball – the art of winning an unfair game is a great example of this. Just like in conventional sports, we can make use of statistics in e-sports to help improve our game, and to select the best players (and champions in the case of LoL). However, there are a couple of big advantages that e-sports has over conventional sports in terms of the data that we collect.

**Ease of data collection**

Every attribute, position and status of a player can (in theory) be programatically collected at any time point during a game, which gives us an almost inexhaustible quantity of data to work with. Of course, all this data is not always available, but some game developers, like Riot, are beginning to understand the value of this and to provide detailed datasets from individual games, allowing teams to make personalised analyses of their own data.

In contrast, collecting a lot of the data from conventional sports requires real people to make decisions and classifications and to record these decisions. This brings us to our next issue.

**Measurement error and bias**

In conventional sports, there is always a certain degree of error when measuring a particular variable (or attribute). The speed of a ball, the position of a player – for some sports many variables like these are recorded using electronic measurements, but even these measurements will always have a certain degree of error. These errors can be random, in which case they reduce the accuracy of our statistical estimates, or they can systematically vary in one direction (e.g. over-estimation of a value), in which case they introduce a bias to the results. Instead of observing the true value X, we instead observe W, which is actually equal to X plus some sort of random or non-random measurement error, U.

For less objective variables that require classification by a human (such as the type of shot played or the severity of an injury), the potential for measurement error and bias is even greater.

In video games there can be no measurement error. Any and all data on a game can be recorded *exactly *as it is in the game, because the data defines the game. This prevents additional random error from imprecise measurements and, more importantly, also prevents biases that can be introduced during measurement. For example, a referee in football may (consciously or unconsciously) more readily give a yellow-card to a player if they dislike them. These types of biases cannot exist in video games, and therefore cannot confuse our analyses.

So overall, this is why we love video games. As a statistician having access to large datasets of precise, comparatively unbiased data is a prospect that makes my calculator display error messages of excitement. Having said that, collecting the data is just the first stage of the process. Correctly analysing and interpreting the data and the relationships that exist in it are what really make the difference. This is what we aim to achieve through the models we develop and publish, models that you can run on your own personalised data. We’ll post more about that later though…

The post What’s so great about video games? appeared first on χplain.

]]>The post Summing Probabilities – Linear Log-Odds appeared first on χplain.

]]>One of the key calculations underpinning our model is that we need to be able to add together probabilities of winning a game, so that we combine results from different sets of champions to make a team . But you can’t simply add probabilities because they are bounded by 0 and 1, meaning that you cannot have a probability greater than 1 or less than 0.

Let’s say the probability of winning in one game with a specific set of champions (set A) was 0.7, and the probability with another set of champions (set B) was 0.6. If we add 0.6 + 0.7 we get 1.3…which is not a valid probability!

Set A

Champion 1 (lane) |
Champion 2 (lane) |
Champion 3 (lane) |
p Win |

Garen (TOP) | Kalista (BOTTOM) | Annie (BOTTOM) | 0.7 |

Set B

Champion 1 (lane) |
Champion 2 (lane) |
p Win |

Sejuani (JUNGLE) | Katarina (MIDDLE) | 0.6 |

Full Team

Champion 1 (lane) |
Champion 2 (lane) |
Champion 3 (lane) |
Champion 4 (lane) |
Champion 5 (lane) |
p Win |

Garen (TOP) | Kalista (BOTTOM) | Annie (BOTTOM) | Sejuani (JUNGLE) | Katarina (MIDDLE) | 1.3 |

The problem is caused by the non-linearity of the probability scale. The magnitude of the difference between 0.5 and 0.6 is smaller than between 0.8 and 0.9. On the graph shown below, a greater range of X values (2 to 4) occur between 0.8 and 0.9, than between 0.5 and 0.6 (0 to 0.25) on the X axis.

**Odds not probabilities**

An improvement on this is by considering a win/loss ratio. This is a form of odds – the probability of you winning, versus the probability of you not winning. Winning half of your games gives you a win/loss ratio of 1. If you lose more you go below 1 but greater than 0. If you win more you go above 1 and the maximum is infinite. So on this scale the boundaries are 0 and infinity, which means that we could continually add odds together without ever reaching a limit. But the problem is that half the distribution is squeezed between 0 and 1 and the other half is stretched out between 1 and infinity – things here are also non-linear so they cannot just be added together.

**A Solution**

So how do we fix this? Well, if we take the natural logarithm of the win/loss ratio (the odds), then the scale becomes linear! The natural logarithm is the logarithm of a number to the base (also written as ). This might seem a little confusing, and it isn’t particularly relevant exactly what it means here, but if you’re interested then have a Google around for it. In this instance and in most statistics, statisticians use the term “log” to define the natural logarithm (unlike in conventional mathematics).

This handy little fix can be shown by considering the distribution across our new scale. The log of 1 is 0, the log of 0 is negative , and the log of is . So now half of our distribution is between negative and 0, and the other half is between 0 and , meaning that things are evenly spread.

This change from probability to log-odds then changes the relationship with X values from a non-linear to a linear one. Because the the distribution is now linear, the gradient is constant, so an increase in the log-odds of 1 is the same at any point along the graph, meaning that we can add and subtract different log-odds and still get valid and meaningful results.

Best wishes,

Hugo – Chinchillarama – Pedder

**Links**

Natural logarithms – http://www.purplemath.com/modules/logs3.htm

The post Summing Probabilities – Linear Log-Odds appeared first on χplain.

]]>