can statistical analysis of match data be used to gain a deeper understanding of football?

28
Can statistical analysis of match data be used to gain a deeper understanding of football? Sachin Aggarwal

Upload: sachin-aggarwal

Post on 14-Feb-2017

401 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Can statistical analysis of match data be used to gain a deeper understanding of football?

Can statistical analysis of match data be used to gain a deeper understanding of football?

Sachin Aggarwal

Page 2: Can statistical analysis of match data be used to gain a deeper understanding of football?

Introduction

“In sports, what is true is more powerful than what you believe, because what is true will give you an edge.”

Bill James

Football is by far the most popular sport on the planet. Set foot in a football-mad country and you’ll never be far away from a group of fans having a heated discussion about the game. Every football fan has their own opinions, ideas and theories on pretty much every single aspect of it, such as which player/manager/team is better, or which tactics should be used in the match.

You’ll often find that these discussions involving football fans continually go round in circles; we never seem to reach a definitive conclusion on a topic. The reason for this is simple: everyone’s opinion is distorted due to personal and emotional bias. We all want our team or favourite player to be the best, so we will continue to argue with other fans about it until we get bored. Which happens to be never.

To completely settle a football debate is to do the impossible. However, if we look at the matter objectively, we can go some way closer to proving or disproving our opinions and ideas. And the only way we can do this is by analysing the cold, hard statistics. People can have their own ideas, but the statistics can show something completely different.

The problem with using statistics is that up until recently, people have always looked down at statistics in football. Things in football are done ‘traditionally’, and that’s the way it should be done because it’s the way it’s always been done. Data analytics has had no place in football, because people believe the game is too ‘fluid’ and has too many variables to be broken down to just numbers and graphs.

Other sports, most notably American sports like baseball and basketball, have embraced the use of statistics to gain an advantage. ‘Moneyball’ is probably the most famous example of using statistics to get ahead of other teams. It was first used by the coach of the Oakland A’s baseball team, and was subsequently used to great effect by many others in Major League Baseball. Football has lagged behind in this respect.

People in the UK view the use of statistics in football as ‘Americanizing the sport’. However, the real issue is that the statistics that football viewers are bombarded with lack context, as they appear to have no relevance to the game. The data needs to be properly processed and looked into with much more depth in order to gain a deeper understanding of what is actually happening.

Data analysis should be used in conjunction with the more traditional methods of analysis rather than being used instead of them. Looking at the game objectively, as well as with standard human intuition, can provide us with a richer, fuller picture of the game. It would be foolish to think that statistics can replace what has been done

Page 3: Can statistical analysis of match data be used to gain a deeper understanding of football?

for years, but it would be more foolish to not use them at all. Consequently, it is something that is increasingly being used in the betting industry and by clubs for individual player performance analysis.

By rigorously analysing data from Europe’s 4 major leagues using statistical techniques and methods, I hope to gain a deeper understanding of factors leading to a team’s success, such as goalscoring, and also how the different leagues have their own distinctive characteristics1. Most importantly, I hope to bring some relevance and context to the vast amount of seemingly meaningless data we’re given, and show that there is a lot to find from it. It’s just a case of knowing how to find it.

1 All data used and referred to in this essay can be found on the accompanying ‘Data and Analysis’ CD.

Page 4: Can statistical analysis of match data be used to gain a deeper understanding of football?

Section 1: Modelling the Distribution of Goals

“Sometimes in football, you have to score goals.”Thierry Henry

On the 22nd May 2010, in his last match as manager of Inter Milan, José Mourinho made history. By beating Bayern Munich 2-0 in the Champions League final, his side became the first Italian team to ever win the treble. That night, Opta (a major football statistics company) logged a total of 2842 events2, things like passes, tackles, shots, saves etc. But out of all 2842 events that happened in the game, there were 2 significant events that led to history being rewritten: the two goals.

It would be safe to say that goals are the single most important statistic in football. They are the difference between winning and losing, the difference between elation and sadness, the difference between champions and the rest. Goals are also relatively very rare and random events; in 90 minutes of football, you will only witness, on average, 2.77 goals. This rarity and randomness only adds to their importance. Yet, by looking at the distribution of goals across different leagues and seasons, you find that they are largely quite predictable.

Modelling goals using the Poisson distributionTo model the distribution of goals, I have looked at the frequency of matches with 0 goals, 1 goal, 2 goals, and so on, for the 4 major European leagues3 over the past two completed seasons. By plotting the frequency against the number of goals in the match, it is clear to see that there is a discernible curve joining these points. This curve appears to be very similar to a special type of probability distribution, called the Poisson distribution (see Appendix 1).

At first glance, it appears that the Poisson distribution is an ideal model, due to goals being rare and random events, and the number of goals being discrete. In our case, the rare events are goals, the fixed interval of time is a 90 minute match, and the base rate is the mean number of goals per match. A property of the Poisson distribution is that the mean and variance are equal, and the mean and variance of the number of goals in each league happen to be quite similar.

Applying the Poisson distribution to each league using the mean as the base rate, we can predict the frequency of matches with each number of goals. By plotting these expected values on the same graph as the observed values, as shown in Figure 1 (see Appendix 2), it appears that the Poisson distribution is a good fit to the observed distribution.

So even though goals are rare and random events, using only the mean number of goals per match, over the length of the season goals are actually quite predictable. We don’t need to know anything about the teams, the managers, the tactics; all we

2 Chris Anderson and David Sally, The Numbers Game (Penguin, 2013), p. 103 English Premier League, Spanish La Liga, German Bundesliga, Italian Serie A

Page 5: Can statistical analysis of match data be used to gain a deeper understanding of football?

need to know is the base rate4. It’s no fluke either, as can be seen by the remarkable consistency of fit across each league.

One aspect that has proved to be consistent across all leagues is that goalless draws are hard to predict. In each league over the past two seasons, there have been more observed goalless draws than what we would expect using the Poisson model. This could be due to the fact that there is only one scoreline combination that can produce a goalless draw. As the number of goals in the match increases, so does the number of different scoreline combinations that give the particular number of goals, hence the observed and expected values being closer.

A major implication of using the Poisson model for the number of goals in a match is its use in the betting and match prediction industry. With millions of pounds going into it each week, the gambling market in the UK is huge. Most fans like to put a bet on to make the match more interesting, whereas some attempt to develop sophisticated systems in order to make a profit. The Poisson model I have shown here is a simplistic model based on a relatively small dataset, and as such has many practical limitations when predicting match outcomes.

4 Chris Anderson and David Sally, The Numbers Game (Penguin, 2013), pp. 41-44

Figure 1: Goal distributions

Page 6: Can statistical analysis of match data be used to gain a deeper understanding of football?

However, it does form the basis of more sophisticated models that are developed in academic papers. For example, a model developed by Dixon and Coles aims to take into account the different abilities of both teams in a match to attack and defend, measured using their recent performances. It uses two Poisson models, one for the home team and one for the away team. The base rates used in these models are derived from each team’s respective ‘attack’ and ‘defence’ ratings, and also the fact that the home team has an advantage5. This is a way I would improve my model if I had the resources and software to better analyse the data.

How good a fit is the model?As well as looking at the graphs to compare the Poisson distribution with the actual distribution, it is possible to get a quantitative measure of how good the fit is by using a chi-squared goodness of fit test. This involves using the observed and expected values for each number of goals to calculate a test statistic. The chi-squared probability distribution is then applied to the test statistic in order to test whether the proposed model can be applied (see Appendix 3).

The classes in the test are the number of goals in the match, 0, 1, 2… and so on. The observed values are the actual number of matches over the season in each class, while the expected values are the number of matches we expect using the Poisson model for each class. For a chi-squared test to be valid, none of the expected values can be less than 5. For higher numbers of goals in the match, the expected numbers of matches are quite small, so classes have to be combined to make sure all expected values are at least 5. For example, 8, 9, and 10 goals become 8+. The value for p is 1, as the mean is the only parameter taken from the data sample for the Poisson distribution.

Chi-squared Premier League La Liga Bundesliga Serie A OverallTest Statistic 11.44 14.05 8.28 14.25 28.52

Sig. level 12.1% 5% 30.9% 4.7% 0.018%Table 1: Chi-squared goodness of fit test

It’s clear to see that the model is a good fit for all 4 leagues, as shown by Table 16. The Bundesliga has by far the best fit, with a test at the 30.9% significance level, while Serie A is the ‘least good’ fit. This could indicate that over the course of a season, matches in the Bundesliga are least affected by external factors that aren’t taken into account by the base rate for the Poisson model, such as tactics or relative team strengths.

As I previously mentioned, there are more goalless draws in each league compared to what would be expected using the model, and this is reflected in the test statistics. Apart from the Bundesliga, the chi-squared value for the 0 goals class is one of the major contributors to the overall test statistic for each league, particularly for Serie A (8.16 out of a total of 14.25). The chi-squared value for the 6 goals class is

5 Dixon and Coles, “Modelling Association Football Scores and Inefficiencies in the Football Betting Market”, Appl. Statist. 46 No. 2 (1997), pp. 265-2806 Data from the past two complete seasons of each league was tested.

Page 7: Can statistical analysis of match data be used to gain a deeper understanding of football?

the biggest contributor for La Liga (4.16 out of 14.05). This could indicate an attacking style of football or certain teams having domination over the rest, therefore contributing to more matches than expected having a high number of goals in them.

It is quite interesting to see that when data from all 4 leagues are combined, the goodness of fit of the model suffers quite considerably. This is counter-intuitive given the fact that in statistical analysis of data, a larger sample size tends to give a truer result. It would therefore follow that each league may have its own distinctive characteristics. When considering each league on its own, the Poisson model is quite good, but when they are combined, the characteristics of each league do not ‘come together’ well to give a good fit.

This then poses the question: what are these unique characteristics that each league has?

How do the European leagues differ?There are some widely accepted footballing ‘truths’ amongst fans when discussing football. For example, the Italians are often derided for being notoriously defensive, the Spaniards are envied with their abundance of attacking flair, and the English are hailed for having the most competitive and entertaining league. But what evidence are these ‘truths’ based on, and just how true are they?

To test these theories out, let’s first compare the mean number of goals per match for each league (the same mean used as the base rate for the Poisson distributions). Table 2 shows the mean goals per match, the mean home/away goals per match, and the ratio of the home goals to away goals.

Mean goals Premier League La Liga Bundesliga Serie A OverallMean 2.8 2.82 2.9 2.6 2.77

Home mean 1.57 1.68 1.63 1.5 1.59Away mean 1.23 1.13 1.27 1.1 1.18

H:A Ratio 1.28 1.48 1.28 1.36 1.35Table 2: Mean goals per match

The table would suggest that fans are right in saying Italian football is defensive, as the average number of goals per match is significantly lower than the other 3 leagues. Not only that, by looking at a graph of the goal distributions of all 4 leagues (Figure 2), the Serie A curve has a notably higher peak towards the left of the graph compared to the other leagues. Again, this indicates that Serie A is more defensive, with a lot more games seeing 2 goals or less compared to the others. Also, as I previously showed, the Serie A saw by far the most goalless draws out of the 4 leagues.

Page 8: Can statistical analysis of match data be used to gain a deeper understanding of football?

Surprisingly, it is the Bundesliga that comes out on top with the highest number of goals per match on average, not La Liga. The difference between the means is around 0.08, which may appear quite small, but over the course of a 380 game season it would equate to around 30 extra goals. Figure 2 would further support the idea that it is in fact the Germans who play the most attacking brand of football. The Bundesliga had more games consisting 2 or more goals and fewer games consisting 1 goal or less than the other leagues.

Attempting to determine the competition level and excitement of the leagues is not as straightforward, as these characteristics are more open to each person’s own interpretation. One way we can attempt to look at it is by comparing the ratio between home and away goals per match, to see which league has the closest matches. The Premier League and Bundesliga have the lowest ratio, and therefore the closest contest between the home and away teams. This would suggest that they are the two most competitive leagues. However, it could be argued that the Bundesliga is the more exciting league as it has more goals per match on average. It’s goals that make the game exciting, and that’s what the fans want to see. Not many fans enjoy watching a gritty, defensive chess match, and those that do are probably Italian!

It appears that La Liga is the least competitive of the 4 leagues, with the highest ratio at 1.48. As we saw previously with the chi-squared tests, La Liga had a higher number of 6 goal matches than what would be expected. If we break it down further

Figure 2: Goal distributions of European leagues

Page 9: Can statistical analysis of match data be used to gain a deeper understanding of football?

and perform a goodness of fit test on the La Liga home goals distribution, you find that the test statistic comes to 23.28 (0.03% significance level). This suggests that the Poisson model is not a very good fit, and one of the major abnormalities is games with 4 or more home goals.

So what is causing these abnormalities in La Liga? Well, it just so happens that two very special teams play in La Liga: Barcelona and Real Madrid.

The Barça-Madrid effectBarcelona and Real Madrid are possibly the two biggest clubs on Earth, and are without doubt two of the best. To say their respective teams are star-studded would be an understatement. In fact, these two teams are so good that they contributed 10 out of the 11 players in the 2012 FIFA World XI7. Not only that, they have arguably the two greatest players of all time, Cristiano Ronaldo (Madrid) and Lionel Messi (Barcelona), scoring goals for them at an unprecedented rate. This has led to such domination of their league that the rest of the teams start the season knowing that the best they can possibly do is snatch 3rd place.

To put their effect on La Liga into perspective, I have compared the statistics for the league with and without them, as shown in Table 3.

La Liga Mean Home mean Away meanWith 2.82 1.68 1.13

Without 2.55 1.53 1.02Barça/Madrid matches 4.22 2.49 1.72Table 3: La Liga with and without Barcelona and Real Madrid

Taking Barcelona and Real Madrid out of the mix has an astounding effect on the data. The mean goals per match drops from 2.82 to 2.55, a lower value than even Serie A. Both the mean home goals and mean away goals drop significantly also. Matches involving Barcelona or Real Madrid have an average of 4.22 goals per match, nearly 1.7 goals per match more than matches that don’t involve them.

Due to their abnormal powers of goalscoring, you would expect the goodness of fit tests for the Poisson model to improve once you remove them from the sample. This is the case, but the extent to which the tests improve without them is staggering. Table 4 shows that the fits for all 3 distributions improve, but the Home and Away fits improve the most. The Home distribution has gone from a poor fit to a very good fit, and the Away distribution has gone from a good fit to a near-perfect fit, simply by taking only 2 teams out of the dataset.

Chi-squared Overall Home AwayWith 14.05 5% 23.28 0.03% 3.32 34.5%

Without 9.77 13.4% 2.61 62.4% 0.36 94.7%

7 FIFA FIFPro World XI Award 2012, http://www.fifa.com/mm/document/ballond'or/fif/proworldxi/02/19/60/12/fifa%5ffifpro%5fworld%5fxi%5f2012%5fneutral.pdf (2012)

Page 10: Can statistical analysis of match data be used to gain a deeper understanding of football?

Table 4: Goodness of fit test, with and without Barcelona and Real Madrid

Figure 38 gives a visual representation of the impact on the goal distribution. It is clear to see that without Barcelona and Madrid, the number of matches with 3 goals or less increases, while the number of matches with 4 goals or more decreases.

Given that only 2 teams are being removed from the dataset, it is stunning to see how drastically the picture changes without them. The fact that Barcelona and Real Madrid are able to skew the data for the whole league so much is testament to just how far ahead of the rest they are.

8 The ‘Without’ data has been adjusted so that the frequencies are for 760 matches

Figure 3: Goal distribution of La Liga, with and without Barcelona and Real Madrid

Page 11: Can statistical analysis of match data be used to gain a deeper understanding of football?

Section 2: How to be Successful

“It is better to win ten times 1-0 than to win once 10-0.”Vahid Halilhodžić

The main aim of any team in any league is to score as many points over the course of the season as possible. Depending on the quality of the team, the target may be to avoid relegation, take a solid mid-table finish, secure a lucrative European spot, or challenge for the title. Every team has it’s own tactics and style of play that they believe will be most effective for them to win as many points as possible, while every fan will have their own opinion on why their team’s tactics are wrong.

As we have seen previously, Barcelona and Real Madrid have their own recipe for success: you score, we’ll score more. They bully their opponents by scoring a lot of goals, often making the number of goals scored by their opponent just another meaningless statistic. That’s not to say they neglect their defence; in fact, they usually have very tight defences, making their opponents’ tasks even more impossible. It’s close to being the perfect strategy.

Unfortunately, the rest of the teams in the world are not lavished with such footballing talent in the way these two Spanish giants have been. The vast majority of teams do not even have one player in their squad with the sort of ability that Barcelona and Madrid take for granted (and none have the genius of Lionel Messi or Cristiano Ronaldo). The rest of the footballing world has to make do and find a way to do well with what they have at their disposal. But what is best way to approach each match so that you can be successful?

‘Make your stadium a fortress’People often talk about how playing at home gives you an advantage, as you have the partisan crowd on your side and you are familiar with the conditions. The analysis in the previous section showed that this is indeed the case, with goalscoring being significantly easier playing at home compared to playing away. But in terms of finishing high in the league table, how important is a team’s home form compared to their away form?

Most fans will argue that their team should, first and foremost, turn their stadium into a ‘fortress’ where visiting teams will find it very difficult to win. To test this, I have calculated the Spearman’s rank correlation coefficient (see Appendix 4) for both home and away performance and final league position. The data are ranked based on number of points scored home/away, and number of points scored overall over the past two seasons in Europe’s four major leagues.

Home/Away Effect Premier League La Liga Bundesliga Serie AHome 0.92 0.94 0.93 0.93Away 0.86 0.84 0.83 0.88

Table 2: Spearman's rank correlation coefficient for home/away performance and league position

Page 12: Can statistical analysis of match data be used to gain a deeper understanding of football?

Table 1 shows that across all four leagues, there is very strong positive correlation between a team’s home performance and their final league position, and that the correlation is stronger for home performance than for away performance. This suggests that fans are right in wanting their teams to prioritise getting the job done at home, as good home performance sets the tone for overall league performance.

It is common to see teams play in quite a defensive, cagey manner away from home, whereas at home they will play more freely. Although this can be attributed to the fact that goalscoring is easier at home, the data in Table 1 suggests that teams should attempt to play more attacking football in order to win away from home rather than play for a draw, as the effect of losing is not as detrimental as it would be if they were playing on their own turf.

Attack or defence?One of the biggest tactical questions in the game is whether a team should play exciting, aggressive football in order to score goals, or play solid, defensive football to ensure they don’t concede. At the international level, the Italians have most notably employed the latter tactic, and have been very successful, but in doing so they’ve acquired a reputation for being boring. On the other hand, the Brazilians have been exponents of the former tactic, again with great success, and are probably the reason football is called ‘The Beautiful Game’.

There may be no definitive right or wrong answer to this debate, but the statistics certainly lean more towards one school of thought than the other. By looking at the correlation9 between goals scored/conceded and points scored over the course of the season, it appears that scoring more goals is better than conceding less, as shown in Table 2.

Attack or Defence? Premier League La Liga Bundesliga Serie AGoals scored 0.92 0.95 0.91 0.91

Goals conceded -0.82 -0.77 -0.84 -0.80Table 3: Product-moment correlation coefficient for goals scored/conceded and points scored

As you would expect, the data show that the more goals you score, and the fewer goals you concede, the more points you will score. The correlation between goals scored and points scored is very strong across all four leagues, and the correlation is stronger for goals scored than for goals conceded. This would suggest that over the course of a season, it is more beneficial to play with an attacking mind-set and try to score goals rather than focus on preventing the other team scoring them. It makes for better viewing for the fans as well, which is always a bonus!

La Liga appears to be the most goal-driven league, with the strongest correlation between goals scored and points scored out of the four leagues. It also has the weakest correlation between goals conceded and points scored. This suggests that defensive play is not at the forefront of the managers’ minds, and given the

9 Pearson product-moment correlation coefficient has been used for this analysis (see Appendix 5).

Page 13: Can statistical analysis of match data be used to gain a deeper understanding of football?

unprecedented success of the Spanish national team over this period, it would seem that tactically they know what they’re doing.

As it happens, football isn’t quite so simple that success can be boiled down to only goals scored or to only goals conceded. When it comes to affecting a team’s success, goals scored and conceded tend to go hand in hand. For example, a team that can score a lot of goals may have very little success if their defence can’t back it up. Likewise, a team that is rock solid at the back won’t do well if they don’t have an attack that can capitalise on it. Because of this, it would be better to analyse the effect of scoring and conceding on points scoring by taking all three variables into account at the same time.

To do this, I have calculated a multiple regression equation (see Appendix 6) using goals scored and goals conceded over the course of the season as the independent variables, and points scored as the dependent variable. The regression equation is the ‘line of best fit’ if you were to plot points scored against goals scored and goals conceded for each team on a graph.

Attack or Defence? Premier League La Liga10 Bundesliga Serie AGoals scored coeff. 0.76 0.73 0.59 0.79

Goals conceded coeff. -0.54 -0.47 -0.53 -0.58Table 4: Goals scored and conceded coefficients

Table 3 shows the coefficients for goals scored and conceded that were calculated for the multiple regression equation. They show, on average, the number of points a team is likely to gain or lose for every goal they score or concede. For example, a team in the Premier League will on average gain 7.6 points for every 10 goals they score, and lose 5.4 points for every 10 goals they concede.

In all four leagues, the numerical value for the goals scored coefficient is greater than that of the goals conceded coefficient. This suggests that scoring goals is more beneficial than preventing the opposition scoring them, which is the same conclusion drawn from the goals scored/conceded correlations shown previously. The data reinforce the idea that over the season, the reward of playing attacking football to score goals outweighs the risk of conceding that comes with it.

We saw in the Goal Distributions section that Serie A was the most defensive of all 4 leagues, with goals harder to come by in Italy than in the other leagues. The data in Table 3 reflects this idea, as Serie A has the highest goals scored coefficient and lowest goals conceded coefficient. Because goals are more rare in Serie A, they are more valuable when you score them and more detrimental when you concede them. The Bundesliga looks to have greatest balance between attack and defence, as the value of scoring is only marginally higher than the value of not conceding, compared to the other leagues.

10 Due to the disparity between the goals scored by Barcelona and Real Madrid, and the rest of La Liga, I have considered their values to be outliers, and as such I have not included them in the calculation of the regression equation.

Page 14: Can statistical analysis of match data be used to gain a deeper understanding of football?

Consistency and impact are vitalBy substituting the number of goals scored and conceded by a team in a season into the regression equation, we can work out the number of points you would expect the team to score. When you compare the predicted points to the actual number of points the team scored, you see that most are fairly similar. However, for some teams, the values differ significantly. This is because sometimes, it’s not about how many goals you score; it’s about how you score them.

One way to look at this is by considering consistency. Sometimes, teams will hit a purple patch and score a lot of goals in just a few games, usually when they have a kind run of fixtures against weaker teams. At other times, these same teams find it very difficult to score, resulting in them dropping points. Come the end of a season, fans look at their team’s goal tally and are left wondering why their team are not higher up the table, especially when some of the teams above them don’t seem to have done as well.

When looking at how many goals a team has scored per match, it is better to consider the harmonic mean (see Appendix 7) as opposed to the arithmetic mean. This is because the harmonic mean is less affected by extreme values than the arithmetic mean, giving a truer measure of how consistently a team scores.

The 2012/13 season performances of fierce Merseyside rivals, Everton and Liverpool, illustrate this idea perfectly. Table 4 shows that the number of goals conceded by both teams was fairly similar, but Liverpool comfortably outscored Everton during the season, giving them a far superior goal difference. The most telling statistics, however, are the harmonic mean goals and points scored.

Season 2012/13 Scored Conceded Goal Diff. Harmonic Scored PointsEverton 55 40 +15 0.912 63

Liverpool 71 43 +28 0.9 61Table 5: Performance of Liverpool and Everton, Season 2012/13

Looking at the number of goals scored, it appears that Liverpool scored quite well during the season, whereas Everton lacked firepower up front, considering their league position. But despite Liverpool scoring 16 more goals than Everton, the Blues actually had the higher harmonic mean out of the two clubs. Everton did not score 4 goals or more in a single match during that season, while Liverpool netted 4 or more times in six of their matches. These big wins inflated Liverpool’s goal tally, and disguised the fact that they played too many games with too few goals.

The other way of looking at the idea of how a team scores goals is by considering impact, and in many ways, impact and consistency go hand in hand. In a football match, the aim of a team isn’t just to score goals; rather, the aim is to score at least one more goal than your opponent. Whether you win by one goal or by five goals, the end result of gaining 3 points will always be the same. Therefore, the first few goals scored in a match have an impact as they are helping the team win. But in a

Page 15: Can statistical analysis of match data be used to gain a deeper understanding of football?

rout, by the time the fourth, fifth, sixth etc. goals are scored, the match has already been won, so these goals have little or no impact.

In the previous example, it is clear to see that although Liverpool scored more goals, Everton’s consistency meant they scored more goals that had an impact on the final result than Liverpool. The Reds’ inconsistency meant that a relatively high proportion of goals did not have an impact on the result, and that there were not enough goals in their season that did have an impact. Consequently, Everton scored more points, finished higher in the league table, and claimed the bragging rights in Merseyside for another year.

In order to quantify impact, I have plotted each team’s arithmetic mean against harmonic mean for goals scored, and calculated a linear regression equation for the graph, as shown in Figure 1. The regression line shows that as the arithmetic mean increases, so does the harmonic mean, and using the regression equation we can work out an expected value for the harmonic mean given a known value for the arithmetic mean.

Figure 1: Arithmetic and harmonic means of goals scored

Page 16: Can statistical analysis of match data be used to gain a deeper understanding of football?

By working out the residual (actual harmonic mean minus expected harmonic mean) for each point on the graph, we can see which teams’ goals have the most impact. This residual is shown as ‘Scored Impact’ in Table 5. The teams with a positive residual (the points on the graph that lie above the regression line) have a higher harmonic mean than expected, which indicates a lot of their goals were spread over their matches rather than clumped together in a few big wins, and as such have had a greater impact on final results. The opposite is true for teams that lie below the line.

This method can also be employed when analysing goals conceded. It is more beneficial for a team to have a low harmonic mean of goals conceded, indicating that the team consistently defends well. Teams would want their residuals11 to be high, meaning that most of the goals they concede have a low impact on match results.

The impact of goals scored and goals conceded helps explain the difference between the actual number of points scored in a season by a team and the predicted number of points using the multiple regression equation. Those teams that have a high residual points value (more points than predicted) are the ones that have scored the most high impact goals and conceded the least, whereas the opposite is true for the teams with a low residual points value.

Impact Scored Conceded Scored Impact Conceded Impact Residual PointsNewcastle 11/12 56 51 0.077 0.142 9.72

Man United 12/13 86 43 0.163 -0.003 6.70Tottenham 12/13 66 46 0.170 -0.056 6.44

Everton 12/13 55 40 0.076 -0.090 2.49Man City 11/12 93 29 -0.091 -0.105 -6.20Liverpool 12/13 71 43 -0.143 0.037 -9.97

Table 5: Team performances in the Premier League, 2011-2013

Table 5 shows a selection of teams’ performances in the Premier League from the past 2 seasons. In the 2011/12 season, Newcastle United were the overachievers of the year, finishing 4th in the league. For a team that finished 4th, their goals scored and conceded totals were not particularly impressive. However, they were an excellent example of scoring goals that contributed to scoring points, and preventing teams from doing it to them. This is reflected in their residual points total (the highest in the league in the past two seasons), which shows that they were certainly overachievers, scoring nearly 10 points more than what we would predict.

In contrast, Liverpool in 2012/13 were the underachievers of the league, finishing with 10 points less than what they should have had given the number of goals they scored and conceded. The goals they scored also had the lowest impact out of any team. The results are surprising in a way, as most Liverpool fans would not say that their team’s problems lay in their goalscoring. For Everton, it was their goalscoring

11 Residual in the defence case is expected harmonic mean minus actual harmonic mean.

Page 17: Can statistical analysis of match data be used to gain a deeper understanding of football?

that helped them through their season rather than their defence. This is also surprising, given that Everton under David Moyes were more renowned for a difficult team to break down defensively rather than a team that scored well.

An interesting inclusion in the table is the Manchester City team of 2011/12. It seems odd that the data suggests they were underachievers, when in actual fact they won the title that season. The reason for this is that given how prolific their attack and tight their defence were that season, we would have predicted them to have scored 6 more points and win the title comfortably. Instead, they won the title on goal difference on arguably the most thrilling final day of the season of all time.

The key to Manchester United’s sustained success over the years has been their ability to score a lot of goals, score them consistently, and score them so that they have a high impact. This has helped them win many league titles, despite having a defence that was always good, but never brilliant.

This method of goal analysis could potentially be used when looking for players to buy. The best strikers would be those who have the highest harmonic mean for goals scored per match, as they would be the ones who scored most consistently. The strikers with the highest goals scored impact value would be the best value for money. They may not necessarily be the most prolific goalscorers, but the goals they do score are very beneficial to their team in terms of gaining points.

Page 18: Can statistical analysis of match data be used to gain a deeper understanding of football?

Conclusion

“Toeval is logisch [coincidence is logical].”Johan Cruijff

In recent seasons, there has been a noticeable rise in the amount of data and statistics that are presented to football fans. But the problem of the data being meaningless to the fans still remains to some extent. As such, one of the main aims of the analysis was to extract meaning from the data and use it to explain some of the things that happen in football.

For example, analysing goals scored/conceded and home/away form gave an insight into how teams should approach football matches so that they can maximise their points totals for a season. Looking at goal distributions across different leagues helped back up theories about different football cultures, and also helped explain betting/prediction markets which are so prevalent in football. It has shown that there is often statistical logic behind ‘coincidence’, as Johan Cruijff said.

On a club level, football is some way behind American sports in using statistics to improve and develop teams. I aimed to show that analytical methods can be used by teams to their advantage, for example, using the ‘scored impact’ and ‘consistency’ analysis when scouting players. Most teams do not have billionaire owners and consequently have to act shrewdly on the transfer market to buy the best players. I believe that with greater resources, the analysis I showed could be used to great effect by these clubs to identify potential players who would be value for money.

With greater resources and more data, the analyses I have done could be improved in different ways. For example, people looking to formulate betting systems using goal distributions could use a technique called Poisson regression to formulate a more accurate model. Major football clubs have access to vast amounts of individual player data and employ a team of dedicated data analysts. It is here that statistical analysis has the greatest potential. With all the data at their disposal, clubs can analyse many different aspects of each player’s game, rather than just the amount of goals they score. In doing this, clubs can identify their teams’ strengths and weaknesses, and subsequently identify suitable transfer targets based on these weaknesses and their respective transfer budgets.

As with all statistical analysis, there is an element of uncertainty so it cannot be used to prove theories and ideas, especially in such a fluid game like football where it is difficult to quantitatively record many aspects. It has been the case in the past that when teams have used statistics, it has not worked as they have not conducted analyses with enough rigour. A prime example of this is Liverpool attempting to imitate the ‘Moneyball’ strategy that was so successful in baseball (see Appendix 8).

However, statistical analysis should be used properly to provide evidence and back up theories in the game. It has been pushed aside by football for too long, with clubs, coaches and fans preferring traditional methods of analysing a game, but I

Page 19: Can statistical analysis of match data be used to gain a deeper understanding of football?

hope I have shown that using statistics alongside tried and tested practices could prove to be very valuable for clubs around the world.

Page 20: Can statistical analysis of match data be used to gain a deeper understanding of football?

Appendices

1. The Poisson distribution is a discrete probability distribution developed by French mathematician, Siméon Denis Poisson, to model the probability of a given number of rare and random events occurring in a fixed interval of time, using a known average base rate of occurrence. The probability of x events occurring with a base rate of λ is given by12:

P (X=x )= λx e−λ

x !

2. As the Poisson distribution is used to calculate probabilities, the calculated values are all less than one. To make the comparison with the observed values, the probabilities have all been multiplied by 760, the number of games over the past 2 seasons. As the Bundesliga has only 306 games per season, both the observed values and Poisson values have been adjusted to be for 760 games.

3. The chi-squared distribution is a probability distribution commonly used for tests of significance and goodness of fit tests. By applying the distribution to the calculated test statistic for a goodness of fit test, the proposed model of fit can be tested at different significance levels. The higher the significance level that the model can be tested at, the better the fit. The test statistic is given by:

χ2=∑ (O−E )2

E

where O is the observed value and E is the expected value. The chi-squared distribution that is applied to the test statistic has k−p−1 degrees of freedom, where k is the number of classes and p is the number of parameters taken from the data sample13.

4. The Spearman’s rank correlation coefficient measures the level of association between two variables, by giving each variable in the sample a rank. The correlation coefficient is calculated by:

r s=1−6∑ d2

n(n2−1)

where d is the rank difference between the variables in each pair, and n is the number of pairs. rs can take any value between -1 and 1, where -1 indicates a perfect negative correlation and 1 indicates a perfect positive correlation14.

12 Cyril H. Goulden, Methods of Statistical Analysis, 2nd Edition (Wiley Publications in Statistics), pp. 42-4413 Analysing Data, Unit C2, Nonparametrics (The Open University, 2003), pp. 23-2614 Roger Porkess, Dictionary of Statistics (Collins, 1988), p. 56

Page 21: Can statistical analysis of match data be used to gain a deeper understanding of football?

5. The Pearson product-moment correlation coefficient also measures the level of association between two variables, but uses the actual values of the variables, rather than ranks. The correlation coefficient is calculated by:

r= ∑ (x i−x )( y i− y )

√∑ (xi−x)2∑ ( y i− y)

2

where x i and y i are the pairs of variables, and x and y are the sample means15.

6. The multiple regression equation is calculated so as to minimise the sum of squared errors of prediction. The equation is given by:

y=a+b1 x1+b2 x2

where y is the dependent variable (predicted number of points scored), a is the intercept, x1 and x2 are the independent variables (number of goals scored and conceded over a season), and b1 and b2 are the coefficients of the dependent variables. b1, b2 and a are calculated by16:

b1=¿¿

b2=¿¿

a= y−b1 x1−b2 x2

7. The harmonic mean of a dataset, H, is calculated by:

H= n

∑ 1x i

where n is the number of values in the dataset, and x i is the values in the dataset17.

For the analysis, I have given a value of 13 to games in which 0 goals were scored, as

it is not possible to calculate the harmonic mean with 0 as a value (division by zero).

8. The ‘Moneyball’ strategy was used by the struggling, cash-strapped Oakland A’s baseball team in order to try and compete with bigger, richer clubs in the league. It involved using data analysis to recruit players that were overlooked by other teams, and as such these players were very cheap. The Oakland A’s went on to win 20 consecutive games, an American League record.

15 T.D.V. Swinscow, Statistics at Square One (British Medical Association, 1980), p. 6516 Regression with Two Independent Variables, http://luna.cas.usf.edu/~mbrannic/files/regression/Reg2IV.html17 Croxton, Cowden & Klein, Applied General Statistics (Pitman, 1968), p.182

Page 22: Can statistical analysis of match data be used to gain a deeper understanding of football?