Showing posts with label modeling. Show all posts
Showing posts with label modeling. Show all posts

Wednesday, September 6, 2017

Explaining My Simulation Methodology

I have been meaning to get around to this for a while now and with a break in fixtures for international team games this seems like a good time to go over my simulation methodology.

 

Basics:

The model is built on this logic: that a soccer match result is determined by goals, goals are determined by the number of shots and the quality of those shots. So I have built the model and 1) trying to estimate the number of shots each team will have and 2) a rough idea of where these shots will be taken and the quality of the shots.

For simplicity I group shots into three location buckets, Danger Zone (6 yard box + Middle of 18 yard box), Wide Box (wide areas of the box) and Outside the Box.

I also estimate the number of headers that will occur in the game and I assume that all headers will be from the danger zone (about 95% of headers occur in this area) all other areas are shots from feet.

Lastly I estimate the number of Big Chances each team will have per game. For simplicity I also assume that all of these will occur in the danger zone (about 75% of big chances occur in the danger zone).

 

Determining Values:

To arrive at the values for each I have taken data from the last two seasons plus the current season. I then weight the data to get to a single value. The current weighting is 1 for 2016-17, 0.5 for 2015-16 and Games Played/19 so for this week there have been 3 matches so the weighting is 0.16 and this will go up every week.

You could certainly pick different weights for this but my thinking is that I would use last season as the baseline for each team, two years ago as half as important because there can be quite a bit of turn over in a squad in that time but still it can provide information and then a sliding scale for the current season that would put it on equal terms at the halfway point with the previous season and then have the largest weighting.

I use this weighting on the data for both offensive and defensive statistics as well as overall and home and away.

These values then feed into the simulation model to determine the number and quality of shots for each team.

Using Arsenal vs Bournemouth as an example:

Arsenal have 6.42 Danger zone shots for overall, 7.41 at home while Bournemouth allow 5.59 overall and allow 3.61 Danger Zone shots on the road. Averaging all of these I have Arsenal with a raw value of 5.69 Danger Zone shots. Taking out the expected headers and Big Chances (same methodology as above) Arsenal are left with 1.58 regular danger zone shots from feet. The decimal portion of the shot is then compared to a random number and if the decimal is higher than the random number the shots total is rounded up.

Doing this for all the different shot categories Arsenal are estimated to have 15.03 (1.58 DZ, 4.37 WB, 4.97 OB, 2.3 headers, 1.81 BC) shots but that can vary between 12 and 17 shots, compared to 10.65 (0.98 DZ, 2.29 WB, 4.32 OB, 1.78 headers, 1.27 BC) shots but can vary between 8 and 13 shots for Bournemouth.

Once the number of shots are determined each one is assigned an xG value. Danger Zone shots are 0.17 xG, Wide Box 0.06 xG, Outside the box 0.024 xG, headers 0.08xG and big chances 0.45 xG.

 

Simulating the match:

These values are assigned to each shot and compared to a random number. Again to our example:

Arsenal
Shot Type Value Random Result
DZ 0.17 0.610036 0
DZ 0.17 0.172277 0
WB 0.06 0.303131 0
WB 0.06 0.267087 0
WB 0.06 0.068808 0
WB 0.06 0.6799 0
OB 0.024 0.715029 0
OB 0.024 0.012071 1
OB 0.024 0.577135 0
OB 0.024 0.657269 0
OB 0.024 0.936911 0
H 0.08 0.356094 0
H 0.08 0.022968 1
BC 0.45 0.657358 0
BC 0.45 0.545432 0

Based on these results Arsenal would have scored 2 goals in this simulation.

This is done for both teams and the goals scored are compared and the result is recorded and then the simulation is run (with the decimals again compared to a new set of random numbers to simulate a bit of randomness that happens) again another 9,999 times. The odds that I present are the count of each result divided by the number of simulations.

Again to the example of Arsenal vs Bournemouth, Arsenal won 5,327 of the simulations, there was a draw 2,190 times and Bournemouth won 2,483 times. So the odds for the match would be presented as 53.3% for Arsenal, 21.9% Draw, 24.8% Bournemouth.

 

Simulating the Season:

For each of the remaining matches in the season the odds are determined using the above method and a similar exercise is performed to simulate the season. I use this to give odds for each team winning the league or finishing top 4 and other targets.

To help illustrate I will again use the Arsenal vs Bournemouth example. For this a random number is generated. I got 0.5058 as my random number and that is compared to odds of home win: 0 to 0.533, draw: greater than 0.533 to less than 0.753 and Away win: 0.752 to 1. With this random number Arsenal have been simulated as the winner.

This is done for each match and the number of wins, draws and losses are recorded as well as the points and where each team finished in the table. This is done another 9,999 times to simulate the season 10,000 times and then the results are presented as the simulated odds.

The latest simulation makes Manchester City the title favorites winning the title in 32.7% of simulations.

 

Team Strength:

Earlier today on twitter I posted something new and that is related to my simulation work. I called it team strength rankings.
Here is how that is derived.

Each team's overall shot spread is multiplied by the assigned values (basically it is a simplified xG value per game) and then compared to league average. Using Arsenal as an example, they have an estimated 1.8 xG per game overall compared to 1.3 for League Average. I then took the team value divided by league average times 100 to give the value in the tweet where 100 is league average and every point above or below represents 1% above or below the league average.

For the overall ranking it is the average of the offense and defense with that compared to league average to determine overall rank.

This is a new thing for me so this might need tweeking as I continue on.

Please let me know if things need further clarification or if I missed anything.

Monday, August 14, 2017

My xG models

I as re-start my weekly updates of the stats that I post on Tableau I wanted to get around to writing about my different xG models.

I have already written about where I get my data from (although that doesn't include StrataBet data but that isn't in the xG model right now or any of the stuff on Tableau) and wanted to go into depth on how these numbers are derived.

First is that I have 4 different xG models, they each vary slightly and are dependent on the information at hand. They are all derived using a logistic regression based on Premier League Data from 2015-16 and the big 5 European leagues + Champions League from 2016-17.

The first and most used is what I started out calling Chance Quality last season and then gave up and realized that name will never catch on and switched to following everyone else calling it xG. In Tableau this will be labeled xG.

Here is the formula for this:

(1-(1/(1+((e^(-2.5+(Feet*-0.29)+(Head*-0.71)+(Very Close Range*1.28)+(Six Yard Box*0.39)+Center of the Box*0.45)+(Wide Box*-0.24)+(Outside Box*-0.86)+(Long Range*-1.09)+(More than 35 Yards*-1.59)+(Difficult Angle*0.16)+(Set Piece*0.88)+(Direct Free Kick*1.91)+(Corner Kick*0.71)+(Throughball Assist*0.31)+(Cross Assist*-0.25)+(Headed Pass Assist*-0.28)+(Fast Break*1.59)+(Opta Big Chance*1.94)

All of the variables are either 1 for yes or 0 for no. It is in the simplest taking a look at where a shot was taken based on fairly large sized buckets. Was it set or open play, how was it assisted and was it a fast break or big chance. Overall I have found that this pretty accurately reflects the work done by others and gives numbers that are very reasonable.

The next xG model is looking at Shots on Target which uses the same variable as above but it is based on only shots on target (for this it is very simply saved shots or goals no blocks). I created this because I believe that shooting accuracy is a skill (how much of a skill compared to luck or random variation ¯\_(ツ)_/¯) and also I way to get a rough measure of goalkeepers.

The formula for this is:

On Target*(1-(1/(1+((e^(-1.64+(Feet*-0.54)+(Head*-1.23)+(Very Close Range*1.99)+(Six Yard Box*1.13)+Center of the Box*1.15)+(Wide Box*0.45)+(Outside Box*-0.07)+(Long Range*-0.95)+(More than 35 Yards*-1.28)+(Difficult Angle*0.46)+(Set Piece*1.33)+(Direct Free Kick*1.95)+(Corner Kick*1.32)+(Throughball Assist*0.05)+(Cross Assist*-0.004)+(Headed Pass Assist*-0.37)+(Fast Break*2.78)+(Opta Big Chance*1.33)

The next xG is for when I do not have Big Chance data. Collecting the big chance data is a big pain in the ass, it is very helpful for building the model but it is all collected by hand and it isn't always available at the exact moment. There are also competitions that it isn't published and I wanted to be able to do analysis on those competitions as well. This uses the same variables expect for no big chances.

The formula for this is:

(1-(1/(1+((e^(-2.5+(Feet*0.09)+(Head*-0.46)+(Very Close Range*2.58)+(Six Yard Box*1.18)+Center of the Box*084)+(Wide Box*-.32)+(Outside Box*-1.07)+(Long Range*-1.29)+(More than 35 Yards*-1.8)+(Difficult Angle*0.11)+(Set Piece*0.86)+(Direct Free Kick*1.88)+(Corner Kick*0.51)+(Throughball Assist*1.19)+(Cross Assist*-0.33)+(Headed Pass Assist*-0.29)+(Fast Break*1.9)

The last model is the newest one that I have created and I hope to be able to replace it with the very first one (or even combine the two) and this is used to create the xG Chain stat that I just introduced. This is a very rough model with just six variables but it is good enough that I felt okay publishing and continuing to work on it in the mean time.

The formula for this is:

(1-(1/(1+((e^(-8.26+(Feet*-0.54)+(Head*-1.23)+(X Coordinate*0.06)+(Y Coordinate*0.000074)+(Fast Break*0.93)+(Opta Big Chance*2.37)

At the very least I think that I should be changing the coordinates to distance from center of goal and maybe add the angle. We will see how things work out and I am open to suggestions.