Showing posts with label Admin. Show all posts
Showing posts with label Admin. Show all posts

Thursday, October 12, 2017

Creating Radar Charts in Tableau - A How To

During the international breaks I like to try creating something new, it is nice to get away from the day to day of the club schedule and play around with data.

During this international break I have worked on figuring out a way to replicate Ted Knutson's very popular Radar charts.

My knock off seems to be fairly popular as well so I wanted to give a bit of a rundown on how to create them should you ever want to try it yourself.

This is based on this template that was posted on the tableau blog.

The first step is getting your data ready. For this walk through I am using the creation of my midfield template.

To have the data play nicely with the radar it all must be normalized to the same scale. To do this I will make everything run between 0 and 100. So I identify each set of data and the minimum and maximum value to be used to calculate the value for the radar.

Stat Name M Min Max
Pass% Value M 74 90
Key Passes Value M 0.7 2.5
PPVA Value 0.03 0.55
xG Buildup Value 0.1 0.6
xG+xA Value 0.1 0.5
Drib Value M 0.5 2.1
Disp Value M 2.4 0.5
Fouls Value 2.4 0.6
DribPast% Value 60 20
Suc Tackles Value 1 3
Int Value 1 3
Suc Long Balls Value 0.53

Narrowing the list down to 12 values (I even through on one more than Ted!) is very tough but I feel that this gives a good overview of the different things you want to measure from a midfielder. It has passing, chance creation, overall scoring contribution, dribbles and ball retention, and some defense. It isn't perfect because nothing is and when you make your own you can go crazy with what ever you want to include. The minimum and maximum are at the 95th percentile and the 5th percentile of the stat.


Above is an example of normalizing Passing% to 100. First I have an If statement for values greater than 0.9, then values for below 0.74 and then finally for values in between. To do this you subtract the minimum value and then divide by the spread between minimum and maximum and multiply by 100. 

When done it should look something like this:

So you go through and do this for all of the stats you want to include in the radar. Once you are complete with that go to Analysis -> View Data and select all of the data to copy into an excel sheet.


In the excel sheet you will add a column for the order that you want the stat to show up on the radar.  For this each stat will have this order:

Stat Name M Order
Pass% Value M 1
Key Passes Value M 2
PPVA Value 3
xG Buildup Value 4
xG+xA Value 5
Drib Value M 6
Disp Value M 7
Fouls Value 8
DribPast% Value 9
Suc Tackles Value 10
Int Value 11
Suc Long Balls Value 12
The next step is to create the radar in tableau. So create the new worksheet and add the newly created excel data to the data connection. 

Next is to create the x Value we will use. We will create a calculated field, I am naming mine X-M and enter the values for the X coordinates for each stat: 

CASE [Stat Name M]
WHEN "Pass% Value M" THEN 0
WHEN "Key Passes Value M" THEN [Per90 Stat Value] *(1/2)
WHEN "PPVA Value" THEN [Per90 Stat Value] *(sqrt(3)/2)
WHEN "xG Buildup Value" THEN [Per90 Stat Value]
WHEN "xG+xA Value" THEN [Per90 Stat Value] *(sqrt(3)/2)
WHEN "Drib Value M" THEN [Per90 Stat Value] *(1/2)
WHEN "Disp Value M" THEN 0
WHEN "Fouls Value" THEN [Per90 Stat Value] *(-1/2)
WHEN "DribPast% Value" THEN [Per90 Stat Value] *(-sqrt(3)/2)
WHEN "Tackles Value" THEN [Per90 Stat Value] *-1
WHEN "Int Value" THEN [Per90 Stat Value] *(-sqrt(3)/2)
WHEN "Suc Long Balls Value" THEN [Per90 Stat Value] *(-1/2)
END


And then the same with the Y Values:

CASE [Stat Name M]
WHEN "Pass% Value M" THEN [Per90 Stat Value]
WHEN "Key Passes Value M" THEN [Per90 Stat Value] *(sqrt(3)/2)
WHEN "PPVA Value" THEN [Per90 Stat Value] *(1/2)
WHEN "xG Buildup Value" THEN 0
WHEN "xG+xA Value" THEN [Per90 Stat Value] *(-1/2)
WHEN "Drib Value M" THEN [Per90 Stat Value] *(-sqrt(3)/2)
WHEN "Disp Value M" THEN [Per90 Stat Value] *(-1)
WHEN "Fouls Value" THEN [Per90 Stat Value] *(-sqrt(3)/2)
WHEN "DribPast% Value" THEN [Per90 Stat Value] *(-1/2)
WHEN "Tackles Value" THEN 0
WHEN "Int Value" THEN [Per90 Stat Value] *(1/2)
WHEN "Suc Long Balls Value" THEN [Per90 Stat Value] *(sqrt(3)/2)
END
Next you will add the X-M to the columns aggregated as an average and Y-M to the rows also aggregated as an average.

Next is to drag Stat Name M Dimension to the marks section.

Then you will convert the mark type from Automatic to Polygon.

Then drag the Order Measure to the path section in mark to fix the weird shapes.


Now we have something that looks like a Radar! However right now it is showing all of the values and we need to fix it to show only one player at a time. To do this drag the Players Dimension to the filters section. I then add it to the window as well and change it to select only single values and I also customize the filter to not show the all value.

It should look something like this now:


We are very close now. Now we will add the background image for the radar. For this I took the blank template from here and then added the values around the circle for the stats. It looks like this:
Next it will be added into Tableau. To do this you go to Map -> Background Images -> and then select your data sheet. Then you select where you saved the image and put in the matching coordinates. 

Now we should have a real radar with some minor formatting stuff to make it look pretty to finish!

The formatting that I like to do is to change the color opacity to 65% to be able to see the numbers on the image behind, have each team be assigned a color, remove the axis labels and the center lines. I then put them all in a dashboard with the Per 90 Stats and Minutes information to complete everything. The final product should look like this:


You can find the dashboard and play around with it here

Wednesday, September 6, 2017

Explaining My Simulation Methodology

I have been meaning to get around to this for a while now and with a break in fixtures for international team games this seems like a good time to go over my simulation methodology.

 

Basics:

The model is built on this logic: that a soccer match result is determined by goals, goals are determined by the number of shots and the quality of those shots. So I have built the model and 1) trying to estimate the number of shots each team will have and 2) a rough idea of where these shots will be taken and the quality of the shots.

For simplicity I group shots into three location buckets, Danger Zone (6 yard box + Middle of 18 yard box), Wide Box (wide areas of the box) and Outside the Box.

I also estimate the number of headers that will occur in the game and I assume that all headers will be from the danger zone (about 95% of headers occur in this area) all other areas are shots from feet.

Lastly I estimate the number of Big Chances each team will have per game. For simplicity I also assume that all of these will occur in the danger zone (about 75% of big chances occur in the danger zone).

 

Determining Values:

To arrive at the values for each I have taken data from the last two seasons plus the current season. I then weight the data to get to a single value. The current weighting is 1 for 2016-17, 0.5 for 2015-16 and Games Played/19 so for this week there have been 3 matches so the weighting is 0.16 and this will go up every week.

You could certainly pick different weights for this but my thinking is that I would use last season as the baseline for each team, two years ago as half as important because there can be quite a bit of turn over in a squad in that time but still it can provide information and then a sliding scale for the current season that would put it on equal terms at the halfway point with the previous season and then have the largest weighting.

I use this weighting on the data for both offensive and defensive statistics as well as overall and home and away.

These values then feed into the simulation model to determine the number and quality of shots for each team.

Using Arsenal vs Bournemouth as an example:

Arsenal have 6.42 Danger zone shots for overall, 7.41 at home while Bournemouth allow 5.59 overall and allow 3.61 Danger Zone shots on the road. Averaging all of these I have Arsenal with a raw value of 5.69 Danger Zone shots. Taking out the expected headers and Big Chances (same methodology as above) Arsenal are left with 1.58 regular danger zone shots from feet. The decimal portion of the shot is then compared to a random number and if the decimal is higher than the random number the shots total is rounded up.

Doing this for all the different shot categories Arsenal are estimated to have 15.03 (1.58 DZ, 4.37 WB, 4.97 OB, 2.3 headers, 1.81 BC) shots but that can vary between 12 and 17 shots, compared to 10.65 (0.98 DZ, 2.29 WB, 4.32 OB, 1.78 headers, 1.27 BC) shots but can vary between 8 and 13 shots for Bournemouth.

Once the number of shots are determined each one is assigned an xG value. Danger Zone shots are 0.17 xG, Wide Box 0.06 xG, Outside the box 0.024 xG, headers 0.08xG and big chances 0.45 xG.

 

Simulating the match:

These values are assigned to each shot and compared to a random number. Again to our example:

Arsenal
Shot Type Value Random Result
DZ 0.17 0.610036 0
DZ 0.17 0.172277 0
WB 0.06 0.303131 0
WB 0.06 0.267087 0
WB 0.06 0.068808 0
WB 0.06 0.6799 0
OB 0.024 0.715029 0
OB 0.024 0.012071 1
OB 0.024 0.577135 0
OB 0.024 0.657269 0
OB 0.024 0.936911 0
H 0.08 0.356094 0
H 0.08 0.022968 1
BC 0.45 0.657358 0
BC 0.45 0.545432 0

Based on these results Arsenal would have scored 2 goals in this simulation.

This is done for both teams and the goals scored are compared and the result is recorded and then the simulation is run (with the decimals again compared to a new set of random numbers to simulate a bit of randomness that happens) again another 9,999 times. The odds that I present are the count of each result divided by the number of simulations.

Again to the example of Arsenal vs Bournemouth, Arsenal won 5,327 of the simulations, there was a draw 2,190 times and Bournemouth won 2,483 times. So the odds for the match would be presented as 53.3% for Arsenal, 21.9% Draw, 24.8% Bournemouth.

 

Simulating the Season:

For each of the remaining matches in the season the odds are determined using the above method and a similar exercise is performed to simulate the season. I use this to give odds for each team winning the league or finishing top 4 and other targets.

To help illustrate I will again use the Arsenal vs Bournemouth example. For this a random number is generated. I got 0.5058 as my random number and that is compared to odds of home win: 0 to 0.533, draw: greater than 0.533 to less than 0.753 and Away win: 0.752 to 1. With this random number Arsenal have been simulated as the winner.

This is done for each match and the number of wins, draws and losses are recorded as well as the points and where each team finished in the table. This is done another 9,999 times to simulate the season 10,000 times and then the results are presented as the simulated odds.

The latest simulation makes Manchester City the title favorites winning the title in 32.7% of simulations.

 

Team Strength:

Earlier today on twitter I posted something new and that is related to my simulation work. I called it team strength rankings.
Here is how that is derived.

Each team's overall shot spread is multiplied by the assigned values (basically it is a simplified xG value per game) and then compared to league average. Using Arsenal as an example, they have an estimated 1.8 xG per game overall compared to 1.3 for League Average. I then took the team value divided by league average times 100 to give the value in the tweet where 100 is league average and every point above or below represents 1% above or below the league average.

For the overall ranking it is the average of the offense and defense with that compared to league average to determine overall rank.

This is a new thing for me so this might need tweeking as I continue on.

Please let me know if things need further clarification or if I missed anything.

Monday, August 14, 2017

My xG models

I as re-start my weekly updates of the stats that I post on Tableau I wanted to get around to writing about my different xG models.

I have already written about where I get my data from (although that doesn't include StrataBet data but that isn't in the xG model right now or any of the stuff on Tableau) and wanted to go into depth on how these numbers are derived.

First is that I have 4 different xG models, they each vary slightly and are dependent on the information at hand. They are all derived using a logistic regression based on Premier League Data from 2015-16 and the big 5 European leagues + Champions League from 2016-17.

The first and most used is what I started out calling Chance Quality last season and then gave up and realized that name will never catch on and switched to following everyone else calling it xG. In Tableau this will be labeled xG.

Here is the formula for this:

(1-(1/(1+((e^(-2.5+(Feet*-0.29)+(Head*-0.71)+(Very Close Range*1.28)+(Six Yard Box*0.39)+Center of the Box*0.45)+(Wide Box*-0.24)+(Outside Box*-0.86)+(Long Range*-1.09)+(More than 35 Yards*-1.59)+(Difficult Angle*0.16)+(Set Piece*0.88)+(Direct Free Kick*1.91)+(Corner Kick*0.71)+(Throughball Assist*0.31)+(Cross Assist*-0.25)+(Headed Pass Assist*-0.28)+(Fast Break*1.59)+(Opta Big Chance*1.94)

All of the variables are either 1 for yes or 0 for no. It is in the simplest taking a look at where a shot was taken based on fairly large sized buckets. Was it set or open play, how was it assisted and was it a fast break or big chance. Overall I have found that this pretty accurately reflects the work done by others and gives numbers that are very reasonable.

The next xG model is looking at Shots on Target which uses the same variable as above but it is based on only shots on target (for this it is very simply saved shots or goals no blocks). I created this because I believe that shooting accuracy is a skill (how much of a skill compared to luck or random variation ¯\_(ツ)_/¯) and also I way to get a rough measure of goalkeepers.

The formula for this is:

On Target*(1-(1/(1+((e^(-1.64+(Feet*-0.54)+(Head*-1.23)+(Very Close Range*1.99)+(Six Yard Box*1.13)+Center of the Box*1.15)+(Wide Box*0.45)+(Outside Box*-0.07)+(Long Range*-0.95)+(More than 35 Yards*-1.28)+(Difficult Angle*0.46)+(Set Piece*1.33)+(Direct Free Kick*1.95)+(Corner Kick*1.32)+(Throughball Assist*0.05)+(Cross Assist*-0.004)+(Headed Pass Assist*-0.37)+(Fast Break*2.78)+(Opta Big Chance*1.33)

The next xG is for when I do not have Big Chance data. Collecting the big chance data is a big pain in the ass, it is very helpful for building the model but it is all collected by hand and it isn't always available at the exact moment. There are also competitions that it isn't published and I wanted to be able to do analysis on those competitions as well. This uses the same variables expect for no big chances.

The formula for this is:

(1-(1/(1+((e^(-2.5+(Feet*0.09)+(Head*-0.46)+(Very Close Range*2.58)+(Six Yard Box*1.18)+Center of the Box*084)+(Wide Box*-.32)+(Outside Box*-1.07)+(Long Range*-1.29)+(More than 35 Yards*-1.8)+(Difficult Angle*0.11)+(Set Piece*0.86)+(Direct Free Kick*1.88)+(Corner Kick*0.51)+(Throughball Assist*1.19)+(Cross Assist*-0.33)+(Headed Pass Assist*-0.29)+(Fast Break*1.9)

The last model is the newest one that I have created and I hope to be able to replace it with the very first one (or even combine the two) and this is used to create the xG Chain stat that I just introduced. This is a very rough model with just six variables but it is good enough that I felt okay publishing and continuing to work on it in the mean time.

The formula for this is:

(1-(1/(1+((e^(-8.26+(Feet*-0.54)+(Head*-1.23)+(X Coordinate*0.06)+(Y Coordinate*0.000074)+(Fast Break*0.93)+(Opta Big Chance*2.37)

At the very least I think that I should be changing the coordinates to distance from center of goal and maybe add the angle. We will see how things work out and I am open to suggestions.

Monday, May 8, 2017

How I Get My Data

One of the most frustrating things about soccer (football) is that there is no easy way to get statistics. Sure you can find match results pretty easy but if you want anything more complicated most people are out of luck.

I considered looking at the Opta method, which would give a nice fire hose of data to examine but that seems expensive and this is a hobby that I was really just getting started in and wasn't willing to commit to spending that kind of money at this point. Maybe some day.

So decided that seeing as this is a hobby I do for fun I don't really have a problem spending my leisure time working on it.

The initial plan was something similar to the work that I had done in College for a project in one of my econometrics classes where I looked at NFL decisions on 4th down. For that I copied and pasted all the play by play data into excel and then cleaned the data very inefficiently. I should have learned to write code to help but I was young and dumb.

For soccer I decided that the commentary data that ends up on places like the BBC and ESPNFC looked like the best. It includes some general shot location; middle of six yard box, wide six yard box, center of the box, wide box, outside the box and long range. The shot location isn't as specific as the Opta data but the price is right on this data. It also includes who took the shot which is important and even better it also includes who set up the shot. The other great thing about this is that it also includes context for the shot, which foot or head and how it was assisted; cross, set piece, free kick, through ball, headed pass and fast break.

So I started off copy and pasting data by hand and then using excel formulas extracting the relevant data from the play by play information.

The next step in my data was using the awesome webapps that have been developed to help scrape data from the web. I currently use both import.io and ParseHub to collect the data each week and then I have a number of macros that I have written that take these big CSV files and break them into each match and extract the data that I need.

The next thing was that I really wanted big chance data, and unfortunately that isn't in the play by play commentary. To accomplish that I still go into whoscored and manually search through all the matches and enter the big chance information into the previous data. This is still a big time suck for me and I would love a better method but I have not figured out when at the moment, if you have suggestions please let me know.

So that is how I figured out how to get half way decent shot information which I know is always one of the biggest questions for people getting started in analytics.