Monday, May 8, 2017

How I Get My Data

One of the most frustrating things about soccer (football) is that there is no easy way to get statistics. Sure you can find match results pretty easy but if you want anything more complicated most people are out of luck.

I considered looking at the Opta method, which would give a nice fire hose of data to examine but that seems expensive and this is a hobby that I was really just getting started in and wasn't willing to commit to spending that kind of money at this point. Maybe some day.

So decided that seeing as this is a hobby I do for fun I don't really have a problem spending my leisure time working on it.

The initial plan was something similar to the work that I had done in College for a project in one of my econometrics classes where I looked at NFL decisions on 4th down. For that I copied and pasted all the play by play data into excel and then cleaned the data very inefficiently. I should have learned to write code to help but I was young and dumb.

For soccer I decided that the commentary data that ends up on places like the BBC and ESPNFC looked like the best. It includes some general shot location; middle of six yard box, wide six yard box, center of the box, wide box, outside the box and long range. The shot location isn't as specific as the Opta data but the price is right on this data. It also includes who took the shot which is important and even better it also includes who set up the shot. The other great thing about this is that it also includes context for the shot, which foot or head and how it was assisted; cross, set piece, free kick, through ball, headed pass and fast break.

So I started off copy and pasting data by hand and then using excel formulas extracting the relevant data from the play by play information.

The next step in my data was using the awesome webapps that have been developed to help scrape data from the web. I currently use both import.io and ParseHub to collect the data each week and then I have a number of macros that I have written that take these big CSV files and break them into each match and extract the data that I need.

The next thing was that I really wanted big chance data, and unfortunately that isn't in the play by play commentary. To accomplish that I still go into whoscored and manually search through all the matches and enter the big chance information into the previous data. This is still a big time suck for me and I would love a better method but I have not figured out when at the moment, if you have suggestions please let me know.

So that is how I figured out how to get half way decent shot information which I know is always one of the biggest questions for people getting started in analytics.

No comments:

Post a Comment