Creating an expected passes model in soccer

If you’ve read about hockey or soccer analytics lately, you’ve no doubt come across a metric called expected goals (xG). If you haven’t, here’s a brief explanation:

Every shot has a probability of going in. By gathering tons of shot data we can find what shots are more likely to go in. From this data we can now assign a probability to a shot of it being a goal, thus the expected goals from that shot. A players xG is simply the sum of all the probabilities from every shot that player took.

Expected passes (xP) does not stray from the idea. xP will represent the number of completed passes a player should be expected to make based off the quantity of passes of attempted and the safety of passes attempted. I use the word safety because often high probability passes are simple passes between defence man which arguably don’t add much to the goal scoring potential of a possession.

In this post I’ll go over the methodology for making an expected pass model in soccer. Then, I’ll look at the results when used in the 2018 world cup.

Data Pre-processing

Using the free data provided by statsbomb.ca I gathered information about every pass made during the 2018 world cup. From these passes I took the the co-ordinates of where the pass ended up, the angle of which the pass was made, the distance of the pass, and the outcome which was either completed or not completed.

Training and Choosing a Model

Since a pass has a binary outcome since it is either completed or not completed I knew that I was dealing with a binary classification problem, meaning I had to predict whether a pass would either be a 1 or a 0. Luckily, Sci-kit Learn has the capability to predict the probability of an instance being classified as either completed or not completed. So know that I had my data, knew the problem and how to solve it I only had to choose which machine learning model to use.

Since there are various classification models I had to look at which ones had the functionality to predict a class probability (probability of a pass being either completed or not) and could process input and understand relationships between the input and the target variable. I narrowed it down to three types of models I could use: logistic regression, gradient boosting trees, or a multilayer perceptron.

I chose a multilayer perceptron (MLP) because I felt it would be much more fluid and able to understand relationships a bit better than the other two options; since an MLP is made to replicate a human mind. MLP’s can learn non-linear relationships and should be able to better understand connections between the feature (x) and the target variable (y).

From there I trained the MLP using the coordinates, distance and angle as features (x variables) and the outcome as the target (y).

Applying Model to Data

After training I then went through every pass in the world cup dataset and assigned them an xP value which is the probability of it being completed which I get from the MLP. Normally I would never have the same data I trained a model on then be evaluated for insight but make do with the small data given.

Drawing Further Insights

Now I had an xP value for every pass made during the competition and the name of the player that attempted it. So I simply added up the number of passes attempted by every player and their corresponding xP. The result of this is that now everyone had a total passes attempted number and an xP number. From there I divided xP by passes attempted to find an expected completion percentage (xP%) .

However this is not enough because it rates players that make one pass with 90% xP% higher than players who have 88% xP% with hundreds of passes and thus, doesn’t pay a proper picture of passing ability. To properly do this we have to perform a regression to the mean of the xP% based on passes made.

Regressing xP%

The method I used to regress player back to the mean is as follows

1. find the positional average for xP%
2. find the positional maximum number passes attempted
3. add xP to the average xP% multiplied by the maximum number attempted plus 100 and divide by the players passes attempted plus maximum positional attempted plus 100

(xP + (xP%_av * (max_attempted + 100)) / (player_attempted + max_attempted + 100)

4.Assign the new regressed value as the actual xP% value

This solves the issue of small sample sizes for passes attempted and xP% because it allows for the more accurate judging of player based of what we know about them and for players we don’t know a lot about (not many passes) we assume they’re average.

Here we can see the difference before and after the regression

x-axis is passes attempted, y-axis is xP%

xP / GP

We now have a xP, xP% but theres a little bit more we can do to draw from the data, we can introduce rate stats. Namely introduce xP/GP which is quite intuitive as its the expected passes per game played. Ideally I could implement an xP/90 which is xP divided by time played multiplied by 90 for the minutes in a normal game. xP/GP is a good representation of xP in a tournament like the world cup where xP can be deceiving as some players can only play 3 games and others 7 so it can compare players better than xP can.

xP% above position

The last thing I did with the xP data was to find how much higher or lower a player was when compared to everyone else in their position. This was simply a subtraction of the players xP% by the positional average xP%. I did this because some positions due to their nature are pre-disposed to have high xP% because they don’t take many high-risk passes (defence man specifically).

Results

Here are the top 5 players when I ranked by xP/GP

Games	Name	Original pass%	Position	Team	passes	xCompletion%	xP% above position	xP/GP	xPasses
4	Sergio Ramos García	0.906019	Back	Spain	496	89.87	7.2	111	445.76
4	Isco	0.851788	Midfield	Spain	475	84.99	-0.46	101	403.7
3	Toni Kroos	0.848236	Midfield	Germany	323	84.44	-0.58	91	272.74
4	Jordi Alba Ramos	0.860943	Back	Spain	414	85.54	2.21	89	354.14
1	Djibril Sidibé	0.882352	Back	France	99	84.68	0.75	84	83.83

ranked by xP% above position

Games	Name	Orig_pass%	Position	Team	num_passes	xCompletion%	xP% above position	xP/GP	xPasses
7	John Stones	0.913549	Back	England	479	90.31	7.67	62	432.58
4	Sergio Ramos García	0.906019	Back	Spain	496	89.87	7.2	111	445.76
4	Gerard Piqué Bernabéu	0.910196	Back	Spain	354	88.43	5.36	78	313.04
5	Philippe Coutinho Correia	0.875754	Wing	Brazil	325	86.05	5.2	56	279.66
5	Vincent Kompany	0.926376	Back	Belgium	277	88.25	5.04	49	244.45

ranked by xP%

Games	Name	Orig_pass%	Position	Team	num_passes	xCompletion%	xP% above position	xP/GP	xPasses
7	John Stones	0.913549	Back	England	479	90.31	7.67	62	432.58
4	Sergio Ramos García	0.906019	Back	Spain	496	89.87	7.2	111	445.76
6	Axel Witsel	0.926542	Midfield	Belgium	325	89.01	4.25	48	289.28
4	Gerard Piqué Bernabéu	0.910196	Back	Spain	354	88.43	5.36	78	313.04
4	Javier Alejandro Mascherano	0.908707	Midfield	Argentina	348	88.26	3.38	77	307.14

Summary

To recap I trained a model to predict the probability of a pass being completed. From there I evaluated every pass in the 2018 world cup and gave it a completion probability. From there I added up the probabilities of each player’s passes to get an xP value. Next, I divided the xP by the total passes attempted by the player to find an expected completion percentage and then regressed it to the mean. I also found high much higher/lower their expected completion percentage is when compared to the positional average of that player and the xP per game.

Issues

Obviously there are multiple issues with evaluating player passing ability with this method because it assumes all passes have the same potential goal value which we know is not true. Because of that reason players that are more risk-averse will be rewarded even if the more risky players add more value to their team.

I’ll explore a way to measure pass value in another blog post

Data Pre-processing

Training and Choosing a Model

Applying Model to Data

Drawing Further Insights

Regressing xP%

xP / GP

xP% above position

Results

Summary

Issues

Related Posts

Clustering NHL Players by Usage

Introducing Relational Plus Minus

Player Similarity Ratings

Leave a Reply Cancel reply