Building reproducible expected‑goals models from public football event data: Logistic and mixed-effects analysis using StatsBomb open data
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
A reproducible expected-goals (xG) modelling pipeline is developed and evaluated using large public football event datasets. The analysis uses 10,709 non-penalty shots from La Liga 2015/2016 and the 2018 FIFA World Cup in StatsBomb Open Data, with predictors derived from event locations (shot distance and angle), body part (head vs. foot), and competition indicators. Logistic regression estimates goal probability from these predictors, and a generalized linear mixed-effects model adds shooter-level random intercepts to capture between-player variability. Model performance is assessed using information criteria and area under the ROC curve (AUC). Distance strongly reduces scoring probability, headers are less likely to be scored than footed shots, and World Cup shots have lower baseline conversion than La Liga attempts at comparable locations. AUC increases from 0.75 in the baseline model to 0.78 in the fixed-effects model and 0.79 in the mixed-effects model, indicating that open event data support transparent, statistically defensible, and practically useful xG pipelines for research and teaching.