Bayes is coming home - Predicting the Euro 2024 with Stan

Author

Oliver Dürr

The Euro 2024 ⚽ is a nice showcase of Bayesian Statistics. In Bayesian statistics, probabilities are seen as a degree of belief, which fits well with the nature of football. Almost everyone has beliefs about the strengths and weaknesses of the teams before seeing any games (based on historical data) and then updates these beliefs as new data comes in (games have been played).

Prediction based on the Historical Data

We load the matches prior to the Euro 2024, see https://github.com/oduerr/da/blob/master/stan/Euro24/Euro_Data.md how to get the data. Note the limitations of using historic; for example, Germany only started finding their form shortly before Euro 2024.

Loading the data

Loaded 116 games with 24 teams.

Expand To Learn How to set up the data for Stan

Preparing the data for Stan

Code

#Some R-Magic to convert the team names to numbers, no need to understand this
ng = nrow(data)
teams = unique(data$Home)
nt = length(teams)
ht = unlist(sapply(1:ng, function(g) which(teams == data$Home[g])))
at = unlist(sapply(1:ng, function(g) which(teams == data$Away[g])))

np=1 #Number games leaving out for prediction
ngob = ng-np #ngames obsered ngob = number of games to fit
#print(paste0("Using the first ", ngob, " games to fit the model and ", np, " games to predict.", "Num teams ",  length(teams)))
my_data = list(
  nt = nt, 
  ng = ngob,
  ht = ht[1:ngob], 
  at = at[1:ngob], 
  s1 = data$score1[1:ngob],
  s2 = data$score2[1:ngob],
  np = np,
  htnew = ht[(ngob+1):ng],
  atnew = at[(ngob+1):ng],
  s1new = data$score1[(ngob+1):ng],
  s2new = data$score2[(ngob+1):ng]
)

Using the first 115 games to fit the model and 1 games to predict. In total we have 24 teams.

A Model for the goals scored 🥅

We will assume that the number of goals scored by the home team \(s_1\) and the away team \(s_2\) follows a Poisson distribution. This has been shown to be a good model for the number of goals scored in a football match. We model the rate parameter \(\theta\) of the Poisson distribution, related to the attack and defense strengths of the teams, as follows:

\[ s_1 \sim \text{Pois}(\theta_1) \quad\text{goals scored by the home team} \]

\[ s_2 \sim \text{Pois}(\theta_2) \quad\text{goals scored by the away team} \]

This is equivalent to performing two separate Poisson regressions, one for each team.

We assume that:

\[ \theta_1 = \exp(\text{home} + \text{att}_\text{ht} - \text{def}_\text{at}) \]

\[ \theta_2 = \exp(\text{att}_\text{at} - \text{def}_\text{ht}) \]

Since there is no home advantage in the Euro (except for Germany), we set \(\text{home} = 0\).

Prior for the attack and defence strength

In Bayesian statistics, we further need to specify a prior for the parameters (our degree of believe in the attack and defense abilities before seeing any data). For that we use a hierarchical model with correlated parameters. Other models are investigated at comp_premier_league for the English Premier League 2019/2020 season and for the German Bundesliga 2000 and 2024 where the hierarchical model have been especially successful. The model is adopted from the blog_post and the paper. We extend the model to include a correlation between the attack and defense strength of the teams, since it is quite reasonable that a team that scores many goals (is above average in offense) is also good in defense.

Conditioning on the data / Fitting the model

After we state the model we fit the model to the data, or in Bayesian parlance, we update our degree of belief after seeing the data.

Details of the model MCMC sampling with Stan

The model is written in the probabilistic programming language Stan and can be found at https://github.com/oduerr/da/blob/master/website/Euro24/hier_model_cor.stan. This model used a Cholesky decomposition to model the correlation between the attack and defense strength of the teams. While this produces very effective sampling and is numerically stable, the Cholesky decomposition adds another layer of complexity to the model. We also provide a model without the Cholesky decomposition at https://github.com/oduerr/da/blob/master/website/Euro24/hier_model_cor_nocholesky.stan which is easier to understand but which is, besides the numerical difficulties, equivalent to the model with the Cholesky decomposition.

Code

library(cmdstanr)
options(mc.cores = parallel::detectCores())
hmodel <- cmdstan_model('~/Documents/GitHub/da/website/Euro24/hier_model_cor.stan')
#hmodel <- cmdstan_model('~/Documents/GitHub/da/website/Euro24/hier_model_cor_nocholsky.stan')
hfit = hmodel$sample(data = my_data)
## Running MCMC with 4 chains, at most 10 in parallel...
## 
## Chain 1 Iteration:    1 / 2000 [  0%]  (Warmup) 
## Chain 2 Iteration:    1 / 2000 [  0%]  (Warmup) 
## Chain 2 Iteration:  100 / 2000 [  5%]  (Warmup) 
## Chain 2 Iteration:  200 / 2000 [ 10%]  (Warmup) 
## Chain 2 Iteration:  300 / 2000 [ 15%]  (Warmup) 
## Chain 3 Iteration:    1 / 2000 [  0%]  (Warmup) 
## Chain 3 Iteration:  100 / 2000 [  5%]  (Warmup) 
## Chain 3 Iteration:  200 / 2000 [ 10%]  (Warmup) 
## Chain 3 Iteration:  300 / 2000 [ 15%]  (Warmup) 
## Chain 3 Iteration:  400 / 2000 [ 20%]  (Warmup) 
## Chain 4 Iteration:    1 / 2000 [  0%]  (Warmup) 
## Chain 4 Iteration:  100 / 2000 [  5%]  (Warmup) 
## Chain 4 Iteration:  200 / 2000 [ 10%]  (Warmup) 
## Chain 4 Iteration:  300 / 2000 [ 15%]  (Warmup) 
## Chain 1 Iteration:  100 / 2000 [  5%]  (Warmup) 
## Chain 1 Iteration:  200 / 2000 [ 10%]  (Warmup) 
## Chain 1 Iteration:  300 / 2000 [ 15%]  (Warmup) 
## Chain 2 Iteration:  400 / 2000 [ 20%]  (Warmup) 
## Chain 2 Iteration:  500 / 2000 [ 25%]  (Warmup) 
## Chain 2 Iteration:  600 / 2000 [ 30%]  (Warmup) 
## Chain 2 Iteration:  700 / 2000 [ 35%]  (Warmup) 
## Chain 3 Iteration:  500 / 2000 [ 25%]  (Warmup) 
## Chain 3 Iteration:  600 / 2000 [ 30%]  (Warmup) 
## Chain 3 Iteration:  700 / 2000 [ 35%]  (Warmup) 
## Chain 4 Iteration:  400 / 2000 [ 20%]  (Warmup) 
## Chain 4 Iteration:  500 / 2000 [ 25%]  (Warmup) 
## Chain 4 Iteration:  600 / 2000 [ 30%]  (Warmup) 
## Chain 4 Iteration:  700 / 2000 [ 35%]  (Warmup) 
## Chain 4 Iteration:  800 / 2000 [ 40%]  (Warmup) 
## Chain 1 Iteration:  400 / 2000 [ 20%]  (Warmup) 
## Chain 1 Iteration:  500 / 2000 [ 25%]  (Warmup) 
## Chain 1 Iteration:  600 / 2000 [ 30%]  (Warmup) 
## Chain 1 Iteration:  700 / 2000 [ 35%]  (Warmup) 
## Chain 2 Iteration:  800 / 2000 [ 40%]  (Warmup) 
## Chain 2 Iteration:  900 / 2000 [ 45%]  (Warmup) 
## Chain 2 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
## Chain 2 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
## Chain 3 Iteration:  800 / 2000 [ 40%]  (Warmup) 
## Chain 3 Iteration:  900 / 2000 [ 45%]  (Warmup) 
## Chain 3 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
## Chain 3 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
## Chain 3 Iteration: 1100 / 2000 [ 55%]  (Sampling) 
## Chain 4 Iteration:  900 / 2000 [ 45%]  (Warmup) 
## Chain 4 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
## Chain 4 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
## Chain 4 Iteration: 1100 / 2000 [ 55%]  (Sampling) 
## Chain 1 Iteration:  800 / 2000 [ 40%]  (Warmup) 
## Chain 1 Iteration:  900 / 2000 [ 45%]  (Warmup) 
## Chain 1 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
## Chain 1 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
## Chain 2 Iteration: 1100 / 2000 [ 55%]  (Sampling) 
## Chain 2 Iteration: 1200 / 2000 [ 60%]  (Sampling) 
## Chain 3 Iteration: 1200 / 2000 [ 60%]  (Sampling) 
## Chain 3 Iteration: 1300 / 2000 [ 65%]  (Sampling) 
## Chain 4 Iteration: 1200 / 2000 [ 60%]  (Sampling) 
## Chain 4 Iteration: 1300 / 2000 [ 65%]  (Sampling) 
## Chain 1 Iteration: 1100 / 2000 [ 55%]  (Sampling) 
## Chain 1 Iteration: 1200 / 2000 [ 60%]  (Sampling) 
## Chain 1 Iteration: 1300 / 2000 [ 65%]  (Sampling) 
## Chain 2 Iteration: 1300 / 2000 [ 65%]  (Sampling) 
## Chain 2 Iteration: 1400 / 2000 [ 70%]  (Sampling) 
## Chain 3 Iteration: 1400 / 2000 [ 70%]  (Sampling) 
## Chain 3 Iteration: 1500 / 2000 [ 75%]  (Sampling) 
## Chain 4 Iteration: 1400 / 2000 [ 70%]  (Sampling) 
## Chain 4 Iteration: 1500 / 2000 [ 75%]  (Sampling) 
## Chain 4 Iteration: 1600 / 2000 [ 80%]  (Sampling) 
## Chain 1 Iteration: 1400 / 2000 [ 70%]  (Sampling) 
## Chain 1 Iteration: 1500 / 2000 [ 75%]  (Sampling) 
## Chain 2 Iteration: 1500 / 2000 [ 75%]  (Sampling) 
## Chain 2 Iteration: 1600 / 2000 [ 80%]  (Sampling) 
## Chain 3 Iteration: 1600 / 2000 [ 80%]  (Sampling) 
## Chain 3 Iteration: 1700 / 2000 [ 85%]  (Sampling) 
## Chain 4 Iteration: 1700 / 2000 [ 85%]  (Sampling) 
## Chain 4 Iteration: 1800 / 2000 [ 90%]  (Sampling) 
## Chain 4 Iteration: 1900 / 2000 [ 95%]  (Sampling) 
## Chain 1 Iteration: 1600 / 2000 [ 80%]  (Sampling) 
## Chain 1 Iteration: 1700 / 2000 [ 85%]  (Sampling) 
## Chain 2 Iteration: 1700 / 2000 [ 85%]  (Sampling) 
## Chain 2 Iteration: 1800 / 2000 [ 90%]  (Sampling) 
## Chain 2 Iteration: 1900 / 2000 [ 95%]  (Sampling) 
## Chain 3 Iteration: 1800 / 2000 [ 90%]  (Sampling) 
## Chain 3 Iteration: 1900 / 2000 [ 95%]  (Sampling) 
## Chain 4 Iteration: 2000 / 2000 [100%]  (Sampling) 
## Chain 4 finished in 0.7 seconds.
## Chain 1 Iteration: 1800 / 2000 [ 90%]  (Sampling) 
## Chain 1 Iteration: 1900 / 2000 [ 95%]  (Sampling) 
## Chain 2 Iteration: 2000 / 2000 [100%]  (Sampling) 
## Chain 3 Iteration: 2000 / 2000 [100%]  (Sampling) 
## Chain 2 finished in 0.8 seconds.
## Chain 3 finished in 0.8 seconds.
## Chain 1 Iteration: 2000 / 2000 [100%]  (Sampling) 
## Chain 1 finished in 0.9 seconds.
## 
## All 4 chains finished successfully.
## Mean chain execution time: 0.8 seconds.
## Total execution time: 1.1 seconds.
p1 = bayesplot::mcmc_rhat_hist(bayesplot::rhat(hfit))
p2 = bayesplot::mcmc_neff_hist(bayesplot::neff_ratio(hfit))
ggpubr::ggarrange(p1, p2, ncol=2)

The fitting of the model is good, as the Rhat values are close to 1 and we have no divergent transitions. The effective sample size is also good.

The fitted model

We plot the means of the attack and defense strengths of the teams. Shown are the mean values along with the 25% and 75% quantiles. There is considerable uncertainty in the strengths of the teams, but that’s the nature of the game.

Code

library(tidyverse)
library(tidybayes)

# Step 1: Gather draws and calculate summary statistics with credible intervals
d = hfit %>%
  tidybayes::gather_draws(A[i, j]) %>%
  group_by(i, j) %>%
  summarise(
    average_value = mean(.value),
    lower = quantile(.value, 0.25),  # Lower bound 
    upper = quantile(.value, 0.75),  # Upper bound 
    .groups = "drop"
  )

# Step 2: Create a matrix of the average values
A = xtabs(average_value ~ i + j, data = d)

# Step 3: Plot the average values
plot(A[1,], A[2,], pch=20, xlab='Attack', ylab='Defence', main='Attack vs Defence')

# Step 4: Add team labels
text(A[1,], A[2,], labels=teams, cex=0.7, adj=c(-0.05, -0.8))

# Step 5: Add error bars for 66% credibility intervals
# Reshape data for plotting
d_wide <- d %>% spread(key = j, value = average_value)
d_lower <- d %>% spread(key = j, value = lower)
d_upper <- d %>% spread(key = j, value = upper)

# Convert to matrices for easier plotting
A_lower <- xtabs(lower ~ i + j, data = d)
A_upper <- xtabs(upper ~ i + j, data = d)

# Plot vertical error bars
arrows(A[1,], A_lower[2,], A[1,], A_upper[2,], angle=90, code=3, length=0.05, col="lightblue", alpha=0.5)

# Plot horizontal error bars
arrows(A_lower[1,], A[2,], A_upper[1,], A[2,], angle=90, code=3, length=0.05, col="lightblue")

Detailed explanations for a single match Germany vs Scotland

The opening game of Euro 2024 was Germany vs. Scotland. In the plots below, we show the posterior probabilities for the attack and defense strengths of the teams.

Code

library(dplyr)
library(tidyr)
library(ggplot2)
library(tidybayes)

# Function to calculate probabilities for a pairing
id1 <- which(teams == 'Germany')
id2 <- which(teams == 'Scotland')

# Extract posterior distributions for Attack and Defense
attack_germany <- hfit %>% tidybayes::spread_draws(A[i, j]) %>% filter(j == id1, i == 1) %>% select(A)
attack_scotland <- hfit %>% tidybayes::spread_draws(A[i, j]) %>% filter(j == id2, i == 1) %>% select(A)
defense_germany <- hfit %>% tidybayes::spread_draws(A[i, j]) %>% filter(j == id1, i == 2) %>% select(A)
defense_scotland <- hfit %>% tidybayes::spread_draws(A[i, j]) %>% filter(j == id2, i == 2) %>% select(A)

# Combine data into a tidy data frame
tidy_df <- bind_rows(
  attack_germany %>% mutate(Statistic = "Attack", Country = "Germany"),
  attack_scotland %>% mutate(Statistic = "Attack", Country = "Scotland"),
  defense_germany %>% mutate(Statistic = "Defense", Country = "Germany"),
  defense_scotland %>% mutate(Statistic = "Defense", Country = "Scotland")
)

# Include the mean values in the plot
ggplot(tidy_df, aes(x = A, fill = Country)) +
  geom_density(alpha = 0.5) +
  facet_wrap(~ Statistic, scales = "free") +
  labs(title = "Posterior Distributions of Attack and Defense Strengths", x = "Strength", y = "Density") +
  geom_vline(data = tidy_df %>% group_by(Statistic, Country) %>% summarise(mean_A = mean(A)), 
             aes(xintercept = mean_A, color = Country), linetype = "dashed") +
  theme_minimal()

Making predictions

We can now make predictions for the game Germany vs Scotland. Below are the first samples of the posterior distribution for the attack and defense strengths of germany and scotland.

Code

# Extract first five samples for demonstration
df = data.frame(attack_germany = attack_germany$A, defense_germany = defense_germany$A, attack_scotland = attack_scotland$A, defense_scotland = defense_scotland$A) %>% head() 
knitr::kable(df)

attack_germany	defense_germany	attack_scotland	defense_scotland
-0.0442763	-0.0242716	0.1630760	-0.1346680
0.3273970	-0.0198311	-0.0108560	0.0377780
0.2607110	-0.1016880	-0.2155270	-0.4300470
-0.0570626	-0.0008690	0.0580947	0.0156115
0.0885649	0.0009105	0.1246280	0.0193857
0.4273620	-0.0032764	0.1563150	0.0513863

We use the samples for posterior row by row to sample the number of goals for Germany and Scotland. We can use the samples to calculate the probability of a win, draw or loss for Germany.

Probabilities for win/draw/loss

Code

 set.seed(42)
 theta_germany = exp(attack_germany$A - defense_scotland$A)
 theta_scotland = exp(attack_scotland$A - defense_germany$A)
 g_germany = rpois(length(theta_germany), theta_germany)
 g_scotland = rpois(length(theta_scotland), theta_scotland)
 
 
 
 # Alternative way to calculate the probabilities
 calc_prob <- function(observed, theta) {
     mean(dpois(observed, theta))
 }
 
 plot(table(g_germany)/length(g_germany), main='Germany Goals', xlab='Goals', ylab='Probability')
 prob_goals_germany = apply(matrix(0:10, ncol=1), 1, function(x) calc_prob(x, theta_germany))
 points(0:5, prob_goals_germany[1:6])

Code

 prob_goals_scotland = apply(matrix(0:10, ncol=1), 1, function(x) calc_prob(x, theta_scotland))
 plot(table(g_scotland)/length(g_scotland), main='Scotland Goals', xlab='Goals', ylab='Probability')
 points(0:5, prob_goals_scotland[1:6])

Code

 prob_goals = outer(prob_goals_germany, prob_goals_scotland, '*')
 sum(prob_goals) #Should be very close to 1

[1] 0.9999955

Code

 print(paste0('Germany wins (simu) ',
              mean(g_germany > g_scotland), #Probability of Germany winning
              ' probs',
              round(sum(prob_goals[lower.tri(prob_goals, diag = FALSE)]),4)))

[1] "Germany wins (simu) 0.4675 probs0.4613"

Code

 mean(g_germany < g_scotland) #Probability of Scotland winning

[1] 0.28125

Code

 mean(g_germany == g_scotland) #Probability of a draw

[1] 0.25125

Code

 print(paste0('Draw (simu) ',
              mean(g_germany == g_scotland), #Probability of a draw
              ' probs',
              sum(diag(prob_goals))))

[1] "Draw (simu) 0.25125 probs0.259600606460745"

Code

 print(paste0('Scotland wins (simu) ',
              mean(g_germany < g_scotland), #Probability of Germany winning
              ' probs',
              round(sum(prob_goals[upper.tri(prob_goals, diag = FALSE)]),4)))

[1] "Scotland wins (simu) 0.28125 probs0.2791"

Another way to look at is is at the joint distribution of the goals scored

Code

library(tidyverse)
library(ggplot2)

# Define the function
plot_goal_probabilities <- function(attack_team1, defense_team1, attack_team2, defense_team2, team1_name = "Team 1", team2_name = "Team 2") {
  set.seed(42)
  
  # Simulate goals scored using Poisson distribution
  theta_team1 <- exp(attack_team1 - defense_team2)
  theta_team2 <- exp(attack_team2 - defense_team1)
  #g_team1 <- rpois(length(theta_team1), theta_team1)
  #g_team2 <- rpois(length(theta_team2), theta_team2)
  
  # Calculate joint probabilities
  #joint_prob <- table(g_team1, g_team2) / length(g_team1)
  
  prob_g1= apply(matrix(0:10, ncol=1), 1, function(x) calc_prob(x, theta_team1))
  prob_g2= apply(matrix(0:10, ncol=1), 1, function(x) calc_prob(x, theta_team2))
  
  df_joint = outer(prob_g1, prob_g2, '*') %>% as.matrix
  df_joint <- reshape2::melt(df_joint, varnames = c("g_team1", "g_team2"), value.name = "Freq")
  #df_joint <- as.data.frame(as.table(joint_prob))
  colnames(df_joint) <- c("Goals_Team1", "Goals_Team2", "Probability")
  df_joint$Goals_Team1 = df_joint$Goals_Team1 - 1
  df_joint$Goals_Team2 = df_joint$Goals_Team2 - 1
  
  # Ensure all combinations from 0 to 5 are included
  #all_combinations <- expand.grid(Goals_Team1 = 0:5, Goals_Team2 = 0:5)
  #df_joint <- merge(all_combinations, df_joint, by = c("Goals_Team1", "Goals_Team2"), all.x = TRUE)
  #df_joint$Probability[is.na(df_joint$Probability)] <- 0
  
  # Calculate outcomes
  df_joint <- df_joint %>%
    mutate(
      Outcome = case_when(
        Goals_Team1 > Goals_Team2 ~ "Win1",
        Goals_Team1 < Goals_Team2 ~ "Win2",
        TRUE ~ "Draw"
      )
    )
  
  # Calculate probabilities
  prob_team1_win <- sum(df_joint$Probability[df_joint$Outcome == "Win1"])
  prob_team2_win <- sum(df_joint$Probability[df_joint$Outcome == "Win2"])
  prob_draw <- sum(df_joint$Probability[df_joint$Outcome == "Draw"])
  
  # Print probabilities to the console
  # cat("Probability of", team1_name, "winning: ", prob_team1_win, "\n")
  # cat("Probability of", team2_name, "winning: ", prob_team2_win, "\n")
  # cat("Probability of a draw: ", prob_draw, "\n")
  
  # Plot the joint probabilities with labels and different colors for outcomes
  d = df_joint %>% filter(df_joint$Goals_Team1 <= 5)  
  joint_plot <- ggplot(d, aes(x = Goals_Team1, y = Goals_Team2, fill = Outcome)) +
    geom_tile(color = "white", aes(alpha = Probability)) +
    geom_text(aes(label = sprintf("%.2f", Probability)), color = "black", size = 3) +
    scale_fill_manual(values = c("Win1" = "blue", "Win2" = "red", "Draw" = "green"), guide = NULL) +
    scale_alpha(range = c(0.3, 1), guide = NULL) +
    labs(title = paste0(team1_name, " vs. ", team2_name),  x = paste("Goals by", team1_name), y = paste("Goals by", team2_name)) +
    theme_minimal() +
    theme(
      plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
      axis.title = element_text(size = 12),
      axis.text = element_text(size = 10)
    ) +
    scale_x_continuous(limits = c(-0.5, 9), breaks = 0:5) +
    scale_y_continuous(limits = c(-0.5, 5.5), breaks = 0:5) +
    annotate("text", x = 5.7, y = 4, label = sprintf("%s Wins: %.2f", team1_name, prob_team1_win), color = "blue", size = 3, hjust = 0) +
    annotate("text", x = 5.7, y = 3, label = sprintf("Draw: %.2f", prob_draw), color = "green", size = 3, hjust = 0) +
    annotate("text", x = 5.7, y = 2, label = sprintf("%s Wins: %.2f", team2_name, prob_team2_win), color = "red", size = 3, hjust = 0)
  
  # Print the joint plot
  return(joint_plot)
}

# Call the function
plot_goal_probabilities(attack_team1=attack_germany$A, defense_team1 = defense_germany$A, 
                        attack_team2=attack_scotland$A, defense_team2 = defense_scotland$A, team1_name="Germany", team2_name="Scotland")

Remember the result? It was 5:1 for Germany, so quite unexpected by the model. So that these predictions with a grain of salt. The model is based on historical data and does not take into account the current form of the teams.

Predictions

Caution

Note that these predictions are based on historical data and might do not take into account the current form of the teams.

Note that there are some wrong dates in the list.

Code

library(tidyverse)
library(tidybayes)
library(ggpubr)
library(knitr)

set.seed(42)


# Function to calculate probabilities for a pairing
calculate_probabilities <- function(team1, team2, teams, hfit) {
  cat(sprintf('## %s vs %s\n', team1, team2))
  id1 <- which(teams == team1)
  id2 <- which(teams == team2)
  
  # Spread draws and filter the relevant data
  As <- hfit %>% tidybayes::spread_draws(A[i, j]) %>% select(i, j, A)
  att_1 <- As %>% filter(j == id1, i == 1)
  att_2 <- As %>% filter(j == id2, i == 1)
  def_1 <- As %>% filter(j == id1, i == 2)
  def_2 <- As %>% filter(j == id2, i == 2)
  
  theta_1 <- exp(att_1$A - def_2$A)
  theta_2 <- exp(att_2$A - def_1$A)
  
  #g_1 <- rpois(length(theta_1), theta_1)
  #g_2 <- rpois(length(theta_2), theta_2)
  
  # Calculate probabilities
  #prob_win <- mean(g_1 > g_2)
  #prob_draw <- mean(g_1 == g_2)
  #prob_lose <- mean(g_1 < g_2)
  
  prob_g1= apply(matrix(0:10, ncol=1), 1, function(x) calc_prob(x, theta_1))
  prob_g2= apply(matrix(0:10, ncol=1), 1, function(x) calc_prob(x, theta_2))
  prob_goals = outer(prob_g1, prob_g2, '*') %>% as.matrix
  prob_draw = sum(diag(prob_goals))
  prob_win = sum(prob_goals[lower.tri(prob_goals, diag = FALSE)])
  prob_lose = sum(prob_goals[upper.tri(prob_goals, diag = FALSE)])
  
  
  # Return the probabilities and goals
  list(
    prob_win = round(prob_win,2),
    prob_draw = round(prob_draw,2),
    prob_lose = round(prob_lose,2)
  )
}

### Group stage matches
# Quite a pain to get the data in the right format from a lying ChatGPT
library(data.table)
# Read the data from the CSV
csv_data <- "MatchNumber, Date, Team1, Team2, KickoffTime
1, 14.06, Germany, Scotland, 21:00
2, 15.06, Hungary, Switzerland, 15:00
3, 15.06, Spain, Croatia, 18:00
4, 15.06, Italy, Albania, 21:00
5, 16.06, Poland, Netherlands, 15:00
6, 16.06, Slovenia, Denmark, 18:00
7, 16.06, Serbia, England, 21:00
8, 17.06, Romania, Ukraine, 15:00
9, 17.06, Belgium, Slovakia, 18:00
10, 17.06, Austria, France, 21:00
11, 18.06, Türkiye, Georgia, 15:00
12, 18.06, Portugal, Czechia, 18:00
13, 19.06, Croatia, Albania, 15:00
14, 19.06, Germany, Hungary, 18:00
15, 19.06, Scotland, Switzerland, 21:00
16, 19.06, Spain, Italy, 21:00
17, 20.06, Slovenia, Serbia, 15:00
18, 20.06, Denmark, England, 18:00
19, 20.06, Poland, Austria, 15:00
20, 20.06, Netherlands, France, 21:00
21, 21.06, Slovakia, Ukraine, 15:00
22, 21.06, Belgium, Romania, 18:00
23, 21.06, Türkiye, Portugal, 21:00
24, 21.06, Georgia, Czechia, 18:00
25, 23.06, Switzerland, Germany, 21:00
26, 23.06, Scotland, Hungary, 21:00
27, 24.06, Croatia, Italy, 21:00
28, 24.06, Albania, Spain, 21:00
29, 25.06, Netherlands, Austria, 18:00
30, 25.06, France, Poland, 18:00
31, 25.06, England, Slovenia, 21:00
32, 25.06, Denmark, Serbia, 21:00
33, 26.06, Slovakia, Romania, 18:00
34, 26.06, Ukraine, Belgium, 18:00
35, 26.06, Czechia, Türkiye, 21:00
36, 26.06, Georgia, Portugal, 21:00"
# Convert CSV data to data table
matches_raw <- fread(text = csv_data)

matches = data.frame(Date=as.Date(paste0('2024-06-', matches_raw$Date)), Time=paste0(matches_raw$KickoffTime, ' CEST'), HomeTeam=matches_raw$Team1, AwayTeam=matches_raw$Team2)

results = data.frame(num=NULL,Date=NULL, Time=NULL, HomeTeam=NULL, AwayTeam=NULL, Win=NULL, Draw=NULL, Lose=NULL)
for (i in 1:nrow(matches)) {
  # i=1
  #cat(sprintf('## %s vs %s\n', matches[i,3], matches[i,4]))
  probs <- calculate_probabilities(matches[i,3,drop=TRUE], matches[i,4,drop=TRUE], teams, hfit)
  cat(sprintf('Win: %.2f, Draw: %.2f, Lose: %.2f\n', probs$prob_win, probs$prob_draw, probs$prob_lose))
  results = rbind(results, data.frame(
    Number = i,
    Date=matches[i,1], 
    Time=matches[i,2], 
    HomeTeam=matches[i,3], 
    AwayTeam=matches[i,4], 
    Win=probs$prob_win, Draw=probs$prob_draw, Lose=probs$prob_lose))
}
## ## Germany vs Scotland
## Win: 0.46, Draw: 0.26, Lose: 0.28
## ## Hungary vs Switzerland
## Win: 0.37, Draw: 0.27, Lose: 0.36
## ## Spain vs Croatia
## Win: 0.52, Draw: 0.23, Lose: 0.25
## ## Italy vs Albania
## Win: 0.42, Draw: 0.28, Lose: 0.30
## ## Poland vs Netherlands
## Win: 0.31, Draw: 0.26, Lose: 0.43
## ## Slovenia vs Denmark
## Win: 0.37, Draw: 0.27, Lose: 0.36
## ## Serbia vs England
## Win: 0.18, Draw: 0.24, Lose: 0.58
## ## Romania vs Ukraine
## Win: 0.44, Draw: 0.28, Lose: 0.28
## ## Belgium vs Slovakia
## Win: 0.51, Draw: 0.26, Lose: 0.23
## ## Austria vs France
## Win: 0.35, Draw: 0.24, Lose: 0.41
## ## Türkiye vs Georgia
## Win: 0.41, Draw: 0.26, Lose: 0.33
## ## Portugal vs Czechia
## Win: 0.44, Draw: 0.25, Lose: 0.31
## ## Croatia vs Albania
## Win: 0.37, Draw: 0.29, Lose: 0.35
## ## Germany vs Hungary
## Win: 0.40, Draw: 0.27, Lose: 0.33
## ## Scotland vs Switzerland
## Win: 0.32, Draw: 0.26, Lose: 0.42
## ## Spain vs Italy
## Win: 0.46, Draw: 0.23, Lose: 0.30
## ## Slovenia vs Serbia
## Win: 0.46, Draw: 0.27, Lose: 0.26
## ## Denmark vs England
## Win: 0.27, Draw: 0.25, Lose: 0.48
## ## Poland vs Austria
## Win: 0.27, Draw: 0.26, Lose: 0.47
## ## Netherlands vs France
## Win: 0.31, Draw: 0.23, Lose: 0.46
## ## Slovakia vs Ukraine
## Win: 0.36, Draw: 0.28, Lose: 0.36
## ## Belgium vs Romania
## Win: 0.43, Draw: 0.27, Lose: 0.30
## ## Türkiye vs Portugal
## Win: 0.24, Draw: 0.23, Lose: 0.53
## ## Georgia vs Czechia
## Win: 0.26, Draw: 0.27, Lose: 0.47
## ## Switzerland vs Germany
## Win: 0.33, Draw: 0.25, Lose: 0.42
## ## Scotland vs Hungary
## Win: 0.31, Draw: 0.28, Lose: 0.41
## ## Croatia vs Italy
## Win: 0.31, Draw: 0.27, Lose: 0.42
## ## Albania vs Spain
## Win: 0.24, Draw: 0.24, Lose: 0.52
## ## Netherlands vs Austria
## Win: 0.33, Draw: 0.24, Lose: 0.43
## ## France vs Poland
## Win: 0.51, Draw: 0.24, Lose: 0.25
## ## England vs Slovenia
## Win: 0.47, Draw: 0.26, Lose: 0.27
## ## Denmark vs Serbia
## Win: 0.47, Draw: 0.26, Lose: 0.27
## ## Slovakia vs Romania
## Win: 0.28, Draw: 0.28, Lose: 0.44
## ## Ukraine vs Belgium
## Win: 0.23, Draw: 0.26, Lose: 0.51
## ## Czechia vs Türkiye
## Win: 0.44, Draw: 0.27, Lose: 0.29
## ## Georgia vs Portugal
## Win: 0.21, Draw: 0.22, Lose: 0.57

# Print the summary table with links
kable(results, caption = "Probabilities of Match Outcomes", escape = FALSE)

Probabilities of Match Outcomes
Number	Date	Time	HomeTeam	AwayTeam	Win	Draw	Lose
1	2024-06-14	21:00 CEST	Germany	Scotland	0.46	0.26	0.28
2	2024-06-15	15:00 CEST	Hungary	Switzerland	0.37	0.27	0.36
3	2024-06-15	18:00 CEST	Spain	Croatia	0.52	0.23	0.25
4	2024-06-15	21:00 CEST	Italy	Albania	0.42	0.28	0.30
5	2024-06-16	15:00 CEST	Poland	Netherlands	0.31	0.26	0.43
6	2024-06-16	18:00 CEST	Slovenia	Denmark	0.37	0.27	0.36
7	2024-06-16	21:00 CEST	Serbia	England	0.18	0.24	0.58
8	2024-06-17	15:00 CEST	Romania	Ukraine	0.44	0.28	0.28
9	2024-06-17	18:00 CEST	Belgium	Slovakia	0.51	0.26	0.23
10	2024-06-17	21:00 CEST	Austria	France	0.35	0.24	0.41
11	2024-06-18	15:00 CEST	Türkiye	Georgia	0.41	0.26	0.33
12	2024-06-18	18:00 CEST	Portugal	Czechia	0.44	0.25	0.31
13	2024-06-19	15:00 CEST	Croatia	Albania	0.37	0.29	0.35
14	2024-06-19	18:00 CEST	Germany	Hungary	0.40	0.27	0.33
15	2024-06-19	21:00 CEST	Scotland	Switzerland	0.32	0.26	0.42
16	2024-06-19	21:00 CEST	Spain	Italy	0.46	0.23	0.30
17	2024-06-20	15:00 CEST	Slovenia	Serbia	0.46	0.27	0.26
18	2024-06-20	18:00 CEST	Denmark	England	0.27	0.25	0.48
19	2024-06-20	15:00 CEST	Poland	Austria	0.27	0.26	0.47
20	2024-06-20	21:00 CEST	Netherlands	France	0.31	0.23	0.46
21	2024-06-21	15:00 CEST	Slovakia	Ukraine	0.36	0.28	0.36
22	2024-06-21	18:00 CEST	Belgium	Romania	0.43	0.27	0.30
23	2024-06-21	21:00 CEST	Türkiye	Portugal	0.24	0.23	0.53
24	2024-06-21	18:00 CEST	Georgia	Czechia	0.26	0.27	0.47
25	2024-06-23	21:00 CEST	Switzerland	Germany	0.33	0.25	0.42
26	2024-06-23	21:00 CEST	Scotland	Hungary	0.31	0.28	0.41
27	2024-06-24	21:00 CEST	Croatia	Italy	0.31	0.27	0.42
28	2024-06-24	21:00 CEST	Albania	Spain	0.24	0.24	0.52
29	2024-06-25	18:00 CEST	Netherlands	Austria	0.33	0.24	0.43
30	2024-06-25	18:00 CEST	France	Poland	0.51	0.24	0.25
31	2024-06-25	21:00 CEST	England	Slovenia	0.47	0.26	0.27
32	2024-06-25	21:00 CEST	Denmark	Serbia	0.47	0.26	0.27
33	2024-06-26	18:00 CEST	Slovakia	Romania	0.28	0.28	0.44
34	2024-06-26	18:00 CEST	Ukraine	Belgium	0.23	0.26	0.51
35	2024-06-26	21:00 CEST	Czechia	Türkiye	0.44	0.27	0.29
36	2024-06-26	21:00 CEST	Georgia	Portugal	0.21	0.22	0.57

Prediction based on the Historical Data

Loading the data

Preparing the data for Stan

A Model for the goals scored 🥅

Prior for the attack and defence strength

Conditioning on the data / Fitting the model

The fitted model

Making predictions

Probabilities for win/draw/loss

Predictions

Details on individual games

Goal Distributions

1 Germany vs Scotland

2 Hungary vs Switzerland

3 Spain vs Croatia

4 Italy vs Albania

5 Poland vs Netherlands

6 Slovenia vs Denmark

7 Serbia vs England

8 Romania vs Ukraine

9 Belgium vs Slovakia

10 Austria vs France

11 Türkiye vs Georgia

12 Portugal vs Czechia

13 Croatia vs Albania

14 Germany vs Hungary

15 Scotland vs Switzerland

16 Spain vs Italy

17 Slovenia vs Serbia

18 Denmark vs England

19 Poland vs Austria

20 Netherlands vs France

21 Slovakia vs Ukraine

22 Belgium vs Romania

23 Türkiye vs Portugal

24 Georgia vs Czechia

25 Switzerland vs Germany

26 Scotland vs Hungary

27 Croatia vs Italy

28 Albania vs Spain

29 Netherlands vs Austria

30 France vs Poland

31 England vs Slovenia

32 Denmark vs Serbia

33 Slovakia vs Romania

34 Ukraine vs Belgium

35 Czechia vs Türkiye

36 Georgia vs Portugal