Expected Learning Week 4 - The Prerequsites of Expected Goals - Part 2

So in the previous prerequisite post, we discussed how the field of statistics serves the purpose of estimating parameters of distributions, probabilities, etc. when the number of events in a sample space is unknown. The estimations are measured with collected samples of data, and the samples aim to best reflect the population of the sample space. To get what we need to understand expected goals, we’re going to go rapid fire through different statistical concepts and then maybe circle back later down the road this summer.

One way to remember the transition to probability in a statistical context is that we will likely never understand the true value of these parameters in sports, but we will collect as much data as we can to get the best estimate that is as accurate as possible.

Random Variable

I’m going to cheat here and plug the second probability in sports video we made at First Line last summer to save typing time. In summary:

First, a variable is just a value or number that changes, or varies. While we’d casually think of the term “random” as referring to a situation in which a bunch of things are equally likely to happen, with Random Variables, we’re referring specifically to a situation in which something can have different values, each value with their own probability. In this context, the word “random” just refers to the fact that something can take on any number of values. These actually serve as the input in probability functions. As an example, if we were to roll a die and ask “what’s the probability of rolling a number less than three?”, what you’re really asking is “what is the probability of the random variable (the number on the die) being a 1 or 2, given the probability function provided?”
Since the output of a probability function, the probability of something happening, is always between 0 and 1, all values of a random variable in question within the sample space have to add up to 1 and all values that exist outside of the sample space must have a probability of zero. Meaning, the probability of rolling a 1 through 6 on a die is 1, because it’s all of the possible values, but the probability of rolling any other number is zero.

Expected Values and Variances

The expected value of a random variable is the most common behavior of the variable. When it’s repeated over time, it’s going to be the average value of the function. They are the summary statistic to describe the key aspects of a distribution. Variance, meanwhile, measures the spread of the values that the random variable can take. Also commonly referenced is standard deviation, which is the square root of the variance. Expected value and variance are the most common parameters used to describe population distributions, which are another way to describe the possible values of a random variable.

Distributions needed for working with expected goals

We’ll start with the Bernoulli distribution, which is literally just random variables that can only take the values 0 and 1. When the value is 1, the probability parameter is p. When the value is 0, the probability is 1-p. Most of the time, Bernoulli distributions are used when replicating a process where there is a defined success (1) or failure (0). When numerous Bernoulli processes are replicated n times with a p probability of each single success, the sum of the successes is modeled with the binomial distribution. The expected value for that one is n*p and the variance is n*p*(1-p). So, this is used for, say, taking 10 shots with probability p and finding out how many successes there are (or, you know, goals).

When there are multiple series of n attempts being taken, those average results will eventually reflect the normal distribution. This of course is the world famous bell curve shape with its parameters being the population mean and the variance standard deviation squared (variance).

Emphasis on Population Parameters

So I mentioned the population mean and standard deviation because the normal distribution would need the full range of random variable outcomes to properly get the true values of the mean and variance. But how come the normal distribution is still so commonly used in statistical applications?

Calculus explanations omitted, when there’s a sequence of independent and identically distributed random variables, and each of the random variables has a defined population mean and variance, with the number of random variables approaching infinity, the probability of all of one of the random variables occurring would equal 1. This is the law of large numbers. In more simple terms, as numbers of random variables get bigger and bigger, the sum of the probabilities will approach 1. Meanwhile, this sequence with a number approaching infinity will have probabilities that approach that of the standard normal distribution: A very broad definition of the central limit theorem.

So what does this mean? If the number of observations in a dataset is large enough, all we need is the average of those values and the standard deviation to estimate the probabilities by using the normal distribution as a guide. When we have averages in this case as sample means, we are estimating what that true expected value is if we did have the near infinite amount of values in the sample. The estimated parameters will also conventionally be paired with confidence intervals for a percentage of times where the true parameter will be between the lower bound of the sample distribution and the upper bound.

Quick Hop Over to Regressions

Regression analyses are used for predicting a value of a variable based on input from different other variables that are known about an event. Trendlines on scatter plots are usually where linear regressions are seen – also sometimes referred to as lines of best fit. An example of this could be predicting heights based on age, heights of parents, etc. The formula isn’t an exact science, as there is an error term included in most regressions, but as mentioned before, most statistical methods are used to estimate parameters we will not have the true value of simply because of limited knowledge of the fully encompassed sample space.

Then finally skipping way ahead, there is also regression using the logistic function. This helps to predict the probability of an event occurring based on the different independent variables that help describe an event (such as the characteristics of a shot). A logistic function is used because it can be bounded at 0 and 1 to replicate a probability function.

So Where Does This Place Us?

Okay so I did mention that this is an extremely high overview of the concepts that it takes to understand how expected goals work, and here we are ready to talk directly about expected goals in hockey…next time. This isn’t everything to know about probability or statistics (or both), but it’s enough to get a sense of what’s to come next time and get the best grasp of what it means for goals to be expected without being too overwhelmed.