Expected Learning - What Factors Make Up Expected Goal Models

We return to summer school a year later another year of learning the ins and outs of hockey analytics. Read through last summer’s prerequisite posts and more here.

Making the Model – What variables are used to estimate the probability of a shot being a goal?

I’m going to use the Moneypuck expected goals model as the reference for this post because they are saints who publish their methodology and shot data.

First, as a reminder, about 6.6% of shots result in goals. The data for this post will be the collection of over 121,000 shots recorded for the 2021-22 season, and, unless specified otherwise, will omit shots on an empty net.

And interestingly enough, scoring was up in 2021-22 compared to the cumulative sample of shots from 2007-2021.

The Shot Itself

Shot Distance from the Net

Nice and straightforward here: It’s easier to score the closer you are to the net. Since shot distance is a measured numeric value, we can look at the frequency of shots taking place at each distance estimated with a density plot, which is a smoothed version of a histogram that accounts for removing the need to determine bins of distances. Here are all shots:

And now including a separation of goals and non-goals

There isn’t anything very surprising about the shapes of these distributions: Shots occur most frequently closer to the net. The blue line is 64 feet away from the goal line in the NHL, so it would make sense that there is a major drop-off after that to only a handful of shots taking place in the neutral zone. It’s also not a shock that shot frequencies tick back up between 55 and 65 feet before this drop-off (the NHL loves itself a good clapper from the stripe). Much more subtly, I love the very small peak back up around 90 feet, since that would represent a shot (usually a dump-in) from center ice that reaches the goalie.

Anyways, once splitting the shot sample into goals and non-goals, the intersection of the plots around 30 feet is where goals shift from happening relatively more frequently than misses would compare to the rest of the sample. More non-goals occur once that 30 feet marker is passed (again, relatively). So given a goal, it’s substantially more likely that the shot took place within 30 feet, and peaking in frequency around 10 feet. Also while we’re here, just a quick reminder that a shot right in front of the goalie straight on is 6 feet from the goal line considering that the crease is 6 feet long from the goal line to top point.

Shot Angle

Shot Angle

Start thinking of shot angle as being face to face with the goalie perpendicular to the goal line as 0 degrees. Shooting on the goal line would be 90 degrees in either direction. Then fill in the rest in between. This is recorded with positive values from 0 to 90 degrees to the right of the net and -90 to 0 degrees to the left of the net.

Now, this isn’t a perfectly symmetrical plot, but given the nature of a density plot to be used for estimation, the true function of shot angle is probably very close to symmetrical shape. From there, it’s fascinating that three very distinct shooting lanes emerge straight on and each side of around 35 degrees. The cone made up of the inner 45 degrees on each side of the goalie is where the majority of shots take place. What’s interesting here though is that when filtering down to just goals, the symmetry remains, but the highest frequency comes from directly facing the net straight on. Not hard logic to follow that a straight-on shot would be the most ideal location to shoot with all else constant because there’s the most available space for the puck to travel into the net.

Coordinates of the Shot

The coordinates of the shot are how the distance and angle get calculated and recorded in the dataset. The terminology is based on a horizontal view of the rink, like the broadcast cam. The x-coordinate is the end-to-end point – the North-South measure. The y-coordinate is the box-to-bench point – the East-West measure. Incorporating these individual coordinates into the model supports the distance and angle calculations, especially the East-West one, to give more context to, wait for it, where the shot happened.

Like with the shot angle plot, there is symmetry on the east-west plot reflected on the imaginary straight line, perpendicular to the crease from the net-out. While the peak in the middle of the ice is similar to that of the shot angle graph, there are no other localized peaks in this graph. So what would that suggest in the context of comparing it to the shot angle? Probably that the wider angle attempts are occurring close to the net.

As for the north-south plot, the graph faces the other direction, but that’s just because center ice is 0, and the value increases when getting closer to the net. There are two local maxima, again taking place close to the net and around the point. The goal frequency once again was overwhelmingly most common in the closest areas to the net with the goal frequency near the point not stretching as high as it does for the frequency of overall attempts.

Shot Type

Shot types are divided into 7 different categories, as you see below. One piece to remember in model making is that a categorical variable such as shot type will work both independently from and interactively with the other information in the model. The conditional probability aspect will change the context of the percentages below. Nothing about that is too surprising: A backhand in front of the net is way different than one from one of the faceoff dots. Same with wrist shots vs slap shots. This also shows the unpredictable nature of deflected or tipped shots reaching the net, with about 2 of every 5 of them going wide, but when they do reach the net, almost 20% of them go in.

The Context of the Shot

Isolating rebounds

Shots that take place on a rebound are isolated in the model because the context of the shot changes substantially with the defense still recovering from defending the previous shot and, more importantly, the goalie is more times than not out of position for the next shot after having to save the previous shot and is now recovering to get to stopping the rebound. Up to 1 in 5 rebounds resulted in a goal in 2022 and only about 1 in 7 missed the net.

Shot Angle Speed Change after Rebound

In this model, Moneypuck divides the difference in the shot angle from the initial shot to the rebound shot by the time since the last shot (formula). The shotAnglePlusReboundSpeed variable accounts for this in the dataset. This is the speed that the angle changed. With the total sample of rebound shots, the distribution peaks around 5 degrees per second in change of shot angle, which would suggest that most immediate rebound shots would come from saves where the goalie has positioning to face the original shot and then the rebounder is close to the net as well. The goal graph decreases at a slower rate as the degrees per second increases. Theoretically, the faster and wider the angle changes are most likely to have the goalie out of position and more net available to shoot at, so while there aren’t that many shots that exceed a 90-degree per second change in shot angle, those are more likely to result in a goal.

Opposition’s Number of Skaters On Ice

Teams get better opportunities to shoot and, with that, score when they have more skaters than the defense. Even at 4-on-4 or 3-on-3, most of the time, the offense has the advantage over the defense in being able to utilize the space of the ice to attempt a higher quality profile of shots. One variable in the model that accounts for this is the number of defenders on ice, regardless of how many are on ice for the offense. One thing to note of course is that when there are six skaters on the ice, it means the net is empty, so those shots are almost always goals or shots that miss the net – the ones that are saved would represent instances where there are too many men on ice but it was missed by the refs (for +/- reasons, this makes sense because there’s no judging which of the 6 was “actually” not supposed to be on the ice. Then we also want to isolate powerplays from even strength opportunities where there are less than 5 skaters since those contexts can also generate different shot types. The minimal difference between the two charts below mostly shows that there isn’t too much discrepancy, but a big reason for that is the drop-off of volume between powerplays and 4-on-4 or 3-on-3 play throughout a season.

Man Advantage Situation

Again, the extra spacing available provides more space to get quality shots off. Add having a man advantage to ensure at least one open member of the offense for passing. That explanation is again really straightforward, but if it wasn’t something that showed a difference from 5 on 5 play, it wouldn’t be in the model.

We can also isolate 6v5 shots to see further how having the man advantage but there being five defenders compares to simply having the strength advantage. Unsurprisingly, it falls between any non-man advantage shot and any man advantage shot.

How do the splits look based when splitting Man Advantage shots up by other strength scenarios? Let’s take a look at that as well:

A 6v4 has a slight scoring drop-off from a 5v4. Could be a practice thing, could be a fluke – Who knows

Time since Powerplay Started

This one took a little bit of data transformation, and since the penalty length columns are based on the first penalty in a string of them if there is more than one in that time frame. Because of that, I’m limiting the sample to 2-minute powerplay to investigate this model variable.

Shots go in slightly more often in the first minute of a powerplay, but they are taken at a higher volume later in the powerplay. Both goals and non-goals have a second increase that peaks around 15 seconds left in the powerplay.

The shot-context variables aren’t the most influential in the model, but they end up being most influential in interacting with the variables about on-ice strength.

The Lead-Up to the Shot

The final grouping of variables is the lead-up to the shot, which is where the difference between public and private models splits away in terms of quality, as the public models are created using data from NHL play-by-play events, which don’t include passes, which are the most common event before a shot. Here are the events that are available publicly at this point with counts from the 2021-22 dataset:

Let’s remove challenges, stoppages, and emergency goalie timeouts before we get the graph

What stands out is that delayed penalties being the most common event before a goal is not surprising at all considering an implied 6-on-5 in most instances after a delayed penalty. Of the shot types, saved shots being higher than blocked shots and wide shots for shooting percentage isn’t surprising either considering that rebound shots usually come with the goalie and defense at their most out of position, as we mentioned earlier in the post. A 1% difference in shooting percentage after giveaways compared to takeaways doesn’t seem like a lot, but giveaways would most likely lead to more odd-man rush opportunities with quality shots where the now-defense finds themselves needing to change direction to skate backward while also, again, likely out of position. Now let’s add more context to this previous event:

Time Since Last Game Event

Without passes, the events that are recorded in the play-by-play records can be boiled down into events that either change possession or change the state of the defense. With that being said, it’s fitting that there would be such a higher frequency of quick goals compared to the rest of the shot attempts. The change in “state”, if you will also be shown by looking at another measurement:

Speed Since Last Game Event

According to the data dictionary for the model, speed is measured as, “the distance between the shot location and the previous event’s location, divided by the number of seconds between them”. So this graph translates as goals being more likely to occur when the puck gets to the shooting location from the previous event spot more quickly, whether that, on the back end, is being done through passing, carrying, or a little bit of both.

Then if we remove the time element and just look at Distance from Previous Event:

Goals have more frequency in around the 0 to 30-foot range and then in around the 130 to 200-foot range. How these distances interact with the different last event types comes as part of the model.

The model also looks at the isolated East-West location of the previous event, and interestingly, 5 distinct lanes emerge, with the majority of last events taking place within those lanes, but goals being more frequent in between them.

And finally, yes, the model does account for if the shot was on an empty net.

So there you have it. These are the pieces that make up an expected goal model, and now you can give insight if anyone you’re talking to ever says. “I wonder what they consider when determining expected goals.” (At least if it’s Moneypuck).

But How Do You Make An Expected Goals Model?

Fortunately, there are a lot of really good tutorials out there on building an xG model and the intricacies of the math behind them. And in Expected Buffalo spirit, one very good one was actually written by Matthew Barlowe. If that name sounds familiar, he is currently part of Sam Ventura’s hockey research and analytics staff as a data engineer. You can find his rpubs post on expected goal models here.