Expected Learning: Data Clustering (Also Intro to Machine Learning)

When we last left off in Expected Learning, we reviewed the components of expected goal models, which include both regression (modeling with continuous variables) and classification elements (modeling with discrete or qualitative variables). However in the world of sports analytics, among all of the other real worlds, data groupings aren’t always clear, and finding those classifications becomes its own exercise in predicting and analysis.

The basis is still the same; a series of inputs is used to calculate a predicted output. And while it’s an over-simplification, this is the foundation of any statistical learning (*whispers* and yeah machine learning).

Clustering falls under the umbrella of unsupervised learning. While supervised learning refers to any instance where each measured observation has an associated response (each observation’s measurement is related to a prediction for the response), unsupervised learning does not have a response to relate to the observations. The analysis is being performed to learn more about the observations and the relationships that the other measurements have with each other.

It’s pretty easy to segue to cluster analysis from here, as the desired outcome is to take the things that are known about observation and see if there are larger groups that can be deciphered.

(Quick Pause)

I started the next section but then I realized that I hadn’t used the algorithm yet in any other Expected Learning post and some people might ask what that term means. Well, it’s much more straightforward than some may think. An algorithm is a list of steps to do something. So a clustering algorithm is a list of steps to cluster data.

Math Side to Clustering…kind of

Clustering algorithms take input of known variables and return a cluster for each individual data point to be assigned. Just like this:

But how does each algorithm “recognize” and “assign”? There are a few types of algorithms that are more frequently used.

K-Means

At the highest level, K-Means clustering takes the data provided and with an input number of groups (k), creates groupings of the most similar data points. This method is most commonly shown visually with two dimensions to get an understanding.

The centroids are randomly assigned within the range of the variables and each datapoint’s “distance” (the more dimensions that are introduced cause it to not be as clear of a measure, especially visually) from the centroids is calculated and the groups are averaged. This iterating typically continues until the centroids are so optimized that data points stop “switching” clusters.

K-means is best used in instances where the amount of groups is either known or can be reasonably accurately estimated. But at the same time, it’s a method that becomes harder to optimize when outliers exist in the data. This is one of the many bias-variance trade offs that exist in statistical modeling.

Here is a good resource that explains the methodology and applications for different methods of clustering: https://www.projectpro.io/article/clustering-algorithms-in-machine-learning/842 Wouldn’t want to get too mathy to get away from why you’re reading this: How does clustering get used in sports analytics?

Soccer First?

Conveniently, around the time I started compiling this post, Ron Yurko put out a post on his substack about clustering in soccer, and hey look at that it’s a list of reasons that clustering has a place in sports analytics. From the post: https://opensourcesports.substack.com/p/careful-clustering-of-soccer-players?r=a5tci&utm_medium=ios&utm_campaign=post

The authors consider two different use cases in the context of clustering soccer players:
- Player types: To help understand how teams are composed, group players into a relatively low number of large clusters that represent broad types of players / archetypes.
- Comparable players: If you’re instead interested in finding the nearest comparable players, then group players into a much higher number of small clusters.

Offseason Content!

As Ron pointed out, determining player comps is an extensive application of clustering, and given the time of year, figuring out similarities between groups of players might be the foundation for all of the hockey offseason, between drafting new players to the organization, re-evaluating a team’s prospect pool after they’ve each added another season of data to their profile, and, most importantly, contract projections.

In their contract projection model write-up, the Evolving Wild twins mention how clustering plays a role in assigning the player comps and then evaluating what kinds of contracts those players got in the past. Their model uses random forest modeling, which is a supervised learning method. It’s a fairly complex algorithm that many statistics students don’t learn about until later in their undergrad or not until grad school, so I’ll keep the explanation of this method out of things. Here is a good post for an overview.

Anyways back to the twins, their model is the cream of the crop right now in the public analytics landscape, and I’d encourage you all to check out their write up and see a really good instance of how a lot of statistical applications include multiple parts that come together for a single data point as small as a single contract offer. (It also highlights the importance of distributions and weighting probabilities which is an A+ methodology to get the context of the population of contracts handed out based on the players’ profiles).