Posted on Leave a comment

MLB Regression and Why it Matters

PayDirt MLB Image

I'm going to show you a series of images regarding MLB regression, and they will all have something in common. It'll be a pretty important thing, and it'll be very apparent.

  • Series of MLB regression images for pitchers over a 5-game sample

Definitions and Math

Did you catch on? If you didn't, that's okay! I won't judge you for it. Basically, what we are looking at here is regression in action. I've taken three unrelated pitchers and ran some rolling 5-game graphs showing their batting average allowed as well as their BABIP for the sample. Now, you'll notice that the two graphs look really similar, and that's no coincidence. BABIP, or batting average on balls in play, moves in step with batting average allowed. the difference is that BABIP can be heavily affected by a number of outside factors and thus can lead to luck interfering in the results.

See, us in the statistics world don't really like luck. It's a bit messy. It causes issues in projections and predictions. It's unruly and inherently difficult to manage. We can't count on it to do what we want. But what we can count on is regression to step in and work it's magic, which is to allow luck to ebb and flow.

Regression, by definition, is “a statistical method that attempts to determine the strength and character of the relationship between dependent variables and other independent variables. Frankly I don't care much for formal definitions, so in layman's terms: regression is used to make assumptions about something based on outside influences or variables. We use it a lot in the DFS and betting world. We look to predict future performances based on stats and metrics that we know end up at a certain point. Typically the way you'll hear it phrased is that “this or that will regress to the mean”, which just means it'll end up at the average point eventually.

Applications and Strategy

So, why does this matter? Well, when something is considered a “luck” stat, we generally assume that it is regressive. BABIP is one of those stats, and we know it will eventually regress to a league averages. Other instances of this are things like home run per flyball (HR/FB) and left on base percentage (LOB%). We know that, for the most part, these stats are going to end up around league average or within a standard deviation of it.

The last point that I'll make about regression is that it doesn't happen slowly over a long sample. Regression is a vengeful bitch, and it happens quickly and with bravado. So we know that regression to the mean will come, and we know it will come quickly. With that info, we can do some pretty radical stuff.

Namely, when we see a stat that we know will regress to the mean sitting at a large deviation from that mean, we can leverage the upcoming correction. Specifically for MLB we can look at these stats like BABIP, HR/FB, and LOB% and attack pitchers who have been far outproducing the league averages.

Example of MLB Regression

You'll notice in the images above that I highlighted a specific point for each pitcher's graph. For each one, I chose a point in time when their BABIP was well below the league average (usually between .280 and .295) and right before the correction came. Let's take a look at Jose Berrios again:

Jose Berrios 5-game rolling graphs for 2021
Jose Berrios' 5-game rolling averages for 2021

Coming into that point on the graph, he had a stretch of three games with just one earned run. His BABIP was at .176. His next three games? He faced the Angels, Nationals, and White Sox and gave up 12 runs and a .488 BABIP. This is a perfect illustration of what regression to the mean is. You can get away with something for a while, but the longer you do the more violent the correction.

Notes and caveats

This is a really powerful method for making predictions, but it's not an exact application. To be clear, random variation is always going to be random. Just because we know that something is very likely to happen doesn't mean that we can predict when it will happen. The crux of this exercise is that you can line things up perfectly in your assessment of a situation and still get nothing from it. that's just how things go sometimes. The goal here is to be able to leverage outcomes to a maximum expected value, not hit the predictions frequently. Likewise, the highest payouts come on the most unlikely hits.

This kind of method can be used for a wide range of applications. Any time you can measure how likely something happens against how often it has happened you can make assumptions about future performances. Granted, not everything that happens will regress to league averages. However, if something regresses to a career average and has a long sample, you can utilize that in similar ways. Look for things that have an ebb and flow and you may have something to leverage with regression.

This type of method is the engine behind the True AVG models on the site, as well as a driving factor in the MLB simulations. The MLB Range of Outcomes algorithms utilize this logic to better predict future performance. If this stuff sounds powerful and you don't already have a subscription to the site, it's in your best interest to get one!

Regardless, I hope you learned something reading this and if you didn't I hope you enjoyed it anyway! Join the community, grab a subscription, and let's bake some bread.