The Stuff of Greatness
On FIP, pitch modeling, and player evaluation in an age of determinism
Eduardo Rodríguez is one of the best pitchers in baseball right now. At least if you measure run-prevention in its simplest form. As of this writing, Rodríguez ranks among the Top 5 in the National League in both ERA (2.27) and RA9 (2.46). He has successfully suppressed run scoring better than almost every other starting pitcher in baseball, and at the end of the day that’s what matters most.
Anyone familiar with modern sports analytics can infer the significance of the weaselly wording above. The rest of Rodríguez’ statline suggests that he will come back down to earth in due time. He has struck out just 18 percent of the batters he’s faced while walking 10 percent — both of those rates are worse than the league average, and his resulting 8 percent differential is the second-lowest among qualified MLB starters. A pitcher so mediocre at controlling the zone would need to be exceptional at preventing hard contact for such ace-worthy performance to be expected, and to my eye nothing in Rodríguez’ present metrics nor his decade-plus Major League track record suggests he is such an outlier.
The field of DIPS (Defense-Independent Pitching Statistics) has been studied for over a quarter of a century. The gist is that a pitcher has more control over some on-field outcomes (like inducing strikeouts) than others (like whether groundballs squeak through the hole for hits), and stripping out exogenous factors provides a better measure over pitching performance. The best-known application of DIPS is FIP (Fielding-Independent Pitching), which estimates what a typical pitcher’s ERA would be based on how many strikeouts, walks, hit-by-pitches, and home runs they gave up. Rodríguez’ FIP is 4.06. The 179-point gap between his ERA and his FIP is the largest in the Majors right now. FIP’s cousins xFIP, kwERA, and SIERA would all tell you a similar story: Rodríguez is pitching well over his head.
FIP isn’t cool these days. Newly available data sources have both challenged the underlying assumptions of FIP and its ilk and inspired new paradigms of pitcher evaluation. But that doesn’t mean it is irrelevant. If you asked front office folks around the league whether Rodríguez’ 2.27 ERA or his 4.06 FIP were closer to his true-talent level going forward, I bet it would be near-unanimous for the latter.
There’s a similar, more-modern discussion to be had about Ranger Suárez. His 2.83 ERA and 3.05 RA9 each ranks fifth in the American League. And unlike for Rodríguez, FIP is a believer: Suárez’ 2.68 is second-best in the AL. Yet newer-generation pitching models, taking DIPS metrics to their logical next step by evaluating the physical properties of pitches themselves without regard for what the batter does against them, are not. PitchingBot, which estimates a pitcher’s performance based solely on such ball-flight characteristics, retrodicts a 4.18 ERA for Suárez based on on the pitches he has thrown. The analogous + models at FanGraphs and Pro metrics from Baseball Prospectus are also unimpressed, pegging him as an average pitcher overall, with a solid ability to locate his pitches but mediocre raw stuff (which is a stabler measurement than command).
It’s not impossible that Suárez could sustain top-of-the-rotation performance with stuff that looks like a fourth starter’s. After all, he’s done so for half a season already. But as with Rodríguez, in cases of such dissonance, the more-intrinsic measurement is right more often than not.
The sabermetric community has spent decades quibbling over how to interpret these discrepancies. If you’re a team analyst or fantasy player trying to project a pitcher’s future effectiveness, the difference between they successfully demonstrated a skill in the short term that is very hard to sustain going forward and they got lucky is semantics, but it makes a huge difference in the narrative around what they have already achieved.
My read is that the current industry consensus lies somewhere in the middle. It would be perhaps overly curmudgeonly to deny Rodríguez a spot on the NL All-Star team because of his mediocre peripherals, but his elevated FIP would be fully fair game as a detriment in the Cy Young race. Just ask Garrett Crochet, who led all AL pitchers in actual-runs-based wins above replacement last year yet finished in a distant second in the vote, as Tarik Skubal’s success in suppressing scoring was seen as more deserved.
This is a longwinded way of introducing a question I have been chewing on for months: As our capacity for measuring and valuing intrinsic skills improves, how should we weigh our best estimates of a player’s individual contributions against their tangible results? In other words, if Ranger Suárez keeps this up for three more months, ought his pedestrian arsenal metrics be relevant to his Cy Young candidacy?
It sounds like a silly premise, but I think it’s an important one for the baseball community to consider — both because such a situation is sure to arise at some point soon, and as a means of probing a broader rising tension point in the modern game.
I’ll lay my cards on the table here: Intuitively, I am fine with older-generation DIPS metrics playing a role in backwards-looking player evaluation, though I do not think pitch-characteristic-based models should. However, I am struggling to find a satisfactory rationale for drawing that line.
FIP is simple and based on tangible outcomes: Strikeouts, walks, home runs. You can calculate it with simple arithmetic, the math analogous to that of basic ERA. Arsenal metrics are black boxes. The inputs must be measured and tracked digitally — not numbers you could jot down in your scorebook. The outputs come from machine-learning models and must be translated into formats fans would recognize. Surely comprehensibility counts for something. But while ease of understanding is a virtue, it is not a generalizable criterion for accepting a metric as valid. Most fans do not know how to calculate WAR, and clearly I’m not opposed to using that.
For all their flattened assumptions about the parts of the game for which the pitcher is assumed to not be responsible, older DIPS metrics are ultimately based on results. Strikeouts and walks are results. Batted-ball type, which fuels SIERA and xFIP, is a kind of result. So are exit velocity and launch angle for balls in play, which factor into xERA. Contrast this with the newfangled metrics, whose jurisdiction ends before the batter decides whether or not to swing. It’s all farther removed what ultimately happens on the field. On the other hand, once you decide that a ball in play counts as a result regardless of whether it falls for a hit, why can’t that logic apply further upstream? A pitch thrown from a given arm angle with a given velocity that veers off a straight-line trajectory in a given direction by a given amount on its way to a given location is also a result. It’s just a matter of perception.
Another subjectively crucial difference is how much I feel like I trust a model when it disagrees with an individual pitcher’s results. It is far easier for me to believe that the pitch-based models are struggling to properly evaluate Ranger Suárez, a throwback-style crafty sinkerballer who has always pitched above his stuff, than that Eduardo Rodríguez (who if anything has underperformed his peripherals for most of his career) has suddenly found a way to outwile FIP. Then again, the older metrics have had a 20-year head start to prove their mettle and let their logic seep into our intuitions. It’s also not fair to judge the arsenal metrics by these cherry-picked examples instead of rigorous studies of their predictive power.
There’s a part of me that’s opposed to using ball-flight models for player recognition because it creates perverse incentives. Your job as a pitcher is to prevent runs, not to max out your velocity and spin rate. Optimizing your arsenal should be the means, not the ends. Then I catch myself and realize I sound like an old fogey complaining that no one plays small ball anymore. Even if increasing awareness of Stuff+ were the reason for today’s max-effort pitching approach (which it isn’t), couldn’t you say the same thing about FIP and SIERA nudging pitchers to prioritize getting strikeouts over generally getting outs?
Finally and maybe most importantly, stuff is not a new concept. Driveline did not invent the radar gun. Though the evaluation methods and specific desired traits have evolved over time, the pitcher with the nastiest arsenal has always been picked earlier and gotten the most chances to prove themselves. Yet the Cy Young has never been awarded based on who throws the best bullpen session. If anything, the bias has historically gone the other way, and pitchers who lacked premium velocity or (what was then considered) elite stuff earning extra respect for succeeding through command, sequencing, and feel to pitch.
But this precedential argument elides the subtle but essential philosophical shift that these metrics enable. An evaluation of a pitcher’s stuff is no longer just a placeholder heuristic until you see how it plays against big-league hitters, or a sign of future potential to be unlocked. Now that we have tools that directly model the relationship between the physical characteristics of a pitch and its outcomes — and therefore the empirical value of an extra tick of velocity or inch of movement — we must reconceptualize stuff from a subjective notion correlated with a pitcher’s success to a quantifiable driver of it. It is not merely a scouting report, but the logical conclusion of the decades-old project to strip away external factors from how we evaluate pitchers.
If I’m being honest, it’s this fatalistic view that physics are destiny that makes me most uncomfortable about the new wave of pitching metrics. And I suspect this feeling is lurking behind other manifestations of pearl-clutching about the modern game, too.
One thing that makes baseball distinct is the attenuated relationship between performance and traditional athleticism. Of course it helps to be stronger, to run faster, to throw harder. But the hand-eye coordination it takes to square up a 100 mph fastball, or the dexterity required to run a sinker back over the edge of the plate — these are separate skills. Of all the major sports, baseball players look the most like the rest of us. If you saw an NFL or NBA player walking down the street, you’d recognize them as a professional athlete even if you didn’t know who they were. Baseball doesn’t (necessarily) work that way. It’s part of the sport’s charm.
Thanks to analytical advances in what drives on-field success, the game is moving away from that towards a style based more on pure strength and athleticism. Teams are optimizing for max-effort pitching (fastball velocity has increased over five mph since 2002) and loud contact (the rate of barrels per batted ball exploded from 5 percent to 9 percent within a decade of tracking becoming available). MLB even directly evoked the famous athletic-measurement-collection events from other sports in calling its own recently instituted pre-draft convention the Combine.
In this context, the crotchety paeans to how the game used to be played become more sympathetic. When you hear John Smoltz or the guy at the end of the bar bemoaning that kids these days don’t understand pitching to contact or hitting behind the runner, it’s not (just) that they believe sacrifice bunting is sound strategy, or that they prefer the aesthetics of small ball, or that they are dog-whistling about intertwined issues of race and class. I suspect their back-in-my-day-ism also stems from the dissonance between a long-held vision of baseball where technique and fundamentals matter more than physicality, and the modern understanding that the gentlemanly way to play is not the smart way to run a team. (Except for the one aspect where I think they’re right.)
Ironically, the early analytics movement was on the old-timers’ side here. Plate discipline, which for many years was the skill most-associated with sabermetrics, is not a demonstration of conventional athleticism. “We’re not selling jeans here,” Billy Beane’s retort to a scout who cared more about a prospect’s physical traits than his statistical track record, became the defining credo for a generation of baseball analytics. In a fascinating time-capsule passage from Moneyball, Jamie Moyer’s enduring success is presented as a counterargument against the scouts’ overemphasis on velocity. “This guy wouldn’t get drafted,” Scott Hatteberg is quoted as saying. “He could go out and try out for a team right now and if they didn’t know who he was he wouldn’t get signed.” How things have changed.
The existence of models that translate the flight of a pitch into a reasonably reliable projection of its effectiveness is an escalation of this fatalistic approach to physicality. It implies that you do not, in fact, have to let a pitcher face high-level hitters to know whether or not their arsenal will play in the big leagues. It is a statement that, somewhere in the data, there exists an objective standard for fastball velocity beneath which there isn’t any point in even calling someone up to give them a shot.
As both an avid baseball fan and someone who used to make a living researching questions like this, I don’t doubt that such a threshold exists, or that the models are correct in wherever they identify it for a given pitcher’s movement profile. But on an emotional level, the concept strikes me as callous and antithetical to the ethos of the sport — its magical way of inspiring the imaginations of young and old alike. And whatever its utility in forecasting future performance, it seems anachronous to risk applying such principles to what a pitcher has already achieved.
Here’s a prediction I don’t need a fancy model to make: It won’t be long before we see the first prominent WAR variant based on arsenal metrics. Swapping in PitchingBot ERA (or any like model that can be scaled to such an output) as the primary pitching-performance input would be an easy tweak to the calculations. The flagship models from the major baseball analytics sites already represent different philosophies on how to credit pitchers for the runs they (don’t) allow, from the actuarial RA9 at Baseball-Reference to the classic-DIPS FIP at FanGraphs to the more-complex DRA at Baseball Prospectus. This new version will simply represent a new pole on the existing spectrum.
Whatever your feelings on WAR, it is a highly influential metric. A version that ties into the trendy field of ball-flight modeling will spread quickly. So when it arrives, it will matter, and not just in a navel-gazing sense about its impact on baseball commentary. WAR is a major factor in how the pre-arbitration bonus pool is distributed, and the new version could someday be one of the models included. It may become a discussion point in arbitration salary hearings. It will surely play a role in voting for the All-Star Game and, yes, the Cy Young — recognitions that come with direct monetary incentives for many players.
Should a pitcher’s salary be a function of their velocity beyond how it empirically influences their results on the mound?
Ranger Suárez is probably not on a Hall of Fame trajectory, but it’s not completely out of the question. The bar for modern starting pitchers is getting easier to clear (as it should be), and as someone who gave real thought to checking Gio González’ name on my mock ballot, I can imagine thinking Suárez deserves serious consideration in 15 years or so. When that day comes, will voters retroactively decide that Suárez, who has outperformed his PitchingBot-modeled ERA by 84 points since establishing himself in the Majors in 2021, was overrated in his prime because his arsenal scores didn’t support his box-score results?
After a lot of internal debate, my own answer to the original hypothetical is that ball-flight-based pitch models should not play a role in backwards-looking player honors. The cleanest distinction I can make between these and the traditional DIPS metrics is whether or not they include any batter-involved outcomes. I am open to a range of assumptions for how much of the task of run-prevention is ascribed to the pitcher versus (a variety of contextual factors that are often colloquially referred to as) luck, though after weighing the philosophical merits of retrodicted projections in such cases I find myself increasingly inclined to keep it simple with ERA or RA9. But for now — and maybe this will change in the future — I’m convinced that a fair evaluation must include some measure of how hitters actually fared against them. I would rather accept a less-precise estimate of a player’s in-a-vacuum performance quality than risk a bias against pitchers who know how to get more out of their arsenals than the models think they should.
Yet I think the broader question is what’s more important. It’s been observed that modern sports audiences think increasingly like mock GMs rather than fans: rooting more for their teams to make good-value deals than to acquire good players. If you are interested enough in the nuances of advanced baseball metrics to still be reading 3,000 words into this essay, this is disproportionately likely to apply to you. In that spirit, the question I would pose to you (unless you make your livelihood thinking about baseball) is not whether any given analytical tool is wrong, but whether your primary goal as a sports fan is to be right.
Enjoy a player’s hot streak without worrying about whether their success is sustainable. Appreciate the outliers who make the increasingly homogenous game more interesting and fun. Root for the underdog, the unlikely, the impossible, instead of dismissing the odds. Someday, If I have grandkids, I’ll tell them all about seeing Ranger Suárez throw a Maddux. At the time, it was my job to know what his metrics were. I won’t need those numbers to tell the story.









