Critics of WAR calculation often argue that the formula has too many variations to be consistent. Just about every year, statisticians (or if some critics are to be believed, cyborgs) announce significant changes to various WAR components. In fact, the two most widely accepted authorities on the subject – Baseball Reference and Fangraphs – can’t seem to agree on the best way of deriving the number. A major step toward WAR uniformity came recently when the sites’ editors agreed to use the same replacement-level record (48-114) in deciding how many Wins Above Replacement to allocate across the whole of baseball (1000). The basic idea is this: in an entire baseball season, 2,430 games are played. If all teams were allotted 48 wins, establishing the replacement level, the total number of wins would be 1,440. This leaves 990 wins above the replacement level to be allotted amongst the teams. The first thing you likely notice is that 990 and 1000, while very close, are not the same number. Well, the agreed-upon performance of a replacement-level team is actually based on a set winning percentage – .294. Over the course of a 162-game season, a .294 record would actually yield 47.628 wins, which we all know is not possible. Allot that exact number of wins to 30 teams, however, and you’re left with something much closer to 1,000 (1,001.16) for the number of wins above replacement to be distributed.
The problem remains that Baseball Reference and Fangraphs use slightly different methods to calculate WAR, and the numbers the two sites come up with are always a little different. For this reason, you’ll often see writers refer to either rWAR (Reference) or fWAR (Fangraphs). In the interest of inclusion, and so I could compare the relative accuracy of each method, I compared both rWAR and fWAR to team records.
The 48-win expectation for a replacement-level team is not meant to be exact. First and foremost, a team comprised of replacement-level players is entirely theoretical. Also, a number of factors, such as luck, strength of schedule, and managerial decision-making, could potentially impact the win total of a replacement-level team (as they do in the “real world”). Perhaps most notably, the win expectancy of a replacement-level team would likely differ depending on whether the team played in the American or National League. Baseball Reference contends that historical interleague win-loss records show the American League to be stronger. Taking this at face value, one might expect a replacement-level team in the American League to win less games than one in the comparatively weaker National League. It follows, then, that replacement level in the American League is actually somewhere between 44-48 wins, while a replacement-level team might win 48-52 games in the National League. Looking at the data, this suggestion seems to bear fruit – generally speaking, American League teams tended to have higher combined WAR than their National League counterparts with similar win-loss records.
Another thing that interested me was whether teams’ WAR totals were more or less consistent with their expected Pythagorean win-loss records than with their actual records. Without getting into the math, a Pythagorean record is an estimation (created by stats O.G. Bill James) of a team’s expected wins and losses based on the total number of runs scored and allowed.
Putting all of this together, I compared the following for all 30 teams:
– rWAR: total combined Wins Above Replacement, as calculated by Baseball Reference
– fWAR: total combined Wins Above Replacement, as calculated by Fangraphs
– Expected Win-Loss Record: total expected wins based on adding team WAR to 48-game replacement level
– Actual Win-Loss Record: total games won and games lost during 2013 regular season
– Pythagorean Record: expected win-loss record based on runs scored and runs allowed
– Difference between Actual Record and Expected Range (rWAR, fWAR)
Despite WAR’s reliance on the concept and likely winning percentage of a replacement-level team, both proponents and critics almost exclusively discuss the calculation as a measure of individual player value. Most obviously, it is enormously difficult to predict team success mid-season based on aggregated individual value. Injuries, regressions in individual performance, and schedule variation all contribute to the problem. Retroactively, though, it’s a fairly easy proposition to look at a team’s cumulative WAR and win-loss record to see how well the expectation matched the results.
The first thing I noticed was that even with varying results for individual teams, fWAR and rWAR produced strikingly similar overall results. At the uniform replacement level of 48 wins, Fangraphs, on average, missed predicting teams’ actual win-loss records by 4.54 wins (4.59 for the National League, and 4.49 for the American League). Baseball Reference did slightly better, with an average win/WAR differential of 4.32 (4.26 for the NL and 4.37 for the AL). With every total generated by both fWAR and rWAR being between 4 and 5 wins “off,” the overall rate of predictive success for WAR was consistently between 94-95%. The same was true for both the American League (where the average team won 81.3 games) and the National League (80.7 average wins per team).
When I adjusted the replacement level to reflect a stronger American League, the initial results were not significantly different from those achieved by using the uniform, 48-win standard. When I added four wins to the NL replacement level (bringing it to 52) and subtracted four from the AL replacement level (44), neither the results for Fangraphs nor those for Baseball Reference changed in accuracy from 94-95%. Looking at the actual difference in wins between the National League and American League, which was present but not enormous, I wondered if an 8-win adjusted differential for AL and NL replacement-level teams was a bit extreme. So, I softened the replacement level for both leagues (to 50 wins for NL teams and 46 wins for AL teams). While the change made a small, uniformly positive difference in accuracy for all data sets, the rate of predictive success for both Fangraphs and Baseball Reference remained just under 95%.
95% is Really Good
Considering so much can happen over the course of a season to change wins into losses, and vice versa, the ability of an analytical tool to predict wins league-wide so accurately is pretty remarkable. There were some significant problems with a few outlying teams (which I’ll get into shortly), but in 2013 WAR seems to have been a very shrewd measure of success not only for individual players but for teams as well. By comparison, the less-maligned Pythagorean win expectation, which uses runs scored and allowed to tell us how many games a team “should” have won, was about 96% (or one more win) accurate.
Critics of the calculation would likely counter that in baseball, a difference of 4-5 games can mean everything. Especially now, in an era when perceptions of team success are largely predicated on whether a team reaches (and wins in) the postseason, tiny variations in regular season records can and often do mean the difference between satisfaction and disappointment. In anticipation of this reaction, I created hypothetical division and league standings based on both fWAR and rWAR totals for comparison against Major League Baseball’s actual final standings. As we know, in “real life” the 2013 playoff teams were Boston, Tampa Bay, Detroit, Cleveland, and Oakland in the American League; and Atlanta, St. Louis, Pittsburgh, Cincinnati, and Los Angeles in the National League. Here’s how fWAR and rWAR did:
– American League: Boston, Tampa, Detroit, Oakland, Texas (both fWAR and rWAR)
– National League: Atlanta, St. Louis, Pittsburgh, Cincinnati, Los Angeles (both fWAR and rWAR)
The only team that WAR “wrongly” predicted would make the playoffs basically did make the playoffs – Texas finished tied for the second American League Wild Card spot with Tampa Bay and was defeated by the Rays in a one-game playoff. The only team left out, the Cleveland Indians, significantly outperformed the WAR expectations of both Fangraphs and Baseball Reference, and was one of the outliers I’ll discuss in the next installment.
Of the 30 major league teams, 11 had actual win-loss records within four wins of what both fWAR and rWAR predicted (based on a 48-win replacement level). If we add the teams that either fWAR or rWAR predicted within a four-win range, the number increases to 19 teams. When an adjusted replacement level (46 wins for American League teams and 50 wins for National League teams, on average the most predictively accurate adjusted replacement level) is used, the total number of teams with records within four wins of one or both WAR calculation(s) ticks up to 21. For simplicity’s sake, and to address concerns readers might have about artificially making WAR appear more accurate, from now until the end of this discussion any reference to “the replacement level” should be assumed to mean 48 wins.
Notably, fWAR and rWAR never took diametrically opposed “positions” on a single team; no team that underperformed its win expectancy by fWAR standards was an extreme (4+ wins) overperformer according to rWAR, and vice versa. Also of interest, for the nine teams which either fWAR or rWAR (but not both) placed within four wins of their actual records, the average difference between total fWAR and total rWAR was only 3.1 wins. In short, because fWAR and rWAR were close to a consensus on nearly all of the teams that one (again, but not both) predicted accurately, I easily decided to include eight of those nine teams as having generally met WAR expectations. The ninth team – the San Francisco Giants – was one of the teams about which Fangraphs and Baseball Reference were most sharply divided. After some thought, I decided to include the Giants in the group of teams that mostly performed according to their WAR totals. The reason for this is that based on rWAR alone and absent any other factors, the Giants’ 2013 record should have been 76 wins and 86 losses. The Giants’ actual record? 76-86. Fangraphs liked the Giants’ individual performances more than Baseball Reference, to the tune of six wins. The difference, discussed later in more detail, was due largely to the two sites’ drastically different evaluations of just two players.
Of the 19 teams that (according to fWAR, rWAR, or both) performed within 4 wins of their combined individual WAR totals, 11 (Arizona, Atlanta, Cincinnati, Dodgers, Miami, Milwaukee, Mets, Pittsburgh, San Diego, San Francisco, and Washington) are National League teams, and 8 (Baltimore, Houston, Kansas City, Minnesota, Seattle, Tampa, Texas, and Toronto) play in the American League. The list represents “good” and “poor” teams fairly evenly, 11 of the 20 having posted a winning percentage of .500 or better.
Of course, that a team’s cumulative individual Wins above Replacement wound up looking similar to its actual win-loss record does not mean that no independent factors contributed to wins and losses. For example, the Arizona Diamondbacks finished at 81-81 for the season. Fangraphs, based on Wins Above Replacement, “predicted” a record of 81-81. Like any other team, the D-Backs won and lost games based on luck, in-game decisionmaking, and other factors. The fact that the team’s performance relative to a set replacement level was so close to its actual season-long result can tell us a few things, however. First, the Diamondbacks were probably not either extremely lucky or extremely unlucky. Their players performed (as measured objectively against every other player) to a certain level, and fans can take heart that the team’s results more or less reflected its players’ combined individual output. The Diamondbacks’ fWAR totals also endorse Kirk Gibson’s in-game performance as a manager insofar as his decisions seem to have had, at worst, a neutral effect on the team’s results.
In the next installment of the piece, I’ll take a look at six teams whose 2013 win-loss records fell short of expectations created by their combined individual WAR totals. While luck and the inexact nature of Wins Above Replacement tell us to anticipate some discrepancies between expectation and performance, I suspected going in that several factors unique to specific teams would contribute to some instances of WAR “not adding up.”