Dear Dr. Math,
I've read that Derek Jeter had a lower batting average than David Justice in both 1995 and 1996, but if you combine the two years together Jeter's average is higher. How is that possible? I don't get it.
No other sport seems to bring out the statistics junkies quite like baseball, what with the ERAs and OBPs and WHIPs.* Where else do you find sports fans casually throwing around ratio approximations computed to 3-digits of accuracy? I guess it's all a holdover from cricket; we in the U.S. should at least count ourselves lucky that we don't have to learn things like the Duckworth-Lewis method.
So, what you say is true about Jeter and Justice. In 1995, Jeter had a batting average (that's ratio of hits to at-bats for the baseball-averse) of .250, and Justice's average was slightly higher at .253. In 1996, Jeter hit a much more respectable .314 but Justice out-paced him again, hitting .321. However, when the results of both years are pooled together (total hits and total at-bats), Jeter's combined average is .310, versus Justice's paltry .270. How could this happen, in the most American of sports?
It's a particular case of something called Simpson's paradox, which generally deals with the weird things that can happen when you try to combine averages over different sets of data. See, the source of the confusion is that Jeter and Justice had different numbers of at-bats in each of the two years, so the reported "averages" really aren't measuring their performances by the same standard. In 1995, Jeter only had 48 at-bats, whereas Justice had 411. In the following year, the numbers were almost reversed, with Jeter having a whopping 582 at bats to Justice's 140.
To see why this matters, consider the following extreme example: Let's say that I got to act out my childhood fantasy and somehow got drafted into Major League Baseball in the year 1995, but I only got 1 at-bat. Imagine that by some miracle I managed to get a hit in that at-bat (let's go ahead and say it was the game-winning home run; it's my fantasy life, OK?). So my batting average for that season would be 1.000, the highest possible. No matter how good Derek Jeter was, he couldn't possibly beat that average. In fact, let's assume he had an otherwise incredible year and got 999 hits out of 1000 at-bats, for an average of .999. Still, though, my average is better. Now, as a result of my awesome performance in 1995, let's say the manager decided to let me have 100 at-bats the following year, but this time, I only managed to get 1 hit. So my average for the second year would be 1/100 = .010, probably the end of my career. Meanwhile, imagine that Jeter got injured for most of that year and only had 1 at-bat, during which he didn't get a hit. Thus, his average for the second season would be .000, the worst possible. Again, my average is higher. So on paper, it would appear that I was better. However, when you combine all the hits and at-bats, I only got 2 hits out of 101 attempts, for an average of 0.019, whereas Jeter actually had 999 hits out of 1001 at-bats, for an amazing average of .998. My two better seasons were merely the result of creative bookkeeping. The same thing could happen if we split up our averages against right-handed and left-handed pitchers, or at home and away, etc.
This is part of the reason that baseball records-keepers require a minimum number of at-bats in a season for player's average to "count." Otherwise, some player could start his career with a hit and then promptly "retire" with an average of 1.000.
The phenomenon isn't just limited to sports, either. Simpson's paradox rears its ugly head in all kinds of statistical analyses, like the ones in medical studies, for example. It can happen that treatment 1 appears to be more effective than treatment 2 in each of two subgroups of a population, but when you pool all the results together, treatment 2 is better. (For an example, replace "Major League Baseball," with "a pharmaceutical company," "at-bats" with "patients," "hits" with "successful treatments," "Derek Jeter" with "a rival drug company," and "childhood fantasy" with "adult nightmare" above.) Again, the key is that the sizes of the groups receiving each treatment have to be different from each other in order for the phenomenon to manifest. If the group sizes (or number of at-bats, or whatever) are held constant, the paradox disappears, because the higher averages would actually correspond to greater numbers of successes in each trial.
As I discussed in my previous post about different kinds of averages, an average is supposed to represent the quantity that if repeated would have the same overall effect as a set of quantities. However, that doesn't mean the average is the end of the story. Another essential component is how many times that quantity was repeated. If I told you I'd pay you an average of $100 per hour to do some work for me around the house, you'd probably be fairly disappointed if the "work" consisted of 2 seconds of opening a pickle jar (total cost to me: 5.5¢. look on your face: murderous rage.).
*Not to mention RISPs, DICEs, BABIPs, HBPs, ...