Tuesday, April 14, 2009

Nothing is Certain but Death and Logarithms

Dear Dr. Math,
I've heard that if I wanted to, ahem, "creatively adjust" some numbers, I should use numbers that start with the digit 1 more often. Why is that?
Inquiring Re. Statistics

Dear IRS,

How timely of you to bring this up! Indeed, there is a general pattern in the digits typically found in measured quantities, especially those spanning many orders of magnitude, for example: populations of cities, distances between stars, or, say, ADJUSTED GROSS INCOME. The pattern is that the digit 1 occurs more often as the leading digit of the number, approximately 30% of the time, followed by the digit 2 about 18% of the time, and so on. The probability, in fact, of having a leading digit equal to d is equal to log(1+1/d), for any d =1,2,...,9. This rule is called Benford's Law, named (as is often the case) for the second person to discover it, when he noticed that the pages of the library's book of logarithms were much dirtier, hence more used, at the front of the book where the numbers began with 1. In pictures, the distribution of digits looks like this:



It seems counterintuitive that any digit should be more likely than any other. After all, if we pick a number "at random," shouldn't it have the same probability of being between 100 and 199 as it does of being between 200 and 299, etc.? If so, the probability of getting a 1 as the first digit would in fact be the same as getting a 2. However, this turns out to be impossible, and it has to do with a very common misconception about "randomness."

The fact of the matter is that there's actually no way to pick a number uniformly at random without further restrictions. So, for example, if I tell you to "pick a random number," it must be the case that you're more likely to select some particular number than some other (which ones, however, are up to you.) Assume this weren't true, so all numbers are equally likely. Just to be clear, let's focus on the positive integers, the numbers 1,2,3,... Now let p be the probability of picking any one of them, say the number 1. Since they're all supposedly equally likely, this means p is also the probability of picking 2, and of picking 3, and so on. So the chance of picking any number between 1 and 10, say, is 10*p. Since probabilities are always less than 1, this means p < 1/10. OK, well, by the same reasoning, the probability of picking a number between 1 and 1000 is 1000*p, so p < 1/1000. Similarly, p < 1/1,000,000, p < 1/(1 googol), and so on. In fact, it follows that p < 1/N for any N, and the only (non-negative) number that has that property is p = 0. Ergo, the chance of getting any particular integer is 0, from which it follows (for reasons I won't get into here) that the probability of picking an integer at all is 0, a "contradiction." That's math-speak for "whoops." You can only pick an integer uniformly from a finite set of possibilities.

So, what do we mean when we say that a number is "random"? Well, there are ways for things to be random without being uniformly random. For example, if you roll a pair of dice, you might say the outcome is "random," but you know that the sum is more likely to be 7 than it is to be 2. Similarly, if you pick a person (uniformly) randomly from the population of the U.S. (note: the population is finite, so that's OK), you might model his/her IQ as a random quantity with a normal distribution, a.k.a. a "bell curve," centered around 100. The existence of different distributions besides the uniform distribution is the source of a lot of popular misunderstandings about statistics.

None of that explains where Benford's Law comes from, of course, but it's at least an argument why it's plausible that the distribution isn't uniform. To explain the appearance of the particular logarithmic distribution of digits I wrote above, we'd need some kind of model for the quantities we were observing, and it can't just be "the uniform distribution on the positive integers," because we already showed that there's no such thing.

One reasonable idea is that the thing we're measuring might be "scale invariant." That is, if it has a wide range of possible values, it might not matter what size units we use to measure it--we'll get roughly the same distribution of numbers. So if we imagine switching from measuring lengths in feet to measuring them in "half-feet,"* say, then anything that gave us a foot-length starting with 1, say 1.2 feet or 1.8 feet, will now give us a half-foot length starting either with 2 or 3, in this case 2.4 and 3.6 "half-feet." If the two distributions are the same, then the occurrence of a first-digit 1 must be the same as the occurrence of a first-digit 2 or 3, combined. By the same reasoning, any quantity initially beginning with a 5, 6, 7, 8, or 9 would now begin with a 1, when doubled. Similarly, by tripling the scale, measuring in "third-feet" and assuming the same invariance, we'd get a 1 as often as a 3, 4, or 5 put together. And so on. By considering every possible scale, this line of reasoning leads you pretty much straight to Benford's Law. This scale invariance kind of makes sense if we're measuring ADJUSTED GROSS INCOME, since incomes vary by so much (so very, very much), whereas something like height wouldn't exhibit scale invariance, being more tightly distributed around its mean.

Another perspective is that when we measure things, we're frequently observing something in the midst of an exponential growth. Exponential growth happens all the time in nature, for example, in the sizes of populations or SECRET OFFSHORE BANK ACCOUNTS with a fixed (compound) interest rate. The key feature of a quantity growing exponentially is that it has a fixed "doubling time." That is, the amount of time it takes to grow by a factor of 2 is independent of how big it is currently. For example, let's assume your illegal bank account (well not yours, but one's) doubles in value every year and starts off with a balance of $1000. At the end of year 1, you'd have $2000, at the end of year 2 you'd have $4000, at the end of year 3 you'd have $8000, and so on. So for the whole first year, your bank balance would start with the digit 1, but during the second year you would have some balances starting with 2 and some with 3. During the third year, you would have balances starting with 4, 5, 6, and 7. If we AUDITED your account at some randomly chosen time, we'd be just as likely to see a balance starting with 1 as a balance starting with 2 or 3, combined, and so on. In other words, we have the same "scale invariance" conditions as before, which lead us back to Benford's Law. The same would be true no matter how quickly the account grew; exponential growth sampled at a random time gives us a logarithmic distribution of digits.

To give you a concrete example, I went through the first 100 powers of 2--1, 2, 4, 8, 16, ...**--and instructed my computer to keep track of just the first digits. The results, as you can see, conform pretty nicely to Benford's Law:



For whatever reason, it appears that Benford's Law, like TAX LAW, is the law.

-DrM


*Sounds vaguely Tolkienesque, don't you think?

**32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768, 65536, 131072, 262144, 524288, 1048576, 2097152, 4194304, 8388608, 16777216, 33554432, 67108864, 134217728, 268435456, 536870912, 1073741824, 2147483648, 4294967296, 8589934592, 17179869184, 34359738368, 68719476736, 137438953472, 274877906944, 549755813888, 1099511627776, 2199023255552, 4398046511104, 8796093022208, 17592186044416, 35184372088832, 70368744177664, 140737488355328, 281474976710656, 562949953421312, 1125899906842624, 2251799813685248, 4503599627370496, 9007199254740992, 18014398509481984, 36028797018963968, 72057594037927936, 144115188075855872, 288230376151711744, 576460752303423488, 1152921504606846976, 2305843009213693952, 4611686018427387904, 9223372036854775808, 18446744073709551616, 36893488147419103232, 73786976294838206464, 147573952589676412928, 295147905179352825856, 590295810358705651712, 1180591620717411303424, 2361183241434822606848, 4722366482869645213696, 9444732965739290427392, 18889465931478580854784, 37778931862957161709568, 75557863725914323419136, 151115727451828646838272, 302231454903657293676544, 604462909807314587353088, 1208925819614629174706176, 2417851639229258349412352, 4835703278458516698824704, 9671406556917033397649408, 19342813113834066795298816, 38685626227668133590597632, 77371252455336267181195264, 154742504910672534362390528, 309485009821345068724781056, 618970019642690137449562112, 1237940039285380274899124224, 2475880078570760549798248448, 4951760157141521099596496896, 9903520314283042199192993792, 19807040628566084398385987584, 39614081257132168796771975168, 79228162514264337593543950336, 158456325028528675187087900672, 316912650057057350374175801344, 633825300114114700748351602688

Tuesday, March 31, 2009

Elevator Action

Dear Dr. Math,
How do I figure out whether to take the elevator or the stairs? The elevator is faster, when it comes, but sometimes I think I end up waiting longer than it would have taken me to just walk.
Regards,
StairMaster


Excellent question, SM, but we need to clarify what criteria we're using to decide between the two options. Some people might generally prefer the stairs because they enjoy the exercise, or maybe they're worried about the possibility of being stuck alone in a metal box for 41 hours. Some people like the elevator because it's more social, and you can do that thing where you jump at the very end and it feels like you're floating. But from the way you asked, I'm assuming you're just trying to minimize time, and that's all you care about. By the way, what's the big hurry? Take time to enjoy the little things, SM; they're all we have.

To actually answer the question, we need a model for all of the relevant quantities that we care about. To keep it general, I'll use variable names instead of hard numbers, and then we can take a look at some specific examples, and you can apply the theory to your own needs.

First, there's the stair option. Let's use S to denote the time it takes to walk. Since S doesn't really change much from trip to trip, we'll treat it as a constant. If you wanted to get more sophisticated, you could account for things like how many other people were trying to take the stairs, whether you were carrying something heavy, whether you could slide down the banister, etc.

The elevator option is the more interesting one. Let's let e be the shortest possible time the elevator could take, say, if it were already there waiting for you and didn't make any other stops. Similarly, there's a maximum time the elevator could take, if it was the greatest possible number of floors away from you and someone had pushed all the buttons, or something. Denote that time by E. (Upper case for the bigger time; lower case for the smaller one.) Again, we're treating e and E as known quantities and as constants; I encourage you to measure them sometime. If S < e, it's always faster to take the stairs, no matter what. If E < S, it's faster to take the elevator, even in the worst case. The remaining possibility is that e < S < E , so sometimes one is better, sometimes the other. It sounds like that's the situation with your elevator, SM, so we'll take it as a given.

Since we don't know the actual length of time the elevator would take, we have to treat it as a random quantity with a value somewhere between e and E. Let's call the actual time T. Here again, we need to consider what information we might have about T that could tell us what kind of probability distribution is reasonable to associate to it. For example, should we expect it to usually be closer to its minimum possible value, e? That would make sense if not many other people used that elevator and it usually hung out on the correct floor--say, if we were trying to go up from the ground floor to the 7th floor in an apartment building. On the other hand, if we were trying to go down from the top floor of a busy office building, it might be more reasonable to expect T to be closer to its maximum value, E, more of the time.

In the absence of any other information, we'll assume the distribution of T is uniform on the set of times between e and E, meaning it's just as likely to be any value as any other. Another way of saying this is to say that the probability that T is less than any given value, say x , is proportional to the difference between x and e. For x = e, the probability is 0, since e is the smallest possible time; for x = E, the greatest possible time, the probability is 1; for times in the middle, the probability is between 0 and 1. In pictures, the distribution looks like this (a = e, b = E):



As a result, the expected value, or average, of T is the number halfway in between its minimum and maximum possible values, that is, (e + E)/2. So, in the sense of minimizing expected values, you should take the stairs only if S < (e + E)/2; otherwise, you're better off waiting for the elevator, on average. As in my discussion of the lottery, though, the expected value may not be the only consideration. Since the time to take the stairs is a known quantity, it has a variance of 0, and that security may be worth some trade off in expected value. On the other hand, maybe you like to gamble, SM (I don't know you that well), in which case you might prefer the thrill of betting on the higher-risk, higher-reward elevator, even if the average time is slightly greater. Over many trials, though, you'd save time choosing the option with the smallest expected value.

Just to see how this would play out with actual numbers, I'll consider a scenario that I frequently encounter when taking the subway. (And you thought this was just about elevators!) Here, the role of "stairs" will be played by the #1 downtown local train, which makes frequent stops every few blocks, and the "elevator" corresponds to the #2 downtown express train, making stops only every few stations. Let's assume that I'm already on the local train at the 96th Street station, and I'm trying to get to Times Square as quickly as possible. My options are 1) stay on the local train, which will get to Times Square after some fixed amount of time, or 2) gamble on the express train, which might get there earlier or might not.

According to the schedule, it takes the local train about 10 minutes to get from 96th Street to Times Square. And the express train takes 6 minutes but only runs every 12 minutes (in the middle of the day). Therefore, the least possible time it could take is 6 minutes, and the greatest possible time is 18 minutes. Assuming the distribution of times to be uniform between these extremes gives us an average travel time of (6 + 18)/2 = 12 minutes, which is 2 minutes more than the local train. Hence, to minimize expected value, I should always stay on the #1. Here, the randomness of the travel time has more to do with how long I have to wait for the train than how long the trip will actually take, and it's somewhat reasonable to treat this as being uniformly distributed, since the train runs on a regular schedule (every 12 minutes) but I don't know what time it is currently relative to that schedule.

Of course, in practice the situation is more complicated than that. For example, there are actually two possible express trains I can take (the #2 or the #3), just as there might be two possible elevators you could take, and I'll take whichever comes first. If we treated these both as uniformly random variables, independent of each other, then the distribution of the time until the next train would not be uniform and would, in fact, have a smaller expected value.

For the numbers I gave above, it actually works out that the #2/#3 express option has the same expected value, 10 minutes, as the local train. So it's back to personal preference to break the tie. Another interesting variant is to consider how many people you see waiting for the train, elevator, bus, etc. and use that as a way to estimate the amount of time they've already been waiting. For example, if you see 3 people waiting and you know people tend to show up at the rate of 2 per minute, then you could estimate the time already waited at 1.5 minutes, reducing the maximum possible travel time by that same amount and perhaps tipping the balance.

Hope that helps some, and let me know whether the time you save turns out to be greater than the time spent thinking about the problem. I've got to go now to catch a (local) train.

-DrM

Thursday, March 26, 2009

An Average Screwball

Dear Dr. Math,
I've read that Derek Jeter had a lower batting average than David Justice in both 1995 and 1996, but if you combine the two years together Jeter's average is higher. How is that possible? I don't get it.
SoxFan04

Dear SoxFan,

No other sport seems to bring out the statistics junkies quite like baseball, what with the ERAs and OBPs and WHIPs.* Where else do you find sports fans casually throwing around ratio approximations computed to 3-digits of accuracy? I guess it's all a holdover from cricket; we in the U.S. should at least count ourselves lucky that we don't have to learn things like the Duckworth-Lewis method.

So, what you say is true about Jeter and Justice. In 1995, Jeter had a batting average (that's ratio of hits to at-bats for the baseball-averse) of .250, and Justice's average was slightly higher at .253. In 1996, Jeter hit a much more respectable .314 but Justice out-paced him again, hitting .321. However, when the results of both years are pooled together (total hits and total at-bats), Jeter's combined average is .310, versus Justice's paltry .270. How could this happen, in the most American of sports?

It's a particular case of something called Simpson's paradox, which generally deals with the weird things that can happen when you try to combine averages over different sets of data. See, the source of the confusion is that Jeter and Justice had different numbers of at-bats in each of the two years, so the reported "averages" really aren't measuring their performances by the same standard. In 1995, Jeter only had 48 at-bats, whereas Justice had 411. In the following year, the numbers were almost reversed, with Jeter having a whopping 582 at bats to Justice's 140.

To see why this matters, consider the following extreme example: Let's say that I got to act out my childhood fantasy and somehow got drafted into Major League Baseball in the year 1995, but I only got 1 at-bat. Imagine that by some miracle I managed to get a hit in that at-bat (let's go ahead and say it was the game-winning home run; it's my fantasy life, OK?). So my batting average for that season would be 1.000, the highest possible. No matter how good Derek Jeter was, he couldn't possibly beat that average. In fact, let's assume he had an otherwise incredible year and got 999 hits out of 1000 at-bats, for an average of .999. Still, though, my average is better. Now, as a result of my awesome performance in 1995, let's say the manager decided to let me have 100 at-bats the following year, but this time, I only managed to get 1 hit. So my average for the second year would be 1/100 = .010, probably the end of my career. Meanwhile, imagine that Jeter got injured for most of that year and only had 1 at-bat, during which he didn't get a hit. Thus, his average for the second season would be .000, the worst possible. Again, my average is higher. So on paper, it would appear that I was better. However, when you combine all the hits and at-bats, I only got 2 hits out of 101 attempts, for an average of 0.019, whereas Jeter actually had 999 hits out of 1001 at-bats, for an amazing average of .998. My two better seasons were merely the result of creative bookkeeping. The same thing could happen if we split up our averages against right-handed and left-handed pitchers, or at home and away, etc.

This is part of the reason that baseball records-keepers require a minimum number of at-bats in a season for player's average to "count." Otherwise, some player could start his career with a hit and then promptly "retire" with an average of 1.000.

The phenomenon isn't just limited to sports, either. Simpson's paradox rears its ugly head in all kinds of statistical analyses, like the ones in medical studies, for example. It can happen that treatment 1 appears to be more effective than treatment 2 in each of two subgroups of a population, but when you pool all the results together, treatment 2 is better. (For an example, replace "Major League Baseball," with "a pharmaceutical company," "at-bats" with "patients," "hits" with "successful treatments," "Derek Jeter" with "a rival drug company," and "childhood fantasy" with "adult nightmare" above.) Again, the key is that the sizes of the groups receiving each treatment have to be different from each other in order for the phenomenon to manifest. If the group sizes (or number of at-bats, or whatever) are held constant, the paradox disappears, because the higher averages would actually correspond to greater numbers of successes in each trial.

As I discussed in my previous post about different kinds of averages, an average is supposed to represent the quantity that if repeated would have the same overall effect as a set of quantities. However, that doesn't mean the average is the end of the story. Another essential component is how many times that quantity was repeated. If I told you I'd pay you an average of $100 per hour to do some work for me around the house, you'd probably be fairly disappointed if the "work" consisted of 2 seconds of opening a pickle jar (total cost to me: 5.5¢. look on your face: murderous rage.).

-DrM


*Not to mention RISPs, DICEs, BABIPs, HBPs, ...

Tuesday, March 17, 2009

How Big is That Number? Special "March Madness" Episode 2^63.

Dear Dr. Math,
How many possible NCAA basketball tournament brackets are there? If I just guessed randomly, what are the odds I'd get it exactly right?
BYOBasketball

Ah yes. I love March Madness for many reasons:

1) the wide assortment of team names and mascots,
2) Dick Vitale, and
3) it's the one time of the year when people care about the number 2^63, because that's the number of possible ways of filling out a tournament bracket prediction.

Even then-future-President Barack Obama couldn't resist giving it a try! Let's see how it works, baby!

To begin with, the NCAA officials select 64 teams to play in the tournament (actually 65--there's a traditional preliminary game to determine the last team in the field, but we'll ignore that). They divide the teams up into 4 groups of 16 and then rank the teams within each group, from 1 to 16. Each group holds a single elimination tournament, with the high-ranking teams pairing off initially against the low-ranking teams, and the 4 winners (the Final Four) go on to play two more games to determine the national champion. At some point, the players cut down the nets with scissors for some reason. Predicting the way the entire tournament will unfold (that is, the "bracket") has become a popular office tradition; legend has it that no one has ever predicted the entire bracket exactly right, although some websites like Yahoo! have offered prizes of up to $1 million if anyone does it. But if they charged even a penny to play, I wouldn't take that bet. Here's why:

Each game played has 2 possibilities for a winner. Considering just the first round of games, of which there are 32, this means the number of possible ways of predicting the winners is 2^32, or 4,294,967,296. Assuming you had guessed the first round correctly, there would now be 16 games to determine the winner of, so the number of possible predictions for the second round would be 2^16 = 65,536. And so on. In the third round, there would be 2^8 = 256 possibilities; in the fourth round, 2^4 = 16; in the fifth round, 2^2 = 4, and in the final game, you'd still have to guess which of the 2 remaining teams would be the champion.

Altogether then, the number of possible ways to fill out a bracket is 2^32 * 2^16 * 2^8 * 2^4 * 2^2 * 2 = 2^(32+16+8+4+2+1) = 2^63. Another way to arrive at this number is to consider the fact that in total, 63 of the 64 teams have to lose a game over the course of the tournament, meaning there are 63 total games played. So, by the same reasoning as before, the number of outcomes of these games is 2^63 = 9,223,372,036,854,775,808, about 9.2 * 10^18. In words, that's 9.2 quintillion, a.k.a. 9.2 million trillion.

How big is that number?

Well, if you were to write out each bracket prediction on a sheet of standard paper (thickness = .1mm), you would have a stack of paper (9.2 * 10^18 * .1 * .001) = 9.2 * 10^14 meters tall, which is about 6200 times farther than the distance from the Earth to the Sun. If you stole 1 paperclip (mass approx. 1 gram) for each sheet, you'd end up with 9.2 * 10^18 * 1 * .001 = 9.2 * 10^14 kilograms of paperclips, which is the equivalent weight of 46 billion blue whales. Every person on Earth could have 7 blue whales' worth of paperclips. If, instead of working, you filled out 1 bracket per second, 24 hours a day, it would take you (9.2 * 10^18)/(60*60*24*365) = 29 million years. You'd be so busy filling out brackets that you wouldn't even notice that the tournament had ended millions of years ago, along with all of human civilization! (Probably.)

Now, does that mean your odds of picking a perfect bracket are 9.2 quintillion-to-1? Hold on just a second there, shooter. That would only be correct if you were assuming that all possible brackets were equally likely to occur. However, with a little thought, we can probably improve on that assumption considerably. For example, in the 24 years since the tournament was expanded to include 64 teams, a #1 ranked team has never lost in the first round to a #16 seed. So, even if we take the ranking as the only meaningful predictor, we can say that some brackets, namely those in which the high-ranked teams always win, are more likely than others.

Using data I found on www.hoopstournament.net, I was able to estimate the probabilities of the higher-ranked team winning each game in the tournament based on the results of past years (remind me sometime to talk about estimating bias, will you?), and my guess at the probability of the "no upsets" bracket is 1.64 * 10^-11, so the odds against the occurrence of this bracket are "only" 1.64*10^11 = 164 billion-to-1. Since we're assuming that it's always more likely for the higher ranked team to win any given game, this bracket is our best hope. Those odds are still pretty long, though. For comparison, let's say your ATM code were reset to a random 4 digit number; you'd be about as likely to win the bracket challenge as you would be to win the Powerball and then guess the new PIN on the first try (presumably to deposit your winnings).

Many of you sports fans out there are no doubt hurling expletives at the screen right about now, because, as you know, there are always some upsets. I mean, who could forget Coppin State over South Carolina in 1997, right? There's a bit of an odd paradox here: even though the "no upsets" tournament prediction is the most likely (based on our assumptions), the number of upsets is more likely to be greater than 0 than otherwise. In fact, the most likely number of upsets over the course of the tournament, according to my predictions, is about 18. (Note: I'm including "upsets" like the #9 seed beating the #8 seed--an event that has happened about as often as it hasn't over the years.)

So, the most likely scenario is that there are no upsets, and yet the most likely number of upsets is 18. How can both statements be true? The answer lies in the fact that we're not being asked to predict just the number of upsets, but the actual sequence of events:

Consider the simpler example of rolling a 6-sided die. Let's assume all faces are equally likely, but let's say we really don't want any sixes, and that's all we care about. We'll label the six-side as "L" and the other 5 sides as "W." So, the chance of winning any particular roll is 5/6. Not bad. Now let's imagine rolling this die 60 times. Since the most likely occurrence in each roll is that we'll win, the most likely sequence of 60 rolls is a sequence of 60 wins, with probability (5/6)^60, about 1.77 * 10^-5, roughly 1 in 57,000. If we put a loss in the sequence in any location (first, second, third, etc.), the probability would go down to (5/6)^59 * (1/6) = 3.54 * 10^-6, which is 5 times less likely. However, according to basic probability, the number of losses would follow what's called a binomial distribution, which has its peak (the most likely number) at the value equal to the number of rolls times the probability of losing each one. In our setup, that's 60*(1/6) = 10. If we repeated this experiment a bunch of times, the most common result would be that we'd lose 10 times out of 60.

What's going on here is that although the chance of any particular sequence of dice-rolls containing a loss is less than the perfect "all wins" sequence, there are so many more possible sequences with a loss--in fact, there are 60, versus only the 1 perfect sequence--that their probabilities accumulate to make the chance of having a loss somewhere greater than not having one anywhere. And continuing in this manner, you can see that the number of sequences containing 2 losses is even greater, despite the fact that each one is even less likely, and so on. Eventually, the two effects balance each other and the combined probability maxes out at 10 losses.

But that's not what we're being asked to predict! In order to win Yahoo!'s bracket tournament or just your office's measly little pool with a perfect bracket, you'd have to predict the precise sequence of wins and losses. People resist picking the most likely bracket because it doesn't "look random enough,"* but really they're just reducing their chances by several orders of magnitude.

In a related experiment, if a person is presented with a flashing light that can be either red or green but is red twice as often, and the subject is asked to predict the color of the next flash, most people will "mix it up," mostly guessing red but occasionally guessing green, because they feel that their predictions need to match the "pattern" of occasionally being green. They're correct to think that it's likely that the light will sometimes be green; their mistake is assuming that they can improve their odds by guessing any particular realization of that notion. As a result, they generally do worse in the experiment than lab rats, who will, reasonably enough, just keep betting on red and winning 2/3 of the time.

And that's why I'm going with the rat strategy: Louisville, Connecticut, Pittsburgh, and North Carolina in the Final Four, baby!**

-DrM


*Or perhaps they feel that they have more information about a given game than the rankings would indicate. For example, a star player could have been recently injured, or a particular team's playing style might be well-suited to beating an otherwise better team. That being said, I think it's more common for people to just pick the teams they want to win and find some way to justify that prediction.

**Personal note: Once, as a nerdy kid who knew nothing about basketball, I employed this strategy in a middle school bracket challenge, picking all four #1 seeds to advance. And as it happened, I got 3 out of 4 right, more than any of my classmates, because they were all clinging to their inferior human-brained approaches. Take that, millions of years of evolution!

Friday, March 13, 2009

You Can't Handle the Lack of Falsehood!

Short Round is at it again, calling me out to challenge me to a good old-fashioned Math-Off.* He's laid down his best moves, throwing out verbal accusations about his old grandpappy and comparing me to Joaquin Phoenix (?), and now it's time for me to step up and put him in his place. The subject at hand was the word "autological":

Q. What does autological mean?
A. Huh, that's sort of a random question. A word is autological if it has the property it denotes. For example, polysyllabic, being itself polysyllabic, is autological because it can be used to describe itself. Other examples include, unhyphenated, pronounceable, abstract, nonpalindromic, and adjectival. The opposite is heterological: the word monosyllabic is polysyllabic and therefore not autological, but/therefore it is heterological. (Most words are going to be heterological: like hairy and well-dressed.) Got it?
Q. Yeah, actually I knew that.
A. What, are you testing me?
Q. No, I have a question about the word and wanted to sort of, you know, set it up first.
A. Is the question whether autological is autological?
Q. Why, yes! That's exactly what I was going to ask.
A. Well, let's take a look. First let's see whether heterological is heterological. That question is basically a restatement of Russell's Paradox and therefore maybe more the domain of Dr. Math, but as my old grandpappy used to say, "F__k Dr. Math!"

We'll see about that!**

So, Shorty, you're right about the connection to Russell's Paradox. In fact, it's this very kind of question that almost brought all of math tumbling down in the early part of the 20th century, although it seems to have mostly recovered now. The real issue at stake here is the idea of sets, and what possible sets you're allowed to construct. When you get down to it, math is really built on a foundation of sets; numbers themselves are just sets in disguise--for example, 0 is actually the empty set, 1 is the set containing the empty set as an element, and so on. But the word "set" is never formally defined, and can't be. If you think about it, this is the only possible way to avoid any circular logic; if "set" were defined in terms of some other words, then those words would themselves need definitions, etc., until we eventually either had an infinite number of words or something was defined in terms of "sets" again. In English this is OK, because we can eventually just point to things and say "I mean that!" but in math we don't have that luxury. You just try pointing to a set sometime.

Many people used to take the point of view that you could always construct a set by specifying the properties of its elements. So, for example, I could say, "Let A be the set of all whole numbers between 1 and 10," or "Let B be the set of all people with moustaches," and those sets would make sense. However, the problem occurs when you start trying to construct sets with self-referential properties; for example, you might like the set itself to be a member of itself. Some sets, like the sets A and B above, are clearly not members of themselves (a set can't grow a moustache), but maybe there are sets that could be, like the "set of all things that aren't made of glass." Unless sets are made of glass (which they aren't), that set would be a member of itself. So far, there's no real paradox, just some weird-looking sets. However, if you now construct "the set of all sets that are not members of themselves," we have an honest-to-Gödel paradox on our hands: Suppose the set exists, and call the set S. The question is, Is S is a member of itself? If it is, then it cannot be, because it fails the test for membership; therefore, it's not, so it is. And it goes on like this, ad infinitum.

If all of this seems like abstract nonsense to you (you're probably right, but) consider the following alternate formulations of the paradox and similar paradoxes:

--Imagine that all the librarians in the country are asked to compile lists of the books in their respective libraries. Since these lists are written in book-form and are kept in the library, some librarians include the catalogue among their lists of books; others don't. Now, imagine that all the librarians send the catalogues to a central library in Washington, where the head librarian makes a master catalogue of all those catalogues which don't include themselves. Should the master catalogue include itself?
--For each number, we can write out statements describing the number and find the statement that uses the fewest syllables. Now, let n be the number described by the statement, "the least number not describable using fewer than twenty syllables." Since this description only contains 19 syllables, n is describable by fewer than twenty syllables, and thus can't be equal to n.
--The barber of Seville, who is a man and is shaven, shaves all men in town who do not shave themselves, and only those men. Who shaves the barber? It can't be someone else, because then he wouldn't shave himself, so he would have to shave himself, but it can't be him, either.
--Groucho Marx once said he wouldn't want to join any club that would have him as a member. Fine. But what if he joined every club that wouldn't have him? Would he be a member of Texarkana Country Club? If yes, then no; if no, then yes.
--Your question about whether the word "heterological" is heterological, meaning not possessing the quality it denotes, can be rephrased in terms of sets as follows: Let W be the set of all words which do not possess the quality they denote; a word is heterological if it is a member of W. So the question is, then, does W contain the word meaning "is a member of W"? If so, then that word doesn't belong in W, therefore it does belong, etc.

In any event, "Uh-oh" is the correct answer.

A related paradox is the so-called "liar's paradox," also called the Epimenides paradox, after the famous stand-up comedian who popularized it around 600 B.C. The joke goes like this: A Cretan walks into a bar and says "All Cretans are liars!" Rim-shot. If the guy's telling the truth, then he's lying, and vice versa. A fun variant that I like to use at parties (courtesy of Raymond Smullyan) is this: Put a dollar bill in one of two envelopes. On the first envelope, write

1) "The sentences on both envelopes are false."

On the second envelope, write

2) "The dollar bill is in the other envelope."

The first claim can't be true, by the same reasoning as the liar's paradox (if it were true, then it would be false, etc.). So therefore, it must be false, right? That means that one of the two statements is true, but, again, it can't be the first one, so it must be the second. Thus, the dollar bill is in the first envelope. Imagine your guests' surprise when they open the first envelope and find it empty (because you put the money in the other envelope!)! Oh, they laugh and laugh.

Another related paradox is called "Newcomb's paradox," which goes like this: Suppose you're going on a gameshow, and you'll be presented with 2 boxes, box A and box B. Your choices are:

1) take box A and box B
2) take box B only

You are told that box A contains $1000, and that box B either contains $0 or $1 million. Seems like an obvious choice, right? Either way, you stand to gain by taking both boxes. Here's the catch, though: the host tells you that somehow (either by clairvoyance, psychological observation, or computer simulation) the producers of the show knew ahead of time what decision you were going to make, and changed the contents of box B as a result. If you were going to just take B, they put the million inside; if you were going to take A and B, they put nothing inside B, so you'll just get $1000. So, now what? Assuming you believe their claim, how do you act?

Now, I wouldn't be writing all this if there weren't a way out of this mess. I'm not ready to give up on math just yet. Just to warn you, though, the resolution can be scarier than the paradoxes. Ready, then? Here goes: the essential answer to all of these puzzles is "You CAN'T!"

That is, the mathematical wisdom that we now hold to, according to the hallowed Zermelo-Fraenkel axioms (sort of the owner's manual of math), is that there are some sets you just can't construct. So, for example, the set of all sets which don't contain themselves just isn't a set. It's nothing. Even though it's described in language reminiscent of sets, it does not exist. The librarian can't make a list of all catalogues not referencing themselves; there's no such thing as the least number not describable in fewer than twenty syllables; the barber can't shave everyone who doesn't shave himself; Groucho can't join every club that won't allow him; and it's impossible to consider all words which don't refer to themselves. The producers of The Newcomb's Paradox Show can't (reliably) claim to have predicted your actions, because whatever simulation they had of your brain could not have included the fact that you knew that they had simulated it! Or if it did, then it didn't include the fact that you knew that it knew that you knew... Again, to put things in the language of set theory, the set of things they claim to know about you cannot contain itself.

A consequence of all of this is that there are some logical statements that are neither true nor false. For example, the statement "the word 'heterological' is heterological" is neither true nor false, because assigning it either truth value (True or False) would be inconsistent. Similarly, the statement "the word 'autological' is autological" is neither true nor false, because either truth value would make it consistent with the axioms. If it were true, then it would be true; if it were false, then it would be false. Starting from the axioms of set theory, we could never hope to prove either statement. So in any logical sense, they are meaningless. And this, by the way, resolves the envelope problem as well, since the flaw in our deduction above was the assumption that statement 1 had to be either true or false.

This seems to fly in the face of everything we've been taught about truth and falsehood--namely, that statements are always either true or false. But consider for a moment, What if that statement is neither true nor false? Whoa.

In fact, we're not out of the woods yet, because, as Gödel showed in 1931, the consistency of the axioms themselves is not provable from the axioms. So it might happen that two perfectly logical deductions could lead to inconsistent conclusions--we can't prove that it's impossible. Later, he died of starvation because he was convinced that someone was trying to poison him. So, perhaps the best advice I can give is just not to think about these things too much.

As Tom Cruise said in A Few Good Men, "It doesn't matter what I believe [to be consistent with the Zermelo-Fraenkel axioms], it only matters what I can prove [as a consequence of the Zermelo-Fraenkel axioms]!"

-DrM


*A tradition dating back to ancient Greece, when Plato used to take on all challengers with his legendary taunt, "αγεωμετρητος μηδεις εισιτω."

**Listen, what happened between me and your grandfather (and grandmother) happened a long time ago and is really none of your business.

Sunday, March 8, 2009

Prob(Rain and not(Pour))=0

Dear Doctor Math,
What does 40% chance of precipitation mean?
Dr. Anonymous, Ph.D., J.D.

Dear Dr. Anonymous,

This is a tricky one, and it confuses a lot of people. First, some wrong answers, courtesy of the Internet:

1) "It will be expected to rain, but the significant chance of rain occurs in only 40 percent of the studied geographical area."

2) "It will definitely rain, but only for 40% of the day."

3) "4 out of 10 meteorologists think it will rain."

4) "40% or below means it won't rain; 70% and above means it will."

5) "There is a 40% chance of precipitation somewhere in the forecast area at some point during the day."

Nope; nope; nope; definitely not; almost, but no. Who would have thought there could be so many different incorrect ways to interpret such a simple statement? We can dismiss numbers 3 and 4 right away--meteorologists aren't generally in the business of taking surveys of each other, and they wouldn't bother with the whole percentage thing if they had a definite opinion either way. The others are tempting, but none get it exactly right. Don't fret if you found yourself agreeing with one of them, though; according to this study in the journal Risk Analysis, respondents from five major cities all over the world overwhelmingly preferred numbers 1, 2 and 5 to the correct answer, which, according to The National Weather Service and the Canadian Weather Office, is this:

"The probability of precipitation (POP) is the chance that measurable precipitation (0.2 mm of rain or 0.2 cm of snow) will fall on any point of the forecast region during the forecast period."

In your example, a 40% chance of rain means the forecaster has determined that the probability is 40% that you will get rained on at some point during the forecast period, usually a day in length (some forecasts include hour-to-hour predictions, but the same idea applies--either way, interpretation 2 is out). But even that's kind of a circular definition--a 40% chance means the chance is 40%--and it doesn't explain where the numbers come from in the first place. To unpack things a little more, we should talk about how weather prediction works and why uncertainty is necessarily involved.

To begin with, meteorologists are always collecting data, tons and tons of it, using extremely sensitive instruments both on the ground and in the upper atmosphere, via weather balloons. These instruments measure a whole array of weather conditions including temperature, humidity, wind speed and direction, pressure, dew point (whatever that is), and others. Then, the meteorologists feed all of that information into huge computer simulations, which use a combination of historical data and physical models (for example, equations of fluid dynamics) to predict the course of events over the next few days. Presumably, the models include the movements of butterflies, because, as I understand it, they are the source of basically all weather phenomena. Why the uncertainty, then? Will all that super technology, why can't they just predict the future?

Well, as I discussed in my previous post about the Monty Hall problem, there are many complex systems in the world which perhaps could be described exactly by massive systems of equations but which we as humans lack the computational power to fully comprehend. And weather is right up there with the most complex systems around. Part of what makes it so difficult to predict is its susceptibility to non-linear feedback, meaning the evolution of the system depends very sensitively on its current state. As a result, even minuscule differences in initial conditions will, with enough time, accumulate into larger differences and eventually produce radically big differences in the outcome. To give you some perspective, scientists have known since the 1960s that even a difference of 0.000001 in initial measurements could lead to vastly different weather conditions after a period of only a week. After a few weeks, you can basically forget about it--in order to predict the weather in any meaningful way, you'd need a complete understanding of the position of every atom in the universe. In the lingo of the 1990s, weather systems are a classic example of chaos, due to their sensitive dependence on initial conditions. Jeff Goldblum explained it all before almost being eaten by a T-Rex.

What makes this sensitive-dependence business such a bummer is that even the most expensive weather-measurement instruments necessarily have some amount of error. For example, a device that measures humidity might detect the presence of water, maybe even down to the last molecule, but that doesn't mean that other patches of nearby air have that same water content. The instrument is kind of like a pollster, asking the air how it's planning to "vote" in the upcoming "election," but it can't "poll" everyone, because there are something like 10^24 "registered voters" in every cubic foot of air. And let's face it, some of them are just "crazy." Failing to account for even a small number of these leads to enough initial uncertainty that prediction becomes more or less impossible. One of the more disturbing things I came across in the course of researching this post was the advice to meteorologists that they should speak in certainties, because "People need to plan for what they need to do. People do not like to be unsure of what will happen." Well, I'm sorry, people, but that's just life.

The output of all these models, then, is just an estimate of the plausibility of the occurrence of rain, just as the statement "the chance of rolling a fair die and getting a 3 is 1/6" is an estimate of that event's plausibility. With enough identical trials, one could perhaps judge whether the estimate had been correct, since the frequency of occurrence should converge to its probability (according to the Law of Large Numbers). However, that whole idea is kind of inapplicable here, because it's impossible to observe multiple independent instances of the same exact weather conditions for the same place at the same time. What would it even mean to say, "Out of 100 instances when conditions were exactly like this in New York on March 8, 2009, it rained in about 40"?* One of the persistent fallacies regarding weather prediction is that it is frequently "wrong." But how can an estimate of uncertainty be wrong? Even unlikely events do occasionally occur--consider 10 flips of a coin; any 10-flip sequence has probability 1/1024 of occurrence, but one of them has to happen. It doesn't mean the probability estimate was wrong. Bottom line, I'll take the National Weather Service over my uncle's "trick knee" any day.

So, the complexity of weather and the imprecision of measurement is one source of uncertainty, but there's actually another one: the weatherman/weatherwoman doesn't know where you are in his/her forecast area. See, the weather forecast covers a fairly large area (like a whole zipcode), and inside that area there are lots of measurement stations, each of which could detect precipitation during the day. Even if a meteorologist knew for a fact that it was going to rain in exactly 40% of the area (and even knew where, too), he/she still would tell you the chance of it raining on you was 40%, since as far as he/she knows, you could be anywhere in town. In a way, interpretation 1 above is a possibility, although it's not by any means the only one. For example, on the other extreme, it could be that the weatherperson thinks that there's a 40% chance it's going to rain everywhere and a 60% chance it's not going to rain anywhere, in which case it doesn't matter at all where you are--your chance of getting rained on is 40%. I can't climb inside Al Roker's head (unfortunately) nor do I have access to his computer models, but most likely, I think the truth is some mixture of these ideas--the 40% figure accounts for both the probability of rain at all and the distribution of rain if it occurs.

This, by the way, is what's wrong with interpretation 5 above, in case you were wondering; there's a subtle difference between saying "the probability of rain at any given point" versus "the probability of rain at some point," assuming that it's possible that it could rain in some places and not rain in others. Think of it this way: if I randomly chose to send an Applebee's gift card to one of my parents for Purim, the probability of any given parent (either Mom Math or Dad Math) getting the card would be 50%, but the probability of some parent getting the card would be 100% (I'm definitely sending it to someone). Another way to phrase it would be to say that if I picked a parent at random after giving out the card, the chance that I would pick the person with the card would be 50%.

In the end then, a more precise definition of what it means to say, "There's a 40% chance of rain" would be something like:

"Given all available meteorological/lepidopteric information, subject to measurement error and uncertainty, I estimate the chance of a location selected uniformly at random from the forecast area receiving any measurable precipitation during the day at 40%."

Maybe a little too wordy for morning drive-time, though.

-DrM

*Side note: it did actually rain today, and Dr. Math chose not to bring an umbrella despite a forecast of "80% chance of rain." Sometime you have to roll the hard 6.

Tuesday, March 3, 2009

Wholly Whexagons, continued

Last time, on Ask Doctor Math:

"... it's true that hexagons do make an appearance in a large variety of different contexts."

"Primarily, I'm referring to regular hexagons--hexagons with 6 equal sides..."

"... by putting a bunch of identical hexagons together as tiles, we can cover an entire plane surface."

"... this way of packing circles has the property of being optimal..."

"... hexagon madness..."


And now, the conclusion!

While regular hexagons have all those great qualities that make them perfect for the needs of bees and frat-boys alike, it happens that irregular (that is, not necessarily equal-sided) hexagons have a somewhat amazing property, too. That is, in some sense they're the average random polygon. What I mean here is that if we generate a random sectioning of a flat surface into a large number of polygonal shapes, the average number of sides per polygon will be about 6. (Side note: this, of course, doesn't mean that there actually are any hexagons; it could be that the shapes are composed of equal parts squares and octagons, say, but in practice, this is unlikely.)

First, we should be clear on some terminology: a "polygon," from the Greek "poly" = "many" and "gon" = "side," is a collection of points connected together by line segments that enclose an area of the plane. The points are called "vertices" (singular "vertex," not "vertice"!), and the segments are called "sides" of the polygon. Notice that it's always true that polygons have the same number of sides as vertices. When we put some polygons together in the plane, their sides are now called "edges," except if two sides overlap each other, we only count that as one edge. Also, when vertices overlap each other, we only count that as one vertex. For example:


Now, before I demonstrate that hexagons are the average, I'll need to lay out a couple of house rules for generating these random polygons:

Rule 1.
All the vertices of the shapes line up with each other, so no brick-patterns, like this:



This is disallowed because it has a vertex of one brick in the middle of the side of another brick.

Rule 2. Every vertex is the junction of exactly 3 polygons (except for a few on the boundary, which we'll ignore).

Now, why are these reasonable? Well, the most common setting for this kind of random polygon-generation is the formation of cracks in some surface, like mud or peanut brittle:



(Note: we're approximating a crack here as a straight line, which takes some imagination.)

Rule 1 essentially states that no cracks spontaneously form in the middle of already existing sides; instead, they emanate from junctions between existing cracks. This is reasonable because a junction between cracks is likely to be a weaker point in the surface than any point along an existing crack. Another way to think about it is that cracks occasionally split into more than one crack, but when they do, both cracks generally change direction, instead of one continuing on like nothing had happened. In pictures, this:



is much more likely than this:



Similarly, rule 2 states that cracks tend to only split off into pairs. To see why that's reasonable, imagine if a crack were trying to split into 3 cracks. If any one of the three were just the slightest bit late to form, it would end up splitting off from one of the already existing 2 cracks, instead of the original one. In pictures, again, this:


is much more likely to occur naturally than this:


There is actually physics that could back me up here, but for now we'll just take these as givens.

It turns out that these same rules make sense in other settings, as well--for example(s), the formation of soap bubbles:



the shape of storm clouds, like this one on Saturn:



the sections on a tortoise shell:


and even France!



Can you spot the hexagons?

OK, before we get to crack apart soapy French turtles on Saturn, we need to take a little side trip to talk about a fundamental fact about polygons in the plane, called Euler's formula.

Euler's formula says that, subject to the rules above, no matter how we arrange a collection of polygons in the plane, the number of vertices minus the number of edges plus the number of polygons (also called faces) is a constant. For our purposes, the constant is 1.* In equation form,

V - E + F = 1

It actually makes quite a lot of sense if you think about it for a second. Imagine we only had one shape, a lonely little triangle all alone in the plane. So, V, the number of vertices, would be 3, as would the number of edges, E. And F, the number of faces, in this case a frowny face, would be 1:


Hence,

V - E + F = 3 - 3 + 1 = 1

Now, if we tacked on a friend for the triangle, for example a square, the resultant shape would have 5 total vertices, 6 total edges, and 2 (now smiley) faces:


So again, V - E + F = 5 - 6 + 2 = 1. Essentially, the square gobbled up 2 of the triangle's vertices and 1 of its edges, while adding 4 of both edges and vertices and 1 extra face. As a result, the quantity V - E + F stayed constant. By similar reasoning, you could convince yourself that the same would happen no matter what shape we tacked on. And we can now repeat the process by adding a third polygon, and a fourth, and so on, until we have a whole polygon party on our hands.

OK now, at long last, we come back to random shapes. In our random polygonal mix, let's call S the total number of sides the polygons have altogether (different from E because 2 sides overlap to form 1 edge). Polygons individually always have the same number of vertices and sides, and since each vertex is shared by 3 polygons, the net total number of vertices is S/3. Also, each edge is shared by 2 sides (except for a small number of boundary edges), so E, the total number of edges, is the same as S/2. Putting these into Euler's formula gives us:

S/3 - S/2 + F = 1

Equivalently, F = 1 + S/2 - S/3.

We can combine the S/2 - S/3 to get S/6, so we have:

F = 1 + S/6

Multiplying by 6 and dividing by F gives us:

6 = 1/F + S/F

Now if we imagine the number of faces being very large, this tells us that 1/F is very small, so S/F is very close to 6. In other words, the ratio of sides to polygons is about 6, so the average polygon is a hexagon!

Next time you're out, keep your compound eyes peeled for hexagons, regular and otherwise, and someday you too can catch the hexagon madness!



-DrM

*Those in the know, take note: the reason the constant is 1 and not 2 is that I'm not counting the unbounded component as a face. However, it doesn't matter in the end, because the constant gets divided by F, so this same property would be true in a topological space with any Euler characteristic.