Tuesday, April 14, 2009

Nothing is Certain but Death and Logarithms

Dear Dr. Math,
I've heard that if I wanted to, ahem, "creatively adjust" some numbers, I should use numbers that start with the digit 1 more often. Why is that?
Inquiring Re. Statistics

Dear IRS,

How timely of you to bring this up! Indeed, there is a general pattern in the digits typically found in measured quantities, especially those spanning many orders of magnitude, for example: populations of cities, distances between stars, or, say, ADJUSTED GROSS INCOME. The pattern is that the digit 1 occurs more often as the leading digit of the number, approximately 30% of the time, followed by the digit 2 about 18% of the time, and so on. The probability, in fact, of having a leading digit equal to d is equal to log(1+1/d), for any d =1,2,...,9. This rule is called Benford's Law, named (as is often the case) for the second person to discover it, when he noticed that the pages of the library's book of logarithms were much dirtier, hence more used, at the front of the book where the numbers began with 1. In pictures, the distribution of digits looks like this:



It seems counterintuitive that any digit should be more likely than any other. After all, if we pick a number "at random," shouldn't it have the same probability of being between 100 and 199 as it does of being between 200 and 299, etc.? If so, the probability of getting a 1 as the first digit would in fact be the same as getting a 2. However, this turns out to be impossible, and it has to do with a very common misconception about "randomness."

The fact of the matter is that there's actually no way to pick a number uniformly at random without further restrictions. So, for example, if I tell you to "pick a random number," it must be the case that you're more likely to select some particular number than some other (which ones, however, are up to you.) Assume this weren't true, so all numbers are equally likely. Just to be clear, let's focus on the positive integers, the numbers 1,2,3,... Now let p be the probability of picking any one of them, say the number 1. Since they're all supposedly equally likely, this means p is also the probability of picking 2, and of picking 3, and so on. So the chance of picking any number between 1 and 10, say, is 10*p. Since probabilities are always less than 1, this means p < 1/10. OK, well, by the same reasoning, the probability of picking a number between 1 and 1000 is 1000*p, so p < 1/1000. Similarly, p < 1/1,000,000, p < 1/(1 googol), and so on. In fact, it follows that p < 1/N for any N, and the only (non-negative) number that has that property is p = 0. Ergo, the chance of getting any particular integer is 0, from which it follows (for reasons I won't get into here) that the probability of picking an integer at all is 0, a "contradiction." That's math-speak for "whoops." You can only pick an integer uniformly from a finite set of possibilities.

So, what do we mean when we say that a number is "random"? Well, there are ways for things to be random without being uniformly random. For example, if you roll a pair of dice, you might say the outcome is "random," but you know that the sum is more likely to be 7 than it is to be 2. Similarly, if you pick a person (uniformly) randomly from the population of the U.S. (note: the population is finite, so that's OK), you might model his/her IQ as a random quantity with a normal distribution, a.k.a. a "bell curve," centered around 100. The existence of different distributions besides the uniform distribution is the source of a lot of popular misunderstandings about statistics.

None of that explains where Benford's Law comes from, of course, but it's at least an argument why it's plausible that the distribution isn't uniform. To explain the appearance of the particular logarithmic distribution of digits I wrote above, we'd need some kind of model for the quantities we were observing, and it can't just be "the uniform distribution on the positive integers," because we already showed that there's no such thing.

One reasonable idea is that the thing we're measuring might be "scale invariant." That is, if it has a wide range of possible values, it might not matter what size units we use to measure it--we'll get roughly the same distribution of numbers. So if we imagine switching from measuring lengths in feet to measuring them in "half-feet,"* say, then anything that gave us a foot-length starting with 1, say 1.2 feet or 1.8 feet, will now give us a half-foot length starting either with 2 or 3, in this case 2.4 and 3.6 "half-feet." If the two distributions are the same, then the occurrence of a first-digit 1 must be the same as the occurrence of a first-digit 2 or 3, combined. By the same reasoning, any quantity initially beginning with a 5, 6, 7, 8, or 9 would now begin with a 1, when doubled. Similarly, by tripling the scale, measuring in "third-feet" and assuming the same invariance, we'd get a 1 as often as a 3, 4, or 5 put together. And so on. By considering every possible scale, this line of reasoning leads you pretty much straight to Benford's Law. This scale invariance kind of makes sense if we're measuring ADJUSTED GROSS INCOME, since incomes vary by so much (so very, very much), whereas something like height wouldn't exhibit scale invariance, being more tightly distributed around its mean.

Another perspective is that when we measure things, we're frequently observing something in the midst of an exponential growth. Exponential growth happens all the time in nature, for example, in the sizes of populations or SECRET OFFSHORE BANK ACCOUNTS with a fixed (compound) interest rate. The key feature of a quantity growing exponentially is that it has a fixed "doubling time." That is, the amount of time it takes to grow by a factor of 2 is independent of how big it is currently. For example, let's assume your illegal bank account (well not yours, but one's) doubles in value every year and starts off with a balance of $1000. At the end of year 1, you'd have $2000, at the end of year 2 you'd have $4000, at the end of year 3 you'd have $8000, and so on. So for the whole first year, your bank balance would start with the digit 1, but during the second year you would have some balances starting with 2 and some with 3. During the third year, you would have balances starting with 4, 5, 6, and 7. If we AUDITED your account at some randomly chosen time, we'd be just as likely to see a balance starting with 1 as a balance starting with 2 or 3, combined, and so on. In other words, we have the same "scale invariance" conditions as before, which lead us back to Benford's Law. The same would be true no matter how quickly the account grew; exponential growth sampled at a random time gives us a logarithmic distribution of digits.

To give you a concrete example, I went through the first 100 powers of 2--1, 2, 4, 8, 16, ...**--and instructed my computer to keep track of just the first digits. The results, as you can see, conform pretty nicely to Benford's Law:



For whatever reason, it appears that Benford's Law, like TAX LAW, is the law.

-DrM


*Sounds vaguely Tolkienesque, don't you think?

**32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768, 65536, 131072, 262144, 524288, 1048576, 2097152, 4194304, 8388608, 16777216, 33554432, 67108864, 134217728, 268435456, 536870912, 1073741824, 2147483648, 4294967296, 8589934592, 17179869184, 34359738368, 68719476736, 137438953472, 274877906944, 549755813888, 1099511627776, 2199023255552, 4398046511104, 8796093022208, 17592186044416, 35184372088832, 70368744177664, 140737488355328, 281474976710656, 562949953421312, 1125899906842624, 2251799813685248, 4503599627370496, 9007199254740992, 18014398509481984, 36028797018963968, 72057594037927936, 144115188075855872, 288230376151711744, 576460752303423488, 1152921504606846976, 2305843009213693952, 4611686018427387904, 9223372036854775808, 18446744073709551616, 36893488147419103232, 73786976294838206464, 147573952589676412928, 295147905179352825856, 590295810358705651712, 1180591620717411303424, 2361183241434822606848, 4722366482869645213696, 9444732965739290427392, 18889465931478580854784, 37778931862957161709568, 75557863725914323419136, 151115727451828646838272, 302231454903657293676544, 604462909807314587353088, 1208925819614629174706176, 2417851639229258349412352, 4835703278458516698824704, 9671406556917033397649408, 19342813113834066795298816, 38685626227668133590597632, 77371252455336267181195264, 154742504910672534362390528, 309485009821345068724781056, 618970019642690137449562112, 1237940039285380274899124224, 2475880078570760549798248448, 4951760157141521099596496896, 9903520314283042199192993792, 19807040628566084398385987584, 39614081257132168796771975168, 79228162514264337593543950336, 158456325028528675187087900672, 316912650057057350374175801344, 633825300114114700748351602688

11 comments:

Anonymous said...

I must say, that in this particular case I'd like the explanation of the German Wikipedia somewhat better. SO for anybody who likes to read it (and understands German), go ahead and read the article marked as »worth reading«.

Sorry to Dr.Math, I really appreciate reading this blog, but this time I found the article over at Wikipedia to be better. ;)

Greetings,
Drizzt

Richard said...

I enjoyed the discussion of Benford's law presented here by Dr. Math. It has one advantage over the German Wikipedia article for me, it is written in English. I myself have analyzed large lists of mathematical and scientific constants and other piles of data just to rediscover Benford's law for my own entertainment so the topic is of personal interest.

Most of us appreciate your entertaining exploration of math and math concepts. Please keep up the good work.

Matt said...

The German wiki article may be great (no idea, sadly, since I don't speak German), but the English one wasn't -- too technical for a lay audience, not technical enough for a technical audience (IMO, obviously). The point of this blog is to convey mathematical ideas in simple terms, so I think Dr M's presentation is pretty good.

That said... Dr Math, may I -- as one Dr M to another -- comment on your scale invariance "derivation"? You mention a conversion to half-feet; you could push this example a little further, since it alone gives a good qualitative feel for the distribution. Divide the digits into four groups: those starting with 1, those starting with 2 or 3, those with 4, and the rest. As you showed, the first two are the same size, and both are the same size as the last one. Given how many digits are encompassed by each set (1, 2 & 5, respectively), it's clear that the number of occurrences is decreasing as the digit increases. This means that the remaining set (numbers starting with 4) is probably fairly small, which means that each of the three equal-sized sets are a little less than a third of the total.

You can even get some bounds on that "little less than a third" estimate. Let x be the size of the three equal-sized sets. Assuming that the distribution is monotonically decreasing, the size of the set of numbers starting with 4 is smaller than the size of the set of numbers starting with 3 which, in turn is less than x/2 (since the total of those numbers starting with 2 & 3 is x, giving an average of x/2 for each). Similarly, it's also greater than the size of the set of numbers starting with 5 which, in turn, is greater than x/5. So, the size of this "extra" set is somewhere between x/2 and x/5. That seems like a wide range, but... the total then lies between x + x + x/2 + x = (7/2)x and x + x + x/5 + x = (16/5)x. Since the total is fixed, we get that x must be between 2/7 (28.6%) and 5/16 (31.3%) of the total. Not a bad estimate, for such a simple argument, based on a single rescaling!

I just thought that was kinda cool.

Matt said...

Oh, PS: I like the line "named (as is often the case) for the second person to discover it".

When you wrote this, were you thinking of the quip about theorems being named after the first person after Euler to discover them?

(And do you know who said that?)

drmath said...

Thanks for the feedback, everyone!

Drizzt, if some part of the German article is particularly good, maybe you could give us a translation? I had to pass a German reading exam to get my doctorate, but I'm pretty rusty now.

Matt, thanks for that derivation! I couldn't think of a good way to get from the scale-invariance property to the actual distribution without doing some more complicated math. But your reasoning gives a good estimate in only a few steps!

I don't know that quote, but it certainly sounds true. The phenomenon of things being named after the wrong people seems to be very common, to the point of almost being a joke. For example, my all-time favorite theorem, Stokes' Theorem, was named neither for the person who discovered it nor the person who proved it, but for the person who transmitted it from one to the other.

I guess it works out well in some contexts, though, because otherwise half of math would be named after Euler and the other half after Cauchy. :)

Anonymous said...

see also: Stigler's law .

Anonymous said...

@Richard,Matt: I was quite aware that most readers wouldn't be able to read the German article, but thought I share it anyway for anybody who does speak German. So my comment wasn't meant as some way of telling that Dr. Math's explanation isn't good, just that I find the German WP article better this time. »Better« doesn't necessarily imply »bad« for the compared thing.

@Matt: In regard to the English article in the WP I share your sentiments. And would extend that statement to a lot of other articles. I'd say that generally the information found in the German WP is more accurate and better presented than in the corresponding English article. The overall quality seems to be better. A notion which Mr. Wales shares.

@drmath: For now I must pass on the translation as my time is very limited right now. Especially because I would more or less translate the entire article, because it was an overall observation, which led to the statement above.

Greetings,
Drizzt

Zyrex said...

dear dr. math,

can you find an equation for this pattern?

this really gives me a hard time figuring it out.

1=0
2=1
3=4
4=15
5=64
6=325
7=1956

there... it's like x=???

x is the number in the right side of the equal sign...

Unknown said...
This comment has been removed by a blog administrator.
Unknown said...
This comment has been removed by a blog administrator.
rooks said...

Do you plan to resume your blog? It used to be one of my favourites and as I was cleaning up among my RSS feeds, I couldn't just delete it in some vague hope of an article showing up again after a longish break.

I guess if you had moved you'd have posted about it. Any chance of having you back?