Dear Dr. Math,
I believe the monkey problem makes the simplifying assumption that all the letters in the text are independent, which they're not in real English (or any other language). How does this affect the results? Real texts are sampled very narrowly from the space of possible letter sequences.
CN
Excellent question, CN; I'm glad you brought it up. In fact, the distribution of the letters in the text is irrelevant to the problem. The important assumption is that the letters being output by the monkey are equally likely and probabilistically independent of each other. Under this assumption, it doesn't matter if the text the monkey's trying to match is The Hobbit or the phone book or a sequence of all "7"s--the probability of matching any sequence of 360,000 characters will be . If you need convincing, consider the simpler example of flipping a fair coin 5 times. Any particular sequence of 5 flips, for example HHTHT, has the same probability, , of coming up. So, if we were trying to match any flip sequence, we'd have the same chance. The same is true here, just on a much larger scale (and with monkeys).
Now, many people object to this idea, because they think that a letter sequence like "a;4atg 9hviidp" is somehow more "random" than a sequence like "i heart hanson". Therefore, they reason, the first sequence would be more likely to occur by chance. But actually the two sequences have exactly the same probability of occurrence, under our assumptions. Really, the only difference between the two is what we could infer about the message source (beyond musical tastes) based on receiving such an output. I hope to discuss this in detail someday in the context of elementary hypothesis testing, but if, say, there were some doubt in our minds as to whether the source of these characters was, in fact, uniformly random, the latter message would give us considerable evidence to help support that doubt. The reason is that we could provide an alternative hypothesis that would make the observed data much more likely. For the monkey problem, however, we were assuming that we knew how the characters were being generated, so there's no doubt.
Dear Dr. Math,
Along the lines of your previous questions on large numbers and randomly generating the book The Hobbit, I'd like to ask about randomly generating images. A low res PC display is normally 640 x 480 pixels. If you randomly generated every combination of color pixels, wouldn't you have created every image imaginable at that resolution? That is, one of the screens would be the Mona Lisa, one would be your Ask Doctor Math page, one would be a picture of the Andromeda galaxy from close up, one would be a picture of you!, etc. If you only wanted to look at black & white images, you'd have a much smaller collection, but once again wouldn't you generate every B&W screen image possible?
With feature recognition software getting better all the time, one could "mine" these images for recognizable features. Similar to the way the pharmaceutical companies sequence through millions of plants to find new substances, one could sequence through these images to extract unknown info.
Mike
Dear Mike,
Absolutely, we could apply the same techniques to any form of information that can be reduced to a sequence of numbers or letters, like images, CDs, chess games, DNA sequences, etc. In fact, we needn't generate them randomly, either. As in The Library of Babel example, one could imagine a vast collection of all possible sequences, generated systematically one at a time with no repeats. Unfortunately, for any interesting form of information, the number of possibilities is simply too great to make it practical.
In your example of 640x480 pixel images, even assuming the images were 1-bit monochrome, there would still be 2 possibilities ("on" or "off") for each of the 640*480 = 307,200 pixels. Therefore, the number of possible images would be , which is about . Remember how big a googol is? Well, this number is about . So, not even all the crappy low-res monitors in the universe could possibly display them, even at their lousy maximum refresh rate of 60 images/second. And even worse, we'd have no reason to believe any of the images we did see, because they'd be indistinguishable from all the many other conflicting images.
Your comparison to pharmaceutical companies is interesting, but remember those companies are starting with a large (but manageable) collection of plants that actually exist, not searching through the space of all possible arrangements of plant cells or something. It's OK to search for a needle in a haystack sometimes, but not when the haystack is larger than the known universe.
Dear Dr. Math,
Unless I misunderstand (and that's quite possible), I think you've introduced a major flaw here... The "second chunk" begins at character number 2, not character number 360,001. There is no reason why these should be considered discrete chunks and so just because the first character isn't "I" doesn't affect the fact that the second and subsequent characters may spell out the work. Thusly, your monkeys are producing over 17 million "blocks" a day, not just 48...
A. Nonymous
Well, A, that all depends on how we set up our assumptions. The way I had pictured things, the monkey was typing out a whole manuscript of 360,000 characters at a time and then having someone (perhaps J.R.R. Tolkien himself!) check it over and see if it was exactly the same as The Hobbit. If not, the monkey tries again and, with very high probability, fails.
However, your idea is more interesting and perhaps more "realistic". That is, we could have Prof. Tolkien just watch over the monkey's shoulder as it typed and see if any string of 360,000 consecutive characters were The Hobbit. So, if the monkey started by typing a preamble of gibberish and then typed the correct text in its entirety, we'd still count it as correct. As you say, this means that the possible "chunks" we'd need to consider have a lot of overlap to them--we might find the text in characters 1 through 360,000 or 2 through 360,001, etc. But unfortunately, it's not just the number of chunks being produced we need to reconsider; because of the way they overlap, we've now introduced relationships between the chunks that mean our assumption of independence no longer holds. For example, if we knew the first block of characters was incorrect, we could determine whether it was even possible for the second block to be correct based on the particular way the first block was wrong. In fact, we'd know it was impossible unless the first block was something like "xin a hole in the ground there lived a hobbit...".
Actually, if we thought about things in this way, then CN's question above would be relevant, because the codependency of the overlapping chunks would depend heavily on the particular text we were trying to match. Consider the example of coin-flipping again: assume we were flipping the coin until we got the string TH. There are 3 possible ways we could fail on the first pair of flips, all equally likely: TT, HT, and HH. If we got TT or HT, then we could succeed on the third try by flipping an H. If we started with HH, there's no way we could get TH on the third flip. The number of ways of succeeding would be 2, out of a possible 6. So the probability of succeeding in the second block given that we failed in the first would be .
Now, if we were trying to match HH and we knew we failed on the first 2 flips, there would still be 3 equally likely possibilities. Either we flipped TT, TH, or HT. If we started off with TT or HT, we can't possibly win on the third flip. But if we got TH first, we'd have a chance of flipping H on the third flip and matching. Thus, our probability of matching in the second block given that we failed in the first would only be . Here's a chart showing all of the possibilities:
Mike
Dear Mike,
Absolutely, we could apply the same techniques to any form of information that can be reduced to a sequence of numbers or letters, like images, CDs, chess games, DNA sequences, etc. In fact, we needn't generate them randomly, either. As in The Library of Babel example, one could imagine a vast collection of all possible sequences, generated systematically one at a time with no repeats. Unfortunately, for any interesting form of information, the number of possibilities is simply too great to make it practical.
In your example of 640x480 pixel images, even assuming the images were 1-bit monochrome, there would still be 2 possibilities ("on" or "off") for each of the 640*480 = 307,200 pixels. Therefore, the number of possible images would be , which is about . Remember how big a googol is? Well, this number is about . So, not even all the crappy low-res monitors in the universe could possibly display them, even at their lousy maximum refresh rate of 60 images/second. And even worse, we'd have no reason to believe any of the images we did see, because they'd be indistinguishable from all the many other conflicting images.
Your comparison to pharmaceutical companies is interesting, but remember those companies are starting with a large (but manageable) collection of plants that actually exist, not searching through the space of all possible arrangements of plant cells or something. It's OK to search for a needle in a haystack sometimes, but not when the haystack is larger than the known universe.
Dear Dr. Math,
Unless I misunderstand (and that's quite possible), I think you've introduced a major flaw here... The "second chunk" begins at character number 2, not character number 360,001. There is no reason why these should be considered discrete chunks and so just because the first character isn't "I" doesn't affect the fact that the second and subsequent characters may spell out the work. Thusly, your monkeys are producing over 17 million "blocks" a day, not just 48...
A. Nonymous
Well, A, that all depends on how we set up our assumptions. The way I had pictured things, the monkey was typing out a whole manuscript of 360,000 characters at a time and then having someone (perhaps J.R.R. Tolkien himself!) check it over and see if it was exactly the same as The Hobbit. If not, the monkey tries again and, with very high probability, fails.
However, your idea is more interesting and perhaps more "realistic". That is, we could have Prof. Tolkien just watch over the monkey's shoulder as it typed and see if any string of 360,000 consecutive characters were The Hobbit. So, if the monkey started by typing a preamble of gibberish and then typed the correct text in its entirety, we'd still count it as correct. As you say, this means that the possible "chunks" we'd need to consider have a lot of overlap to them--we might find the text in characters 1 through 360,000 or 2 through 360,001, etc. But unfortunately, it's not just the number of chunks being produced we need to reconsider; because of the way they overlap, we've now introduced relationships between the chunks that mean our assumption of independence no longer holds. For example, if we knew the first block of characters was incorrect, we could determine whether it was even possible for the second block to be correct based on the particular way the first block was wrong. In fact, we'd know it was impossible unless the first block was something like "xin a hole in the ground there lived a hobbit...".
Actually, if we thought about things in this way, then CN's question above would be relevant, because the codependency of the overlapping chunks would depend heavily on the particular text we were trying to match. Consider the example of coin-flipping again: assume we were flipping the coin until we got the string TH. There are 3 possible ways we could fail on the first pair of flips, all equally likely: TT, HT, and HH. If we got TT or HT, then we could succeed on the third try by flipping an H. If we started with HH, there's no way we could get TH on the third flip. The number of ways of succeeding would be 2, out of a possible 6. So the probability of succeeding in the second block given that we failed in the first would be .
Now, if we were trying to match HH and we knew we failed on the first 2 flips, there would still be 3 equally likely possibilities. Either we flipped TT, TH, or HT. If we started off with TT or HT, we can't possibly win on the third flip. But if we got TH first, we'd have a chance of flipping H on the third flip and matching. Thus, our probability of matching in the second block given that we failed in the first would only be . Here's a chart showing all of the possibilities:
The two probabilities are different because TH can overlap in more ways with the wrong texts, whereas HH can only overlap with TH.
Therefore, our previous strategy of multiplying probabilities, which rested on the assumption of independence, won't work here. In order to explain how long it would take the monkey to produce The Hobbit with high probability under your scheme, I'd have to go into some fairly heavy-duty math involving Markov chains and their transition probabilities. The relevant probabilities can be found by raising a 360,000 x 360,000 matrix to the nth power--not generally an easy thing to do. But it turns out that the expected (i.e., average) number of characters the monkey would have to type before finishing would still be on the order of , similar to the the previous setup.
Either way, you and J.R.R. would have probably given up by that point.
-DrM
3 comments:
I believe that a thourough answer to the first question would involve calculating the probability of producing Tolkien's work through random generation of only well formed english (and elven) sentences. Might be tricky to calculate their proportion though...
Interesting point, thecod. I don't know of any systematic way to compute the number of well-formed sentences, especially once you start allowing character names like Bilbo and Smaug.
As a lower bound, we could imagine a monkey librarian just picking an already published book at random. According to one source I found, there have been something like 100 million books ever written. So, by the same reasoning as before, the monkey would have to try an average of 100 million times before finding the right one.
You could do that, and it would take a mere 6 days on average to "produce" the hobbit at a rate of 200 choices per second. But where's the fun in that? If I had a troupe of monkeys doing my bidding I would make it so there was the possibility, however remote, that they would produce the greatest novel *not yet* produced by man(monkey?)kind.
Also, I read somewhere that by reading a book letter by letter to a group of friends and trying to make them guess the next letter, Claude Shannon had managed to get an estimate of the entropy of the english language. Could this help calculate the proportion of well formed sentences in any way?
PS I love all your posts! Your references in particular are always well researched... If only all "vulgarised" articles on the web could be of a comparable quality...
Post a Comment