Monday, April 23, 2012

The notepad.exe problem

Quite a long time ago, there was a minor stir over notepad's treatment of text encodings. So many people talked about this that Raymond Chen even got in on the action (twice). As a Linux user, I honestly have no idea whether Microsoft ever got around to fixing this (Raymond's post suggests they consider it impossible to "fix" as such). I'm not really interested in bashing them for using an imperfect algorithm, since, to be honest, you'll never get perfection with this issue. But I do think it provides an interesting framework for a math problem.

Technically, ANSI is Windows-1252 (well, technically the term "ANSI" is wrong, but nobody cares), but in practice I think it's most commonly used as ASCII, in which case it's really equivalent to UTF-8. Now, I know that technically there are probably quite a lot of files out there with non-ASCII symbols encoded as 1252, but honestly, if you're trying to detect 1252 automatically, you've already lost since there are only 5 invalid code points; this means that if you feed totally random data into an ANSI decoder, it will take, on average, 50 bytes before the decoder complains of an invalid code point. This means that it's very likely to accept short runs of totally random data

One thing I notice is that, while Raymond says (implies, probably unintentionally) the string is ANSI, it's also perfectly good UTF-8, and will even produce the same interpretation in UTF-8 as in ANSI. This is, of course, because they're both backwards compatible with ASCII, which is what our string really is. Now, it's difficult to figure out the probability of random data being valid UTF-8 since UTF-8 is variable length. But I would say that it's rather unlikely since valid UTF-8 has quite a few constraints on its behavior: each leader byte must be followed by exactly the right number of continuation bytes, and overlong sequences are also disallowed. Then you look at Unicode itself and find that certain sequences (U+FFFF, for instance) are illegal.

UTF-16, on the other hand, is a much denser representation. Most 2 byte sequences are legal, unless they happen to form an invalid surrogate pair or result in an illegal code point (again, U+FFFF and friends). Now, the "illegal code point" problem is common to both encodings, so we can ignore it for our purposes (though technically, I suppose UTF-8 has the code points U+D800 to U+DFFF illegal, which UTF-16 can't even represent). But how likely is it that you get an invalid surrogate? Well, the likelihood of a random 16-bit value being (say) a leading surrogate is 1/64, since the first 6 bits are constrained (so you get 2^-6). This is the same for a trailing surrogate. Now we need the probability of a leading surrogate followed by something which is not a trailing surrogate, or a trailing surrogate followed by anything. Well, if we crunch the numbers, we find that it takes about 31 2-byte values, or 62 bytes, before you encounter an invalid surrogate pair in random data. This is actually worse than ANSI. However, that's not a valid comparison since we've been ignoring the invalid code points problem (which is different in ANSI). But there aren't a whole lot of "defined invalid" characters in the Unicode spec, so this only reduces the count a little (remember, the "reserved" U+D800 to U+DFFF cannot be safely represented in UTF-16; that's why they're reserved. Any attempt to do so will produce a (possibly illegal) surrogate pair, which we've already covered).

Earlier I said it's difficult to find the probability for UTF-8. I could, however, find the probability for a discrete code-point being invalid. This is less useful since it's not measured in bytes, but it will at least tell us how many characters your decoder will emit before it yells at you. So an invalid UTF-8 "character" is any of the following:
  1. Begins with 0b10 (begins with a continuation byte, or the previous character has too many continuations). (p=1/4)
  2. Begins with 0b110, but the next byte doesn't begin with 0b10. (p=3/32)
  3. Begins with 0b1110, but either (or both!) of the next two bytes doesn't begin with 0b10. (p=15/256)
  4. Begins with 0b11110, but any of the next 3 bytes isn't a continuation. (p=63/2048)
  5. Begins with 0b11111 (too long). (p=1/32)
  6. Begins with a valid leader byte, but the file ends before the requisite number of continuations have appeared. (p is impossible to calculate without knowing more about the file)
These possibilities are mutually exclusive (except for 6). We've ignored the issue of overlong encodings (other than 5; things beginning with 0xC0 and 0xC1) because they're difficult to account for (they unpleasantly overlap with 2)  and many decoders will silently allow them anyway, as they're unambiguous. If we assume the random data is long enough to avoid 6, we get P=91/2048. So your decoder will emit, on average, one character before it notices something illegal; this is equivalent to "no more than 4 bytes" since there's at most 4 bytes in a UTF-8 character. This is why UTF-8 is not supposed to need a BOM: it is very discriminating already, and it's highly unlikely for random data to be accepted as valid UTF-8. Just look at the PDF chart: after 3 characters (at most 12 bytes), random data has less than a 10% chance of being accepted. Of course, real data is seldom random, so applying this to IsTextUnicode is iffy at best. All this tells us is that, ceteris paribus, UTF-8 is much more likely to reject something which isn't UTF-8 than the other encodings.

So where does this leave us? Well, in my opinion, it leaves us with the conclusion that ANSI should be scrapped since it's both undiscriminating and BOM-less, making it rather difficult to autodetect; furthermore, UTF-8 provides ASCII-compatibility and a much more diverse character set. But Microsoft obviously can't do that since there are a lot of ANSI files floating around and people would be mildly annoyed if they had to be converted (seriously: because lots of legacy programs use ANSI and Microsoft hates breaking such programs, which cannot be easily converted (unlike text files); furthermore, I'm not a Windows programmer, but I don't think the ANSI interface is deprecated in the first place!).

I suppose it also suggests that Microsoft consider UTF-8 before other encodings, but as Raymond Chen said, "[N]o matter how you decide to resolve the ambiguity, somebody will win and somebody else will lose"; there's no general solution to this problem. And as we said before, real data may not follow the probability distributions that random data does.

Sunday, April 22, 2012

Truth in turnstiles

We're often told that "logic" is at the heart of mathematics. We're told that you cannot do math without logic, that mathematics requires it. Personally, I subscribe to the view of formalism, which basically says that math is the manipulation of symbols according to rules. Do we need full-blown propositional logic to manipulate symbols? I hope not.

We need to decide what we mean by "logic." There are a lot of different kinds, believe it or not, and it's not just a matter of expressive power. Alternative logics often entirely disagree with "classical" or typical logic. For instance, paraconsistent logic will tolerate a contradiction without going nuts, and relevance logic takes perhaps a more intuitive approach to material implication, though it's also more complex. When we say "logic," we often mean classical logic, but I honestly cannot see why it should be privileged over other logics. Classical logic is more intuitive than paraconsistent logic, but at the cost of increased complexity (and fragility). On the other hand, it's less intuitive than relevance logic, for the sake of simplicity. Now, maybe this is a happy medium along a spectrum of intuition versus simplicity. But it seems to me that if you're building the foundation of mathematics, you'll favor simplicity over intuition; a happy medium isn't your goal.

The foundation of mathematics, in truth, is entailment, represented by the turnstile. Entailment basically means "given some statements (which look like) X, you can prove another statement (which looks like) Y." Note that this is very different from material implication, which isn't even meaningful at this level (because we haven't even picked a logic yet, let alone laid out its axioms). The actual axioms you're working with can then be defined using the turnstile. Those axioms are often logical in nature, to lay a strong foundation for other mathematical principles (such as ZFC), but there's no requirement that we use logic at this stage. You could, for instance, define the rules of untyped lambda calculus (which are themselves fairly simple) using only the turnstile, and then you would have a Turing-complete system that was defined without logic at all!

So where does this leave us? Different logics have different uses, and we can pick which one we want or need on a case-by-case basis. The turnstile is used to formalize this detachment, making it a framework for choosing a logical system. But we should remember that, technically, we don't need to use a logical system at all. It just happens to be quite convenient to do so.

Tuesday, April 10, 2012

Sherlock Holmes was wrong

It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts. --Sherlock Holmes, A Scandal in Bohemia
To the average person, the above quote seems eminently reasonable. If you don't know what you're talking about, you probably shouldn't guess. But, truth be told, he's dead wrong.

One of the less obvious fallacies we have to contend with is the Texas sharpshooter fallacy. It happens when you theorize after getting your data, when you twist theories to suit facts. But why is this a fallacy? Well, suppose that I hand you a long list of numbers and ask you to theorize about it. You spend hours carefully examining the data, feeding it through complex and abstruse equations, until eventually you come up with a baroque formula that exactly reproduces every number in the list.

But then I give you some more numbers, generated by the same process, and your formula gives totally wrong answers! How can this be? Well, the numbers were randomly generated all along; there never was a pattern. Now, this is a rather extreme case, but humans are wired to find patterns in everything.

Here's another example, one which actually happens from time to time. Suppose you have a large sample from a huge population, and you start taking cross-sections of that data. In other words, you start pulling out and examining chunks of data at a time. You take a lot of cross-sections, and eventually, one of them shows some unusual pattern. If you assume that pattern also holds in the larger population, you are committing the fallacy. You're already working from a sample; taking samples of that sample is iffy at best. You've already assumed the larger sample to be representative, and you're now adding the assumption that the smaller sample is also good. Worse, you've specifically selected for the particular cross-section that shows an unusual result! If any of your cross-sections shows an unusual result, purely by chance, you'll find it sooner or later.

So if we shouldn't twist theories to suit facts, does that force us to do the reverse? Certainly not! The proper way to do this sort of reasoning is to theorize in advance, and be prepared to discard your theory as soon as it looks wrong. This way, you're not twisting either to suit the other: you're scrapping bad theories and replacing them with (hopefully) good theories. Of course, you will let your previous theories influence your new theory, so in a sense it could be likened to twisting your theories, but really, the important thing here is that you have a viable theory before you receive new data, so that you can test your theory on data it wasn't based on. In a way, basing your theory on the data it's supposed to explain is cheating, since you're not really testing anything; you're just finding a pattern.