Monday, April 23, 2012

The notepad.exe problem

Quite a long time ago, there was a minor stir over notepad's treatment of text encodings. So many people talked about this that Raymond Chen even got in on the action (twice). As a Linux user, I honestly have no idea whether Microsoft ever got around to fixing this (Raymond's post suggests they consider it impossible to "fix" as such). I'm not really interested in bashing them for using an imperfect algorithm, since, to be honest, you'll never get perfection with this issue. But I do think it provides an interesting framework for a math problem.

Technically, ANSI is Windows-1252 (well, technically the term "ANSI" is wrong, but nobody cares), but in practice I think it's most commonly used as ASCII, in which case it's really equivalent to UTF-8. Now, I know that technically there are probably quite a lot of files out there with non-ASCII symbols encoded as 1252, but honestly, if you're trying to detect 1252 automatically, you've already lost since there are only 5 invalid code points; this means that if you feed totally random data into an ANSI decoder, it will take, on average, 50 bytes before the decoder complains of an invalid code point. This means that it's very likely to accept short runs of totally random data

One thing I notice is that, while Raymond says (implies, probably unintentionally) the string is ANSI, it's also perfectly good UTF-8, and will even produce the same interpretation in UTF-8 as in ANSI. This is, of course, because they're both backwards compatible with ASCII, which is what our string really is. Now, it's difficult to figure out the probability of random data being valid UTF-8 since UTF-8 is variable length. But I would say that it's rather unlikely since valid UTF-8 has quite a few constraints on its behavior: each leader byte must be followed by exactly the right number of continuation bytes, and overlong sequences are also disallowed. Then you look at Unicode itself and find that certain sequences (U+FFFF, for instance) are illegal.

UTF-16, on the other hand, is a much denser representation. Most 2 byte sequences are legal, unless they happen to form an invalid surrogate pair or result in an illegal code point (again, U+FFFF and friends). Now, the "illegal code point" problem is common to both encodings, so we can ignore it for our purposes (though technically, I suppose UTF-8 has the code points U+D800 to U+DFFF illegal, which UTF-16 can't even represent). But how likely is it that you get an invalid surrogate? Well, the likelihood of a random 16-bit value being (say) a leading surrogate is 1/64, since the first 6 bits are constrained (so you get 2^-6). This is the same for a trailing surrogate. Now we need the probability of a leading surrogate followed by something which is not a trailing surrogate, or a trailing surrogate followed by anything. Well, if we crunch the numbers, we find that it takes about 31 2-byte values, or 62 bytes, before you encounter an invalid surrogate pair in random data. This is actually worse than ANSI. However, that's not a valid comparison since we've been ignoring the invalid code points problem (which is different in ANSI). But there aren't a whole lot of "defined invalid" characters in the Unicode spec, so this only reduces the count a little (remember, the "reserved" U+D800 to U+DFFF cannot be safely represented in UTF-16; that's why they're reserved. Any attempt to do so will produce a (possibly illegal) surrogate pair, which we've already covered).

Earlier I said it's difficult to find the probability for UTF-8. I could, however, find the probability for a discrete code-point being invalid. This is less useful since it's not measured in bytes, but it will at least tell us how many characters your decoder will emit before it yells at you. So an invalid UTF-8 "character" is any of the following:
  1. Begins with 0b10 (begins with a continuation byte, or the previous character has too many continuations). (p=1/4)
  2. Begins with 0b110, but the next byte doesn't begin with 0b10. (p=3/32)
  3. Begins with 0b1110, but either (or both!) of the next two bytes doesn't begin with 0b10. (p=15/256)
  4. Begins with 0b11110, but any of the next 3 bytes isn't a continuation. (p=63/2048)
  5. Begins with 0b11111 (too long). (p=1/32)
  6. Begins with a valid leader byte, but the file ends before the requisite number of continuations have appeared. (p is impossible to calculate without knowing more about the file)
These possibilities are mutually exclusive (except for 6). We've ignored the issue of overlong encodings (other than 5; things beginning with 0xC0 and 0xC1) because they're difficult to account for (they unpleasantly overlap with 2)  and many decoders will silently allow them anyway, as they're unambiguous. If we assume the random data is long enough to avoid 6, we get P=91/2048. So your decoder will emit, on average, one character before it notices something illegal; this is equivalent to "no more than 4 bytes" since there's at most 4 bytes in a UTF-8 character. This is why UTF-8 is not supposed to need a BOM: it is very discriminating already, and it's highly unlikely for random data to be accepted as valid UTF-8. Just look at the PDF chart: after 3 characters (at most 12 bytes), random data has less than a 10% chance of being accepted. Of course, real data is seldom random, so applying this to IsTextUnicode is iffy at best. All this tells us is that, ceteris paribus, UTF-8 is much more likely to reject something which isn't UTF-8 than the other encodings.

So where does this leave us? Well, in my opinion, it leaves us with the conclusion that ANSI should be scrapped since it's both undiscriminating and BOM-less, making it rather difficult to autodetect; furthermore, UTF-8 provides ASCII-compatibility and a much more diverse character set. But Microsoft obviously can't do that since there are a lot of ANSI files floating around and people would be mildly annoyed if they had to be converted (seriously: because lots of legacy programs use ANSI and Microsoft hates breaking such programs, which cannot be easily converted (unlike text files); furthermore, I'm not a Windows programmer, but I don't think the ANSI interface is deprecated in the first place!).

I suppose it also suggests that Microsoft consider UTF-8 before other encodings, but as Raymond Chen said, "[N]o matter how you decide to resolve the ambiguity, somebody will win and somebody else will lose"; there's no general solution to this problem. And as we said before, real data may not follow the probability distributions that random data does.

No comments:

Post a Comment

This is pretty much a free-for-all. If your comment is very hard to read (e.g. txtspk, l33t, ALL CAPS, etc.) I may fix it for you or remove it. If your comment brings absolutely nothing to the discussion (e.g. pointless flaming, spam, etc.), it is subject to removal, unless I find it sufficiently amusing. If you don't like my judgment, tough shit, it's my blog. If the blog post itself is old enough, your comment is subject to moderation, so don't panic if it's not visible immediately.