Wednesday, August 8, 2012

An exercise

Sometimes, I feel like stretching my programming muscles.  A while ago, I read about this problem, or one like it.  It's not an especially hard problem, but I'd like to go over it anyway, because low-level data types can be unintuitive.

Suppose you have two integers a and b.  You want to add them, but you also want to be sure they don't overflow.  You're working in a language where overflow is undefined, such as C.  You can't use arbitrary precision variables or anything fancy like that, nor can you make use of low-level things like CPU overflow detection.  You need to guard against overflow mathematically, without the use of any of those systems.  Maybe you're working on a reduced architecture that doesn't provide overflow detection.  Maybe your language or library is deficient and lacks an arbitrary precision integer type.  It doesn't really matter.  The point is, taking away all those "outs", we're left with a somewhat interesting problem.

Here are some assumptions you may make:
  1. You have a constant called INT_MAX which is equal to the largest integer which can be represented on your system.  You also have another called INT_MIN, which serves a similar purpose for the smallest (most negative) integer.
  2. You're working in two's complement, but overflow is still undefined.
  3. You do not have the exact number of bits available (but you could figure it out from INT_MAX, so that's not much of a restriction).
  4. You may throw an exception to indicate an overflow condition.
Actually think about this problem for a few minutes.  I'll put the solution after the break.  I really think most people who are likely to read this blog can solve this on their own, so please actually try this.

Sunday, July 22, 2012

Why I turned off Firefox's inline autocomplete

Firefox's location bar inline autocomplete is, for me, the single most annoying aspect of Firefox 14. Here's a list of why I'm annoyed:
  1. Pressing tab navigates to the first dropped down suggestion. It has no relation whatsoever to the inline suggestion, contrary to appearances, and doesn't let me edit the URL before navigating to it. 
  2. Pressing backspace removes the inline suggestion, but doesn't erase an actual character that I've typed. This throws off touch typing
  3. When I'm typing a Google search, if Firefox thinks I made a typo, it adds ">>" followed by the "correct" search terms.  I've had false positives here, and pressing enter searches the whole thing, which I find confusing.  Maybe there's a "right" way to use this feature, but if that's the case, then in my opinion it's poorly afforded.  Or maybe Mozilla really thinks people want to search for both correct and incorrect terms.  I don't know.
  4. Pressing enter goes to the autocompleted URL or search terms, instead of whatever I actually typed.  Since I often type rather quickly, this frequently results in searches for the wrong keywords.
  5. If I want to edit the autocompleted URL (e.g. to add additional fragments to it), I have to press the right arrow.  This takes me off the home row.  It is intuitive to use tab for this purpose, but that does something entirely different (see point 1).
  6. Last I checked, I wasn't able to find any addons on AMO to address any of these issues.  It's as if the new autocomplete fell out of the sky one day.  The only addons I was able to find were all about adding inline autocomplete to older versions of Firefox.
I like Firefox, and use it as my primary browser.   But I see this as a major blemish, and I went to the trouble of disabling it completely.  Here's how:
  1. Go to about:config in the location bar (Blogger refuses to link to it, unfortunately).
  2. If Firefox warns you, click through the warning.
  3. Type browser.urlbar.autoFill in the Search box.
  4. Double click the first result.
  5. You're done!  No need to restart the browser or anything.

Saturday, July 21, 2012

SpaceChem optimization

SpaceChem is a programming video game disguised as a chemistry video game.  A static description cannot really do it justice, so here's a trailer:

Honestly, the trailer doesn't do it justice either, so maybe you should go grab the demo from Steam.  Anyway, SpaceChem has two types of puzzles: Research and Pipeline puzzles.  Research puzzles are fairly straightforward to work with because an input/output operation always takes exactly one cycle, whereas with a Pipeline puzzle, the tubes pipes can be clogged or empty, making your I/O operations block.  This means that performance problems with one reactor can affect others.  But optimization is not an all-or-nothing affair.  You need to know where to concentrate your work.  That's what this post is about.

Thursday, July 19, 2012

Do we even need copyright?

As we established last time, copyright is an indirect subsidy.  The purpose of this is to incentivize creativity, the idea being that art for its own sake is important, but won't pay the bills on its own.  But art has a nice property: people like it, especially if it's original and well-executed.  As there's both a supply and a demand for original, creative expression, I must wonder whether the market can connect the two without the use of a subsidy.

Saturday, July 7, 2012

The worst-case scenario

A lot of the lessons from Fukushima have been obvious for a while. Nuclear safety is a global challenge, and every country has to learn from the best practices of others. These best practices include retrofitting passive safety features wherever possible, and continuing to update safety measures in response to our changing understanding of the plant's environment. -- Ars Technica (emphasis added)
 This advice certainly sounds right, to me.  But when I check consequentialism, I find that the words "whenever possible" give me pause.  On the one hand, it seems that our only question should be "What effect does a given safety measure have on actual safety?"  But proper consequentialism requires a two-sided analysis, so we must also ask "How much will this cost in relation to the expected number of lives saved?  Is it worth it?"

I don't like this conclusion.

It certainly seems logical enough: if we're to make a change, we must do a proper cost-benefit analysis.  But imagine this hypothetical: you're provided with an investing opportunity.  The expected value is better than any other investment you're likely to find, and you're sure that it is legitimate.  However, it's a risky investment, with a chance of returning nothing at all, not even the money you put in.  I don't think any sane person would invest all of their money in such a scheme, and I don't think this falls under irrational risk aversion either.  Yet that is exactly what basic economics (which is very similar to utilitarianism) says we ought to do, perhaps leaving some money out for basic living expenses, but certainly without a safety net.  If someone did invest all their money like that, and ended up bankrupt, they would have only themself to blame.

Of course, most real financial analysts would never do this, instead preferring to diversify.  But I can't understand how you get from "throw all your money at the best expected value" to "diversify", and I think it's just a kludge to make the system work right.

What can we distil from this?  I think it's clear that we need to consider the worst-case scenario in any evaluation.  But which one?  A situation can always be worse.  So, to keep this reasonable, we need to limit ourselves to outcomes which are related to the proposed action: those outcomes whose probabilities are significantly affected by our proposed action.  Of those outcomes, we should focus on the very worst.  We need to ask ourselves "Is this outcome tolerable?  What effect does our decision have on its likelihood?"

But what of utilitarianism?  This doesn't seem compatible with it.  I feel that we may revise utilitarianism if we find it necessary; I may explain why in a later post.

Monday, May 14, 2012

Is Fanfiction Legal?

If you're expecting practical advice, please look elsewhere. This is intended as a high-level discussion of copyright law in general, not a practical advisory. If you need legal advice (for example, because someone sent you a DMCA takedown notice), I recommend contacting the Electronic Frontier Foundation, as they sometimes offer pro bono assistance; even if they won't take your case, they may be able to refer you to someone who will. Please don't act on anything in this post without consulting a lawyer.

This is based on US law and may be inaccurate elsewhere.

Fanfiction is fiction involving the characters, setting, or other aspects of existing fiction, and written without the original author's permission.  Many authors approve of this practice, but quite a few others are strongly opposed to it. Worse, authors may selectively enforce their copyrights as much as they please; if, say, Alice writes something, and Bob and Carol each make some fanfiction of it, Alice has every right to ignore Bob and only sue Carol, perhaps because she likes Bob's writing better or she thinks Carol is a hack. This is the case even if Alice previously said she likes fanfiction and doesn't mind people making it.

For the sake of argument, let's just stick with this Alice/Bob/Carol example. So suppose Alice actually does sue Carol. What happens? Well, it depends. Basically, Alice will need to allege one of the following, depending on the circumstances:
  • Unauthorized distribution of the original (if the fanfiction is quite similar to the original)
  • Unauthorized preparation of a derivative work of the original (if it is).
The former is not very likely to work, unless the fanfiction contains large amounts of Alice's original text. Still, it could happen, if perhaps Carol quoted a lot of Alice's text for some reason. But it's not really relevant to our discussion, since most fanfiction doesn't have that.

Derivative works are works "based upon one or more pre-existing works." The "based upon" language suggests, to me, that a derivative work would not exist but for the original. You'll note that the statute does not say "incorporating" or "equivalent to". In fact, if the supposed derivative is "equivalent" to the pre-existing work, then it's not really a derivative at all; it's basically a copy of the original. Furthermore, Carol's derivative need not include any of Alice's text to be a derivative work.

So what are some of the defenses Carol might raise? Well, here are a few that I thought of:
  • Fair use is complicated and will be discussed below
  • Estoppel if for instance Alice said something like "everyone should feel free to create and distribute fanfiction of my work." This seems like a long-shot to me; I'm not aware of any examples of this applied to copyright law, but Carol would probably assert it anyway since it can't hurt her. Furthermore, I don't think this would work unless Alice's statement was worded like a real copyright license, as opposed to a general statement of support such as "I think fanfiction is a good thing," or even "I'm flattered when people make fanfiction of my work."
  • Innocent infringement, if Alice's publisher somehow forgot to include a copyright notice. This would likely force Carol to take down her fanfiction, but protect her from damages. I don't think any publishers are likely to forget the copyright notice any time soon, so this is largely hypothetical.
  • Bare non-infringement, predicated on the idea-expression divide. Carol may admit that she copied certain ideas from Alice's work, but insist that she never copied any protected expressions of those ideas. I have no idea whether this would work, since to the best of my knowledge it's never been tried. On the one hand, this is often associated with attempts to copyright bare facts, as in Feist v. Rural, which Alice is not doing. But on the other hand, Baker v. Selden had little to do with facts and more to do with patentable ideas versus copyrightable expressions. The terms idea and expression don't have terribly clear definitions, in my opinion.
On to fair use. There are four components:
  1. The purpose and character of the use. Carol would probably win this since she's being creative and (presumably) isn't making money from it. But if Carol's work is too similar in story structure or wording to Alice's work, Carol might well lose this one.
  2. The nature of the copyrighted work. Alice would likely win this part, since her work is (presumably) fictional and creative rather than factual and academic.
  3. The amount and substantiality of the portion used in relation to the copyrighted work as a whole. This is very difficult to assess since Carol didn't copy explicit passages from Alice's work. Note that this is explicitly not "just a word count"; the "in relation" language ensures that courts consider the importance of the copied portion, rather than its absolute size.
  4. The effect of the use upon the potential market for or value of the copyrighted work. This is going to be negligible or positive, so Carol will probably win this one.
The above are not evenly weighted, and there is no simple algorithm for figuring out who won, even if you know who won each part. But in general, the fourth part is the most important, although the other parts are still quite relevant. The first part is also fairly important, especially for derivative work cases.

Personally, I'd really prefer to see Carol win, and perhaps that's distorted the outcome above. So I'm not going to say who wins, especially since a court might disagree with me about the four outcomes above. But, to be honest, I think that if Carol won parts 1 and 4, she'd probably win the whole case.

Finally, I never got an opportunity to link it, but Wikipedia has a nice article about legal issues with fanfiction, which also includes trademark law (and a few other things besides), which I never got around to discussing. In short, "I do not own" is potentially effective for trademarks, though not for copyright.

Sunday, May 6, 2012

Copyright is a subsidy

The Congress shall have Power... To promote the Progress of Science and useful Arts, by securing for limited Times to Authors and Inventors the exclusive Right to their respective Writings and Discoveries. -- Copyright Clause, United States Constitution
As you can see, the rationale for copyright, in the US at least, is about promoting progress. It has nothing to do with notional "rights" of authors, and is explicitly not a property right. But the Constitution was a long time ago; maybe laws have changed focus since then. Well, no. The Copyright Act of 1790 bills itself as "An Act for the encouragement of learning." It was amended in 1831, primarily to extend it, with very little discussion of its purpose. A major revamping was undertaken in 1909, again with very little discussion of purpose. Finally, more changes were made in 1976, which is still good law, and once more, it failed to specify a purpose for copyright. The only purpose of copyright is its original purpose; no new purpose has been specified, so we must conclude that Congress was satisfied with the Constitution and the original act.

Now, of course, everything the government does is supposed to be for our benefit, in theory, though perhaps not in practice. So naturally the question arises as to whether copyright, in the above context, really qualifies as a "subsidy" per se. I think it does. A subsidy, in traditional terms, is money paid (usually by the government) to a business or industry to encourage that business model, theoretically due to positive externalities. However, in practice, the term may also be used to describe the government consistently favoring one company or industry over another in e.g. government contracts. From this, we see a broader principle: a subsidy is an action, typically by the government, which encourages a particular behavior via an economic incentive, at the cost of a reduction in competition, for the benefit of the public at large. Well, let's look at copyright through that lens. Copyright is defined as a limited monopoly granted by the government, so it's clearly an economic incentive which reduces competition. Theoretically, it's supposed to encourage the authorship and publication of creative works. This is supposed to benefit the public because of the mere existence of those works, and because, theoretically, they'll eventually enter the public domain for all to enjoy; this is a positive externality. That certainly seems to meet the criteria I specified above. Perhaps this definition is too liberal, but the principle is still there: copyright exists to benefit the public, not the author.

Now, I'm not saying that all copyright is inherently evil; I've never said that subsidies are necessarily bad. Indeed, it's thanks to copyright that we have many high-quality works. But I do feel that the reduction in competition means that there is a balance: there must be some point above which a particular subsidy does more harm than good, as the loss in competition eventually outweighs any public benefit. Of course, someone espousing a command economy would probably disagree with me on this point, but I feel the difference in opinion there is rather beyond the scope of this post. This principle applies quite nicely to copyright. Currently, copyright lasts for 70 years after the death of the author; essentially two and a half generations. It is not clear to me that this effectively "promote[s] the Progress" of anything. If it confers no benefit to the public as a whole, why are we doing it?

Monday, April 23, 2012

The notepad.exe problem

Quite a long time ago, there was a minor stir over notepad's treatment of text encodings. So many people talked about this that Raymond Chen even got in on the action (twice). As a Linux user, I honestly have no idea whether Microsoft ever got around to fixing this (Raymond's post suggests they consider it impossible to "fix" as such). I'm not really interested in bashing them for using an imperfect algorithm, since, to be honest, you'll never get perfection with this issue. But I do think it provides an interesting framework for a math problem.

Technically, ANSI is Windows-1252 (well, technically the term "ANSI" is wrong, but nobody cares), but in practice I think it's most commonly used as ASCII, in which case it's really equivalent to UTF-8. Now, I know that technically there are probably quite a lot of files out there with non-ASCII symbols encoded as 1252, but honestly, if you're trying to detect 1252 automatically, you've already lost since there are only 5 invalid code points; this means that if you feed totally random data into an ANSI decoder, it will take, on average, 50 bytes before the decoder complains of an invalid code point. This means that it's very likely to accept short runs of totally random data

One thing I notice is that, while Raymond says (implies, probably unintentionally) the string is ANSI, it's also perfectly good UTF-8, and will even produce the same interpretation in UTF-8 as in ANSI. This is, of course, because they're both backwards compatible with ASCII, which is what our string really is. Now, it's difficult to figure out the probability of random data being valid UTF-8 since UTF-8 is variable length. But I would say that it's rather unlikely since valid UTF-8 has quite a few constraints on its behavior: each leader byte must be followed by exactly the right number of continuation bytes, and overlong sequences are also disallowed. Then you look at Unicode itself and find that certain sequences (U+FFFF, for instance) are illegal.

UTF-16, on the other hand, is a much denser representation. Most 2 byte sequences are legal, unless they happen to form an invalid surrogate pair or result in an illegal code point (again, U+FFFF and friends). Now, the "illegal code point" problem is common to both encodings, so we can ignore it for our purposes (though technically, I suppose UTF-8 has the code points U+D800 to U+DFFF illegal, which UTF-16 can't even represent). But how likely is it that you get an invalid surrogate? Well, the likelihood of a random 16-bit value being (say) a leading surrogate is 1/64, since the first 6 bits are constrained (so you get 2^-6). This is the same for a trailing surrogate. Now we need the probability of a leading surrogate followed by something which is not a trailing surrogate, or a trailing surrogate followed by anything. Well, if we crunch the numbers, we find that it takes about 31 2-byte values, or 62 bytes, before you encounter an invalid surrogate pair in random data. This is actually worse than ANSI. However, that's not a valid comparison since we've been ignoring the invalid code points problem (which is different in ANSI). But there aren't a whole lot of "defined invalid" characters in the Unicode spec, so this only reduces the count a little (remember, the "reserved" U+D800 to U+DFFF cannot be safely represented in UTF-16; that's why they're reserved. Any attempt to do so will produce a (possibly illegal) surrogate pair, which we've already covered).

Earlier I said it's difficult to find the probability for UTF-8. I could, however, find the probability for a discrete code-point being invalid. This is less useful since it's not measured in bytes, but it will at least tell us how many characters your decoder will emit before it yells at you. So an invalid UTF-8 "character" is any of the following:
  1. Begins with 0b10 (begins with a continuation byte, or the previous character has too many continuations). (p=1/4)
  2. Begins with 0b110, but the next byte doesn't begin with 0b10. (p=3/32)
  3. Begins with 0b1110, but either (or both!) of the next two bytes doesn't begin with 0b10. (p=15/256)
  4. Begins with 0b11110, but any of the next 3 bytes isn't a continuation. (p=63/2048)
  5. Begins with 0b11111 (too long). (p=1/32)
  6. Begins with a valid leader byte, but the file ends before the requisite number of continuations have appeared. (p is impossible to calculate without knowing more about the file)
These possibilities are mutually exclusive (except for 6). We've ignored the issue of overlong encodings (other than 5; things beginning with 0xC0 and 0xC1) because they're difficult to account for (they unpleasantly overlap with 2)  and many decoders will silently allow them anyway, as they're unambiguous. If we assume the random data is long enough to avoid 6, we get P=91/2048. So your decoder will emit, on average, one character before it notices something illegal; this is equivalent to "no more than 4 bytes" since there's at most 4 bytes in a UTF-8 character. This is why UTF-8 is not supposed to need a BOM: it is very discriminating already, and it's highly unlikely for random data to be accepted as valid UTF-8. Just look at the PDF chart: after 3 characters (at most 12 bytes), random data has less than a 10% chance of being accepted. Of course, real data is seldom random, so applying this to IsTextUnicode is iffy at best. All this tells us is that, ceteris paribus, UTF-8 is much more likely to reject something which isn't UTF-8 than the other encodings.

So where does this leave us? Well, in my opinion, it leaves us with the conclusion that ANSI should be scrapped since it's both undiscriminating and BOM-less, making it rather difficult to autodetect; furthermore, UTF-8 provides ASCII-compatibility and a much more diverse character set. But Microsoft obviously can't do that since there are a lot of ANSI files floating around and people would be mildly annoyed if they had to be converted (seriously: because lots of legacy programs use ANSI and Microsoft hates breaking such programs, which cannot be easily converted (unlike text files); furthermore, I'm not a Windows programmer, but I don't think the ANSI interface is deprecated in the first place!).

I suppose it also suggests that Microsoft consider UTF-8 before other encodings, but as Raymond Chen said, "[N]o matter how you decide to resolve the ambiguity, somebody will win and somebody else will lose"; there's no general solution to this problem. And as we said before, real data may not follow the probability distributions that random data does.

Sunday, April 22, 2012

Truth in turnstiles

We're often told that "logic" is at the heart of mathematics. We're told that you cannot do math without logic, that mathematics requires it. Personally, I subscribe to the view of formalism, which basically says that math is the manipulation of symbols according to rules. Do we need full-blown propositional logic to manipulate symbols? I hope not.

We need to decide what we mean by "logic." There are a lot of different kinds, believe it or not, and it's not just a matter of expressive power. Alternative logics often entirely disagree with "classical" or typical logic. For instance, paraconsistent logic will tolerate a contradiction without going nuts, and relevance logic takes perhaps a more intuitive approach to material implication, though it's also more complex. When we say "logic," we often mean classical logic, but I honestly cannot see why it should be privileged over other logics. Classical logic is more intuitive than paraconsistent logic, but at the cost of increased complexity (and fragility). On the other hand, it's less intuitive than relevance logic, for the sake of simplicity. Now, maybe this is a happy medium along a spectrum of intuition versus simplicity. But it seems to me that if you're building the foundation of mathematics, you'll favor simplicity over intuition; a happy medium isn't your goal.

The foundation of mathematics, in truth, is entailment, represented by the turnstile. Entailment basically means "given some statements (which look like) X, you can prove another statement (which looks like) Y." Note that this is very different from material implication, which isn't even meaningful at this level (because we haven't even picked a logic yet, let alone laid out its axioms). The actual axioms you're working with can then be defined using the turnstile. Those axioms are often logical in nature, to lay a strong foundation for other mathematical principles (such as ZFC), but there's no requirement that we use logic at this stage. You could, for instance, define the rules of untyped lambda calculus (which are themselves fairly simple) using only the turnstile, and then you would have a Turing-complete system that was defined without logic at all!

So where does this leave us? Different logics have different uses, and we can pick which one we want or need on a case-by-case basis. The turnstile is used to formalize this detachment, making it a framework for choosing a logical system. But we should remember that, technically, we don't need to use a logical system at all. It just happens to be quite convenient to do so.

Tuesday, April 10, 2012

Sherlock Holmes was wrong

It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts. --Sherlock Holmes, A Scandal in Bohemia
To the average person, the above quote seems eminently reasonable. If you don't know what you're talking about, you probably shouldn't guess. But, truth be told, he's dead wrong.

One of the less obvious fallacies we have to contend with is the Texas sharpshooter fallacy. It happens when you theorize after getting your data, when you twist theories to suit facts. But why is this a fallacy? Well, suppose that I hand you a long list of numbers and ask you to theorize about it. You spend hours carefully examining the data, feeding it through complex and abstruse equations, until eventually you come up with a baroque formula that exactly reproduces every number in the list.

But then I give you some more numbers, generated by the same process, and your formula gives totally wrong answers! How can this be? Well, the numbers were randomly generated all along; there never was a pattern. Now, this is a rather extreme case, but humans are wired to find patterns in everything.

Here's another example, one which actually happens from time to time. Suppose you have a large sample from a huge population, and you start taking cross-sections of that data. In other words, you start pulling out and examining chunks of data at a time. You take a lot of cross-sections, and eventually, one of them shows some unusual pattern. If you assume that pattern also holds in the larger population, you are committing the fallacy. You're already working from a sample; taking samples of that sample is iffy at best. You've already assumed the larger sample to be representative, and you're now adding the assumption that the smaller sample is also good. Worse, you've specifically selected for the particular cross-section that shows an unusual result! If any of your cross-sections shows an unusual result, purely by chance, you'll find it sooner or later.

So if we shouldn't twist theories to suit facts, does that force us to do the reverse? Certainly not! The proper way to do this sort of reasoning is to theorize in advance, and be prepared to discard your theory as soon as it looks wrong. This way, you're not twisting either to suit the other: you're scrapping bad theories and replacing them with (hopefully) good theories. Of course, you will let your previous theories influence your new theory, so in a sense it could be likened to twisting your theories, but really, the important thing here is that you have a viable theory before you receive new data, so that you can test your theory on data it wasn't based on. In a way, basing your theory on the data it's supposed to explain is cheating, since you're not really testing anything; you're just finding a pattern.