Tuesday, April 10, 2012

Sherlock Holmes was wrong

It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts. --Sherlock Holmes, A Scandal in Bohemia
To the average person, the above quote seems eminently reasonable. If you don't know what you're talking about, you probably shouldn't guess. But, truth be told, he's dead wrong.

One of the less obvious fallacies we have to contend with is the Texas sharpshooter fallacy. It happens when you theorize after getting your data, when you twist theories to suit facts. But why is this a fallacy? Well, suppose that I hand you a long list of numbers and ask you to theorize about it. You spend hours carefully examining the data, feeding it through complex and abstruse equations, until eventually you come up with a baroque formula that exactly reproduces every number in the list.

But then I give you some more numbers, generated by the same process, and your formula gives totally wrong answers! How can this be? Well, the numbers were randomly generated all along; there never was a pattern. Now, this is a rather extreme case, but humans are wired to find patterns in everything.

Here's another example, one which actually happens from time to time. Suppose you have a large sample from a huge population, and you start taking cross-sections of that data. In other words, you start pulling out and examining chunks of data at a time. You take a lot of cross-sections, and eventually, one of them shows some unusual pattern. If you assume that pattern also holds in the larger population, you are committing the fallacy. You're already working from a sample; taking samples of that sample is iffy at best. You've already assumed the larger sample to be representative, and you're now adding the assumption that the smaller sample is also good. Worse, you've specifically selected for the particular cross-section that shows an unusual result! If any of your cross-sections shows an unusual result, purely by chance, you'll find it sooner or later.

So if we shouldn't twist theories to suit facts, does that force us to do the reverse? Certainly not! The proper way to do this sort of reasoning is to theorize in advance, and be prepared to discard your theory as soon as it looks wrong. This way, you're not twisting either to suit the other: you're scrapping bad theories and replacing them with (hopefully) good theories. Of course, you will let your previous theories influence your new theory, so in a sense it could be likened to twisting your theories, but really, the important thing here is that you have a viable theory before you receive new data, so that you can test your theory on data it wasn't based on. In a way, basing your theory on the data it's supposed to explain is cheating, since you're not really testing anything; you're just finding a pattern.