MiningMany researchers only have very limited data to play with and are not necessarily required to know the perils of data mining. With abundant data, however, the perils must be fully recognized. The most popular paper in modern mid-ocean-ridge geochemistry and geophysics was about a correlation between sodium content (adjusted to a constant magnesium content) and the thickness of oceanic crust. When the chief author of the awe-inspiring paper was asked how one could even think of such a correlation, she said with the large quantity of global data set they just collected, they simply plot anything against anything else (how a smart way not to miss any possible correlation I thought at that time) and the above-mentioned correlation appeared. They came up with a grandiose explanation for the correlation. I was never able to fully appreciate that paper as most of that correlation comes from one outlier location (Iceland) and on local scale (meaning each location studied along mid-ocean ridges) that correlation either disappears or is the reverse. In addition, my test showed that the adjustment to a constant magnesium content itself produces that same correlation.

To apply the plotting anything against anything else approach in trading, I tested trading methods based on all combinations of a very large number of entry and exit criteria. Needless to say, a large number of trading methods that could produce over one percent daily returns appeared. Once I further tested these methods using outside trading data, however, all their grandiosity disappeared. The above sodium correlation has not been similarly tested.



 Modern sciences must have created many emperors that have no clothes. I found one when I was a graduate student (Geology; March 2001; v. 29; no. 3). An important technique of Petrology is called crystal size distribution (CSD): after counting the numbers and measuring the sizes of crystals in a rock, the plot of crystal numbers vs. crystal sizes can be explained by important and hard-to-constrain parameters like crystal growth rate, nucleation density, and residence time. Fortunately, my acquired high learning in CSD did not prevent me from realizing the simple truth learned in elementary school that it is total crystal volume that fully determines crystal numbers and sizes, nothing else, not growth rate, not nucleation density, not residence time.

Unfortunately, the end of my career in sciences greeted me after I shouted the emperor has no clothes. At then I did not realize that I deserved the fate as I made so many professors miserable and that I really have a speculator's heart.

By the way, crystal number in CSD is actually called crystal population density, which is still crystal number but measured within a unit volume of system and within a unit crystal size. Confused? This is why CSD practitioners have published about one thousand papers and have spent quite sizable funding.


Resources & Links