Those who choose not to read good books have no advantages over those who cannot read. (Attributed to Mark Twain.) A similar thing applies to research and data: Those who do not collect (and scrutinize) their own data, have no advantages over those who get their ideas and data from journalists or poor data suppliers. I would venture an educated guess that most of the managed investment money is handled by managers getting their information from journalists. Gentlemen, that's your competition. Go forth and prosper.

In quantitative analysis (irrespective of whether its data origins are financial statements or market prices) the guy with the best data has a definite advantage. Conversely, the best analytical mind coupled with poor quality data is at a disadvantage. Let me first deal with the problems of price data.

In equities, back-testing requires using deceased stocks to eliminate survivor bias. That means that to test the Russell 3000 over say 15 years, you need data on maybe 8,000 stocks. You cannot collect those by symbol, because symbols get recycled. So you try SEDOL or CUSIP numbers, but even those have problems. The holy of holies, CRSP has problems. And you cannot simply toss out the missing stocks without experiencing bias. Also note that the constituents of the R3k decrease monthly and are refreshed annually.

Obviously you have to adjust for dividends because you want to compare total returns. That introduces the dividend adjustment problem: do you use multiplicative adjustment or subtraction? Either one is problematic: destroying round numbers or creating negative numbers.

However the problems with data create tremendous opportunities to those who mine it. You know you are on to something when:

1. A major data provider has all of the dates of certain data off by 1 day. (systematic error) You call to ask why that is, and they don't have any idea what you are talking about. "How can you possibly know apriori that their data is wrong?" So you quickly reverse yourself and apologize for being mistaken. Everyone who uses that data has the error. They are counting things that are impossible.

2. You circumvent data suppliers and go directly to the exchange (or government website) because intermediaries screw it up. Hey, you cannot expect data replication to be perfect. (idiosyncratic error)

3. You disregard seasonally adjusted data in favor of raw data, and do your own seasonal adjustment. You cannot do this for every dataset, but certainly for the important ones.

4. A free provider (e.g. government or an exchange) provides detailed instructions on how to data mine their site. But the instructions are wrong. You call and the service people don't know what you are talking about. You eventually get to speak to the geeks and somehow learn the right way to get access. They confirm that no one had those problems before. WHY? Because no one else is looking at the data. He shoots; he scores!
These examples are like lifting back the bride's burqa, thinking that she might have a beard, and being surprised that she is absolutely beautiful.


a. When at all possible, go directly to the source. That may mean the exchanges or the government agency itself rather than your data supplier, and may appear unnecessary on the surface. But if you want to find the mistakes that most cannot find, you have to look in different places.

b. Look for site or download counters and check them out. Come back to them and recheck the numbers later to see the average daily hit rate. I was absolutely delighted to learn that I was one of only four downloaders of certain data.

c. Further check that data (with the counter) to see if it is available on Bloomberg or another major source.

d. Look for alternatives to the data you seek. The alternatives might not be the exact data, but they may be good surrogates. Real numbers for something close to what you want are better than bullsh*t numbers from a poorly conducted survey.

e. I cannot overemphasize the importance of checking the data, and checking that your data mining routine has collected it properly. Errors (either systematic or idiosyncratic) regularly occur. As renowned data cruncher John Tukey said, "There is no substitute for looking at the data." (Exploratory Data Analysis)

Typical problems you have to avoid:

- Look-ahead bias and survivor bias

- Lack of statistical significance - engineers typically require 30-50 observations, but market traders (such as technical analysts) frequently consider one event as significant. Don't do that!

- Testing on a sample of data that may not include the pattern. The solution to that of course is to always use the population, rather than a sample.

- Frequently you may test something on an index as a precursor to testing on thousands of individual stocks (or worse, options). But indices do not necessarily behave like individual stocks. ETFs might be a solution, but they are in themselves just smaller indices.

Those who challenge the validity of data mining (and also market timing) tend to cite as their proof that the first order daily changes in stock prices are random. We can concede that point, but there are lots more relationships to be studied than daily changes.

Data mining can be successful for any number of reasons but the juiciest fruit is to be found in the following ways:

- Analysis of data that is unknown or unseen by most people or, better yet, subject to systematic error (~finding buried treasure)

- Better analysis of existing data. (~having a better brain) Note that some of this will be serendipitous. Exploration by definition will lead you to discovering things you did not expect.

- Incredible persistence (hard work).

Should you seek to do fundamental analysis you will find different and more exasperating problems. We know many of them first hand.

The first problem is that the data is not easily accessed. It tends to cost quite a lot of money, and much of it has systematic errors. We have not found a commercial data supplier that did not have systematic errors.

The cost can be prohibitive. The major high-end quote provider places limits on the amount of data one can retrieve in a given period. We also know that provider uses a lot of humans in the process and has a lot of errors. Looking further afield, the fundamental data vendors we found are three in number. One replied quickly with a quote of $30,000 for the back data and a 1-year subscription going forward. Another came back a month later and wanted $72,000 for the same, and the third never came back to us. When I informed the higher priced service that they were above their competitors, they asked if their "pricing committee" could know what we were quoted by their competitors. That does not tend to make one comfortable, as what kind of business does not know what their competitors charge? Particularly if they make the point of having a pricing committee.

There is an alternative to buying fundamental data – getting it yourself. In theory this should be straightforward: the S.E.C. has all of the relevant files online. But if you are looking to get data on say the R3k for 15 years, you will have to collect it from approximately a half-million 10K and 10Q files.

Uniformity is generally not the rule, and you need some uniformity when doing computer mining. For example, sometimes a 10Q will be labeled "Ten Q" which has to be planned for. Unless you have access to a lot of people from the sub-continent, you want to do this automatically, which will also enable you to avoid things like transposition errors committed by humans. But some things are easier for humans than for computers. For example, most data constituting a company's total assets are listed as "Total Assets". Sometimes that is misspelled, and sometimes the number appears with a double underline, and other times without. Usually the next line starts with "Liabilities", but not always. It's laughable, but not fun.

We cut our teeth on a subset of the universe, REITs. The 172 that we found interesting had approximately 8,000 10K and 10Q files. After a lot of work we managed to get data cleanly from all but about 50. We consider that a major success, but even that low failure rate means we will have to go through about 3,000 files manually for the entire universe of a half-million files.

The good news is that having unrestricted access to such data provides a lot of opportunities. We are making a leap of faith that the data and our analysis will improve our existing results. Of course there isn't a guarantee, but that's the way to bet.

Phil McDonnell writes: 

Thanks to Bill for his excellent survey of data collection techniques and especially the pitfalls. There is little to add to his survey except one thing. That is when there are retroactive changes to data. To handle that case one needs to time stamp your data as to the time received. This caution applies to both fundamental data as well as price data which can be 'adjusted' a day or two later.

The worst example if this was Enron. The Enron data which showed the fraud was only released several years after the bankruptcy.


WordPress database error: [Table './dailyspeculations_com_@002d_dailywordpress/wp_comments' is marked as crashed and last (automatic?) repair failed]
SELECT * FROM wp_comments WHERE comment_post_ID = '8243' AND comment_approved = '1' ORDER BY comment_date




Speak your mind


Resources & Links