Mar
25
Ramblings About Quantitative Research, from Bill Rafter
March 25, 2013 |
Those who choose not to read good books have no advantages over those who cannot read. (Attributed to Mark Twain.) A similar thing applies to research and data: Those who do not collect (and scrutinize) their own data, have no advantages over those who get their ideas and data from journalists or poor data suppliers. I would venture an educated guess that most of the managed investment money is handled by managers getting their information from journalists. Gentlemen, that's your competition. Go forth and prosper.
In quantitative analysis (irrespective of whether its data origins are financial statements or market prices) the guy with the best data has a definite advantage. Conversely, the best analytical mind coupled with poor quality data is at a disadvantage. Let me first deal with the problems of price data.
In equities, back-testing requires using deceased stocks to eliminate survivor bias. That means that to test the Russell 3000 over say 15 years, you need data on maybe 8,000 stocks. You cannot collect those by symbol, because symbols get recycled. So you try SEDOL or CUSIP numbers, but even those have problems. The holy of holies, CRSP has problems. And you cannot simply toss out the missing stocks without experiencing bias. Also note that the constituents of the R3k decrease monthly and are refreshed annually.
Obviously you have to adjust for dividends because you want to compare total returns. That introduces the dividend adjustment problem: do you use multiplicative adjustment or subtraction? Either one is problematic: destroying round numbers or creating negative numbers.
However the problems with data create tremendous opportunities to those who mine it. You know you are on to something when:
1. A major data provider has all of the dates of certain data off by 1 day. (systematic error) You call to ask why that is, and they don't have any idea what you are talking about. "How can you possibly know apriori that their data is wrong?" So you quickly reverse yourself and apologize for being mistaken. Everyone who uses that data has the error. They are counting things that are impossible.
2. You circumvent data suppliers and go directly to the exchange (or government website) because intermediaries screw it up. Hey, you cannot expect data replication to be perfect. (idiosyncratic error)
3. You disregard seasonally adjusted data in favor of raw data, and do your own seasonal adjustment. You cannot do this for every dataset, but certainly for the important ones.
4. A free provider (e.g. government or an exchange) provides detailed instructions on how to data mine their site. But the instructions are wrong. You call and the service people don't know what you are talking about. You eventually get to speak to the geeks and somehow learn the right way to get access. They confirm that no one had those problems before. WHY? Because no one else is looking at the data. He shoots; he scores!
These examples are like lifting back the bride's burqa, thinking that she might have a beard, and being surprised that she is absolutely beautiful.
Ideas:
a. When at all possible, go directly to the source. That may mean the exchanges or the government agency itself rather than your data supplier, and may appear unnecessary on the surface. But if you want to find the mistakes that most cannot find, you have to look in different places.
b. Look for site or download counters and check them out. Come back to them and recheck the numbers later to see the average daily hit rate. I was absolutely delighted to learn that I was one of only four downloaders of certain data.
c. Further check that data (with the counter) to see if it is available on Bloomberg or another major source.
d. Look for alternatives to the data you seek. The alternatives might not be the exact data, but they may be good surrogates. Real numbers for something close to what you want are better than bullsh*t numbers from a poorly conducted survey.
e. I cannot overemphasize the importance of checking the data, and checking that your data mining routine has collected it properly. Errors (either systematic or idiosyncratic) regularly occur. As renowned data cruncher John Tukey said, "There is no substitute for looking at the data." (Exploratory Data Analysis)
Typical problems you have to avoid:
- Look-ahead bias and survivor bias
- Lack of statistical significance - engineers typically require 30-50 observations, but market traders (such as technical analysts) frequently consider one event as significant. Don't do that!
- Testing on a sample of data that may not include the pattern. The solution to that of course is to always use the population, rather than a sample.
- Frequently you may test something on an index as a precursor to testing on thousands of individual stocks (or worse, options). But indices do not necessarily behave like individual stocks. ETFs might be a solution, but they are in themselves just smaller indices.
Those who challenge the validity of data mining (and also market timing) tend to cite as their proof that the first order daily changes in stock prices are random. We can concede that point, but there are lots more relationships to be studied than daily changes.
Data mining can be successful for any number of reasons but the juiciest fruit is to be found in the following ways:
- Analysis of data that is unknown or unseen by most people or, better yet, subject to systematic error (~finding buried treasure)
- Better analysis of existing data. (~having a better brain) Note that some of this will be serendipitous. Exploration by definition will lead you to discovering things you did not expect.
- Incredible persistence (hard work).
Should you seek to do fundamental analysis you will find different and more exasperating problems. We know many of them first hand.
The first problem is that the data is not easily accessed. It tends to cost quite a lot of money, and much of it has systematic errors. We have not found a commercial data supplier that did not have systematic errors.
The cost can be prohibitive. The major high-end quote provider places limits on the amount of data one can retrieve in a given period. We also know that provider uses a lot of humans in the process and has a lot of errors. Looking further afield, the fundamental data vendors we found are three in number. One replied quickly with a quote of $30,000 for the back data and a 1-year subscription going forward. Another came back a month later and wanted $72,000 for the same, and the third never came back to us. When I informed the higher priced service that they were above their competitors, they asked if their "pricing committee" could know what we were quoted by their competitors. That does not tend to make one comfortable, as what kind of business does not know what their competitors charge? Particularly if they make the point of having a pricing committee.
There is an alternative to buying fundamental data – getting it yourself. In theory this should be straightforward: the S.E.C. has all of the relevant files online. But if you are looking to get data on say the R3k for 15 years, you will have to collect it from approximately a half-million 10K and 10Q files.
Uniformity is generally not the rule, and you need some uniformity when doing computer mining. For example, sometimes a 10Q will be labeled "Ten Q" which has to be planned for. Unless you have access to a lot of people from the sub-continent, you want to do this automatically, which will also enable you to avoid things like transposition errors committed by humans. But some things are easier for humans than for computers. For example, most data constituting a company's total assets are listed as "Total Assets". Sometimes that is misspelled, and sometimes the number appears with a double underline, and other times without. Usually the next line starts with "Liabilities", but not always. It's laughable, but not fun.
We cut our teeth on a subset of the universe, REITs. The 172 that we found interesting had approximately 8,000 10K and 10Q files. After a lot of work we managed to get data cleanly from all but about 50. We consider that a major success, but even that low failure rate means we will have to go through about 3,000 files manually for the entire universe of a half-million files.
The good news is that having unrestricted access to such data provides a lot of opportunities. We are making a leap of faith that the data and our analysis will improve our existing results. Of course there isn't a guarantee, but that's the way to bet.
Phil McDonnell writes:
Thanks to Bill for his excellent survey of data collection techniques and especially the pitfalls. There is little to add to his survey except one thing. That is when there are retroactive changes to data. To handle that case one needs to time stamp your data as to the time received. This caution applies to both fundamental data as well as price data which can be 'adjusted' a day or two later.
The worst example if this was Enron. The Enron data which showed the fraud was only released several years after the bankruptcy.
Comments
WordPress database error: [Table './dailyspeculations_com_@002d_dailywordpress/wp_comments' is marked as crashed and last (automatic?) repair failed]
SELECT * FROM wp_comments WHERE comment_post_ID = '8243' AND comment_approved = '1' ORDER BY comment_date
Archives
- April 2021
- March 2021
- February 2021
- January 2021
- December 2020
- November 2020
- October 2020
- September 2020
- August 2020
- July 2020
- June 2020
- May 2020
- April 2020
- March 2020
- February 2020
- January 2020
- December 2019
- November 2019
- October 2019
- September 2019
- August 2019
- July 2019
- June 2019
- May 2019
- April 2019
- March 2019
- February 2019
- January 2019
- December 2018
- November 2018
- October 2018
- September 2018
- August 2018
- July 2018
- June 2018
- May 2018
- April 2018
- March 2018
- February 2018
- January 2018
- December 2017
- November 2017
- October 2017
- September 2017
- August 2017
- July 2017
- June 2017
- May 2017
- April 2017
- March 2017
- February 2017
- January 2017
- December 2016
- November 2016
- October 2016
- September 2016
- August 2016
- July 2016
- June 2016
- May 2016
- April 2016
- March 2016
- February 2016
- January 2016
- December 2015
- November 2015
- October 2015
- September 2015
- August 2015
- July 2015
- June 2015
- May 2015
- April 2015
- March 2015
- February 2015
- January 2015
- December 2014
- November 2014
- October 2014
- September 2014
- August 2014
- July 2014
- June 2014
- May 2014
- April 2014
- March 2014
- February 2014
- January 2014
- December 2013
- November 2013
- October 2013
- September 2013
- August 2013
- July 2013
- June 2013
- May 2013
- April 2013
- March 2013
- February 2013
- January 2013
- December 2012
- November 2012
- October 2012
- September 2012
- August 2012
- July 2012
- June 2012
- May 2012
- April 2012
- March 2012
- February 2012
- January 2012
- December 2011
- November 2011
- October 2011
- September 2011
- August 2011
- July 2011
- June 2011
- May 2011
- April 2011
- March 2011
- February 2011
- January 2011
- December 2010
- November 2010
- October 2010
- September 2010
- August 2010
- July 2010
- June 2010
- May 2010
- April 2010
- March 2010
- February 2010
- January 2010
- December 2009
- November 2009
- October 2009
- September 2009
- August 2009
- July 2009
- June 2009
- May 2009
- April 2009
- March 2009
- February 2009
- January 2009
- December 2008
- November 2008
- October 2008
- September 2008
- August 2008
- July 2008
- June 2008
- May 2008
- April 2008
- March 2008
- February 2008
- January 2008
- December 2007
- November 2007
- October 2007
- September 2007
- August 2007
- July 2007
- June 2007
- May 2007
- April 2007
- March 2007
- February 2007
- January 2007
- December 2006
- November 2006
- October 2006
- September 2006
- August 2006
- Older Archives
Resources & Links
- The Letters Prize
- Pre-2007 Victor Niederhoffer Posts
- Vic’s NYC Junto
- Reading List
- Programming in 60 Seconds
- The Objectivist Center
- Foundation for Economic Education
- Tigerchess
- Dick Sears' G.T. Index
- Pre-2007 Daily Speculations
- Laurel & Vics' Worldly Investor Articles