I heard there is a new open source Python library 'PySEC' allows easy access to all of the SEC's filings.

This is interesting primarily because we are in our 11th week of programming to do essentially what this guy says he has done. Our goal is to glean all of the SEC submissions without human intervention. Many of the commercial data suppliers use the "thousand scribes" method in which they hire a thousand people in a developing nation to manually record and categorize data. And those commercial suppliers charge huge fees for that suspect data.

Does the Python programmer really have something? Have our 11 weeks (to date) been a fruitless exercise?

Prior to 2010 the SEC required submission of quarterly and annual reports to be postable on the web. However there are all manner of idiosyncratic ways in which that information can be posted. Most of the submissions can be mined by a computer, but the fact that we are still programming after 11 weeks suggests it isn't simple.

The vast majority of files are text files. However that does not make mining easy, as labeling of the data is not consistent. Many data items within a given 10-Q may be labeled "total assets" perhaps for each subsidiary. Total liabilities are frequently called something else, or not labeled at all. Then in 2010 it was required that the files be submitted in HTML. Then that requirement was changed to XML, but HTML has appeared to survive. Within submissions we occasionally see an extraneous dingbat dropped into a label, which screws up the mining operation. There is only one submission that has completely stymied us - where the company presented their financial results as an attached GIF file.

We are highly suspect of data that is difficult to mine. Maybe extraneous dingbats have been put there deliberately to foil such a search, or maybe the person responsible is merely trying to impress a boss. But it is enough for us to log the difficulties and research subsequent performance of those problematic submitters. That we will provide to the list, but we will most likely have to abstain from providing a list of the miscreants. We would be happy to hear from any lawyers on the list about that one.

The Python program appears to have made some progress in mining the XML submissions from 2010 but it is a tedious one-by-one search. And now that many of the submissions are back in HTML, the miner has much more work to do for the same effort. So we certainly aren't going to give up our work and pay homage to the Python program.


WordPress database error: [Table './dailyspeculations_com_@002d_dailywordpress/wp_comments' is marked as crashed and last (automatic?) repair failed]
SELECT * FROM wp_comments WHERE comment_post_ID = '8555' AND comment_approved = '1' ORDER BY comment_date




Speak your mind


Resources & Links