Textual Financial Data Repository for Machine Learning, Artificial Intelligence, and Textual Analyses: Major Sections from 10-K, 10-Q, and Financial Statement Notes Extracted Using Shared Python Code
Mauricio Mello Codesso et al.
What the paper says
Financial reports, including 10-K and 10-Q filings, are a primary source of textual data in business disciplines. However, extracting specific sections from these lengthy documents remains a challenge. Custom code development by each research team to parse these files leads to redundancy, inefficiency, and inconsistencies and is especially challenging for teams lacking technical expertise. We address this by offering raw textual data from MD&A, risk factors, and business description sections, and financial statement notes, for all firms from 2008 onward. We share Python code to facilitate download and parsing. We also provide pre-calculated textual metrics, such as word counts, readability measures, and several bags of word metrics, including negative sentiment, forward-looking statements, and R&D. Additionally, we contribute two new word lists, COVID-19 and human capital, developed using a novel approach based on disclosure shocks. Our goal is to streamline research processes, ensure consistency, and enable further advances in the field. Data Availability: Data are available for download at http://www.analytext.com/. Code is available for download at https://github.com/mmcodesso/edgar-metrics-parser JEL Classifications: C55; C88; M4; M48.
Evidence weight
Balanced mode · F 0.40 / M 0.15 / V 0.05 / R 0.40
| F · citation impact | 0.50 × 0.4 = 0.20 |
| M · momentum | 0.50 × 0.15 = 0.07 |
| V · venue signal | 0.50 × 0.05 = 0.03 |
| R · text relevance † | 0.50 × 0.4 = 0.20 |
† Text relevance is estimated at 0.50 on the detail page — for your query’s actual relevance score, open this paper from a search result.