Textual Financial Data Repository for Machine Learning, Artificial Intelligence, and Textual Analyses: Major Sections from 10-K, 10-Q, and Financial Statement Notes Extracted Using Shared Python Code
Financial reports, including 10-K and 10-Q filings, are a primary source of textual data in business disciplines. However, extracting specific sections from these lengthy documents remains a challenge. Custom code development by each research team to parse these files leads to redundancy, inefficiency, and inconsistencies and is especially challenging for teams lacking technical expertise. We address this by offering raw textual data from MD&A, risk factors, and business description sections, and financial statement notes, for all firms from 2008 onward. We share Python code to facilitate download and parsing. We also provide pre-calculated textual metrics, such as word counts, readability measures, and several bags of word metrics, including negative sentiment, forward-looking statements, and R&D. Additionally, we contribute two new word lists, COVID-19 and human capital, developed using a novel approach based on disclosure shocks. Our goal is to streamline research processes, ensure consistency, and enable further advances in the field. Data Availability: Data are available for download at http://www.analytext.com/. Code is available for download at https://github.com/mmcodesso/edgar-metrics-parser JEL Classifications: C55; C88; M4; M48.