Textual Financial Data Repository for Machine Learning, Artificial Intelligence, and Textual Analyses: Major Sections from 10-K, 10-Q, and Financial Statement Notes Extracted Using Shared Python Code

Mauricio Mello Codesso et al.

Journal of Information Systems2026https://doi.org/10.2308/isys-2024-084article
AJG 1ABDC A
Weight
0.50

Abstract

Financial reports, including 10-K and 10-Q filings, are a primary source of textual data in business disciplines. However, extracting specific sections from these lengthy documents remains a challenge. Custom code development by each research team to parse these files leads to redundancy, inefficiency, and inconsistencies and is especially challenging for teams lacking technical expertise. We address this by offering raw textual data from MD&A, risk factors, and business description sections, and financial statement notes, for all firms from 2008 onward. We share Python code to facilitate download and parsing. We also provide pre-calculated textual metrics, such as word counts, readability measures, and several bags of word metrics, including negative sentiment, forward-looking statements, and R&D. Additionally, we contribute two new word lists, COVID-19 and human capital, developed using a novel approach based on disclosure shocks. Our goal is to streamline research processes, ensure consistency, and enable further advances in the field. Data Availability: Data are available for download at http://www.analytext.com/. Code is available for download at https://github.com/mmcodesso/edgar-metrics-parser JEL Classifications: C55; C88; M4; M48.

Open via your library →

Cite this paper

https://doi.org/https://doi.org/10.2308/isys-2024-084

Or copy a formatted citation

@article{mauricio2026,
  title        = {{Textual Financial Data Repository for Machine Learning, Artificial Intelligence, and Textual Analyses: Major Sections from 10-K, 10-Q, and Financial Statement Notes Extracted Using Shared Python Code}},
  author       = {Mauricio Mello Codesso et al.},
  journal      = {Journal of Information Systems},
  year         = {2026},
  doi          = {https://doi.org/https://doi.org/10.2308/isys-2024-084},
}

Paste directly into BibTeX, Zotero, or your reference manager.

Flag this paper

Textual Financial Data Repository for Machine Learning, Artificial Intelligence, and Textual Analyses: Major Sections from 10-K, 10-Q, and Financial Statement Notes Extracted Using Shared Python Code

Flags are reviewed by the Arbiter methodology team within 5 business days.


Evidence weight

0.50

Balanced mode · F 0.40 / M 0.15 / V 0.05 / R 0.40

F · citation impact0.50 × 0.4 = 0.20
M · momentum0.50 × 0.15 = 0.07
V · venue signal0.50 × 0.05 = 0.03
R · text relevance †0.50 × 0.4 = 0.20

† Text relevance is estimated at 0.50 on the detail page — for your query’s actual relevance score, open this paper from a search result.