Assessing the Quality of Large Language Models and Human Inputs to a Decision: A Proposed Framework and Two Benchmark Case Studies

Ali E. Abbas et al.

Decision Analysis2026https://doi.org/10.1287/deca.2025.0467article
AJG 1ABDC A
Weight
0.50

Abstract

This paper proposes a general framework for assessing the quality of inputs to a decision analysis provided by large language models (LLMs). The paper provides two benchmark case studies that focus on alternatives, preferences, and uncertainties related to a decision and are used to illustrate the proposed framework using ChatGPT version 4.0. The analysis uses the proposed framework and the data obtained to compare the efficacy of decision inputs provided by crowdsourcing from a group and those obtained from LLMs, with the relevance of inputs determined independently by a panel. The results show that (i) panel judgements about the relevance of inputs exhibited high correlation to one another; (ii) human groups performed better on generating alternatives, with higher rates of relevant alternatives; (iii) LLMs performed better on generating uncertainties, with higher rates of relevant alternatives; and (iv) human groups and LLMs performed similarly on generating preferences. These findings repeated across both subsets of data. Direct questions to participants about which input source they preferred resulted in a slight edge for artificial intelligence inputs. Although the benchmark case studies used ChatGPT version 4.0, the general framework applies to any LLM.

Open via your library →

Cite this paper

https://doi.org/https://doi.org/10.1287/deca.2025.0467

Or copy a formatted citation

@article{ali2026,
  title        = {{Assessing the Quality of Large Language Models and Human Inputs to a Decision: A Proposed Framework and Two Benchmark Case Studies}},
  author       = {Ali E. Abbas et al.},
  journal      = {Decision Analysis},
  year         = {2026},
  doi          = {https://doi.org/https://doi.org/10.1287/deca.2025.0467},
}

Paste directly into BibTeX, Zotero, or your reference manager.

Flag this paper

Assessing the Quality of Large Language Models and Human Inputs to a Decision: A Proposed Framework and Two Benchmark Case Studies

Flags are reviewed by the Arbiter methodology team within 5 business days.


Evidence weight

0.50

Balanced mode · F 0.40 / M 0.15 / V 0.05 / R 0.40

F · citation impact0.50 × 0.4 = 0.20
M · momentum0.50 × 0.15 = 0.07
V · venue signal0.50 × 0.05 = 0.03
R · text relevance †0.50 × 0.4 = 0.20

† Text relevance is estimated at 0.50 on the detail page — for your query’s actual relevance score, open this paper from a search result.