Assessing the Quality of Large Language Models and Human Inputs to a Decision: A Proposed Framework and Two Benchmark Case Studies
Ali E. Abbas et al.
Abstract
This paper proposes a general framework for assessing the quality of inputs to a decision analysis provided by large language models (LLMs). The paper provides two benchmark case studies that focus on alternatives, preferences, and uncertainties related to a decision and are used to illustrate the proposed framework using ChatGPT version 4.0. The analysis uses the proposed framework and the data obtained to compare the efficacy of decision inputs provided by crowdsourcing from a group and those obtained from LLMs, with the relevance of inputs determined independently by a panel. The results show that (i) panel judgements about the relevance of inputs exhibited high correlation to one another; (ii) human groups performed better on generating alternatives, with higher rates of relevant alternatives; (iii) LLMs performed better on generating uncertainties, with higher rates of relevant alternatives; and (iv) human groups and LLMs performed similarly on generating preferences. These findings repeated across both subsets of data. Direct questions to participants about which input source they preferred resulted in a slight edge for artificial intelligence inputs. Although the benchmark case studies used ChatGPT version 4.0, the general framework applies to any LLM.
Evidence weight
Balanced mode · F 0.40 / M 0.15 / V 0.05 / R 0.40
| F · citation impact | 0.50 × 0.4 = 0.20 |
| M · momentum | 0.50 × 0.15 = 0.07 |
| V · venue signal | 0.50 × 0.05 = 0.03 |
| R · text relevance † | 0.50 × 0.4 = 0.20 |
† Text relevance is estimated at 0.50 on the detail page — for your query’s actual relevance score, open this paper from a search result.