Pose-Aware Image Captioning for Ergonomic Problem and Solution Identification

Gunwoo Yong et al.

Journal of Construction Engineering and Management2026https://doi.org/10.1061/jcemd4.coeng-17775article

AJG 2ABDC A*

Weight

0.50

What the paper says

Construction workers have continuously confronted work-related musculoskeletal disorders (WMSDs) due to physically demanding tasks. Identifying ergonomic problems and solutions is necessary to proactively address WMSDs. However, manual identification by ergonomic experts is challenging in construction due to ever-changing construction environments, a transient workforce, and the shortage of these professionals. Image captioning, a technique to generate text from an image, holds the potential to automate the ergonomic problem and solution identification task given its scene understanding capability and ability to express that understanding in text. However, identifying prevalent pose-related problems and their solutions in construction is challenging because specific worker poses are not explicitly incorporated in this image captioning. To this end, we propose a pose-aware image captioning approach. Specifically, we developed a pose-awareness module that enables pose instruction tuning, which guides an image captioning model to interpret images in relation to workers’ poses. We tested our model on 322 site images using five evaluation metrics: Bilingual Evaluation Understudy (BLEU) and Consensus-based Image Description Evaluation (CIDEr) to measure how correctly the generated captions matched the information contained in the ground-truth captions; accuracy based on human evaluation for semantic correctness of the identified problems and solutions; posture precision to assess our model’s ability in identifying postures; and posture recall to assess how many postures were correctly captured. Our pose-aware model achieved a BLEU-4 score of 0.8887, CIDEr score of 0.6973, accuracy of 0.8509, posture precision of 0.9055, and posture recall of 0.9283, outperforming general models without pose awareness, specifically InstructBLIP, our backbone architecture, and GPT-4.1, a leading off-the-shelf model with strong generalization capabilities. These findings highlight the potential applicability of pose-aware image captioning in identifying ergonomic problems and solutions in construction. Our approach can contribute to enabling ergonomic problem and solution identification in an accessible manner for dynamic sites and limited ergonomic expertise.

Open paper page →

Evidence weight

0.50

Balanced mode · F 0.40 / M 0.15 / V 0.05 / R 0.40

F · citation impact	0.50 × 0.4 = 0.20
M · momentum	0.50 × 0.15 = 0.07
V · venue signal	0.50 × 0.05 = 0.03
R · text relevance †	0.50 × 0.4 = 0.20

† Text relevance is estimated at 0.50 on the detail page — for your query’s actual relevance score, open this paper from a search result.