Image captioning for automated bridge inspection: a feasibility study
Mo Li et al.
Abstract
Purpose This study investigates the application of image captioning technology in automated bridge inspection. Given the scarcity of research in this domain, this study aims to evaluate the feasibility, effectiveness and practical implications of using transformer-based models to generate natural language descriptions of bridge damage from visual data. Design/methodology/approach A triangulated research methodology was used, comprising a systematic literature review to assess the current state of image captioning in bridge inspection; a feasibility study using an encoder–decoder architecture (EfficientNet–transformers) trained on a structural damage dataset; and an interview-based transferability study with seven industry professionals to evaluate practical adoption challenges. Findings The systematic review identified only four relevant studies, underscoring the nascent state of research in this field. The feasibility study demonstrated promising results, with EfficientNet–transformers achieving high bilingual evaluation understudy (BLEU) scores (BLEU-1:0.944, BLEU-4:0.904) in structural damage description tasks. Finally, industry feedback highlighted potential benefits in inspection efficiency but emphasized challenges in workflow integration and model reliability. Originality/value To the best of the authors; knowledge, this study represents one of the first comprehensive explorations of image captioning for bridge inspection, contributing both methodological and practical insights. It identifies key research gaps, including the need for domain-specific data sets and standardized evaluation frameworks, while proposing actionable directions for future AI applications in infrastructure maintenance. The findings provide a foundation for advancing automated inspection technologies toward safer and more efficient infrastructure management.
Evidence weight
Balanced mode · F 0.40 / M 0.15 / V 0.05 / R 0.40
| F · citation impact | 0.50 × 0.4 = 0.20 |
| M · momentum | 0.50 × 0.15 = 0.07 |
| V · venue signal | 0.50 × 0.05 = 0.03 |
| R · text relevance † | 0.50 × 0.4 = 0.20 |
† Text relevance is estimated at 0.50 on the detail page — for your query’s actual relevance score, open this paper from a search result.