Efficient Feature Aggregation and Scale-Aware Regression for Monocular 3-D Object Detection

Yifan Wang et al.

IEEE Transactions on Intelligent Transportation Systems2026https://doi.org/10.1109/tits.2026.3659175preprint

ABDC A

Weight

0.37

What the paper says

Monocular 3D object detection has received considerable attention for its simplicity and low cost. Existing methods typically follow conventional 2D detection paradigms, first locating object centers and then predicting 3D attributes via neighboring features. However, these approaches mainly focus on local information, which may limit the model’s global context awareness and result in missed detections, as the global context provides semantic and spatial dependencies essential for detecting small objects in cluttered or occluded environments. In addition, due to large variation in object scales across different scenes and depths, inaccurate receptive fields often lead to background noise and degraded feature representation. To address these issues, we introduce MonoASRH, a novel monocular 3D detection framework composed of Efficient Hybrid Feature Aggregation Module (EH-FAM) and Adaptive Scale-Aware 3D Regression Head (ASRH). Specifically, EH-FAM employs multi-head attention with a global receptive field to extract semantic features and leverages lightweight convolutional modules to efficiently aggregate visual features across different scales, enhancing small-scale object detection. The ASRH encodes 2D bounding box dimensions and then fuses scale features with the semantic features aggregated by EH-FAM through a scale-semantic feature fusion module. The scale-semantic feature fusion module guides ASRH in learning dynamic receptive field offsets, incorporating scale information into 3D position prediction for better scale-awareness. Extensive experiments on the KITTI and Waymo datasets demonstrate that MonoASRH achieves state-of-the-art performance. The code and model are released at https://github.com/WYFDUT/MonoASRH

1 citation

Open paper page →

Evidence weight

0.37

Balanced mode · F 0.40 / M 0.15 / V 0.05 / R 0.40

F · citation impact	0.16 × 0.4 = 0.06
M · momentum	0.53 × 0.15 = 0.08
V · venue signal	0.50 × 0.05 = 0.03
R · text relevance †	0.50 × 0.4 = 0.20

† Text relevance is estimated at 0.50 on the detail page — for your query’s actual relevance score, open this paper from a search result.