MSA-MVSNet: A Cross-Scale Collaborative Attention-Based Multi-View Reconstruction Network for Orchard Tree 3D Reconstruction with Instance Segmentation for Fruit Counting
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
To address the issues of detail loss and matching difficulties in fruit tree 3D reconstruction caused by complex branch–leaf morphology, fruit occlusion, and illumination variations, this paper proposes an end-to-end cross-scale collaborative attention multi-view stereo network, termed MSA-MVSNet, for high-quality 3D reconstruction of orchard trees, while integrating semantic segmentation for fruit counting. A multi-scale feature enhancement module is designed to adaptively fuse deep semantic features and shallow fine-grained details through a spatial–channel collaborative attention mechanism, thereby enhancing the network’s capability to represent multi-scale structures such as trunks, branches, and leaves. Multi-branch dilated convolutions are introduced to enlarge the receptive field, and deformable convolutions are incorporated to adaptively capture the irregular geometric shapes of fruits, improving modeling robustness. In addition, a feature matching transformer is introduced to strengthen long-range global contextual correlations within and across images via intra-attention and inter-attention mechanisms, thereby improving matching stability in low-texture and repetitive-texture regions.To validate the effectiveness of the proposed method, experiments are conducted on self-collected real orchard dataset and public benchmark datasets. The results demonstrate that MSA-MVSNet outperforms baseline models by 8.2% in terms of 3D reconstruction quality. Finally, by combining depth filtering with the semantic segmentation results of YOLOv11-Seg, a semantic-guided fruit reconstruction and counting framework is constructed. This framework achieves an overall counting F1-score of 92.8% on the self-collected dataset with varying scene sparsity and 93.5% on the public Fuji-sfm dataset, demonstrating its effectiveness and generalization capability.