MSA-MVSNet: A Cross-Scale Collaborative Attention-Based Multi-View Reconstruction Network for Orchard Tree 3D Reconstruction with Instance Segmentation for Fruit Counting

Hui Li
Jun Zhang
Jianhua Hong
Fuzhi Ke
Chu Zhang
Zhixin Li

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

To address the issues of detail loss and matching difficulties in fruit tree 3D reconstruction caused by complex branch–leaf morphology, fruit occlusion, and illumination variations, this paper proposes an end-to-end cross-scale collaborative attention multi-view stereo network, termed MSA-MVSNet, for high-quality 3D reconstruction of orchard trees, while integrating semantic segmentation for fruit counting. A multi-scale feature enhancement module is designed to adaptively fuse deep semantic features and shallow fine-grained details through a spatial–channel collaborative attention mechanism, thereby enhancing the network’s capability to represent multi-scale structures such as trunks, branches, and leaves. Multi-branch dilated convolutions are introduced to enlarge the receptive field, and deformable convolutions are incorporated to adaptively capture the irregular geometric shapes of fruits, improving modeling robustness. In addition, a feature matching transformer is introduced to strengthen long-range global contextual correlations within and across images via intra-attention and inter-attention mechanisms, thereby improving matching stability in low-texture and repetitive-texture regions.To validate the effectiveness of the proposed method, experiments are conducted on self-collected real orchard dataset and public benchmark datasets. The results demonstrate that MSA-MVSNet outperforms baseline models by 8.2% in terms of 3D reconstruction quality. Finally, by combining depth filtering with the semantic segmentation results of YOLOv11-Seg, a semantic-guided fruit reconstruction and counting framework is constructed. This framework achieves an overall counting F1-score of 92.8% on the self-collected dataset with varying scene sparsity and 93.5% on the public Fuji-sfm dataset, demonstrating its effectiveness and generalization capability.

Version published to 10.21203/rs.3.rs-8759462/v1 on Research Square
Mar 5, 2026

View-GFN: A Novel View-Based Graph Convolution and Sampling Fusion Network for 3D Shape Recognition

This article has 3 authors:
1. Min Pang
2. Jichao Jiao
3. Yingjian Zhang
This article has no evaluationsLatest version Apr 14, 2026
MFFP-Net: Multi-directional Feature Fusion and Position-Aware Network

This article has 4 authors:
1. Yazhong Si
2. Jingyu Chen
3. Hongxu Li
4. Chen Li
This article has no evaluationsLatest version Mar 9, 2026
CFA-DeepLabV3+: Cross-level Fusion and Attention Network for Lightweight Road Segmentation

This article has 6 authors:
1. Xin Zhang
2. Yan Li
3. Zexi Hua
4. XiangZhen Zhou
5. YuGe Pan
6. Hui Qiao
This article has no evaluationsLatest version Apr 8, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

View-GFN: A Novel View-Based Graph Convolution and Sampling Fusion Network for 3D Shape Recognition

MFFP-Net: Multi-directional Feature Fusion and Position-Aware Network

CFA-DeepLabV3+: Cross-level Fusion and Attention Network for Lightweight Road Segmentation