晁琪, 赵燕东, 刘圣波. 多模态融合的三维语义分割算法研究[J]. 红外与激光工程, 2024, 53(5): 20240026. DOI: 10.3788/IRLA20240026
引用本文: 晁琪, 赵燕东, 刘圣波. 多模态融合的三维语义分割算法研究[J]. 红外与激光工程, 2024, 53(5): 20240026. DOI: 10.3788/IRLA20240026
Chao Qi, Zhao Yandong, Liu Shengbo. Multi-modal-fusion-based 3D semantic segmentation algorithm[J]. Infrared and Laser Engineering, 2024, 53(5): 20240026. DOI: 10.3788/IRLA20240026
Citation: Chao Qi, Zhao Yandong, Liu Shengbo. Multi-modal-fusion-based 3D semantic segmentation algorithm[J]. Infrared and Laser Engineering, 2024, 53(5): 20240026. DOI: 10.3788/IRLA20240026

多模态融合的三维语义分割算法研究

Multi-modal-fusion-based 3D semantic segmentation algorithm

  • 摘要: 如何高效提取稠密感知的图像特征信息以及真实三维感知的点云特征信息并充分利用其各自优势进行信息互补是提升三维目标识别的关键。本文提出了一种图像和点云融合的多模态框架用于三维语义分割任务。图像与点云特征提取分支相互独立,设计深度估计融合网络用于图像分支,将稠密感知的图像语义信息与真值显式监督的深度特征信息有效融合,对点云的无序及稀疏性进行补偿。并改进体素特征提取方法,减少点云体素化带来的信息损失。图像、点云分支提取多尺度特征后通过动态特征融合模块提升网络对关键特征的提取能力,更有效的获取全局特征。同时本文提出点级的多模态融合数据增强策略,提升样本多样性的同时有效缓解样本不均衡问题。在Pandaset公开数据集上进行对比实验,本文的多模态融合框架展现出更优的性能和更强的鲁棒性,尤其在小样本小目标上性能提升更为明显。

     

    Abstract:
      Objective  In the field of computer vision, cameras and LiDAR have their own advantages. Cameras have dense perception and RGB information, which can capture rich semantic information. LiDAR has more accurate ranging and can provide more accurate spatial information. How to utilize the advantages of cameras and LiDAR to achieve information complementarity is the key to improving 3D target recognition. The single-mode laser point cloud recognition network framework, whether based on point or voxel processing methods, cannot effectively solve the information loss caused by long time consumption or point cloud voxelization. Existing multi-modal networks that fuse images overly rely on point cloud input but fail to reduce the information loss caused by point cloud voxelization, weakening the high-dimensional semantic information provided by images and failing to fully utilize the complementary information between point clouds and images. To address the above issues, this paper improves the feature generation network and multi-modal fusion strategy, while proposing a point level multimodal data augmentation strategy to further enhance model performance.
      Methods  The multi-modal network framework uses independent image and point cloud branches to extract multi-scale features and fuse them at the feature layer (Fig.1). The image branch uses a depth estimation fusion network to fuse dense perceptual image semantic information and truth supervised deep features (Fig.2), compensating for the disorder and sparsity of point clouds. In the point cloud branch, the feature extraction method for voxelization of point clouds has been improved (Fig.3), no longer solely using voxel center point features, but using vector features, standard deviation features, and extremum features for fusion. By using the dynamic feature fusion module (Fig.4) for feature fusion, the network's ability to extract key features is improved, and global features are obtained more effectively. A point level multimodal fusion data augmentation strategy is proposed, which not only enhances sample diversity but also alleviates the problem of sample imbalance to a certain extent, effectively improving the performance of the model.
      Results and Discussions  Experiments are conducted using the open-source publicly available dataset Pandaset for autonomous driving at the L5 level, and IoU is used as an evaluation metric for semantic segmentation performance. We first visualized the point level multimodal fusion data augmentation strategy proposed in this paper on Pandaset, and found that this data augmentation strategy outperforms previous methods in terms of visual effects and sample authenticity in task expansion (Fig.5-6). At the same time, comparative experiments were conducted on this dataset with some mainstream 3D semantic segmentation algorithms based on point cloud single modal processing and image point cloud fusion multimodal processing. The algorithm proposed in this paper achieved performance improvement on most labels and mIoU (Tab.1), and the improvement was more significant on distant or small targets. This fully demonstrates the effectiveness of the algorithm proposed in this article, and verifies the effectiveness of each module proposed in this paper on model performance through ablation experiments (Tab.2). And additional comparative experiments were conducted on the improvement of model performance by data augmentation strategies, which proved that the click data augmentation strategy proposed in this paper is also superior to previous data augmentation methods in object detection tasks (Tab.3).
      Conclusions  This paper improves the image and point cloud feature extraction network and designs a multimodal network framework for image and point cloud fusion, combining the advantages of dense perception images and real 3D perception point clouds to achieve information complementarity. A multimodal fusion network framework has been implemented to improve the performance of 3D object recognition, with the performance improvement being more significant on small samples and small targets. This paper demonstrates the effectiveness of the proposed algorithm through comparative experiments and ablation experiments on the open-source dataset Pandaset.

     

/

返回文章
返回