Abstract:
Objective Infrared and visible image fusion aims to integrate the complementary information of infrared radiation and visible textures into a single image, enhancing both perceptual quality and downstream task performance. However, most existing encoder-decoder architectures struggle to simultaneously extract modality-shared and modality-specific features, leading to suboptimal fusion results. To address these limitations, this paper proposes MBAFuse, a novel end-to-end multi-branch autoencoder fusion network tailored for infrared and visible image fusion. The proposed method is designed to enhance the extraction of complementary features and improve the quality and robustness of the reconstructed fusion image.
Methods MBAFuse adopts a three-branch encoder architecture to separately extract shared and modality-specific features from infrared and visible images. The DenseBlock module is used to extract modality-invariant shared features, while an Invertible Neural Network (INN) and the Outlook Attention module are integrated to capture modality-specific low-frequency and high-frequency details, respectively. To better disentangle the shared and unique components, a modality correlation loss is introduced. The decoder employs a Restormer module, which leverages multi-head self-attention mechanisms to enhance reconstruction fidelity and detail retention. Furthermore, a composite loss function is formulated by combining structural similarity (SSIM) loss, mean squared error (MSE) loss, gradient loss, and correlation loss. The network is optimized using a two-stage training strategy, which gradually improves feature learning and fusion quality.
Results and Discussions Extensive experiments are conducted on three widely used public datasets: TNO, MSRS, and RoadScene. MBAFuse demonstrates superior performance in both qualitative and quantitative evaluations, outperforming seven state-of-the-art fusion methods. Specifically, MBAFuse achieves average improvements of 10.2%, 29.7%, and 7.5% in metrics such as entropy (EN), standard deviation (SD), and visual information fidelity (VIF) across the three datasets, respectively. In addition to infrared-visible fusion, the proposed method is also validated on medical image fusion (e.g., MRI-CT) and RGB image fusion tasks, exhibiting strong generalization capability and robustness across domains. While MBAFuse delivers impressive fusion performance, the complexity of its multi-branch architecture introduces computational overhead, posing challenges in real-time applications and dynamic environments. Future work will focus on model lightweighting and efficiency optimization, enabling real-time fusion for video streams and mobile scenarios.
Conclusions This paper proposes MBAFuse, a novel multi-branch autoencoder fusion network for infrared and visible image fusion. The proposed method effectively combines shared and modality-specific feature extraction through a three-branch encoder architecture incorporating DenseBlocks, an Invertible Neural Network (INN), and an Outlook Attention module. By introducing a comprehensive loss function and a two-stage training strategy, MBAFuse significantly enhances both the visual quality and structural consistency of the fused images. Experimental results on the TNO, MSRS, and RoadScene datasets demonstrate that MBAFuse outperforms several state-of-the-art fusion methods in both objective metrics and subjective evaluations. Moreover, the method exhibits strong generalization ability across medical and RGB image fusion tasks. Despite the increased complexity introduced by the multi-branch structure, MBAFuse presents a robust and effective solution for multi-modal image fusion. Future work will optimize the network architecture for real-time performance and extending the approach to dynamic and video-based fusion scenarios.