基于条件生成对抗网络的可见光-红外图像转换算法

苏毅; 胡浡然; 温焱; 姜宏志

doi:10.3788/IRLA20250166

基于条件生成对抗网络的可见光-红外图像转换算法

Visible-infrared image translation algorithm based on conditional generative adversarial networks

摘要

摘要: 针对航拍目标检测等下游任务中红外图像数据集稀缺的问题，以及可见光-红外图像跨模态转换中存在的纹理细节失真与辐射特征保真度不足问题，提出一种基于改进条件生成对抗网络的可见光-红外图像转换算法。通过构建融合ConvNeXt模块与Swin-Transformer模块的混合编码架构，实现了局部特征提取与全局上下文建模的协同优化；设计使用双层残差结构的解码器与内容引导的特征注意力模块，解决了跨层次特征融合中的梯度耗散问题；引入多尺度判别器与复合损失函数（L1内容损失、边缘损失及对抗损失），显著提升了生成图像的物理保真度。在VEDAI与AVIID-3数据集上的实验表明：该方法PSNR分别达32.098与26.553，较基准模型提升10.6%与17.8%，SSIM提升4.7%与34.0%，LPIPS指标同步优化32.7%和32.9%。消融实验显示网络架构改进的贡献。主观评估验证了算法在复杂场景（如建筑、植被、水体辐射特征重建、低对比度目标边缘保持）中的优越性。研究成果为突破红外数据样本瓶颈提供了高效解决方案，对提升军事侦察与安防监控领域任务具有重要应用价值。

Abstract:
Objective The scarcity of high-quality infrared (IR) datasets poses a critical bottleneck for downstream tasks such as aerial target detection and tracking, primarily due to the prohibitive costs of IR imaging equipment, stringent data acquisition conditions, and restricted access to sensitive target samples. Traditional physics-based simulation methods suffer from inherent limitations, including high computational complexity, low efficiency, and domain gaps caused by insufficient modeling of deep feature correlations in dynamic environments. Meanwhile, existing deep learning approaches for visible-to-IR image translation struggle with texture distortion and insufficient radiation fidelity, leading to performance degradation in practical applications. This study proposes an advanced conditional generative adversarial network (cGAN) framework to address these challenges, aiming to generate high-fidelity IR images with enhanced structural integrity and thermal radiation consistency for military reconnaissance, aviation monitoring, and security surveillance applications.
Methods The proposed framework employs an enhanced conditional generative adversarial network (cGAN) to address visible-to-infrared (IR) image translation challenges. The generator network synthesizes high-fidelity IR images through a hybrid encoder-decoder architecture (Fig.1). The encoder integrates ConvNeXt modules for hierarchical local feature extraction using 7×7 depthwise separable convolutions, constructing a four-stage feature pyramid. To model global spatial dependencies, Swin-Transformer blocks are embedded in deep layers, leveraging window-based self-attention mechanisms to capture long-range contextual relationships. The decoder employs dual residual connections with transposed convolution units, combining 1×1 convolutions for channel adjustment and 3×3 deconvolutions for spatial upsampling. A content-guided feature attention module (CGFAM) is introduced to mitigate gradient dissipation during cross-level feature fusion. This module dynamically fuses shallow and deep features through parallel channel-spatial attention mechanisms, utilizing broadcast addition and group convolution to achieve cross-dimensional interaction. A multi-scale discriminator extracts hierarchical features at resolutions of 256×256, 128×128, and 64×64, aggregating outputs via summation to enhance global semantic modeling. The adversarial training framework is optimized using a composite loss function combining three components:1. L1 Content Loss: Ensures pixel-level consistency between generated and real IR images.2. Edge-Preserving Loss: Integrates Sobel (first-order gradient) and Laplace (second-order gradient) operators to enforce structural alignment.3. Adversarial Loss: Guides realistic texture generation through dynamic competition between generator and discriminator. Training employs the Adam optimizer with a two-phase learning rate schedule, maintaining stability during initial convergence and fine-tuning in later stages. Loss weights are empirically configured to balance detail preservation and training stability.
Results and Discussions The algorithm was evaluated on VEDAI and AVIID-3 datasets, showing significant improvements over state-of-the-art methods (Pix2Pix, IRGAN, etc.). On VEDAI, it achieved 32.098 PSNR (+10.6%) and 0.945 SSIM (+4.7%), with LPIPS optimized by 32.7%. AVIID-3 tests confirmed robustness in complex aerial scenes, yielding 17.8% PSNR and 34.0% SSIM gains. Ablation studies highlighted critical contributions: Hybrid generator architecture (ConvNeXt+Swin-Transformer) and multi-scale discriminator contributed 82.4% of performance gains. Content-guided attention reduced LPIPS by 5.0%, enhancing cross-layer fusion and ensuring structural fidelity. Subjective evaluations (Figs.11-12) demonstrated geometric accuracy, eliminating artifacts in water-body radiation reconstruction (e.g., grayscale uniformity) and preserving low-altitude target edges (e.g., vehicle contours). The hybrid encoder-decoder design and multi-scale discriminator framework validated its efficacy for visible-to-infrared translation, supporting applications in aerial surveillance and target detection.
Conclusions This work presents a novel cGAN-based framework for visible-to-IR image translation, addressing critical challenges in texture fidelity and radiation consistency. Key innovations include a hybrid encoder-decoder architecture for local-global feature synergy, a multi-scale discriminator for enhanced semantic modeling, and a composite loss function for edge-aware optimization. Experimental results confirm the method’s superiority over state-of-the-art models, with significant improvements in both objective metrics and subjective visual quality. The framework’s ability to generate physically realistic IR images unlocks potential applications in defense and surveillance systems, particularly in scenarios with limited real-world IR data. Future research will focus on lightweight network design for real-time deployment.

HTML全文

参考文献(22)

施引文献

资源附件(0)