文本引導的雙分支異構編碼多模態圖像融合方法

王傳云; 周明奇; 孫冬冬; 王田; 高騫; 李照奎

doi:10.13374/j.issn2095-9389.2025.06.17.002

文本引導的雙分支異構編碼多模態圖像融合方法

A Text-Guided Multimodal Image Fusion Method Based on Dual-Branch Heterogeneous Encoders

摘要

摘要: 針對資源受限的無人機平臺對紅外與可見光圖像的融合效率與感知性能需求，本文提出一種文本引導的雙分支異構編碼多模態圖像融合方法。該方法設計了一種面向紅外與可見光圖像信息表達功能互補的輕量化雙分支異構編碼網絡，紅外圖像編碼分支強調熱目標與邊緣響應，可見光圖像編碼分支側重于紋理與細節信息建模，從而有效避免同構編碼器帶來的特征冗余與性能瓶頸。同時，引入輕量級跨模態特征融合模塊，增強多模信息之間的互補性與融合表達能力。進一步，通過預訓練視覺語言模型結合義文本特征對融合過程進行引導與調控，提升融合圖像的語義一致性與環境適應性。在三個公開多模態圖像數據集上，本文方法與七種具有代表性的圖像融合算法進行了充分的對比實驗與綜合分析。結果表明：所提出的網絡在互信息、結構相似性、峰值信噪比等多個主流評價指標上均表現優越，能夠有效提升融合圖像的細節清晰度與結構一致性，在推理時間上顯著降低約50%，且性能損失可忽略不計，體現出極高的效率優勢。

Abstract: To meet the requirements of fusion efficiency and perception performance for infrared and visible images on resource-constrained UAV platforms, this paper proposes a text-guided dual-branch heterogeneous encoder-based multimodal image fusion method. A lightweight dual-branch heterogeneous encoding architecture is designed to exploit the complementary characteristics of infrared and visible modalities. Specifically, the infrared branch focuses on thermal target and edge response representation, while the visible branch emphasizes texture and detail modeling, effectively avoiding the feature redundancy and performance bottlenecks caused by homogeneous encoders. A lightweight cross-modal feature fusion module is further introduced to enhance the complementarity and expressive capacity between modalities. In addition, a pre-trained vision-language model is integrated to extract semantic text features, guiding and regulating the fusion process to improve the semantic consistency and adaptability of the fused images under diverse environmental conditions. Extensive experiments on three public multimodal image datasets demonstrate that the proposed method outperforms seven representative fusion algorithms in terms of multiple mainstream metrics such as mutual information (MI), structural similarity (SSIM), and peak signal-to-noise ratio (PSNR). The results show that the proposed network effectively enhances detail clarity and structural consistency of the fused images. Compared to baseline methods, it reduces inference time by approximately 50% with negligible performance degradation, demonstrating outstanding efficiency and deployment potential.

HTML全文

參考文獻(0)

施引文獻

資源附件(0)