Regular inspection of bridge cracks is crucial to bridge maintenance and repair.The traditional manual crack detection methods are timeconsuming,dangerous and subjective.At the same time,for the existing mainstream vi...Regular inspection of bridge cracks is crucial to bridge maintenance and repair.The traditional manual crack detection methods are timeconsuming,dangerous and subjective.At the same time,for the existing mainstream vision-based automatic crack detection algorithms,it is challenging to detect fine cracks and balance the detection accuracy and speed.Therefore,this paper proposes a new bridge crack segmentationmethod based on parallel attention mechanism and multi-scale features fusion on top of the DeeplabV3+network framework.First,the improved lightweight MobileNetv2 network and dilated separable convolution are integrated into the original DeeplabV3+network to improve the original backbone network Xception and atrous spatial pyramid pooling(ASPP)module,respectively,dramatically reducing the number of parameters in the network and accelerates the training and prediction speed of the model.Moreover,we introduce the parallel attention mechanism into the encoding and decoding stages.The attention to the crack regions can be enhanced from the aspects of both channel and spatial parts and significantly suppress the interference of various noises.Finally,we further improve the detection performance of the model for fine cracks by introducing a multi-scale features fusion module.Our research results are validated on the self-made dataset.The experiments show that our method is more accurate than other methods.Its intersection of union(IoU)and F1-score(F1)are increased to 77.96%and 87.57%,respectively.In addition,the number of parameters is only 4.10M,which is much smaller than the original network;also,the frames per second(FPS)is increased to 15 frames/s.The results prove that the proposed method fits well the requirements of rapid and accurate detection of bridge cracks and is superior to other methods.展开更多
The task of food image recognition,a nuanced subset of fine-grained image recognition,grapples with substantial intra-class variation and minimal inter-class differences.These challenges are compounded by the irregula...The task of food image recognition,a nuanced subset of fine-grained image recognition,grapples with substantial intra-class variation and minimal inter-class differences.These challenges are compounded by the irregular and multi-scale nature of food images.Addressing these complexities,our study introduces an advanced model that leverages multiple attention mechanisms and multi-stage local fusion,grounded in the ConvNeXt architecture.Our model employs hybrid attention(HA)mechanisms to pinpoint critical discriminative regions within images,substantially mitigating the influence of background noise.Furthermore,it introduces a multi-stage local fusion(MSLF)module,fostering long-distance dependencies between feature maps at varying stages.This approach facilitates the assimilation of complementary features across scales,significantly bolstering the model’s capacity for feature extraction.Furthermore,we constructed a dataset named Roushi60,which consists of 60 different categories of common meat dishes.Empirical evaluation of the ETH Food-101,ChineseFoodNet,and Roushi60 datasets reveals that our model achieves recognition accuracies of 91.12%,82.86%,and 92.50%,respectively.These figures not only mark an improvement of 1.04%,3.42%,and 1.36%over the foundational ConvNeXt network but also surpass the performance of most contemporary food image recognition methods.Such advancements underscore the efficacy of our proposed model in navigating the intricate landscape of food image recognition,setting a new benchmark for the field.展开更多
Medical image segmentation is an important application field of computer vision in medical image processing.Due to the close location and high similarity of different organs in medical images,the current segmentation ...Medical image segmentation is an important application field of computer vision in medical image processing.Due to the close location and high similarity of different organs in medical images,the current segmentation algorithms have problems with mis-segmentation and poor edge segmentation.To address these challenges,we propose a medical image segmentation network(AF-Net)based on attention mechanism and feature fusion,which can effectively capture global information while focusing the network on the object area.In this approach,we add dual attention blocks(DA-block)to the backbone network,which comprises parallel channels and spatial attention branches,to adaptively calibrate and weigh features.Secondly,the multi-scale feature fusion block(MFF-block)is proposed to obtain feature maps of different receptive domains and get multi-scale information with less computational consumption.Finally,to restore the locations and shapes of organs,we adopt the global feature fusion blocks(GFF-block)to fuse high-level and low-level information,which can obtain accurate pixel positioning.We evaluate our method on multiple datasets(the aorta and lungs dataset),and the experimental results achieve 94.0%in mIoU and 96.3%in DICE,showing that our approach performs better than U-Net and other state-of-art methods.展开更多
In this paper,a feature interactive bi-temporal change detection network(FIBTNet)is designed to solve the problem of pseudo change in remote sensing image building change detection.The network improves the accuracy of...In this paper,a feature interactive bi-temporal change detection network(FIBTNet)is designed to solve the problem of pseudo change in remote sensing image building change detection.The network improves the accuracy of change detection through bi-temporal feature interaction.FIBTNet designs a bi-temporal feature exchange architecture(EXA)and a bi-temporal difference extraction architecture(DFA).EXA improves the feature exchange ability of the model encoding process through multiple space,channel or hybrid feature exchange methods,while DFA uses the change residual(CR)module to improve the ability of the model decoding process to extract different features at multiple scales.Additionally,at the junction of encoder and decoder,channel exchange is combined with the CR module to achieve an adaptive channel exchange,which further improves the decision-making performance of model feature fusion.Experimental results on the LEVIR-CD and S2Looking datasets demonstrate that iCDNet achieves superior F1 scores,Intersection over Union(IoU),and Recall compared to mainstream building change detectionmodels,confirming its effectiveness and superiority in the field of remote sensing image change detection.展开更多
Visual Question Answering(VQA)has sparked widespread interest as a crucial task in integrating vision and language.VQA primarily uses attention mechanisms to effectively answer questions to associate relevant visual r...Visual Question Answering(VQA)has sparked widespread interest as a crucial task in integrating vision and language.VQA primarily uses attention mechanisms to effectively answer questions to associate relevant visual regions with input questions.The detection-based features extracted by the object detection network aim to acquire the visual attention distribution on a predetermined detection frame and provide object-level insights to answer questions about foreground objects more effectively.However,it cannot answer the question about the background forms without detection boxes due to the lack of fine-grained details,which is the advantage of grid-based features.In this paper,we propose a Dual-Level Feature Embedding(DLFE)network,which effectively integrates grid-based and detection-based image features in a unified architecture to realize the complementary advantages of both features.Specifically,in DLFE,In DLFE,firstly,a novel Dual-Level Self-Attention(DLSA)modular is proposed to mine the intrinsic properties of the two features,where Positional Relation Attention(PRA)is designed to model the position information.Then,we propose a Feature Fusion Attention(FFA)to address the semantic noise caused by the fusion of two features and construct an alignment graph to enhance and align the grid and detection features.Finally,we use co-attention to learn the interactive features of the image and question and answer questions more accurately.Our method has significantly improved compared to the baseline,increasing accuracy from 66.01%to 70.63%on the test-std dataset of VQA 1.0 and from 66.24%to 70.91%for the test-std dataset of VQA 2.0.展开更多
Current fusion methods for infrared and visible images tend to extract features at a single scale,which results in insufficient detail and incomplete feature preservation.To address these issues,we propose an infrared...Current fusion methods for infrared and visible images tend to extract features at a single scale,which results in insufficient detail and incomplete feature preservation.To address these issues,we propose an infrared and visible image fusion network based on a multiscale feature learning and attention mechanism(MsAFusion).A multiscale dilation convolution framework is employed to capture image features across various scales and broaden the perceptual scope.Furthermore,an attention network is introduced to enhance the focus on salient targets in infrared images and detailed textures in visible images.To compensate for information loss during convolution,jump connections are utilized during the image reconstruction phase.The fusion process utilizes a combined loss function consisting of pixel loss and gradient loss for unsupervised fusion of infrared and visible images.Extensive experiments on the dataset of electricity facilities demonstrate that our proposed method outperforms nine state-of-theart methods in terms of visual perception and four objective evaluation metrics.展开更多
In order to solve difficult detection of far and hard objects due to the sparseness and insufficient semantic information of LiDAR point cloud,a 3D object detection network with multi-modal data adaptive fusion is pro...In order to solve difficult detection of far and hard objects due to the sparseness and insufficient semantic information of LiDAR point cloud,a 3D object detection network with multi-modal data adaptive fusion is proposed,which makes use of multi-neighborhood information of voxel and image information.Firstly,design an improved ResNet that maintains the structure information of far and hard objects in low-resolution feature maps,which is more suitable for detection task.Meanwhile,semantema of each image feature map is enhanced by semantic information from all subsequent feature maps.Secondly,extract multi-neighborhood context information with different receptive field sizes to make up for the defect of sparseness of point cloud which improves the ability of voxel features to represent the spatial structure and semantic information of objects.Finally,propose a multi-modal feature adaptive fusion strategy which uses learnable weights to express the contribution of different modal features to the detection task,and voxel attention further enhances the fused feature expression of effective target objects.The experimental results on the KITTI benchmark show that this method outperforms VoxelNet with remarkable margins,i.e.increasing the AP by 8.78%and 5.49%on medium and hard difficulty levels.Meanwhile,our method achieves greater detection performance compared with many mainstream multi-modal methods,i.e.outperforming the AP by 1%compared with that of MVX-Net on medium and hard difficulty levels.展开更多
Performance of Video Question and Answer(VQA)systems relies on capturing key information of both visual images and natural language in the context to generate relevant questions’answers.However,traditional linear com...Performance of Video Question and Answer(VQA)systems relies on capturing key information of both visual images and natural language in the context to generate relevant questions’answers.However,traditional linear combinations of multimodal features focus only on shallow feature interactions,fall far short of the need of deep feature fusion.Attention mechanisms were used to perform deep fusion,but most of them can only process weight assignment of single-modal information,leading to attention imbalance for different modalities.To address above problems,we propose a novel VQA model based on Triple Multimodal feature Cyclic Fusion(TMCF)and Self-AdaptiveMultimodal Balancing Mechanism(SAMB).Our model is designed to enhance complex feature interactions among multimodal features with cross-modal information balancing.In addition,TMCF and SAMB can be used as an extensible plug-in for exploring new feature combinations in the visual image domain.Extensive experiments were conducted on MSVDQA and MSRVTT-QA datasets.The results confirm the advantages of our approach in handling multimodal tasks.Besides,we also provide analyses for ablation studies to verify the effectiveness of each proposed component.展开更多
真实场景点云不仅具有点云的空间几何信息,还具有三维物体的颜色信息,现有的网络无法有效利用真实场景的局部特征以及空间几何特征信息,因此提出了一种双通道特征融合的真实场景点云语义分割方法DCFNet(dual-channel feature fusion of ...真实场景点云不仅具有点云的空间几何信息,还具有三维物体的颜色信息,现有的网络无法有效利用真实场景的局部特征以及空间几何特征信息,因此提出了一种双通道特征融合的真实场景点云语义分割方法DCFNet(dual-channel feature fusion of real scene for point cloud semantic segmentation)可用于不同场景下的室内外场景语义分割。更具体地说,为了解决不能充分提取真实场景点云颜色信息的问题,该方法采用上下两个输入通道,通道均采用相同的特征提取网络结构,其中上通道的输入是完整RGB颜色和点云坐标信息,该通道主要关注于复杂物体对象场景特征,下通道仅输入点云坐标信息,该通道主要关注于点云的空间几何特征;在每个通道中为了更好地提取局部与全局信息,改善网络性能,引入了层间融合模块和Transformer通道特征扩充模块;同时,针对现有的三维点云语义分割方法缺乏关注局部特征与全局特征的联系,导致对复杂场景的分割效果不佳的问题,对上下两个通道所提取的特征通过DCFFS(dual-channel feature fusion segmentation)模块进行融合,并对真实场景进行语义分割。对室内复杂场景和大规模室内外场景点云分割基准进行了实验,实验结果表明,提出的DCFNet分割方法在S3DIS Area5室内场景数据集以及STPLS3D室外场景数据集上,平均交并比(MIOU)分别达到71.18%和48.87%,平均准确率(MACC)和整体准确率(OACC)分别达到77.01%与86.91%,实现了真实场景的高精度点云语义分割。展开更多
基金This work was supported by the High-Tech Industry Science and Technology Innovation Leading Plan Project of Hunan Provincial under Grant 2020GK2026,author B.Y,http://kjt.hunan.gov.cn/.
文摘Regular inspection of bridge cracks is crucial to bridge maintenance and repair.The traditional manual crack detection methods are timeconsuming,dangerous and subjective.At the same time,for the existing mainstream vision-based automatic crack detection algorithms,it is challenging to detect fine cracks and balance the detection accuracy and speed.Therefore,this paper proposes a new bridge crack segmentationmethod based on parallel attention mechanism and multi-scale features fusion on top of the DeeplabV3+network framework.First,the improved lightweight MobileNetv2 network and dilated separable convolution are integrated into the original DeeplabV3+network to improve the original backbone network Xception and atrous spatial pyramid pooling(ASPP)module,respectively,dramatically reducing the number of parameters in the network and accelerates the training and prediction speed of the model.Moreover,we introduce the parallel attention mechanism into the encoding and decoding stages.The attention to the crack regions can be enhanced from the aspects of both channel and spatial parts and significantly suppress the interference of various noises.Finally,we further improve the detection performance of the model for fine cracks by introducing a multi-scale features fusion module.Our research results are validated on the self-made dataset.The experiments show that our method is more accurate than other methods.Its intersection of union(IoU)and F1-score(F1)are increased to 77.96%and 87.57%,respectively.In addition,the number of parameters is only 4.10M,which is much smaller than the original network;also,the frames per second(FPS)is increased to 15 frames/s.The results prove that the proposed method fits well the requirements of rapid and accurate detection of bridge cracks and is superior to other methods.
基金The support of this research was by Hubei Provincial Natural Science Foundation(2022CFB449)Science Research Foundation of Education Department of Hubei Province(B2020061),are gratefully acknowledged.
文摘The task of food image recognition,a nuanced subset of fine-grained image recognition,grapples with substantial intra-class variation and minimal inter-class differences.These challenges are compounded by the irregular and multi-scale nature of food images.Addressing these complexities,our study introduces an advanced model that leverages multiple attention mechanisms and multi-stage local fusion,grounded in the ConvNeXt architecture.Our model employs hybrid attention(HA)mechanisms to pinpoint critical discriminative regions within images,substantially mitigating the influence of background noise.Furthermore,it introduces a multi-stage local fusion(MSLF)module,fostering long-distance dependencies between feature maps at varying stages.This approach facilitates the assimilation of complementary features across scales,significantly bolstering the model’s capacity for feature extraction.Furthermore,we constructed a dataset named Roushi60,which consists of 60 different categories of common meat dishes.Empirical evaluation of the ETH Food-101,ChineseFoodNet,and Roushi60 datasets reveals that our model achieves recognition accuracies of 91.12%,82.86%,and 92.50%,respectively.These figures not only mark an improvement of 1.04%,3.42%,and 1.36%over the foundational ConvNeXt network but also surpass the performance of most contemporary food image recognition methods.Such advancements underscore the efficacy of our proposed model in navigating the intricate landscape of food image recognition,setting a new benchmark for the field.
基金This work was supported in part by the National Natural Science Foundation of China under Grant 61772561,author J.Q,http://www.nsfc.gov.cn/in part by the Science Research Projects of Hunan Provincial Education Department under Grant 18A174,author X.X,http://kxjsc.gov.hnedu.cn/+5 种基金in part by the Science Research Projects of Hunan Provincial Education Department under Grant 19B584,author Y.T,http://kxjsc.gov.hnedu.cn/in part by the Natural Science Foundation of Hunan Province(No.2020JJ4140),author Y.T,http://kjt.hunan.gov.cn/in part by the Natural Science Foundation of Hunan Province(No.2020JJ4141),author X.X,http://kjt.hunan.gov.cn/in part by the Key Research and Development Plan of Hunan Province under Grant 2019SK2022,author Y.T,http://kjt.hunan.gov.cn/in part by the Key Research and Development Plan of Hunan Province under Grant CX20200730,author G.H,http://kjt.hunan.gov.cn/in part by the Graduate Science and Technology Innovation Fund Project of Central South University of Forestry and Technology under Grant CX20202038,author G.H,http://jwc.csuft.edu.cn/.
文摘Medical image segmentation is an important application field of computer vision in medical image processing.Due to the close location and high similarity of different organs in medical images,the current segmentation algorithms have problems with mis-segmentation and poor edge segmentation.To address these challenges,we propose a medical image segmentation network(AF-Net)based on attention mechanism and feature fusion,which can effectively capture global information while focusing the network on the object area.In this approach,we add dual attention blocks(DA-block)to the backbone network,which comprises parallel channels and spatial attention branches,to adaptively calibrate and weigh features.Secondly,the multi-scale feature fusion block(MFF-block)is proposed to obtain feature maps of different receptive domains and get multi-scale information with less computational consumption.Finally,to restore the locations and shapes of organs,we adopt the global feature fusion blocks(GFF-block)to fuse high-level and low-level information,which can obtain accurate pixel positioning.We evaluate our method on multiple datasets(the aorta and lungs dataset),and the experimental results achieve 94.0%in mIoU and 96.3%in DICE,showing that our approach performs better than U-Net and other state-of-art methods.
基金supported in part by the Fund of National Sensor Network Engineering Technology Research Center(No.NSNC202103)the Natural Science Research Project in Colleges and Universities of Anhui Province(No.2022AH040155)the Undergraduate Teaching Quality and Teaching Reform Engineering Project of Chuzhou University(No.2022ldtd03).
文摘In this paper,a feature interactive bi-temporal change detection network(FIBTNet)is designed to solve the problem of pseudo change in remote sensing image building change detection.The network improves the accuracy of change detection through bi-temporal feature interaction.FIBTNet designs a bi-temporal feature exchange architecture(EXA)and a bi-temporal difference extraction architecture(DFA).EXA improves the feature exchange ability of the model encoding process through multiple space,channel or hybrid feature exchange methods,while DFA uses the change residual(CR)module to improve the ability of the model decoding process to extract different features at multiple scales.Additionally,at the junction of encoder and decoder,channel exchange is combined with the CR module to achieve an adaptive channel exchange,which further improves the decision-making performance of model feature fusion.Experimental results on the LEVIR-CD and S2Looking datasets demonstrate that iCDNet achieves superior F1 scores,Intersection over Union(IoU),and Recall compared to mainstream building change detectionmodels,confirming its effectiveness and superiority in the field of remote sensing image change detection.
文摘Visual Question Answering(VQA)has sparked widespread interest as a crucial task in integrating vision and language.VQA primarily uses attention mechanisms to effectively answer questions to associate relevant visual regions with input questions.The detection-based features extracted by the object detection network aim to acquire the visual attention distribution on a predetermined detection frame and provide object-level insights to answer questions about foreground objects more effectively.However,it cannot answer the question about the background forms without detection boxes due to the lack of fine-grained details,which is the advantage of grid-based features.In this paper,we propose a Dual-Level Feature Embedding(DLFE)network,which effectively integrates grid-based and detection-based image features in a unified architecture to realize the complementary advantages of both features.Specifically,in DLFE,In DLFE,firstly,a novel Dual-Level Self-Attention(DLSA)modular is proposed to mine the intrinsic properties of the two features,where Positional Relation Attention(PRA)is designed to model the position information.Then,we propose a Feature Fusion Attention(FFA)to address the semantic noise caused by the fusion of two features and construct an alignment graph to enhance and align the grid and detection features.Finally,we use co-attention to learn the interactive features of the image and question and answer questions more accurately.Our method has significantly improved compared to the baseline,increasing accuracy from 66.01%to 70.63%on the test-std dataset of VQA 1.0 and from 66.24%to 70.91%for the test-std dataset of VQA 2.0.
基金supported by the project of CSG Electric Power Research Institute(Grant No.SEPRI-K22B100)。
文摘Current fusion methods for infrared and visible images tend to extract features at a single scale,which results in insufficient detail and incomplete feature preservation.To address these issues,we propose an infrared and visible image fusion network based on a multiscale feature learning and attention mechanism(MsAFusion).A multiscale dilation convolution framework is employed to capture image features across various scales and broaden the perceptual scope.Furthermore,an attention network is introduced to enhance the focus on salient targets in infrared images and detailed textures in visible images.To compensate for information loss during convolution,jump connections are utilized during the image reconstruction phase.The fusion process utilizes a combined loss function consisting of pixel loss and gradient loss for unsupervised fusion of infrared and visible images.Extensive experiments on the dataset of electricity facilities demonstrate that our proposed method outperforms nine state-of-theart methods in terms of visual perception and four objective evaluation metrics.
基金National Youth Natural Science Foundation of China(No.61806006)Innovation Program for Graduate of Jiangsu Province(No.KYLX160-781)Jiangsu University Superior Discipline Construction Project。
文摘In order to solve difficult detection of far and hard objects due to the sparseness and insufficient semantic information of LiDAR point cloud,a 3D object detection network with multi-modal data adaptive fusion is proposed,which makes use of multi-neighborhood information of voxel and image information.Firstly,design an improved ResNet that maintains the structure information of far and hard objects in low-resolution feature maps,which is more suitable for detection task.Meanwhile,semantema of each image feature map is enhanced by semantic information from all subsequent feature maps.Secondly,extract multi-neighborhood context information with different receptive field sizes to make up for the defect of sparseness of point cloud which improves the ability of voxel features to represent the spatial structure and semantic information of objects.Finally,propose a multi-modal feature adaptive fusion strategy which uses learnable weights to express the contribution of different modal features to the detection task,and voxel attention further enhances the fused feature expression of effective target objects.The experimental results on the KITTI benchmark show that this method outperforms VoxelNet with remarkable margins,i.e.increasing the AP by 8.78%and 5.49%on medium and hard difficulty levels.Meanwhile,our method achieves greater detection performance compared with many mainstream multi-modal methods,i.e.outperforming the AP by 1%compared with that of MVX-Net on medium and hard difficulty levels.
基金This work was supported by the National Natural Science Foundation of China(No.61872231)the National Key Research and Development Program of China(No.2021YFC2801000)the Major Research plan of the National Social Science Foundation of China(No.20&ZD130).
文摘Performance of Video Question and Answer(VQA)systems relies on capturing key information of both visual images and natural language in the context to generate relevant questions’answers.However,traditional linear combinations of multimodal features focus only on shallow feature interactions,fall far short of the need of deep feature fusion.Attention mechanisms were used to perform deep fusion,but most of them can only process weight assignment of single-modal information,leading to attention imbalance for different modalities.To address above problems,we propose a novel VQA model based on Triple Multimodal feature Cyclic Fusion(TMCF)and Self-AdaptiveMultimodal Balancing Mechanism(SAMB).Our model is designed to enhance complex feature interactions among multimodal features with cross-modal information balancing.In addition,TMCF and SAMB can be used as an extensible plug-in for exploring new feature combinations in the visual image domain.Extensive experiments were conducted on MSVDQA and MSRVTT-QA datasets.The results confirm the advantages of our approach in handling multimodal tasks.Besides,we also provide analyses for ablation studies to verify the effectiveness of each proposed component.
文摘真实场景点云不仅具有点云的空间几何信息,还具有三维物体的颜色信息,现有的网络无法有效利用真实场景的局部特征以及空间几何特征信息,因此提出了一种双通道特征融合的真实场景点云语义分割方法DCFNet(dual-channel feature fusion of real scene for point cloud semantic segmentation)可用于不同场景下的室内外场景语义分割。更具体地说,为了解决不能充分提取真实场景点云颜色信息的问题,该方法采用上下两个输入通道,通道均采用相同的特征提取网络结构,其中上通道的输入是完整RGB颜色和点云坐标信息,该通道主要关注于复杂物体对象场景特征,下通道仅输入点云坐标信息,该通道主要关注于点云的空间几何特征;在每个通道中为了更好地提取局部与全局信息,改善网络性能,引入了层间融合模块和Transformer通道特征扩充模块;同时,针对现有的三维点云语义分割方法缺乏关注局部特征与全局特征的联系,导致对复杂场景的分割效果不佳的问题,对上下两个通道所提取的特征通过DCFFS(dual-channel feature fusion segmentation)模块进行融合,并对真实场景进行语义分割。对室内复杂场景和大规模室内外场景点云分割基准进行了实验,实验结果表明,提出的DCFNet分割方法在S3DIS Area5室内场景数据集以及STPLS3D室外场景数据集上,平均交并比(MIOU)分别达到71.18%和48.87%,平均准确率(MACC)和整体准确率(OACC)分别达到77.01%与86.91%,实现了真实场景的高精度点云语义分割。