摘要
深度视频修复技术就是利用深度学习技术,对视频中的缺失区域进行补全或移除特定目标对象。它也可用于合成篡改视频,其篡改后的视频很难通过肉眼辨别真假,尤其是一些恶意修复的视频在社交媒体上传播时,容易造成负面的社会舆论。目前,针对深度视频修复篡改的被动检测技术起步较晚,尽管它已经得到一些关注,但在研究的深度和广度上还远远不够。因此,本文提出一种基于级联Conv GRU和八方向局部注意力的被动取证技术,从时空域角度实现对深度修复篡改区域的定位检测。首先,为了提取修复区域的更多特征,RGB帧和错误级分析帧ELA平行输入编码器中,通过通道特征级融合,生成不同尺度的多模态特征。其次,在解码器部分,使用编码器生成的多尺度特征与串联的Conv GRU进行通道级融合来捕捉视频帧间的时域不连续性。最后,在编码器的最后一级RGB特征后,引入八方向局部注意力模块,该模块通过八个方向来关注像素的邻域信息,捕捉修复区域像素间的异常。实验中,本文使用了VI、OP、DSTT和FGVC四种最新的深度视频修复方法与已有的深度视频修复篡改检测方法HPF和VIDNet进行了对比,性能优于HPF且在编码器参数仅VIDNet的五分之一的情况下获得与VIDNet可比的性能。结果表明,本文所提方法利用多尺度双模态特征和引入的八方向局部注意力模块来关注像素间的相关性,使用Conv GRU捕捉时域异常,实现像素级的篡改区域定位,获得精准的定位效果。
Deep video inpainting is to fill missing areas or remove the specific target objects in the video by using deep learning technology.It is also exploited to synthesize tampered videos.The tampered videos are arduous to be identified with the naked eye.Especially,it is easy to cause negative public perspectives when some maliciously inpainted videos are spread on social media.At present,although it has received some attentions,its passive detection was far from enough in the depth and breadth of research for deep video inpainting.Therefore,this paper proposes a passive forensics technique based on cascaded ConvGRU and eight-direction local attention to achieve the localization of inpainted regions in deep tampered videos.The proposed method aims to localize the tampered regions in deep inpainted videos from the spatiotemporal domain.Firstly,RGB frames and error-level analysis frames,ELA,are fed into the encoder in parallel to extract more features of the inpainted area,and then multi-modal features are generated at different scales through channel feature-level fusion.Secondly,in the decoder,encoder-generated multimodal features cascaded ConvGRUs are utilized to capture the temporal continuity between video frames.Finally,in the last level RGB feature of the encoder,an eight-direction local attention module is introduced,which pays attention to the neighborhood information of pixels through eight directions and captures the anomaly between pixels in the inpainted area.In the experiment,four latest deep video inpainting methods,VI,OP,DSTT,and FGVC,were used to compare their performance with existing deep video inpainting tamper detection methods,HPF and VIDNet.The performance was superior to HPF and comparable to VIDNet was achieved when the encoder parameters were only one-fifth of VIDNet.The results show that the proposed method focuses on the correlation between pixels by generating multi-modal features and introduces an eight-direction local attention module.Concurrently,the ConvGRU takes advantage of capturing temporal anomalies,by achieving tampered positioning,and obtaining accurate localization effect.
作者
熊义毛
丁湘陵
谷庆
杨高波
赵险峰
XIONG Yimao;DING Xiangling;GU Qing;YANG Gaobo;ZHAO Xianfeng(School of Computer Science and Engineering,Hunan University of Science and Technology,Xiangtan 411201,China;College of Computer and Communication,Hunan University,Changsha 410082,China;State Key Laboratory of Information Security,Institute of Information Engineering,Chinese Academy of Sciences,Beijing 100093,China;School of Cyber Security,University of Chinese Academy of Sciences,Beijing 100093,China;Zhengzhou Xinda Institute of Advanced Technology,Zhengzhou 450000,China)
出处
《信息安全学报》
CSCD
2024年第4期125-138,共14页
Journal of Cyber Security
基金
国家自然科学基金(No.62272160)
信息安全国家重点实验室开放课题(No.2021-ZD-07)
河南省网络空间态势感知重点实验室开放课题基金资助(No.HNTS2022025)资助。