摘要
在软件工程领域,基于语义相似的代码克隆检测方法可以降低软件维护的成本并预防系统漏洞,抽象语法树(AST)作为典型的代码抽象表征形式,已成功应用于多种程序语言的代码克隆检测任务,然而现有工作主要利用原始AST提取代码的语义,没有深入挖掘AST中的深层语义和结构信息。针对上述问题,提出一种基于依赖增强的分层抽象语法树(DEHAST)的代码克隆检测方法。首先,对AST进行分层处理,将AST划分得到不同的语义层次;其次,为AST的不同层次添加相应的依赖增强边构建DEHAST,将简单的AST变成具有更丰富程序语义的异构图;最后,使用图匹配网络(GMN)模型检测异构图的相似性,实现代码克隆检测。在BigCloneBench和Google Code Jam两个数据集上的实验结果显示,DEHAST能够检测100%的Type-1和Type-2代码克隆、99%的Type-3代码克隆和97%的Type-4代码克隆;与基于树的方法ASTNN(AST-based Neural Network)相比,F1分数均提高了4个百分点,验证了DEHAST可以较好地完成代码语义克隆检测。
In the field of software engineering,code clone detection methods based on semantic similarity can reduce the cost of software maintenance and prevent system vulnerabilities.As a typical form of code abstract representation,Abstract Syntax Tree(AST)has achieved success in code clone detection tasks of many program languages.However,the existing work mainly uses the original AST to extract code semantics,and does not dig deep semantic and structural information in AST.To solve the above problem,a code clone detection method based on Dependency Enhanced Hierarchical Abstract Syntax Tree(DEHAST)was proposed.Firstly,the AST was layered and divided into different semantic levels.Secondly,corresponding dependency enhancement edges were added to different levels of AST to construct DEHAST,thus a simple AST was transformed into a heterogeneous graph with richer program semantics.Finally,a Graph Matching Network(GMN)model was used to detect the similarity of heterogeneous graphs to achieve code clone detection.Experimental results on two datasets BigCloneBench and Google Code Jam show that DEHAST is able to detect 100%of Type-1 and Type-2 code clones,99%of Type-3 code clones,and 97%of Type-4 code clones;compared with the tree based method ASTNN(AST-based Neural Network),the F1 values all increase by 4 percentage points.Therefore,DEHAST can effectively perform code semantic clone detection.
作者
万泽轩
谢春丽
吕泉润
梁瑶
WAN Zexuan;XIE Chunli;LYU Quanrun;LIANG Yao(School of Computer Science and Technology,Jiangsu Normal University,Xuzhou Jiangsu 221116,China)
出处
《计算机应用》
CSCD
北大核心
2024年第4期1259-1268,共10页
journal of Computer Applications
基金
国家自然科学基金资助项目(62276119)
江苏师范大学研究生科研与实践创新计划项目(2022XKT1530)。
关键词
代码克隆检测
语义克隆
抽象语法树
深度学习
图匹配网络
code clone detection
semantic clone
Abstract Syntax Tree(AST)
deep learning
Graph Matching Network(GMN)