摘要
漏洞这一名词伴随着计算机软件领域的发展已经走过了数十载。自世界上第一个软件漏洞被公开以来,软件安全研究者和工程师们就一直在探索漏洞的挖掘与分析方法。源代码漏洞静态分析是一种能够贯穿整个软件开发生命周期的、帮助软件开发人员及早发现漏洞的技术,在业界有着广泛的使用。然而,随着软件的体量越来越大,软件的功能越来越复杂,如何表示和建模软件源代码是当前面临的一个难题;此外,近年来的研究倾向于将源代码漏洞静态分析和机器学习相结合,试图通过引入机器学习模型提升漏洞挖掘的精度,但如何选择和构建合适的机器学习模型是该研究方向的一个核心问题。本文将目光聚焦于源代码漏洞静态分析技术(以下简称:静态分析技术),通过对该领域相关工作的回顾,将静态分析技术的研究分为两个方向:传统静态分析和基于学习的静态分析。传统静态分析主要是利用数据流分析、污点分析等一系列软件分析技术对软件的源代码进行建模分析;基于学习的静态分析则是将源代码以数值的形式表示并提交给学习模型,利用学习模型挖掘源代码的深层次表征特征和关联性。本文首先阐述了软件漏洞分析技术的基本概念,对比了静态分析技术和动态分析技术的优劣;然后对源代码的表示方法进行了说明。接着,本文对传统静态分析和基于学习的静态分析的一般步骤进行了总结,同时对这两个研究方向典型的研究成果进行了系统地梳理,归纳了它们的技术特点和工作流程,提出了当前静态分析技术中存在的问题,并对该方向上未来的研究工作进行了展望。
The term vulnerability has gone through several decades with the development of the computer software field.Since the first software vulnerability in the world was made public,software security researchers and engineers have been exploring the methods of vulnerability mining and analysis.The static analysis of source code vulnerability is a technology that can run through the whole software development life cycle and help software developers find software vulnerabilities early.It is widely used in the industry.However,with the increasing volume and complexity of software,how to represent and model the software source code is a difficult problem at present.In addition,in recent years,researchers tend to com-bine static analysis of source code vulnerabilities with machine learning,trying to improve the accuracy of vulnerability mining by introducing machine learning model.Nonetheless,how to select and build a suitable machine learning model is a core issue in this research direction.This paper focuses on the static analysis technology of source code vulnerability(hereinafter referred to as static analysis technology),and reviews the related work in this field.The research of static analysis technology is divided into two directions:traditional static analysis and learning-based static analysis.Traditional static analysis mainly uses a series of software analysis technologies such as data flow analysis and taint analysis to model and analyze the source code of the software;learning-based static analysis represents the source code in numerical form and submits it to the learning model,then using the learning model to mine the deep representation features and relevance of the source code.This paper first expounds the basic concepts of software vulnerability analysis technology,and com-pares the advantages and disadvantages of static analysis technology and dynamic analysis technology.Next,the representation method of the source code is explained.After that,this paper summarizes the general steps of traditional static analysis and learning-based static analysis,and systematically combs the typical research results of these two research directions,summarizes their technical characteristics and workflow,puts forward the existing problems in the current static analysis technology,and looks forward to the future research work in these directions.
作者
刘嘉勇
韩家璇
黄诚
LIU Jiayong;HAN Jiaxuan;HUANG Cheng(School of Cyber Science and Engineering,Sichuan University,Chengdu 610207,China)
出处
《信息安全学报》
CSCD
2022年第4期100-113,共14页
Journal of Cyber Security
基金
国家自然科学基金资助项目(No.61902265)
四川省科技厅重点研发资助项目(No.2020YFG0047,No.2020YFG0076)资助.
关键词
源代码漏洞
静态分析
数据流分析
污点分析
机器学习
source code vulnerability
static analysis
dataflow analysis
taint analysis
machine learning