摘要
大数据蕴藏的巨大商机引发了大数据产业化浪潮,互联网数据以其庞大的数据和简单的获得方式成为了首要分析目标。得益于互联网大数据的发展,安全领域的侦查手段从传统的事后侦查和重点监控,发展到可以进行预防性分析,在某种程度上可避免危害发生。使用互联网数据进行产业化挖掘面临着两个基本问题:多源数据的解析、清洗与整合;互联网身份的实体识别。结合具体安全服务,给出了一种普适的基于Map Reduce的互联网大数据去冗降噪的统计方法,可大幅降低数据存储空间,并在此基础上流程化地完成互联网虚拟身份识别模型。它能够量化互联网用户身份关系的可靠性和关联稳定性,并结合R语言给出了可视化展示。
Tremendous opportunities reserved in big data results in tidal waves of big data industrialization. Internet big data turns out to be the primary analytical object due to its great capacity and its ease of acquisition. Benefit from it,when do sleuthing in security fields, now we can analysis in advance before crime is made, prevent crime happens to some degree. Industrialization data mining from Internet big data has two basic difficulties. The pressure on storage devices and the low density of value attributed to its non-structural and redundancy characteristics. The difficulty of data mining results from the concern about information security. This paper puts forward a service-oriented statistical method based on Map Reduce, which could significantly reduce the amount of data volume. What's more, this paper elaborates a streamline Internet entity resolution model which quantifies the correlation between entity-attributes and its stability. A visual R graph presents as supplementarv.
出处
《软件导刊》
2016年第2期170-174,共5页
Software Guide