摘要
基因信息选取工作中由于数据量庞大,传统单线程运行的分类查询方法无法满足实时性与提取精度要求。为此,利用Hadoop框架设计两阶段并行计算模型。其中第1阶段用于候选基因子集并行选取,第2阶段用于并行K近邻基因信息选取,从而实现并行计算的全过程覆盖。为降低算法的计算复杂度,针对基因信息微阵列数据,定义数据筛选指标对其进行采样,在降低数据处理量的同时消除数据冗余。实验结果表明,该算法具有较高的运行效率,并且继承了Hadoop编程模型的可扩展特性,可移植性较强。
Because of huge amount of data in gene information extraction, whose real-time requirements can not be met by traditional methods with single threaded operation, the Hadoop framework is used to design the two-stage parallel computing model. The first stage is used to extract candidate gene subset, and the second stage is used to extract parallel K nearest neighbor genetic information, and it implements whole process cover of parallel computing. At the same time,in order to further reduce the computational complexity of the algorithm, the microarray data sampling method is used to reduce the amount of data processing and eliminate data redundancy. Experimental results show that the proposed algorithm has better running efficiency, inherits the extensible features of Hadoop programming model and has strong portability.
出处
《计算机工程》
CAS
CSCD
北大核心
2016年第5期54-59,共6页
Computer Engineering
基金
辽宁省教育厅基金资助项目(L2012113)
关键词
Hadoop框架
并行计算
微阵列采样
大数据
K近邻
基因信息
Hadoop framework
parallel computing
micro-array sampling
big data
K nearest neighbor
gene information