摘要
为了研究p53基因与其下游/靶基因的关联性,以了解p53基因表达调控网络,采用文本数据挖取方法,利用自编的Perl 5.10程序,对PubMed文献数据库中p53基因相关文献及人类基因本体数据库进行数据挖掘,并利用连锁聚类法构建p53基因表达调控网络图。结果发现,目标基因的频率分布同文本中所有基因本体的频率分布存在一定的关联性,低频基因的文本挖掘比例明显低于高频基因的文本挖掘比例。从而说明,p53基因表达调控网络中各基因的分布情况与基因频率有较大关系,而文本数据量对文本数据挖掘的准确率也有重要影响。
To study the relationship between p53 gene and its downstream/target genes in order to understand p53 network,text data mining method is used and noncommercial software written in Perl 5.10 is used to mine the database from PubMed about p53 gene and human gene ontology,and the p53 network is constructed by linkage clustering analysis.Results show that the frequency distribution of the objective gene with the gene ontology of all the text has a certain correlation,which indicates that the proportion of the low-frequency genes is significantly lower than the high-frequency genes in text data mining.This has allowed us to demonstrate that the distributions of genes in the p53 network have a greater relationship with the frequency of the genes,and meanwhile,text size has an important influence on the accuracy of the text data mining.
出处
《重庆邮电大学学报(自然科学版)》
北大核心
2012年第6期798-803,共6页
Journal of Chongqing University of Posts and Telecommunications(Natural Science Edition)
基金
重庆市自然科学基金(CSTC
2009BB5419)
重庆邮电大学博士启动基金(A2007-40
A2009-63)~~