摘要
基于高通量测序的RNA-Seq(RNA-sequencing)是用于转录组研究的一种新技术,针对该技术在转录组表达分析研究中存在的读段多源映射和读段非均匀分布等难点,提出一个改进的转录组表达研究方法 LDASeqII(Improvement of latent Dirichlet allocation for sequencing data)。模型利用剪接异构体结构信息对参数进行约束并进行外显子读段数目归一化处理,解决了读段非均匀分布下的多源映射问题。通过引入"伪外显子"和"伪转录本"分别处理接合区读段和噪声读段。将模型应用到真实数据集上,并与原LDASeq(Latent Dirichlet allocation for sequencing data)模型和目前流行的Cufflinks与RSEM(RNA-Seq by expectation maximization)方法进行对比。结果显示,改进方法获得了更为准确的转录本及基因表达水平计算结果。
RNA-Seq(RNA-sequencing),based on high-throughput sequencing,is a new technique for transcriptome research.Considering the difficulties in the analysis of transcript expression using RNA-Seq data,an improved method,improvement of latent dirichlet allocation for sequencing data(LDASeqⅡ)is proposed to calculate the transcript expression.To deal with multi-mappings between reads and isoforms and non-uniform distribution of reads along reference,LDASeqⅡ utilizes the known gene-isoform annotation to constrain the hyperparameters and normalizes the read counts by exon length for each individual exon.By introducing″pseudo-exon″and″pseudo-transcript″,the conjunction reads and noise reads gain proper treatments.LDASeqⅡis validated using two real datasets on gene and transcript expression calculation and compared with latent dirichlet allocation for sequencing data(LDASeq)and other two popular methods Cufflinks and RNA-Seq by expectation maximization(RSEM).The results show that LDASeqⅡobtains more accurate transcript and gene expression measurements than other approaches.
出处
《数据采集与处理》
CSCD
北大核心
2015年第5期1028-1035,共8页
Journal of Data Acquisition and Processing
基金
国家自然科学基金(61170152)资助项目
中央高校基本科研业务费专项(CXZZ11_0217)资助项目