摘要
基因型填充(Genotype Imputation, GI)是一种利用现有的基因型信息来推断未测定或不完整基因型的技术。为了探索在大豆基因组测序中处理不完整数据的高效填充方法,以提高数据处理速度和效率,本研究采用真实的大豆参考面板基因型数据,通过对数据进行2%、5%、10%和25%的完全随机缺失处理,利用GPU加速的随机森林机器学习算法构建填充模型,并对不同缺失比例的数据进行填充。同时,对比分析了不同处理器的准确性和性能。结果显示:基于GPU加速的随机森林算法在大豆基因组中实现了优秀的填充精度。与主流基因填充软件相比,该方法至少提供了4倍以上的运算时间优势。因此,GPU加速的基因型填充策略可应用于大规模基因型数据处理中,提高大豆基因型数据处理的速度和效率,同时减少计算时间和资源消耗。
Genotype Imputation(GI)is a technique that uses existing genotype information to infer unobserved or incomplete genotypes.This study aims to explore efficient imputation methods for handling incomplete data in soybean genomic sequencing,with the goal of improving data processing speed and efficiency.Real soybean reference panel genotype data was used in the study,and the data was subjected to complete random missingness at rates of 2%,5%,10%,and 25%.A GPU-acceleratedrandom forest machine learning algorithm was employed to construct imputation models and fill in the missing data at different missingness rates.Additionally,the accuracy and performance of different processors were compared and analyzed.The research results demonstrate that the GPU-accelerated random forest algorithm achieves excellent imputation accuracy in the soybean genome.Compared to mainstream genotype imputation software,this method provides at least a fourfold computational time advantage.Therefore,the GPU-accelerated genotype imputation strategy can be applied to largescale genotype data processing,improving the speed and efficiency of soybean genotype data processing while reducing computational time and resource consumption.
作者
李明亮
李卓
黄斌
于军
辛鹏
张继成
唐友
LI Mingliang;LI Zhuo;HUANG Bin;YU Jun;XIN Peng;ZHANG Jicheng;TANG You(School of Information and Control Engineering,Jilin Institute of Chemical Technology,Jilin 132022,China;Electrical and Information Engineering College,Jilin Agricultural Science and Technology University,Jilin 132101,China;College of Electronic and Information,Northeast Agricultural University,Harbin 150030,China)
出处
《大豆科学》
CAS
CSCD
北大核心
2023年第6期742-748,共7页
Soybean Science
基金
吉林省科技发展计划项目(YDZJ202201ZYTS692)。
关键词
大豆基因填充
随机森林算法
GPU加速
数据处理
soybean genotype imputation
random forest algorithm
GPU acceleration
data processing