摘要
随着预训练模型规模的急剧增长,训练此类模型需要海量的计算和存储能力。为此,本工作在新一代国产高性能计算机上训练了一个174万亿参数的超大规模预训练模型,模型参数量可与人脑中的突触数量相媲美。重点讨论在训练这一超大规模预训练模型中遇到的几个关键系统挑战:如何选取高效并行策略,如何进行高效数据存储,如何选取合适的数据精度,以及如何实现动态负载均衡,并总结了针对上述挑战的一些解决方法。
As the size of pre-trained artificial intelligence models grows dramatically each year,training such models requires massive com⁃puting and memory capabilities.To this end,an unprecedentedly large-scale pre-trained model with 174 trillion parameters on an entire su⁃percomputer is proposed,which rivals the number of synapses in a human brain.The key challenges encountered in such large-scale model training,including deciding efficient parallel strategy,performing efficient data storage,deciding appropriate data precision,and dy⁃namic load balancing are proposed.Then the solutions to the above challenges are summarized.
作者
马子轩
翟季冬
韩文弢
陈文光
郑纬民
MA Zixuan;ZHAI Jidong;HAN Wentao;CHEN Wenguang;ZHENG Weimin(Tsinghua University,Beijing 100083,China)
出处
《中兴通讯技术》
2022年第2期51-58,共8页
ZTE Technology Journal
关键词
人工智能
超级计算机
混合专家
异构系统
artificial intelligence
supercomputer
mixture of experts
heterogeneous architecture