摘要
随着用户和业务复杂度的增加,数据仓库的数据对外服务能力急需提升,数据分发系统作为统一接口分发管理,不可避免地面对多用户数据访问的并发性通信阻塞问题。本文利用开源的Kettle工具构建数据分发功能应用,运用并行计算思想提升串行算法效率。在并行化过程中,详述了传统的数据分发收集并行I/O方案,并构建了时间估计方程。在分析总结其瓶颈问题的基础上,借鉴GoogleFileSystem的思想,提出了基于元数据的并行I/O改进型新方案。实验证明,不论并行计算进程数(计算单元数)多少,基于元数据的并行I/O方案比数据分发收集方案都具有更好的性能,数据导入、导出耗时更短。
The external service capability of data warehouse urgently needs to be improved with the increase of users and business complexity.As a unified interface,data distribution system is distributed and managed,and it is inevitable to deal with the congested communication congestion with multi-user data access.In this paper,open-source kettle tools are used to build data distribution applications,parallel computing ideas are used to improve the efficiency of serial algorithms.In the parallelization process,the traditional data distribution and collection parallel I/O scheme is described in detail,and the time estimation equation is constructed.On the basis of analyzing and summarizing its bottleneck problem,this paper proposes a new scheme of parallel I/O improvement based on metadata,referring to the idea of Google File System.Experiments show that,regardless of the number of parallel computing processes(the number of computational units),the metadata-based parallel I/O scheme has better performance than the data distribution and collection scheme,and the data import and derivation takes less time.
作者
肖招娣
皇甫汉聪
余永忠
吕顺锋
XIAO Zhao-di;HUANGFU Han-cong;YU Yong-zhong;LV Shun-feng(Foshan Power Supply Bureau,Guangdong Power Grid Co.,Ltd.,Foshan 528000 China;Guangdong Zhuo Wei Network Co.,Ltd.,Foshan 528000 China)
出处
《自动化技术与应用》
2018年第10期38-42,共5页
Techniques of Automation and Applications