Graph processing is a vital component of many AI and big data applications.However,due to its poor locality and complex data access patterns,graph processing is also a known performance killer of AI and big data appli...Graph processing is a vital component of many AI and big data applications.However,due to its poor locality and complex data access patterns,graph processing is also a known performance killer of AI and big data applications.In this work,we propose to enhance graph processing applications by leveraging fine-grained memory access patterns with a dual-path architecture on top of existing software-based graph optimizations.We first identify that memory accesses to the offset,edge,and state array have distinct locality and impact on performance.We then introduce the Skyway architecture,which consists of two primary components:1)a dedicated direct data path between the core and memory to transfer state array elements efficiently,and 2)a data-type aware fine-grained memory-side row buffer hardware for both the newly designed direct data path and the regular memory hierarchy data path.The proposed Skyway architecture is able to improve the overall performance by reducing the memory access interference and improving data access efficiency with a minimal overhead.We evaluate Skyway on a set of diverse algorithms using large real-world graphs.On a simulated fourcore system,Skyway improves the performance by 23%on average over the best-performing graph-specialized hardware optimizations.展开更多
With the surge of big data applications and the worsening of the memory-wall problem,the memory system,instead of the computing unit,becomes the commonly recognized major concern of computing.However,this“memorycent...With the surge of big data applications and the worsening of the memory-wall problem,the memory system,instead of the computing unit,becomes the commonly recognized major concern of computing.However,this“memorycentric”common understanding has a humble beginning.More than three decades ago,the memory-bounded speedup model is the first model recognizing memory as the bound of computing and provided a general bound of speedup and a computing-memory trade-off formulation.The memory-bounded model was well received even by then.It was immediately introduced in several advanced computer architecture and parallel computing textbooks in the 1990’s as a must-know for scalable computing.These include Prof.Kai Hwang’s book“Scalable Parallel Computing”in which he introduced the memory-bounded speedup model as the Sun-Ni’s Law,parallel with the Amdahl’s Law and the Gustafson’s Law.Through the years,the impacts of this model have grown far beyond parallel processing and into the fundamental of computing.In this article,we revisit the memory-bounded speedup model and discuss its progress and impacts in depth to make a unique contribution to this special issue,to stimulate new solutions for big data applications,and to promote data-centric thinking and rethinking.展开更多
Modern High-Performance Computing(HPC)systems are adding extra layers to the memory and storage hierarchy,named deep memory and storage hierarchy(DMSH),to increase I/O performance.New hardware technologies,such as NVM...Modern High-Performance Computing(HPC)systems are adding extra layers to the memory and storage hierarchy,named deep memory and storage hierarchy(DMSH),to increase I/O performance.New hardware technologies,such as NVMe and SSD,have been introduced in burst buffer installations to reduce the pressure for external storage and boost the burstiness of modern I/O systems.The DMSH has demonstrated its strength and potential in practice.However,each layer of DMSH is an independent heterogeneous system and data movement among more layers is significantly more complex even without considering heterogeneity.How to efficiently utilize the DMSH is a subject of research facing the HPC community.Further,accessing data with a high-throughput and low-latency is more imperative than ever.Data prefetching is a well-known technique for hiding read latency by requesting data before it is needed to move it from a high-latency medium(e.g.,disk)to a low-latency one(e.g.,main memory).However,existing solutions do not consider the new deep memory and storage hierarchy and also suffer from under-utilization of prefetching resources and unnecessary evictions.Additionally,existing approaches implement a client-pull model where understanding the application's I/O behavior drives prefetching decisions.Moving towards exascale,where machines run multiple applications concurrently by accessing files in a workflow,a more data-centric approach resolves challenges such as cache pollution and redundancy.In this paper,we present the design and implementation of Hermes:a new,heterogeneous-aware,multi-tiered,dynamic,and distributed I/O buffering system.Hermes enables,manages,supervises,and,in some sense,extends I/O buffering to fully integrate into the DMSH.We introduce three novel data placement policies to efficiently utilize all layers and we present three novel techniques to perform memory,metadata,and communication management in hierarchical buffering systems.Additionally,we demonstrate the benefits of a truly hierarchical data prefetcher that adopts a server-push approach to data prefetching.Our evaluation shows that,in addition to automatic data movement through the hierarchy,Hermes can significantly accelerate I/O and outperforms by more than 2x state-of-the-art buffering platforms.Lastly,results show 10%-35%performance gains over existing prefetchers and over 50%when compared to systems with no prefetching.展开更多
Accesses Per Cycle(APC),Concurrent Average Memory Access Time(C-AMAT),and Layered Performance Matching(LPM)are three memory performance models that consider both data locality and memory assess concurrency.The APC mod...Accesses Per Cycle(APC),Concurrent Average Memory Access Time(C-AMAT),and Layered Performance Matching(LPM)are three memory performance models that consider both data locality and memory assess concurrency.The APC model measures the throughput of a memory architecture and therefore reflects the quality of service(QoS)of a memory system.The C-AMAT model provides a recursive expression for the memory access delay and therefore can be used for identifying the potential bottlenecks in a memory hierarchy.The LPM method transforms a global memory system optimization into localized optimizations at each memory layer by matching the data access demands of the applications with the underlying memory system design.These three models have been proposed separately through prior efforts.This paper reexamines the three models under one coherent mathematical framework.More specifically,we present a new memorycentric view of data accesses.We divide the memory cycles at each memory layer into four distinct categories and use them to recursively define the memory access latency and concurrency along the memory hierarchy.This new perspective offers new insights with a clear formulation of the memory performance considering both locality and concurrency.Consequently,the performance model can be easily understood and applied in engineering practices.As such,the memory-centric approach helps establish a unified mathematical foundation for model-driven performance analysis and optimization of contemporary and future memory systems.展开更多
It is our great pleasure to announce the publication of this special section in JCST,Selected I/O Technologies for High-Performance Computing and Data Analytics.With the explosive grow th of colossal data from various...It is our great pleasure to announce the publication of this special section in JCST,Selected I/O Technologies for High-Performance Computing and Data Analytics.With the explosive grow th of colossal data from various academic and industrial sectors,many High-Performance Computing(HPC)and data analytics systems have been developed to meet the needs of data collection,processing and analysis.Accordingly,many research groups around the world have explored unconventional and cut ting-edge ideas for the management of storage and I/O.展开更多
It is our great pleasure to announce the publication of this special section in Journal of Computer Science and Technology(JCST),Memory-Centric System Research for High-Performance Computing(HPC).The growing disparity...It is our great pleasure to announce the publication of this special section in Journal of Computer Science and Technology(JCST),Memory-Centric System Research for High-Performance Computing(HPC).The growing disparity between CPU speed and memory speed,known as the memory-wall problem,has been a long-standing challenge in the computing industry.Several memory technologies and architectures including 3Dstacked memory,non-volatile random-access memory(NVRAM),memristor,hybrid software and hardware caches,etc.have been introduced in recent years to address the infamous memory-wall problem.展开更多
基金supported in part by the U.S.National Science Foundation under Grant Nos.CCF-2008907 and CCF-2029014the Chinese Academy of Sciences Project for Young Scientists in Basic Research under Grant No.YSBR-029the Chinese Academy of Sciences Project for Youth Innovation Promotion Association.
文摘Graph processing is a vital component of many AI and big data applications.However,due to its poor locality and complex data access patterns,graph processing is also a known performance killer of AI and big data applications.In this work,we propose to enhance graph processing applications by leveraging fine-grained memory access patterns with a dual-path architecture on top of existing software-based graph optimizations.We first identify that memory accesses to the offset,edge,and state array have distinct locality and impact on performance.We then introduce the Skyway architecture,which consists of two primary components:1)a dedicated direct data path between the core and memory to transfer state array elements efficiently,and 2)a data-type aware fine-grained memory-side row buffer hardware for both the newly designed direct data path and the regular memory hierarchy data path.The proposed Skyway architecture is able to improve the overall performance by reducing the memory access interference and improving data access efficiency with a minimal overhead.We evaluate Skyway on a set of diverse algorithms using large real-world graphs.On a simulated fourcore system,Skyway improves the performance by 23%on average over the best-performing graph-specialized hardware optimizations.
基金supported in part by the U.S.National Science Foundation under Grant Nos.CCF-2029014 and CCF-2008907.
文摘With the surge of big data applications and the worsening of the memory-wall problem,the memory system,instead of the computing unit,becomes the commonly recognized major concern of computing.However,this“memorycentric”common understanding has a humble beginning.More than three decades ago,the memory-bounded speedup model is the first model recognizing memory as the bound of computing and provided a general bound of speedup and a computing-memory trade-off formulation.The memory-bounded model was well received even by then.It was immediately introduced in several advanced computer architecture and parallel computing textbooks in the 1990’s as a must-know for scalable computing.These include Prof.Kai Hwang’s book“Scalable Parallel Computing”in which he introduced the memory-bounded speedup model as the Sun-Ni’s Law,parallel with the Amdahl’s Law and the Gustafson’s Law.Through the years,the impacts of this model have grown far beyond parallel processing and into the fundamental of computing.In this article,we revisit the memory-bounded speedup model and discuss its progress and impacts in depth to make a unique contribution to this special issue,to stimulate new solutions for big data applications,and to promote data-centric thinking and rethinking.
基金This work is funded by the National Science Foundation of USA under Grants Nos.OCI-1835764 and CSR-1814872。
文摘Modern High-Performance Computing(HPC)systems are adding extra layers to the memory and storage hierarchy,named deep memory and storage hierarchy(DMSH),to increase I/O performance.New hardware technologies,such as NVMe and SSD,have been introduced in burst buffer installations to reduce the pressure for external storage and boost the burstiness of modern I/O systems.The DMSH has demonstrated its strength and potential in practice.However,each layer of DMSH is an independent heterogeneous system and data movement among more layers is significantly more complex even without considering heterogeneity.How to efficiently utilize the DMSH is a subject of research facing the HPC community.Further,accessing data with a high-throughput and low-latency is more imperative than ever.Data prefetching is a well-known technique for hiding read latency by requesting data before it is needed to move it from a high-latency medium(e.g.,disk)to a low-latency one(e.g.,main memory).However,existing solutions do not consider the new deep memory and storage hierarchy and also suffer from under-utilization of prefetching resources and unnecessary evictions.Additionally,existing approaches implement a client-pull model where understanding the application's I/O behavior drives prefetching decisions.Moving towards exascale,where machines run multiple applications concurrently by accessing files in a workflow,a more data-centric approach resolves challenges such as cache pollution and redundancy.In this paper,we present the design and implementation of Hermes:a new,heterogeneous-aware,multi-tiered,dynamic,and distributed I/O buffering system.Hermes enables,manages,supervises,and,in some sense,extends I/O buffering to fully integrate into the DMSH.We introduce three novel data placement policies to efficiently utilize all layers and we present three novel techniques to perform memory,metadata,and communication management in hierarchical buffering systems.Additionally,we demonstrate the benefits of a truly hierarchical data prefetcher that adopts a server-push approach to data prefetching.Our evaluation shows that,in addition to automatic data movement through the hierarchy,Hermes can significantly accelerate I/O and outperforms by more than 2x state-of-the-art buffering platforms.Lastly,results show 10%-35%performance gains over existing prefetchers and over 50%when compared to systems with no prefetching.
基金supported in part by the U.S.National Science Foundation under Grant Nos.CCF-2008000,CNS-1730488,and CCF-2008907the U.S.Department of Homeland Security under Grant No.2017-ST-062-000002.
文摘Accesses Per Cycle(APC),Concurrent Average Memory Access Time(C-AMAT),and Layered Performance Matching(LPM)are three memory performance models that consider both data locality and memory assess concurrency.The APC model measures the throughput of a memory architecture and therefore reflects the quality of service(QoS)of a memory system.The C-AMAT model provides a recursive expression for the memory access delay and therefore can be used for identifying the potential bottlenecks in a memory hierarchy.The LPM method transforms a global memory system optimization into localized optimizations at each memory layer by matching the data access demands of the applications with the underlying memory system design.These three models have been proposed separately through prior efforts.This paper reexamines the three models under one coherent mathematical framework.More specifically,we present a new memorycentric view of data accesses.We divide the memory cycles at each memory layer into four distinct categories and use them to recursively define the memory access latency and concurrency along the memory hierarchy.This new perspective offers new insights with a clear formulation of the memory performance considering both locality and concurrency.Consequently,the performance model can be easily understood and applied in engineering practices.As such,the memory-centric approach helps establish a unified mathematical foundation for model-driven performance analysis and optimization of contemporary and future memory systems.
文摘It is our great pleasure to announce the publication of this special section in JCST,Selected I/O Technologies for High-Performance Computing and Data Analytics.With the explosive grow th of colossal data from various academic and industrial sectors,many High-Performance Computing(HPC)and data analytics systems have been developed to meet the needs of data collection,processing and analysis.Accordingly,many research groups around the world have explored unconventional and cut ting-edge ideas for the management of storage and I/O.
文摘It is our great pleasure to announce the publication of this special section in Journal of Computer Science and Technology(JCST),Memory-Centric System Research for High-Performance Computing(HPC).The growing disparity between CPU speed and memory speed,known as the memory-wall problem,has been a long-standing challenge in the computing industry.Several memory technologies and architectures including 3Dstacked memory,non-volatile random-access memory(NVRAM),memristor,hybrid software and hardware caches,etc.have been introduced in recent years to address the infamous memory-wall problem.