MicroMagnetic.jl is an open-source Julia package for micromagnetic and atomistic simulations.Using the features of the Julia programming language,MicroMagnetic.jl supports CPU and various GPU platforms,including NVIDI...MicroMagnetic.jl is an open-source Julia package for micromagnetic and atomistic simulations.Using the features of the Julia programming language,MicroMagnetic.jl supports CPU and various GPU platforms,including NVIDIA,AMD,Intel,and Apple GPUs.Moreover,MicroMagnetic.jl supports Monte Carlo simulations for atomistic models and implements the nudged-elastic-band method for energy barrier computations.With built-in support for double and single precision modes and a design allowing easy extensibility to add new features,MicroMagnetic.jl provides a versatile toolset for researchers in micromagnetics and atomistic simulations.展开更多
Aiming to solve the bottleneck problem of electromagnetic scattering simulation in the scenes of extremely large-scale seas and ships,a high-frequency method by using graphics processing unit(GPU)parallel acceleration...Aiming to solve the bottleneck problem of electromagnetic scattering simulation in the scenes of extremely large-scale seas and ships,a high-frequency method by using graphics processing unit(GPU)parallel acceleration technique is proposed.For the implementation of different electromagnetic methods of physical optics(PO),shooting and bouncing ray(SBR),and physical theory of diffraction(PTD),a parallel computing scheme based on the CPU-GPU parallel computing scheme is realized to balance computing tasks.Finally,a multi-GPU framework is further proposed to solve the computational difficulty caused by the massive number of ray tubes in the ray tracing process.By using the established simulation platform,signals of ships at different seas are simulated and their images are achieved as well.It is shown that the higher sea states degrade the averaged peak signal-to-noise ratio(PSNR)of radar image.展开更多
Personal desktop platform with teraflops peak performance of thousands of cores is realized at the price of conventional workstations using the programmable graphics processing units(GPUs).A GPU-based parallel Euler/N...Personal desktop platform with teraflops peak performance of thousands of cores is realized at the price of conventional workstations using the programmable graphics processing units(GPUs).A GPU-based parallel Euler/Navier-Stokes solver is developed for 2-D compressible flows by using NVIDIA′s Compute Unified Device Architecture(CUDA)programming model in CUDA Fortran programming language.The techniques of implementation of CUDA kernels,double-layered thread hierarchy and variety memory hierarchy are presented to form the GPU-based algorithm of Euler/Navier-Stokes equations.The resulting parallel solver is validated by a set of typical test flow cases.The numerical results show that dozens of times speedup relative to a serial CPU implementation can be achieved using a single GPU desktop platform,which demonstrates that a GPU desktop can serve as a costeffective parallel computing platform to accelerate computational fluid dynamics(CFD)simulations substantially.展开更多
Fluid-structure interaction (FSI) problems in microchannels play a prominent role in many engineering applications. The present study is an effort toward the simulation of flow in microchannel considering FSI. The b...Fluid-structure interaction (FSI) problems in microchannels play a prominent role in many engineering applications. The present study is an effort toward the simulation of flow in microchannel considering FSI. The bottom boundary of the microchannel is simulated by size-dependent beam elements for the finite element method (FEM) based on a modified cou- ple stress theory. The lattice Boltzmann method (LBM) using the D2Q13 LB model is coupled to the FEM in order to solve the fluid part of the FSI problem. Because of the fact that the LBM generally needs only nearest neighbor information, the algorithm is an ideal candidate for parallel computing. The simulations are carried out on graphics processing units (GPUs) using computed unified device architecture (CUDA). In the present study, the governing equations are non-dimensionalized and the set of dimensionless groups is exhibited to show their effects on micro-beam displacement. The numerical results show that the displacements of the micro-beam predicted by the size-dependent beam element are smaller than those by the classical beam element.展开更多
Evolutionary algorithms(EAs)have been used in high utility itemset mining(HUIM)to address the problem of discover-ing high utility itemsets(HUIs)in the exponential search space.EAs have good running and mining perform...Evolutionary algorithms(EAs)have been used in high utility itemset mining(HUIM)to address the problem of discover-ing high utility itemsets(HUIs)in the exponential search space.EAs have good running and mining performance,but they still require huge computational resource and may miss many HUIs.Due to the good combination of EA and graphics processing unit(GPU),we propose a parallel genetic algorithm(GA)based on the platform of GPU for mining HUIM(PHUI-GA).The evolution steps with improvements are performed in central processing unit(CPU)and the CPU intensive steps are sent to GPU to eva-luate with multi-threaded processors.Experiments show that the mining performance of PHUI-GA outperforms the existing EAs.When mining 90%HUIs,the PHUI-GA is up to 188 times better than the existing EAs and up to 36 times better than the CPU parallel approach.展开更多
Large eddy simulation (LES) using the Smagorinsky eddy viscosity model is added to the two-dimensional nine velocity components (D2Q9) lattice Boltzmann equation (LBE) with multi-relaxation-time (MRT) to simul...Large eddy simulation (LES) using the Smagorinsky eddy viscosity model is added to the two-dimensional nine velocity components (D2Q9) lattice Boltzmann equation (LBE) with multi-relaxation-time (MRT) to simulate incompressible turbulent cavity flows with the Reynolds numbers up to 1 × 10^7. To improve the computation efficiency of LBM on the numerical simulations of turbulent flows, the massively parallel computing power from a graphic processing unit (GPU) with a computing unified device architecture (CUDA) is introduced into the MRT-LBE-LES model. The model performs well, compared with the results from others, with an increase of 76 times in computation efficiency. It appears that the higher the Reynolds numbers is, the smaller the Smagorinsky constant should be, if the lattice number is fixed. Also, for a selected high Reynolds number and a selected proper Smagorinsky constant, there is a minimum requirement for the lattice number so that the Smagorinsky eddy viscosity will not be excessively large.展开更多
In this study,insights into the effect of interfacial anisotropy on a complex hexagonal close-packed(hcp) dendritic growth during alloy solidification were gained by graphics processing unit(GPU)-accelerated three-dim...In this study,insights into the effect of interfacial anisotropy on a complex hexagonal close-packed(hcp) dendritic growth during alloy solidification were gained by graphics processing unit(GPU)-accelerated three-dimensional(3D) phase-field simulations,as demonstrated for a Mg-Gd alloy.An anisotropic phasefield model with finite interface dissipation was developed by incorporating the contribution of the anisotropy of interfacial energy into the total free energy functional.The modified spherical harmonic anisotropy function was then chosen for the hcp crystal.The GPU parallel computing algorithm was implemented in the present phase-field model,and a corresponding code was developed in the compute unified device architecture parallel computing platform.Benchmark tests indicated that the calculation efficiency of a single TESLA V100 GPU could be~80times that of open multi-processing(OpenMP) with eight central processing unit cores.By coupling the phase-field model with reliable thermodynamic and interfacial energy descriptions,the 3D phase-field simulation of α-Mg dendritic growth in the Mg-6Gd(in wt%) alloy during solidification was performed.Various two-dimensional dendrite morphologies were revealed by cutting the simulated 3D dendrite along different crystallographic planes.Typical sixfold equiaxed and butterflied microstructures observed in experiments were well reproduced.展开更多
High resolution cameras and multi camera systems are being used in areas of video surveillance like security of public places, traffic monitoring, and military and satellite imaging. This leads to a demand for computa...High resolution cameras and multi camera systems are being used in areas of video surveillance like security of public places, traffic monitoring, and military and satellite imaging. This leads to a demand for computational algorithms for real time processing of high resolution videos. Motion detection and background separation play a vital role in capturing the object of interest in surveillance videos, but as we move towards high resolution cameras, the time-complexity of the algorithm increases and thus fails to be a part of real time systems. Parallel architecture provides a surpass platform to work efficiently with complex algorithmic solutions. In this work, a method was proposed for identifying the moving objects perfectly in the videos using adaptive background making, motion detection and object estimation. The pre-processing part includes an adaptive block background making model and a dynamically adaptive thresholding technique to estimate the moving objects. The post processing includes a competent parallel connected component labelling algorithm to estimate perfectly the objects of interest. New parallel processing strategies are developed on each stage of the algorithm to reduce the time-complexity of the system. This algorithm has achieved a average speedup of 12.26 times for lower resolution video frames(320×240, 720×480, 1024×768) and 7.30 times for higher resolution video frames(1360×768, 1920×1080, 2560×1440) on GPU, which is superior to CPU processing. Also, this algorithm was tested by changing the number of threads in a thread block and the minimum execution time has been achieved for 16×16 thread block. And this algorithm was tested on a night sequence where the amount of light in the scene is very less and still the algorithm has given a significant speedup and accuracy in determining the object.展开更多
Large deformation contact problems generally involve highly nonlinear behaviors,which are very time-consuming and may lead to convergence issues.The finite particle method(FPM)effectively separates pure deformation fr...Large deformation contact problems generally involve highly nonlinear behaviors,which are very time-consuming and may lead to convergence issues.The finite particle method(FPM)effectively separates pure deformation from total motion in large deformation problems.In addition,the decoupled procedures of the FPM make it suitable for parallel computing,which may provide an approach to solve time-consuming issues.In this study,a graphics processing unit(GPU)-based parallel algorithm is proposed for two-dimensional large deformation contact problems.The fundamentals of the FPM for planar solids are first briefly introduced,including the equations of motion of particles and the internal forces of quadrilateral elements.Subsequently,a linked-list data structure suitable for parallel processing is built,and parallel global and local search algorithms are presented for contact detection.The contact forces are then derived and directly exerted on particles.The proposed method is implemented with main solution procedures executed in parallel on a GPU.Two verification problems comprising large deformation frictional contacts are presented,and the accuracy of the proposed algorithm is validated.Furthermore,the algorithm’s performance is investigated via a large-scale contact problem,and the maximum speedups of total computational time and contact calculation reach 28.5 and 77.4,respectively,relative to commercial finite element software Abaqus/Explicit running on a single-core central processing unit(CPU).The contact calculation time percentage of the total calculation time is only 18%with the FPM,much smaller than that(50%)with Abaqus/Explicit,demonstrating the efficiency of the proposed method.展开更多
Energy efficiency has become one of the top design criteria for current computing systems. The Dynamic Voltage and Frequency Scaling (DVFS) has been widely adopted by laptop computers, servers, and mobile devices to...Energy efficiency has become one of the top design criteria for current computing systems. The Dynamic Voltage and Frequency Scaling (DVFS) has been widely adopted by laptop computers, servers, and mobile devices to conserve energy, while the GPU DVFS is still at a certain early age. This paper aims at exploring the impact of GPU DVFS on the application performance and power consumption, and furthermore, on energy conservation. We survey the state-of-the-art GPU DVFS characterizations, and then summarize recent research works on GPU power and performance models. We also conduct real GPU DVFS experiments on NVIDIA Fermi and Maxwell GPUs. According to our experimental results, GPU DVFS has significant potential for energy saving. The effect of scaling core voltage/frequency and memory voltage/frequency depends on not only the GPLI architectures, but also the characteristic of GPU applications.展开更多
Empirical potential structure refinement is a neutron scattering data analysis algorithm and a software package.It was developed by the disordered materials group in the British spallation neutron source(ISIS)in 1980s...Empirical potential structure refinement is a neutron scattering data analysis algorithm and a software package.It was developed by the disordered materials group in the British spallation neutron source(ISIS)in 1980s,and aims to construct the most-probable atomic structures of disordered materials in the field of chemical physics.It has been extensively used during the past decades,and has generated reliable results.However,it implements a shared-memory architecture with open multi-processing(OpenMP).With the extensive construction of supercomputer clusters and the widespread use of graphics processing unit(GPU)acceleration technology,it is now possible to rebuild the EPSR with these techniques in the effort to improve its calculation speed.In this study,an open source framework NeuDATool is proposed.It is programmed in the object-oriented language C++,can be paralleled across nodes within a computer cluster,and supports GPU acceleration.The performance of NeuDATool has been tested with water and amorphous silica neutron scattering data.The test shows that the software can reconstruct the correct microstructure of the samples,and the calculation speed with GPU acceleration can increase by more than 400 times,compared with CPU serial algorithm at a simulation box that has about 100 thousand atoms.NeuDATool provides another choice to implement simulation in the(neutron)diffraction community,especially for experts who are familiar with C++programming and want to define specific algorithms for their analysis.展开更多
The most popular hardware used for parallel depth migration is the PC-Cluster but its application is limited due to large space occupation and high power consumption. In this paper, we introduce a new hardware archite...The most popular hardware used for parallel depth migration is the PC-Cluster but its application is limited due to large space occupation and high power consumption. In this paper, we introduce a new hardware architecture, based on which the finite difference (FD) wavefield-continuation depth migration can be conducted using the Graphics Processing Unit (GPU) as a CPU coprocessor. We demonstrate the program module and three key optimization steps for implementing FD depth migration: memory, thread structure, and instruction optimizations and consider evaluation methods for the amount of optimization. 2D and 3D models are used to test depth migration on the GPU. The tested results show that the depth migration computational efficiency greatly increased using the general-purpose GPU, increasing by at least 25 times compared to the AMD 2.5 GHz CPU.展开更多
This paper proposes a new Graphics Processing Unit(GPU)-accelerated storage format to speed up Sparse Matrix Vector Products(SMVPs) for Finite Element Method(FEM) analysis of electromagnetic problems.A new format call...This paper proposes a new Graphics Processing Unit(GPU)-accelerated storage format to speed up Sparse Matrix Vector Products(SMVPs) for Finite Element Method(FEM) analysis of electromagnetic problems.A new format called Modified Compile Time Optimization(MCTO) format is used to reduce much execution time and design for hastening the iterative solution of FEM equations especially when rows have uneven lengths.The MCTO-applied FEM is about 10 times faster than conventional FEM on a CPU,and faster than other row-major ordering formats on a GPU.Numerical results show that the proposed GPU-accelerated storage format turns out to be an excellent accelerator.展开更多
Breakage of particles will have greatly influence on mechanical behavior of granular material(GM)under external loads,such as ballast,rockfill and sand.The discrete element method(DEM)is one of the most popular method...Breakage of particles will have greatly influence on mechanical behavior of granular material(GM)under external loads,such as ballast,rockfill and sand.The discrete element method(DEM)is one of the most popular methods for simulating GM as each particle is represented on its own.To study breakage mechanism of particle breakage,a cohesive contact mode is developed based on the GPU accelerated DEM code-Blaze-DEM.A database of the 3D geometry model of rock blocks is established based on the 3D scanning method.And an agglomerate describing the rock block with a series of non-overlapping spherical particles is used to build the DEM numerical model of a railway ballast sample,which is used to the DEM oedometric test to study the particles’breakage characteristics of the sample under external load.Furthermore,to obtain the meso-mechanical parameters used in DEM,a black-analysis method is used based on the laboratory tests of the rock sample.Based on the DEM numerical tests,the particle breakage process and mechanisms of the railway ballast are studied.All results show that the developed code can better used for large scale simulation of the particle breakage analysis of granular material.展开更多
We demonstrate real-time three-dimensional(3D)color video using a color electroholographic system with a cluster of multiple-graphics processing units(multi-GPU)and three spatial light modulators(SLMs)corresponding re...We demonstrate real-time three-dimensional(3D)color video using a color electroholographic system with a cluster of multiple-graphics processing units(multi-GPU)and three spatial light modulators(SLMs)corresponding respectively to red,green,and blue(RGB)-colored reconstructing lights.The multi-GPU cluster has a computer-generated hologram(CGH)display node containing a GPU,for displaying calculated CGHs on SLMs,and four CGH calculation nodes using 12 GPUs.The GPUs in the CGH calculation node generate CGHs corresponding to RGB reconstructing lights in a 3D color video using pipeline processing.Real-time color electroholography was realized for a 3D color object comprising approximately 21,000 points per color.展开更多
Computationally, the calculation of computer-generated holograms is extremely expensive, and the image quality deteriorates when reconstructing three-dimensional(3 D) holographic video from a point-cloud model compris...Computationally, the calculation of computer-generated holograms is extremely expensive, and the image quality deteriorates when reconstructing three-dimensional(3 D) holographic video from a point-cloud model comprising a huge number of object points. To solve these problems, we implement herein a spatiotemporal division multiplexing method on a cluster system with 13 GPUs connected by a gigabit Ethernet network.A performance evaluation indicates that the proposed method can realize a real-time holographic video of a3 D object comprising ~1,200,000 object points. These results demonstrate a clear 3 D holographic video at32.7 frames per second reconstructed from a 3 D object comprising 1,064,462 object points.展开更多
基金supported by the National Key R&D Program of China(Grant No.2022YFA1403603)the Strategic Priority Research Program of Chinese Academy of Sciences(Grant No.XDB33030100)+2 种基金the National Natural Science Fund for Distinguished Young Scholar(Grant No.52325105)the National Natural Science Foundation of China(Grant Nos.12374098,11974021,and 12241406)the CAS Project for Young Scientists in Basic Research(Grant No.YSBR-084).
文摘MicroMagnetic.jl is an open-source Julia package for micromagnetic and atomistic simulations.Using the features of the Julia programming language,MicroMagnetic.jl supports CPU and various GPU platforms,including NVIDIA,AMD,Intel,and Apple GPUs.Moreover,MicroMagnetic.jl supports Monte Carlo simulations for atomistic models and implements the nudged-elastic-band method for energy barrier computations.With built-in support for double and single precision modes and a design allowing easy extensibility to add new features,MicroMagnetic.jl provides a versatile toolset for researchers in micromagnetics and atomistic simulations.
基金supported by the Opening Foundation of the Agile and Intelligence Computing Key Laboratory of Sichuan Province under Grant No.H23004the Chengdu Municipal Science and Technology Bureau Technological Innovation R&D Project(Key Project)under Grant No.2024-YF08-00106-GX.
文摘Aiming to solve the bottleneck problem of electromagnetic scattering simulation in the scenes of extremely large-scale seas and ships,a high-frequency method by using graphics processing unit(GPU)parallel acceleration technique is proposed.For the implementation of different electromagnetic methods of physical optics(PO),shooting and bouncing ray(SBR),and physical theory of diffraction(PTD),a parallel computing scheme based on the CPU-GPU parallel computing scheme is realized to balance computing tasks.Finally,a multi-GPU framework is further proposed to solve the computational difficulty caused by the massive number of ray tubes in the ray tracing process.By using the established simulation platform,signals of ships at different seas are simulated and their images are achieved as well.It is shown that the higher sea states degrade the averaged peak signal-to-noise ratio(PSNR)of radar image.
基金supported by the National Natural Science Foundation of China (No.11172134)the Funding of Jiangsu Innovation Program for Graduate Education (No.CXLX13_132)
文摘Personal desktop platform with teraflops peak performance of thousands of cores is realized at the price of conventional workstations using the programmable graphics processing units(GPUs).A GPU-based parallel Euler/Navier-Stokes solver is developed for 2-D compressible flows by using NVIDIA′s Compute Unified Device Architecture(CUDA)programming model in CUDA Fortran programming language.The techniques of implementation of CUDA kernels,double-layered thread hierarchy and variety memory hierarchy are presented to form the GPU-based algorithm of Euler/Navier-Stokes equations.The resulting parallel solver is validated by a set of typical test flow cases.The numerical results show that dozens of times speedup relative to a serial CPU implementation can be achieved using a single GPU desktop platform,which demonstrates that a GPU desktop can serve as a costeffective parallel computing platform to accelerate computational fluid dynamics(CFD)simulations substantially.
文摘Fluid-structure interaction (FSI) problems in microchannels play a prominent role in many engineering applications. The present study is an effort toward the simulation of flow in microchannel considering FSI. The bottom boundary of the microchannel is simulated by size-dependent beam elements for the finite element method (FEM) based on a modified cou- ple stress theory. The lattice Boltzmann method (LBM) using the D2Q13 LB model is coupled to the FEM in order to solve the fluid part of the FSI problem. Because of the fact that the LBM generally needs only nearest neighbor information, the algorithm is an ideal candidate for parallel computing. The simulations are carried out on graphics processing units (GPUs) using computed unified device architecture (CUDA). In the present study, the governing equations are non-dimensionalized and the set of dimensionless groups is exhibited to show their effects on micro-beam displacement. The numerical results show that the displacements of the micro-beam predicted by the size-dependent beam element are smaller than those by the classical beam element.
基金This work was supported by the National Natural Science Foundation of China(62073155,62002137,62106088,62206113)the High-End Foreign Expert Recruitment Plan(G2023144007L)the Fundamental Research Funds for the Central Universities(JUSRP221028).
文摘Evolutionary algorithms(EAs)have been used in high utility itemset mining(HUIM)to address the problem of discover-ing high utility itemsets(HUIs)in the exponential search space.EAs have good running and mining performance,but they still require huge computational resource and may miss many HUIs.Due to the good combination of EA and graphics processing unit(GPU),we propose a parallel genetic algorithm(GA)based on the platform of GPU for mining HUIM(PHUI-GA).The evolution steps with improvements are performed in central processing unit(CPU)and the CPU intensive steps are sent to GPU to eva-luate with multi-threaded processors.Experiments show that the mining performance of PHUI-GA outperforms the existing EAs.When mining 90%HUIs,the PHUI-GA is up to 188 times better than the existing EAs and up to 36 times better than the CPU parallel approach.
基金supported by College of William and Mary,Virginia Institute of Marine Science for the study environment
文摘Large eddy simulation (LES) using the Smagorinsky eddy viscosity model is added to the two-dimensional nine velocity components (D2Q9) lattice Boltzmann equation (LBE) with multi-relaxation-time (MRT) to simulate incompressible turbulent cavity flows with the Reynolds numbers up to 1 × 10^7. To improve the computation efficiency of LBM on the numerical simulations of turbulent flows, the massively parallel computing power from a graphic processing unit (GPU) with a computing unified device architecture (CUDA) is introduced into the MRT-LBE-LES model. The model performs well, compared with the results from others, with an increase of 76 times in computation efficiency. It appears that the higher the Reynolds numbers is, the smaller the Smagorinsky constant should be, if the lattice number is fixed. Also, for a selected high Reynolds number and a selected proper Smagorinsky constant, there is a minimum requirement for the lattice number so that the Smagorinsky eddy viscosity will not be excessively large.
基金supported by the Natural Science Foundation of Hunan Province for Distinguished Young Scholars (No. 2021JJ10062)National Key Research and Development Program of China (No. 2016YFB0301101)+2 种基金Science and Technology Program of Guangxi province, China (No. AB21220028)the financial support from the Fundamental Research Funds for the Central Universities of Central South University (No. 2019zzts050)Postgraduate Scientific Research Innovation Project of Hunan Province (No. CX20190106)。
文摘In this study,insights into the effect of interfacial anisotropy on a complex hexagonal close-packed(hcp) dendritic growth during alloy solidification were gained by graphics processing unit(GPU)-accelerated three-dimensional(3D) phase-field simulations,as demonstrated for a Mg-Gd alloy.An anisotropic phasefield model with finite interface dissipation was developed by incorporating the contribution of the anisotropy of interfacial energy into the total free energy functional.The modified spherical harmonic anisotropy function was then chosen for the hcp crystal.The GPU parallel computing algorithm was implemented in the present phase-field model,and a corresponding code was developed in the compute unified device architecture parallel computing platform.Benchmark tests indicated that the calculation efficiency of a single TESLA V100 GPU could be~80times that of open multi-processing(OpenMP) with eight central processing unit cores.By coupling the phase-field model with reliable thermodynamic and interfacial energy descriptions,the 3D phase-field simulation of α-Mg dendritic growth in the Mg-6Gd(in wt%) alloy during solidification was performed.Various two-dimensional dendrite morphologies were revealed by cutting the simulated 3D dendrite along different crystallographic planes.Typical sixfold equiaxed and butterflied microstructures observed in experiments were well reproduced.
文摘High resolution cameras and multi camera systems are being used in areas of video surveillance like security of public places, traffic monitoring, and military and satellite imaging. This leads to a demand for computational algorithms for real time processing of high resolution videos. Motion detection and background separation play a vital role in capturing the object of interest in surveillance videos, but as we move towards high resolution cameras, the time-complexity of the algorithm increases and thus fails to be a part of real time systems. Parallel architecture provides a surpass platform to work efficiently with complex algorithmic solutions. In this work, a method was proposed for identifying the moving objects perfectly in the videos using adaptive background making, motion detection and object estimation. The pre-processing part includes an adaptive block background making model and a dynamically adaptive thresholding technique to estimate the moving objects. The post processing includes a competent parallel connected component labelling algorithm to estimate perfectly the objects of interest. New parallel processing strategies are developed on each stage of the algorithm to reduce the time-complexity of the system. This algorithm has achieved a average speedup of 12.26 times for lower resolution video frames(320×240, 720×480, 1024×768) and 7.30 times for higher resolution video frames(1360×768, 1920×1080, 2560×1440) on GPU, which is superior to CPU processing. Also, this algorithm was tested by changing the number of threads in a thread block and the minimum execution time has been achieved for 16×16 thread block. And this algorithm was tested on a night sequence where the amount of light in the scene is very less and still the algorithm has given a significant speedup and accuracy in determining the object.
基金This work was supported by the National Key Research and Development Program of China[Grant No.2016YFC0800200]the National Natural Science Foundation of China[Grant Nos.51778568,51908492,and 52008366]+1 种基金Zhejiang Provincial Natural Science Foundation of China[Grant Nos.LQ21E080019 and LY21E080022]This work was also sup-ported by the Key Laboratory of Space Structures of Zhejiang Province(Zhejiang University)and the Center for Balance Architecture of Zhejiang University.
文摘Large deformation contact problems generally involve highly nonlinear behaviors,which are very time-consuming and may lead to convergence issues.The finite particle method(FPM)effectively separates pure deformation from total motion in large deformation problems.In addition,the decoupled procedures of the FPM make it suitable for parallel computing,which may provide an approach to solve time-consuming issues.In this study,a graphics processing unit(GPU)-based parallel algorithm is proposed for two-dimensional large deformation contact problems.The fundamentals of the FPM for planar solids are first briefly introduced,including the equations of motion of particles and the internal forces of quadrilateral elements.Subsequently,a linked-list data structure suitable for parallel processing is built,and parallel global and local search algorithms are presented for contact detection.The contact forces are then derived and directly exerted on particles.The proposed method is implemented with main solution procedures executed in parallel on a GPU.Two verification problems comprising large deformation frictional contacts are presented,and the accuracy of the proposed algorithm is validated.Furthermore,the algorithm’s performance is investigated via a large-scale contact problem,and the maximum speedups of total computational time and contact calculation reach 28.5 and 77.4,respectively,relative to commercial finite element software Abaqus/Explicit running on a single-core central processing unit(CPU).The contact calculation time percentage of the total calculation time is only 18%with the FPM,much smaller than that(50%)with Abaqus/Explicit,demonstrating the efficiency of the proposed method.
文摘Energy efficiency has become one of the top design criteria for current computing systems. The Dynamic Voltage and Frequency Scaling (DVFS) has been widely adopted by laptop computers, servers, and mobile devices to conserve energy, while the GPU DVFS is still at a certain early age. This paper aims at exploring the impact of GPU DVFS on the application performance and power consumption, and furthermore, on energy conservation. We survey the state-of-the-art GPU DVFS characterizations, and then summarize recent research works on GPU power and performance models. We also conduct real GPU DVFS experiments on NVIDIA Fermi and Maxwell GPUs. According to our experimental results, GPU DVFS has significant potential for energy saving. The effect of scaling core voltage/frequency and memory voltage/frequency depends on not only the GPLI architectures, but also the characteristic of GPU applications.
基金supported by the National Key Research and Development Program of China(No.2017YFA-0403703)the National Natural Science Foundation of China(No.U1830205,No.21674020).
文摘Empirical potential structure refinement is a neutron scattering data analysis algorithm and a software package.It was developed by the disordered materials group in the British spallation neutron source(ISIS)in 1980s,and aims to construct the most-probable atomic structures of disordered materials in the field of chemical physics.It has been extensively used during the past decades,and has generated reliable results.However,it implements a shared-memory architecture with open multi-processing(OpenMP).With the extensive construction of supercomputer clusters and the widespread use of graphics processing unit(GPU)acceleration technology,it is now possible to rebuild the EPSR with these techniques in the effort to improve its calculation speed.In this study,an open source framework NeuDATool is proposed.It is programmed in the object-oriented language C++,can be paralleled across nodes within a computer cluster,and supports GPU acceleration.The performance of NeuDATool has been tested with water and amorphous silica neutron scattering data.The test shows that the software can reconstruct the correct microstructure of the samples,and the calculation speed with GPU acceleration can increase by more than 400 times,compared with CPU serial algorithm at a simulation box that has about 100 thousand atoms.NeuDATool provides another choice to implement simulation in the(neutron)diffraction community,especially for experts who are familiar with C++programming and want to define specific algorithms for their analysis.
基金supported by the National Natural Science Foundation of China (Nos. 41104083 and 40804024) Fundamental Research Funds for the Central Universities (No, 2011YYL022)
文摘The most popular hardware used for parallel depth migration is the PC-Cluster but its application is limited due to large space occupation and high power consumption. In this paper, we introduce a new hardware architecture, based on which the finite difference (FD) wavefield-continuation depth migration can be conducted using the Graphics Processing Unit (GPU) as a CPU coprocessor. We demonstrate the program module and three key optimization steps for implementing FD depth migration: memory, thread structure, and instruction optimizations and consider evaluation methods for the amount of optimization. 2D and 3D models are used to test depth migration on the GPU. The tested results show that the depth migration computational efficiency greatly increased using the general-purpose GPU, increasing by at least 25 times compared to the AMD 2.5 GHz CPU.
基金Supported by the National Science Foundation of China(Nos.61272097,71203064,71103077)the Natural Science Foundation of Shanghai(No.12ZR1443000)+2 种基金the Funding Research and Innovation Project of Shanghai Municipal Education Commission(No.12ZZ182)the Fundamental Research Funds for the Central Universitiesand the Local Colleges and Universities "1025" Connotation Construction Project of Shanghai(No.nhky-2012-10)the Foundation of Shanghai University of Engineering Science(No.A-0501-13-012)
文摘This paper proposes a new Graphics Processing Unit(GPU)-accelerated storage format to speed up Sparse Matrix Vector Products(SMVPs) for Finite Element Method(FEM) analysis of electromagnetic problems.A new format called Modified Compile Time Optimization(MCTO) format is used to reduce much execution time and design for hastening the iterative solution of FEM equations especially when rows have uneven lengths.The MCTO-applied FEM is about 10 times faster than conventional FEM on a CPU,and faster than other row-major ordering formats on a GPU.Numerical results show that the proposed GPU-accelerated storage format turns out to be an excellent accelerator.
基金project of “Natural Science Foundation of China, China (Nos. 5187914, 51679123, 51479095)”
文摘Breakage of particles will have greatly influence on mechanical behavior of granular material(GM)under external loads,such as ballast,rockfill and sand.The discrete element method(DEM)is one of the most popular methods for simulating GM as each particle is represented on its own.To study breakage mechanism of particle breakage,a cohesive contact mode is developed based on the GPU accelerated DEM code-Blaze-DEM.A database of the 3D geometry model of rock blocks is established based on the 3D scanning method.And an agglomerate describing the rock block with a series of non-overlapping spherical particles is used to build the DEM numerical model of a railway ballast sample,which is used to the DEM oedometric test to study the particles’breakage characteristics of the sample under external load.Furthermore,to obtain the meso-mechanical parameters used in DEM,a black-analysis method is used based on the laboratory tests of the rock sample.Based on the DEM numerical tests,the particle breakage process and mechanisms of the railway ballast are studied.All results show that the developed code can better used for large scale simulation of the particle breakage analysis of granular material.
基金partially supported by the Japan Society for the Promotion of Science(JSPS)KAKENHI(Nos.18K11399 and 19H01097)the Telecommunications Advancement Foundation.
文摘We demonstrate real-time three-dimensional(3D)color video using a color electroholographic system with a cluster of multiple-graphics processing units(multi-GPU)and three spatial light modulators(SLMs)corresponding respectively to red,green,and blue(RGB)-colored reconstructing lights.The multi-GPU cluster has a computer-generated hologram(CGH)display node containing a GPU,for displaying calculated CGHs on SLMs,and four CGH calculation nodes using 12 GPUs.The GPUs in the CGH calculation node generate CGHs corresponding to RGB reconstructing lights in a 3D color video using pipeline processing.Real-time color electroholography was realized for a 3D color object comprising approximately 21,000 points per color.
基金partially supported by the Japan Society for the Promotion of Science(JSPS)KAKENHI(Nos.18K11399 and 19H01097)the Telecommunications Advancement Foundation
文摘Computationally, the calculation of computer-generated holograms is extremely expensive, and the image quality deteriorates when reconstructing three-dimensional(3 D) holographic video from a point-cloud model comprising a huge number of object points. To solve these problems, we implement herein a spatiotemporal division multiplexing method on a cluster system with 13 GPUs connected by a gigabit Ethernet network.A performance evaluation indicates that the proposed method can realize a real-time holographic video of a3 D object comprising ~1,200,000 object points. These results demonstrate a clear 3 D holographic video at32.7 frames per second reconstructed from a 3 D object comprising 1,064,462 object points.