Cancer cells can evade immune recognition by losing major histocompatibility complex(MHC)class Ⅰ.Hence,MHC class Ⅰ-negative cancers represent the most challenging cancers to treat.Chemotherapeutic drugs not only dir...Cancer cells can evade immune recognition by losing major histocompatibility complex(MHC)class Ⅰ.Hence,MHC class Ⅰ-negative cancers represent the most challenging cancers to treat.Chemotherapeutic drugs not only directly kill tumors but also modulate the tumor immune microenvironment However,it remains unknown whether chemotherapy-treated cancer cells can activate CD8 T cells independent of tumor-derived MHC class Ⅰ and whether such MHC class Ⅰ-independent CD8 T-cell activation can be exploited for cancer immunotherapy.Here,we showed that chemotherapy-treated cancer cells directly activated CD8 T cells in an MHC class Ⅰ-independent manner and that these activated CD8 T cells exhibit virtual memory(VM)phenotypes.Consistently,in vivo chemotherapeutic treatment preferentially increased tumor-infiltrating VM CD8 T cells.Mechanistically,MHC class Ⅰ-independent activation of CD8 T cells requires cell-cell contact and activation of the PI3K pathway.VM CD8 T cells contribute to a superior therapeutic effect on MHC class Ⅰ-deficient tumors.Using humanized mouse models or primary human CD8 T cells,we also demonstrated that chemotherapy-treated human lymphomas activated VM CD8 T cells independent of tumor-derived MHC class Ⅰ.In conclusion,CD8 T cells can be directly activated in an MHC class Ⅰ-independent manner by chemotherapy-treated cancers,and these activated CD8 T cells may be exploited for developing new strategies to treat MHC class Ⅰ-deficient cancers.展开更多
With the rapid development of big data and artificial intelligence(AI),the cloud platform architecture system is constantly developing,optimizing,and improving.As such,new applications,like deep computing and high-per...With the rapid development of big data and artificial intelligence(AI),the cloud platform architecture system is constantly developing,optimizing,and improving.As such,new applications,like deep computing and high-performance computing,require enhanced computing power.To meet this requirement,a non-uniform memory access(NUMA)configuration method is proposed for the cloud computing system according to the affinity,adaptability,and availability of the NUMA architecture processor platform.The proposed method is verified based on the test environment of a domestic central processing unit(CPU).展开更多
This paper explores potential for the RAMpage memory hierarchy to use a microkernel with a small memory footprint, in a specialized cache-speed static RAM (tightly-coupled memory, TCM). Dreamy memory is DRAM kept in...This paper explores potential for the RAMpage memory hierarchy to use a microkernel with a small memory footprint, in a specialized cache-speed static RAM (tightly-coupled memory, TCM). Dreamy memory is DRAM kept in low-power mode, unless referenced. Simulations show that a small microkernel suits RAMpage well, in that it achieves significantly better speed and energy gains than a standard hierarchy from adding TCM. RAMpage, in its best 128KB L2 case, gained 11% speed using TCM, and reduced energy 14%. Equivalent conventional hierarchy gains were under 1%. While 1MB L2 was significantly faster against lower-energy cases for the smaller L2, the larger SRAM's energy does not justify the speed gain. Using a 128KB L2 cache in a conventional architecture resulted in a best-case overall run time of 2.58s, compared with the best dreamy mode run time (RAMpage without context switches on misses) of 3.34s, a speed penalty of 29%. Energy in the fastest 128KB L2 case was 2.18J vs. 1.50J, a reduction of 31%. The same RAMpage configuration without dreamy mode took 2.83s as simulated, and used 2.393, an acceptable trade-off (penalty under 10%) for being able to switch easily to a lower-energy mode.展开更多
Based on the characteristics of distrubuted system and the behavior of parallelprograms, this paper presents the fixed and randofored competitive memory coherence algorithms for distributed shared virtual memory These...Based on the characteristics of distrubuted system and the behavior of parallelprograms, this paper presents the fixed and randofored competitive memory coherence algorithms for distributed shared virtual memory These algorithms exploit parallel programs' locality of reference and dribit good competitive property Our simulation shows that the fixed and randomized algorithms achieve better performance and higher stability than other strategies such as write-invalldate and write-update.展开更多
Malicious software programs usually bypass the detection of anti-virus software by hiding themselves among apparently legitimate programs.In this work,we propose Windows Virtual Machine Introspection(WVMI)to accurat...Malicious software programs usually bypass the detection of anti-virus software by hiding themselves among apparently legitimate programs.In this work,we propose Windows Virtual Machine Introspection(WVMI)to accurately detect those hidden processes by analyzing memory data.WVMI dumps in-memory data of the target Windows operating systems from hypervisor and retrieves EPROCESS structures’address of process linked list first,and then generates Data Type Confidence Table(DTCT).Next,it traverses the memory and identifies the similarities between the nodes in process linked list and the corresponding segments in the memory by utilizing DTCT.Finally,it locates the segments of Windows’EPROCESS and identifies the hidden processes by further comparison.Through extensive experiments,our experiment shows that the WVMI detects the hidden process with high identification rate,and it is independent of different versions of Windows operating system.展开更多
With the popularity and commercialization of cloud computing platforms, the security of virtualization technology must be guaranteed. The paper studies the protection of memory privacy under virtual platform to enhanc...With the popularity and commercialization of cloud computing platforms, the security of virtualization technology must be guaranteed. The paper studies the protection of memory privacy under virtual platform to enhance system security. Based on the monitoring of foreign mapping for Dom0, a memory privacy protection scheme is designed and implemented to prevent process memory pages in DomU being mapped illegally which might result in the leakage of secret data.展开更多
Pipeline parallelism is a popular parallel programming pattern for emerging applications. However, program- ming pipelines directly on conventional multithreaded shared memory is difficult and error-prone. We present ...Pipeline parallelism is a popular parallel programming pattern for emerging applications. However, program- ming pipelines directly on conventional multithreaded shared memory is difficult and error-prone. We present DStream, a C library that provides high-level abstractions of deterministic threads and streams for simply representing pipeline stage work- ers and their communications. The deterministic stream is established atop our proposed single-producer/multi-consumer (SPMC) virtual memory, which integrates synchronization with the virtual memory model to enforce determinism on shared memory accesses. We investigate various strategies on how to efficiently implement DStream atop the SPMC memory, so that an infinite sequence of data items can be asynchronously published (fixed) and asynchronously consumed in order among adjacent stage workers. We have successfully transformed two representative pipeline applications ferret and dedup using DStream, and conclude conversion rules. An empirical evaluation shows that the converted ferret performed on par with its Pthreads and TBB counterparts in term of running time, while the converted dedup is close to 2.56X, 7.05X faster than the Pthreads counterpart and 1.06X, 3.9X faster than the TBB counterpart on 16 and 32 CPUs, respectively.展开更多
Stream processing is a special form of the dataflow execution model that offers extensive opportunities for optimization and automatic parallelism. A streaming application is represented by a graph of computation stag...Stream processing is a special form of the dataflow execution model that offers extensive opportunities for optimization and automatic parallelism. A streaming application is represented by a graph of computation stages that communicate with each other via FIFO channels. In shared-memory environment, an FIFO channel is classically a com- mon, fixed-size synchronized buffer shared between the producer and the consumer. As the number of concurrent stage workers increases, the synchronization overheads, such as contention and waiting times, rise sharply and severely impair application performance. In this paper, we present a novel multithreaded model which isolates memory between threads by default and provides a higher level abstraction for scalable unicast or multicast communication between threads -- CHAUS (Channel for Unbounded Streaming). The CHAUS model hides the underlying synchronization details, but requires the user to declare producer-consumer relationship of a channel in advance. It is the duty of the runtime system to ensure reliable data transmission at data item granularity as declared. To achieve unbounded buffer for streaming and reduce the synchronization overheads, we propose a virtual memory based solution to implement a scalable CHAUS channel. We check the programmability of CHAUS by successfully porting dedup and ferret from PARSEC as well as implementing MapReduce library with Phoenix-like API. The experimental results show that workloads built with CHAUS run faster than those with Pthreads, and CHAUS has the best scalability compared with two Pthread versions. There are three workloads whose CHAUS versions only spend no more than 0.17x runtime of Pthreads on both 16 and 32 cores.展开更多
POTENTIAL is a virtual database machine based on general computing platforms, especially parallel computing platforms. It provides a complete solution to high-performance database systems by a 'virtual processor ...POTENTIAL is a virtual database machine based on general computing platforms, especially parallel computing platforms. It provides a complete solution to high-performance database systems by a 'virtual processor + virtual data bus + virtual memory' architecture. Virtual processors manage all CPU resources in the system, on which various operations are running. Virtual data bus is responsible for the management of data transmission between associated operations, which forms the hinges of the entire system. Virtual memory provides efficient data storage and buffering mechanisms that conform to data reference behaviors in database systems. The architecture of POTENTIAL is very clear and has many good features, including high efficiency, high scalability, high extensibility, high portability, etc.展开更多
GPUs are widely used in modem high-performance computing systems.To reduce the burden of GPU programmers,operating system and GPU hardware provide great supports for shared virtual memory,which enables GPU and CPU to ...GPUs are widely used in modem high-performance computing systems.To reduce the burden of GPU programmers,operating system and GPU hardware provide great supports for shared virtual memory,which enables GPU and CPU to share the same virtual address space.Unfortunately,the current SIMT execution model of GPU brings great challenges for the virtual-physical address translation on the GPU side,mainly due to the huge number of virtual addresses which are generated simultaneously and the bad locality of these virtual addresses.Thus,the excessive TLB accesses increase the miss ratio of TLB.As an attractive solution,Page Walk Cache(PWC)has received wide attention for its capability of reducing the memory accesses caused by TLB misses.However,the current PWC mechanism suffers from heavy redundancies,which significantly limits its efficiency.In this paper,we first investigate the facts leading to this issue by evaluating the performance of PWC with typical GPU benchmarks.We find that the repeated L4 and L3 indices of virtual addresses increase the redundancies in PWC,and the low locality of L2 indices causes the low hit ratio in PWC.Based on these observations,we propose a new PWC structure,namely Compressed Page Walk Cache(CPWC),to resolve the redundancy burden in current PWC.Our CPWC can be organized in either direct-mapped mode or set-associated mode.Experimental results show that CPWC increases by 3 times over TPC in the number of page table entries,increases by 38.3%over PWC in L2 index hit ratio and reduces by 26.9%in the memory accesses of page tables.The average memory accesses caused by each TLB miss is reduced to 1.13.Overall,the average IPC can improve by 25.3%.展开更多
基金supported by University of Colorado School of Medicine and Cancer Center startup funds to JHW,Cancer League of Colorado grants R21-CA184707,R21-Al110777,R01-CA166325,R21 Al133110,and R01-CA229174 to J.H.W.a fund from American Cancer Society(ACS IRG#16-184-56)to Z.C.X.W.was supported by an AAI Careers in Immunology Fellowship+1 种基金supported by an NIH F31 fellowship(F31DE027854)supported by an NIH T32 fellowship(T32 AI007405).
文摘Cancer cells can evade immune recognition by losing major histocompatibility complex(MHC)class Ⅰ.Hence,MHC class Ⅰ-negative cancers represent the most challenging cancers to treat.Chemotherapeutic drugs not only directly kill tumors but also modulate the tumor immune microenvironment However,it remains unknown whether chemotherapy-treated cancer cells can activate CD8 T cells independent of tumor-derived MHC class Ⅰ and whether such MHC class Ⅰ-independent CD8 T-cell activation can be exploited for cancer immunotherapy.Here,we showed that chemotherapy-treated cancer cells directly activated CD8 T cells in an MHC class Ⅰ-independent manner and that these activated CD8 T cells exhibit virtual memory(VM)phenotypes.Consistently,in vivo chemotherapeutic treatment preferentially increased tumor-infiltrating VM CD8 T cells.Mechanistically,MHC class Ⅰ-independent activation of CD8 T cells requires cell-cell contact and activation of the PI3K pathway.VM CD8 T cells contribute to a superior therapeutic effect on MHC class Ⅰ-deficient tumors.Using humanized mouse models or primary human CD8 T cells,we also demonstrated that chemotherapy-treated human lymphomas activated VM CD8 T cells independent of tumor-derived MHC class Ⅰ.In conclusion,CD8 T cells can be directly activated in an MHC class Ⅰ-independent manner by chemotherapy-treated cancers,and these activated CD8 T cells may be exploited for developing new strategies to treat MHC class Ⅰ-deficient cancers.
基金the National Key Research and Development Program of China(No.2017YFC0212100)National High-tech R&D Program of China(No.2015AA015308).
文摘With the rapid development of big data and artificial intelligence(AI),the cloud platform architecture system is constantly developing,optimizing,and improving.As such,new applications,like deep computing and high-performance computing,require enhanced computing power.To meet this requirement,a non-uniform memory access(NUMA)configuration method is proposed for the cloud computing system according to the affinity,adaptability,and availability of the NUMA architecture processor platform.The proposed method is verified based on the test environment of a domestic central processing unit(CPU).
文摘This paper explores potential for the RAMpage memory hierarchy to use a microkernel with a small memory footprint, in a specialized cache-speed static RAM (tightly-coupled memory, TCM). Dreamy memory is DRAM kept in low-power mode, unless referenced. Simulations show that a small microkernel suits RAMpage well, in that it achieves significantly better speed and energy gains than a standard hierarchy from adding TCM. RAMpage, in its best 128KB L2 case, gained 11% speed using TCM, and reduced energy 14%. Equivalent conventional hierarchy gains were under 1%. While 1MB L2 was significantly faster against lower-energy cases for the smaller L2, the larger SRAM's energy does not justify the speed gain. Using a 128KB L2 cache in a conventional architecture resulted in a best-case overall run time of 2.58s, compared with the best dreamy mode run time (RAMpage without context switches on misses) of 3.34s, a speed penalty of 29%. Energy in the fastest 128KB L2 case was 2.18J vs. 1.50J, a reduction of 31%. The same RAMpage configuration without dreamy mode took 2.83s as simulated, and used 2.393, an acceptable trade-off (penalty under 10%) for being able to switch easily to a lower-energy mode.
文摘Based on the characteristics of distrubuted system and the behavior of parallelprograms, this paper presents the fixed and randofored competitive memory coherence algorithms for distributed shared virtual memory These algorithms exploit parallel programs' locality of reference and dribit good competitive property Our simulation shows that the fixed and randomized algorithms achieve better performance and higher stability than other strategies such as write-invalldate and write-update.
基金Supported by the National Natural Science Foundation of China(61170026)
文摘Malicious software programs usually bypass the detection of anti-virus software by hiding themselves among apparently legitimate programs.In this work,we propose Windows Virtual Machine Introspection(WVMI)to accurately detect those hidden processes by analyzing memory data.WVMI dumps in-memory data of the target Windows operating systems from hypervisor and retrieves EPROCESS structures’address of process linked list first,and then generates Data Type Confidence Table(DTCT).Next,it traverses the memory and identifies the similarities between the nodes in process linked list and the corresponding segments in the memory by utilizing DTCT.Finally,it locates the segments of Windows’EPROCESS and identifies the hidden processes by further comparison.Through extensive experiments,our experiment shows that the WVMI detects the hidden process with high identification rate,and it is independent of different versions of Windows operating system.
基金Supported by the National Natural Science Foundation of China(61373169)the National High Technology Research and Development Program of China(863 Program)(2015AA016004)the Foundation of Science and Technology on Information Assure Laboratory(KJ-14-110,KJ-14-101)
文摘With the popularity and commercialization of cloud computing platforms, the security of virtualization technology must be guaranteed. The paper studies the protection of memory privacy under virtual platform to enhance system security. Based on the monitoring of foreign mapping for Dom0, a memory privacy protection scheme is designed and implemented to prevent process memory pages in DomU being mapped illegally which might result in the leakage of secret data.
基金This work was supported in part by the National High Technology Research and Development 863 Program of China under Grant No. 2012AA010901, the National Natural Science Foundation of China under Grant No. 61229201, and the China Postdoctoral Science Foundation under Grant No. 2012M521250.
文摘Pipeline parallelism is a popular parallel programming pattern for emerging applications. However, program- ming pipelines directly on conventional multithreaded shared memory is difficult and error-prone. We present DStream, a C library that provides high-level abstractions of deterministic threads and streams for simply representing pipeline stage work- ers and their communications. The deterministic stream is established atop our proposed single-producer/multi-consumer (SPMC) virtual memory, which integrates synchronization with the virtual memory model to enforce determinism on shared memory accesses. We investigate various strategies on how to efficiently implement DStream atop the SPMC memory, so that an infinite sequence of data items can be asynchronously published (fixed) and asynchronously consumed in order among adjacent stage workers. We have successfully transformed two representative pipeline applications ferret and dedup using DStream, and conclude conversion rules. An empirical evaluation shows that the converted ferret performed on par with its Pthreads and TBB counterparts in term of running time, while the converted dedup is close to 2.56X, 7.05X faster than the Pthreads counterpart and 1.06X, 3.9X faster than the TBB counterpart on 16 and 32 CPUs, respectively.
文摘Stream processing is a special form of the dataflow execution model that offers extensive opportunities for optimization and automatic parallelism. A streaming application is represented by a graph of computation stages that communicate with each other via FIFO channels. In shared-memory environment, an FIFO channel is classically a com- mon, fixed-size synchronized buffer shared between the producer and the consumer. As the number of concurrent stage workers increases, the synchronization overheads, such as contention and waiting times, rise sharply and severely impair application performance. In this paper, we present a novel multithreaded model which isolates memory between threads by default and provides a higher level abstraction for scalable unicast or multicast communication between threads -- CHAUS (Channel for Unbounded Streaming). The CHAUS model hides the underlying synchronization details, but requires the user to declare producer-consumer relationship of a channel in advance. It is the duty of the runtime system to ensure reliable data transmission at data item granularity as declared. To achieve unbounded buffer for streaming and reduce the synchronization overheads, we propose a virtual memory based solution to implement a scalable CHAUS channel. We check the programmability of CHAUS by successfully porting dedup and ferret from PARSEC as well as implementing MapReduce library with Phoenix-like API. The experimental results show that workloads built with CHAUS run faster than those with Pthreads, and CHAUS has the best scalability compared with two Pthread versions. There are three workloads whose CHAUS versions only spend no more than 0.17x runtime of Pthreads on both 16 and 32 cores.
基金This work is supported by the National .'863' High-Tech Programme under grant! No.863-306-02-04-1the National Natural Scienc
文摘POTENTIAL is a virtual database machine based on general computing platforms, especially parallel computing platforms. It provides a complete solution to high-performance database systems by a 'virtual processor + virtual data bus + virtual memory' architecture. Virtual processors manage all CPU resources in the system, on which various operations are running. Virtual data bus is responsible for the management of data transmission between associated operations, which forms the hinges of the entire system. Virtual memory provides efficient data storage and buffering mechanisms that conform to data reference behaviors in database systems. The architecture of POTENTIAL is very clear and has many good features, including high efficiency, high scalability, high extensibility, high portability, etc.
基金This paper was supported by the National Natural Science Fundation of China(Grant No.61972407).
文摘GPUs are widely used in modem high-performance computing systems.To reduce the burden of GPU programmers,operating system and GPU hardware provide great supports for shared virtual memory,which enables GPU and CPU to share the same virtual address space.Unfortunately,the current SIMT execution model of GPU brings great challenges for the virtual-physical address translation on the GPU side,mainly due to the huge number of virtual addresses which are generated simultaneously and the bad locality of these virtual addresses.Thus,the excessive TLB accesses increase the miss ratio of TLB.As an attractive solution,Page Walk Cache(PWC)has received wide attention for its capability of reducing the memory accesses caused by TLB misses.However,the current PWC mechanism suffers from heavy redundancies,which significantly limits its efficiency.In this paper,we first investigate the facts leading to this issue by evaluating the performance of PWC with typical GPU benchmarks.We find that the repeated L4 and L3 indices of virtual addresses increase the redundancies in PWC,and the low locality of L2 indices causes the low hit ratio in PWC.Based on these observations,we propose a new PWC structure,namely Compressed Page Walk Cache(CPWC),to resolve the redundancy burden in current PWC.Our CPWC can be organized in either direct-mapped mode or set-associated mode.Experimental results show that CPWC increases by 3 times over TPC in the number of page table entries,increases by 38.3%over PWC in L2 index hit ratio and reduces by 26.9%in the memory accesses of page tables.The average memory accesses caused by each TLB miss is reduced to 1.13.Overall,the average IPC can improve by 25.3%.