Background Gesture is a basic interaction channel that is frequently used by humans to communicate in daily life. In this paper, we explore to use gesture-based approaches for target acquisition in virtual and augment...Background Gesture is a basic interaction channel that is frequently used by humans to communicate in daily life. In this paper, we explore to use gesture-based approaches for target acquisition in virtual and augmented reality. A typical process of gesture-based target acquisition is: when a user intends to acquire a target, she performs a gesture with her hands, head or other parts of the body, the computer senses and recognizes the gesture and infers the most possible target. Methods We build mental model and behavior model of the user to study two key parts of the interaction process. Mental model describes how user thinks up a gesture for acquiring a target, and can be the intuitive mapping between gestures and targets. Behavior model describes how user moves the body parts to perform the gestures, and the relationship between the gesture that user intends to perform and signals that computer senses. Results In this paper, we present and discuss three pieces of research that focus on the mental model and behavior model of gesture-based target acquisition in VR and AR. Conclusions We show that leveraging these two models, interaction experience and performance can be improved in VR and AR environments.展开更多
The cross-view matching of local image features is a fundamental task in visual localization and 3D reconstruction.This study proposes FilterGNN,a transformer-based graph neural network(GNN),aiming to improve the matc...The cross-view matching of local image features is a fundamental task in visual localization and 3D reconstruction.This study proposes FilterGNN,a transformer-based graph neural network(GNN),aiming to improve the matching efficiency and accuracy of visual descriptors.Based on high matching sparseness and coarse-to-fine covisible area detection,FilterGNN utilizes cascaded optimal graph-matching filter modules to dynamically reject outlier matches.Moreover,we successfully adapted linear attention in FilterGNN with post-instance normalization support,which significantly reduces the complexity of complete graph learning from O(N2)to O(N).Experiments show that FilterGNN requires only 6%of the time cost and 33.3%of the memory cost compared with SuperGlue under a large-scale input size and achieves a competitive performance in various tasks,such as pose estimation,visual localization,and sparse 3D reconstruction.展开更多
Computer-aided pronunciation training(CAPT) technologies enable the use of automatic speech recognition to detect mispronunciations in second language(L2) learners' speech. In order to further facilitate learning...Computer-aided pronunciation training(CAPT) technologies enable the use of automatic speech recognition to detect mispronunciations in second language(L2) learners' speech. In order to further facilitate learning, we aim to develop a principle-based method for generating a gradation of the severity of mispronunciations. This paper presents an approach towards gradation that is motivated by auditory perception. We have developed a computational method for generating a perceptual distance(PD) between two spoken phonemes. This is used to compute the auditory confusion of native language(L1). PD is found to correlate well with the mispronunciations detected in CAPT system for Chinese learners of English,i.e., L1 being Chinese(Mandarin and Cantonese) and L2 being US English. The results show that auditory confusion is indicative of pronunciation confusions in L2 learning. PD can also be used to help us grade the severity of errors(i.e.,mispronunciations that confuse more distant phonemes are more severe) and accordingly prioritize the order of corrective feedback generated for the learners.展开更多
With the rapid development of computer, communication, and sensing technology, our living space has been transformed from physical space into a space shared by physical space and cyberspace. In the light of this fact ...With the rapid development of computer, communication, and sensing technology, our living space has been transformed from physical space into a space shared by physical space and cyberspace. In the light of this fact and based on analyzing the char- acteristics of physical and cyberspace, respectively, this paper proposed that there are dual relations be- tween physical space and cyberspace. Establishing dual relations is realized in the following two processes: the process of information extraction, analysis and structurization from physical space to cyberspace and the process of providing the information services from cyberspace to physical space by means of inferring the intention, state and demand of users, as well. HCI (Human Cyberspace Interaction) in dual space means to establish the dual relations, which embodied the human centered HCI, i.e. the interaction is carried out in the way accustomed to users and without distract- ing their attention.展开更多
Human–object interaction(HOI)detection is crucial for human-centric image understanding which aims to infer human,action,object triplets within an image.Recent studies often exploit visual features and the spatial co...Human–object interaction(HOI)detection is crucial for human-centric image understanding which aims to infer human,action,object triplets within an image.Recent studies often exploit visual features and the spatial configuration of a human–object pair in order to learn the action linking the human and object in the pair.We argue that such a paradigm of pairwise feature extraction and action inference can be applied not only at the whole human and object instance level,but also at the part level at which a body part interacts with an object,and at the semantic level by considering the semantic label of an object along with human appearance and human–object spatial configuration,to infer the action.We thus propose a multi-level pairwise feature network(PFNet)for detecting human–object interactions.The network consists of three parallel streams to characterize HOI utilizing pairwise features at the above three levels;the three streams are finally fused to give the action prediction.Extensive experiments show that our proposed PFNet outperforms other state-of-the-art methods on the VCOCO dataset and achieves comparable results to the state-of-the-art on the HICO-DET dataset.展开更多
The accurate segmentation of medical images is crucial to medical care and research;however, many efficient supervised image segmentation methods require sufficient pixel level labels. Such requirement is difficult to...The accurate segmentation of medical images is crucial to medical care and research;however, many efficient supervised image segmentation methods require sufficient pixel level labels. Such requirement is difficult to meet in practice and even impossible in some cases, e.g., rare Pathoma images. Inspired by traditional unsupervised methods, we propose a novel Chan–Vese model based on the Markov chain for unsupervised medical image segmentation. It combines local information brought by superpixels with the global difference between the target tissue and the background. Based on the Chan–Vese model, we utilize weight maps generated by the Markov chain to model and solve the segmentation problem iteratively using the min-cut algorithm at the superpixel level.Our method exploits abundant boundary and local region information in segmentation and thus can handle images with intensity inhomogeneity and object sparsity. In our method, users gain the power of fine-tuning parameters to achieve satisfactory results for each segmentation. By contrast, the result from deep learning based methods is rigid.The performance of our method is assessed by using four Computerized Tomography(CT) datasets. Experimental results show that the proposed method outperforms traditional unsupervised segmentation techniques.展开更多
An algorithm is presented for estimating the direction and strength of point light with the strength of ambient illumination. Existing approaches evaluate these illumination parameters directly in the high dimensional...An algorithm is presented for estimating the direction and strength of point light with the strength of ambient illumination. Existing approaches evaluate these illumination parameters directly in the high dimensional image space, while we estimate the parameters in two steps: first by projecting the image to an orthogonal linear subspace based on spherical harmonic basis functions and then by calculating the parameters in the low dimensional subspace. The test results using the CMU PIE database and Yale Database B show the stability and effectiveness of the method. The resulting illumination information can be used to synthesize more realistic relighting images and to recognize objects under variable illumination.展开更多
文摘Background Gesture is a basic interaction channel that is frequently used by humans to communicate in daily life. In this paper, we explore to use gesture-based approaches for target acquisition in virtual and augmented reality. A typical process of gesture-based target acquisition is: when a user intends to acquire a target, she performs a gesture with her hands, head or other parts of the body, the computer senses and recognizes the gesture and infers the most possible target. Methods We build mental model and behavior model of the user to study two key parts of the interaction process. Mental model describes how user thinks up a gesture for acquiring a target, and can be the intuitive mapping between gestures and targets. Behavior model describes how user moves the body parts to perform the gestures, and the relationship between the gesture that user intends to perform and signals that computer senses. Results In this paper, we present and discuss three pieces of research that focus on the mental model and behavior model of gesture-based target acquisition in VR and AR. Conclusions We show that leveraging these two models, interaction experience and performance can be improved in VR and AR environments.
基金supported by the National Natural Science Foundation of China(Grant No.62220106003)Tsinghua-Tencent Joint Laboratory for Internet Innovation Technology.
文摘The cross-view matching of local image features is a fundamental task in visual localization and 3D reconstruction.This study proposes FilterGNN,a transformer-based graph neural network(GNN),aiming to improve the matching efficiency and accuracy of visual descriptors.Based on high matching sparseness and coarse-to-fine covisible area detection,FilterGNN utilizes cascaded optimal graph-matching filter modules to dynamically reject outlier matches.Moreover,we successfully adapted linear attention in FilterGNN with post-instance normalization support,which significantly reduces the complexity of complete graph learning from O(N2)to O(N).Experiments show that FilterGNN requires only 6%of the time cost and 33.3%of the memory cost compared with SuperGlue under a large-scale input size and achieves a competitive performance in various tasks,such as pose estimation,visual localization,and sparse 3D reconstruction.
基金supported by the National Basic Research 973 Program of China under Grant No.2013CB329304the National Natural Science Foundation of China under Grant No.61370023+2 种基金the Major Project of the National Social Science Foundation of China under Grant No.13&ZD189partially supported by the General Research Fund of the Hong Kong SAR Government under Project No.415511the CUHK Teaching Development Grant
文摘Computer-aided pronunciation training(CAPT) technologies enable the use of automatic speech recognition to detect mispronunciations in second language(L2) learners' speech. In order to further facilitate learning, we aim to develop a principle-based method for generating a gradation of the severity of mispronunciations. This paper presents an approach towards gradation that is motivated by auditory perception. We have developed a computational method for generating a perceptual distance(PD) between two spoken phonemes. This is used to compute the auditory confusion of native language(L1). PD is found to correlate well with the mispronunciations detected in CAPT system for Chinese learners of English,i.e., L1 being Chinese(Mandarin and Cantonese) and L2 being US English. The results show that auditory confusion is indicative of pronunciation confusions in L2 learning. PD can also be used to help us grade the severity of errors(i.e.,mispronunciations that confuse more distant phonemes are more severe) and accordingly prioritize the order of corrective feedback generated for the learners.
基金supported by the National Natural Science Foundation of China(Grant No.60433030)IBM International Cooperation Foundation.
文摘With the rapid development of computer, communication, and sensing technology, our living space has been transformed from physical space into a space shared by physical space and cyberspace. In the light of this fact and based on analyzing the char- acteristics of physical and cyberspace, respectively, this paper proposed that there are dual relations be- tween physical space and cyberspace. Establishing dual relations is realized in the following two processes: the process of information extraction, analysis and structurization from physical space to cyberspace and the process of providing the information services from cyberspace to physical space by means of inferring the intention, state and demand of users, as well. HCI (Human Cyberspace Interaction) in dual space means to establish the dual relations, which embodied the human centered HCI, i.e. the interaction is carried out in the way accustomed to users and without distract- ing their attention.
基金supported by the National Natural Science Foundation of China(Project No.61902210),a Research Grant of Beijing Higher Institution Engineering Research Center,and the Tsinghua–Tencent Joint Laboratory for Internet Innovation Technology.
文摘Human–object interaction(HOI)detection is crucial for human-centric image understanding which aims to infer human,action,object triplets within an image.Recent studies often exploit visual features and the spatial configuration of a human–object pair in order to learn the action linking the human and object in the pair.We argue that such a paradigm of pairwise feature extraction and action inference can be applied not only at the whole human and object instance level,but also at the part level at which a body part interacts with an object,and at the semantic level by considering the semantic label of an object along with human appearance and human–object spatial configuration,to infer the action.We thus propose a multi-level pairwise feature network(PFNet)for detecting human–object interactions.The network consists of three parallel streams to characterize HOI utilizing pairwise features at the above three levels;the three streams are finally fused to give the action prediction.Extensive experiments show that our proposed PFNet outperforms other state-of-the-art methods on the VCOCO dataset and achieves comparable results to the state-of-the-art on the HICO-DET dataset.
基金supported by the National Natural Science Foundation of China (Nos.61672017 and 61272232)the Key-Area Research and Development Program of Guangdong Province (No.2019B010137005)。
文摘The accurate segmentation of medical images is crucial to medical care and research;however, many efficient supervised image segmentation methods require sufficient pixel level labels. Such requirement is difficult to meet in practice and even impossible in some cases, e.g., rare Pathoma images. Inspired by traditional unsupervised methods, we propose a novel Chan–Vese model based on the Markov chain for unsupervised medical image segmentation. It combines local information brought by superpixels with the global difference between the target tissue and the background. Based on the Chan–Vese model, we utilize weight maps generated by the Markov chain to model and solve the segmentation problem iteratively using the min-cut algorithm at the superpixel level.Our method exploits abundant boundary and local region information in segmentation and thus can handle images with intensity inhomogeneity and object sparsity. In our method, users gain the power of fine-tuning parameters to achieve satisfactory results for each segmentation. By contrast, the result from deep learning based methods is rigid.The performance of our method is assessed by using four Computerized Tomography(CT) datasets. Experimental results show that the proposed method outperforms traditional unsupervised segmentation techniques.
基金the National Natural Science Foundation of China (No. 60273005)
文摘An algorithm is presented for estimating the direction and strength of point light with the strength of ambient illumination. Existing approaches evaluate these illumination parameters directly in the high dimensional image space, while we estimate the parameters in two steps: first by projecting the image to an orthogonal linear subspace based on spherical harmonic basis functions and then by calculating the parameters in the low dimensional subspace. The test results using the CMU PIE database and Yale Database B show the stability and effectiveness of the method. The resulting illumination information can be used to synthesize more realistic relighting images and to recognize objects under variable illumination.