Nowadays,Internet of Things(IoT)is widely deployed and brings great opportunities to change people's daily life.To realize more effective human-computer interaction in the IoT applications,the Question Answering(Q...Nowadays,Internet of Things(IoT)is widely deployed and brings great opportunities to change people's daily life.To realize more effective human-computer interaction in the IoT applications,the Question Answering(QA)systems implanted in the IoT services are supposed to improve the ability to understand natural language.Therefore,the distributed representation of words,which contains more semantic or syntactic information,has been playing a more and more important role in the QA systems.However,learning high-quality distributed word vectors requires lots of storage and computing resources,hence it cannot be deployed on the resource-constrained IoT devices.It is a good choice to outsource the data and computation to the cloud servers.Nevertheless,it could cause privacy risks to directly upload private data to the untrusted cloud.Therefore,realizing the word vector learning process over untrusted cloud servers without privacy leakage is an urgent and challenging task.In this paper,we present a novel efficient word vector learning scheme over encrypted data.We first design a series of arithmetic computation protocols.Then we use two non-colluding cloud servers to implement high-quality word vectors learning over encrypted data.The proposed scheme allows us to perform training word vectors on the remote cloud servers while protecting privacy.Security analysis and experiments over real data sets demonstrate that our scheme is more secure and efficient than existing privacy-preserving word vector learning schemes.展开更多
Neural Machine Translation(NMT)based system is an important technology for translation applications.However,there is plenty of rooms for the improvement of NMT.In the process of NMT,traditional word vector cannot dist...Neural Machine Translation(NMT)based system is an important technology for translation applications.However,there is plenty of rooms for the improvement of NMT.In the process of NMT,traditional word vector cannot distinguish the same words under different parts of speech(POS).Aiming to alleviate this problem,this paper proposed a new word vector training method based on POS feature.It can efficiently improve the quality of translation by adding POS feature to the training process of word vectors.In the experiments,we conducted extensive experiments to evaluate our methods.The experimental result shows that the proposed method is beneficial to improve the quality of translation from English into Chinese.展开更多
Document processing in natural language includes retrieval,sentiment analysis,theme extraction,etc.Classical methods for handling these tasks are based on models of probability,semantics and networks for machine learn...Document processing in natural language includes retrieval,sentiment analysis,theme extraction,etc.Classical methods for handling these tasks are based on models of probability,semantics and networks for machine learning.The probability model is loss of semantic information in essential,and it influences the processing accuracy.Machine learning approaches include supervised,unsupervised,and semi-supervised approaches,labeled corpora is necessary for semantics model and supervised learning.The method for achieving a reliably labeled corpus is done manually,it is costly and time-consuming because people have to read each document and annotate the label of each document.Recently,the continuous CBOW model is efficient for learning high-quality distributed vector representations,and it can capture a large number of precise syntactic and semantic word relationships,this model can be easily extended to learn paragraph vector,but it is not precise.Towards these problems,this paper is devoted to developing a new model for learning paragraph vector,we combine the CBOW model and CNNs to establish a new deep learning model.Experimental results show that paragraph vector generated by the new model is better than the paragraph vector generated by CBOW model in semantic relativeness and accuracy.展开更多
With the development of Internet technology,the explosive growth of Internet information presentation has led to difficulty in filtering effective information.Finding a model with high accuracy for text classification...With the development of Internet technology,the explosive growth of Internet information presentation has led to difficulty in filtering effective information.Finding a model with high accuracy for text classification has become a critical problem to be solved by text filtering,especially for Chinese texts.This paper selected the manually calibrated Douban movie website comment data for research.First,a text filtering model based on the BP neural network has been built;Second,based on the Term Frequency-Inverse Document Frequency(TF-IDF)vector space model and the doc2vec method,the text word frequency vector and the text semantic vector were obtained respectively,and the text word frequency vector was linearly reduced by the Principal Component Analysis(PCA)method.Third,the text word frequency vector after dimensionality reduction and the text semantic vector were combined,add the text value degree,and the text synthesis vector was constructed.Experiments show that the model combined with text word frequency vector degree after dimensionality reduction,text semantic vector,and text value has reached the highest accuracy of 84.67%.展开更多
One of the critical hurdles, and breakthroughs, in the field of Natural Language Processing (NLP) in the last two decades has been the development of techniques for text representation that solves the so-called curse ...One of the critical hurdles, and breakthroughs, in the field of Natural Language Processing (NLP) in the last two decades has been the development of techniques for text representation that solves the so-called curse of dimensionality, a problem which plagues NLP in general given that the feature set for learning starts as a function of the size of the language in question, upwards of hundreds of thousands of terms typically. As such, much of the research and development in NLP in the last two decades has been in finding and optimizing solutions to this problem, to feature selection in NLP effectively. This paper looks at the development of these various techniques, leveraging a variety of statistical methods which rest on linguistic theories that were advanced in the middle of the last century, namely the distributional hypothesis which suggests that words that are found in similar contexts generally have similar meanings. In this survey paper we look at the development of some of the most popular of these techniques from a mathematical as well as data structure perspective, from Latent Semantic Analysis to Vector Space Models to their more modern variants which are typically referred to as word embeddings. In this review of algoriths such as Word2Vec, GloVe, ELMo and BERT, we explore the idea of semantic spaces more generally beyond applicability to NLP.展开更多
基金supported by the National Natural Science Foundation of China under Grant No.61672195,61872372the Open Foundation of State Key Laboratory of Cryptology No.MMKFKT201617the National University of Defense Technology Grant No.ZK19-38.
文摘Nowadays,Internet of Things(IoT)is widely deployed and brings great opportunities to change people's daily life.To realize more effective human-computer interaction in the IoT applications,the Question Answering(QA)systems implanted in the IoT services are supposed to improve the ability to understand natural language.Therefore,the distributed representation of words,which contains more semantic or syntactic information,has been playing a more and more important role in the QA systems.However,learning high-quality distributed word vectors requires lots of storage and computing resources,hence it cannot be deployed on the resource-constrained IoT devices.It is a good choice to outsource the data and computation to the cloud servers.Nevertheless,it could cause privacy risks to directly upload private data to the untrusted cloud.Therefore,realizing the word vector learning process over untrusted cloud servers without privacy leakage is an urgent and challenging task.In this paper,we present a novel efficient word vector learning scheme over encrypted data.We first design a series of arithmetic computation protocols.Then we use two non-colluding cloud servers to implement high-quality word vectors learning over encrypted data.The proposed scheme allows us to perform training word vectors on the remote cloud servers while protecting privacy.Security analysis and experiments over real data sets demonstrate that our scheme is more secure and efficient than existing privacy-preserving word vector learning schemes.
基金This work is supported by the National Natural Science Foundation of China(61872231,61701297).
文摘Neural Machine Translation(NMT)based system is an important technology for translation applications.However,there is plenty of rooms for the improvement of NMT.In the process of NMT,traditional word vector cannot distinguish the same words under different parts of speech(POS).Aiming to alleviate this problem,this paper proposed a new word vector training method based on POS feature.It can efficiently improve the quality of translation by adding POS feature to the training process of word vectors.In the experiments,we conducted extensive experiments to evaluate our methods.The experimental result shows that the proposed method is beneficial to improve the quality of translation from English into Chinese.
基金The authors would like to thank all anonymous reviewers for their suggestions and feedback.This work Supported by the National Natural Science,Foundation of China(No.61379052,61379103)the National Key Research and Development Program(2016YFB1000101)+1 种基金The Natural Science Foundation for Distinguished Young Scholars of Hunan Province(Grant No.14JJ1026)Specialized Research Fund for the Doctoral Program of Higher Education(Grant No.20124307110015).
文摘Document processing in natural language includes retrieval,sentiment analysis,theme extraction,etc.Classical methods for handling these tasks are based on models of probability,semantics and networks for machine learning.The probability model is loss of semantic information in essential,and it influences the processing accuracy.Machine learning approaches include supervised,unsupervised,and semi-supervised approaches,labeled corpora is necessary for semantics model and supervised learning.The method for achieving a reliably labeled corpus is done manually,it is costly and time-consuming because people have to read each document and annotate the label of each document.Recently,the continuous CBOW model is efficient for learning high-quality distributed vector representations,and it can capture a large number of precise syntactic and semantic word relationships,this model can be easily extended to learn paragraph vector,but it is not precise.Towards these problems,this paper is devoted to developing a new model for learning paragraph vector,we combine the CBOW model and CNNs to establish a new deep learning model.Experimental results show that paragraph vector generated by the new model is better than the paragraph vector generated by CBOW model in semantic relativeness and accuracy.
基金Supported by the Sichuan Science and Technology Program (2021YFQ0003).
文摘With the development of Internet technology,the explosive growth of Internet information presentation has led to difficulty in filtering effective information.Finding a model with high accuracy for text classification has become a critical problem to be solved by text filtering,especially for Chinese texts.This paper selected the manually calibrated Douban movie website comment data for research.First,a text filtering model based on the BP neural network has been built;Second,based on the Term Frequency-Inverse Document Frequency(TF-IDF)vector space model and the doc2vec method,the text word frequency vector and the text semantic vector were obtained respectively,and the text word frequency vector was linearly reduced by the Principal Component Analysis(PCA)method.Third,the text word frequency vector after dimensionality reduction and the text semantic vector were combined,add the text value degree,and the text synthesis vector was constructed.Experiments show that the model combined with text word frequency vector degree after dimensionality reduction,text semantic vector,and text value has reached the highest accuracy of 84.67%.
文摘One of the critical hurdles, and breakthroughs, in the field of Natural Language Processing (NLP) in the last two decades has been the development of techniques for text representation that solves the so-called curse of dimensionality, a problem which plagues NLP in general given that the feature set for learning starts as a function of the size of the language in question, upwards of hundreds of thousands of terms typically. As such, much of the research and development in NLP in the last two decades has been in finding and optimizing solutions to this problem, to feature selection in NLP effectively. This paper looks at the development of these various techniques, leveraging a variety of statistical methods which rest on linguistic theories that were advanced in the middle of the last century, namely the distributional hypothesis which suggests that words that are found in similar contexts generally have similar meanings. In this survey paper we look at the development of some of the most popular of these techniques from a mathematical as well as data structure perspective, from Latent Semantic Analysis to Vector Space Models to their more modern variants which are typically referred to as word embeddings. In this review of algoriths such as Word2Vec, GloVe, ELMo and BERT, we explore the idea of semantic spaces more generally beyond applicability to NLP.