Distinguishing between web traffic generated by bots and humans is an important task in the evaluation of online marketing campaigns.One of the main challenges is related to only partial availability of the performanc...Distinguishing between web traffic generated by bots and humans is an important task in the evaluation of online marketing campaigns.One of the main challenges is related to only partial availability of the performance metrics:although some users can be unambiguously classified as bots,the correct label is uncertain in many cases.This calls for the use of classifiers capable of explaining their decisions.This paper demonstrates two such mechanisms based on features carefully engineered from web logs.The first is a man-made rule-based system.The second is a hierarchical model that first performs clustering and next classification using human-centred,interpretable methods.The stability of the proposed methods is analyzed and a minimal set of features that convey the classdiscriminating information is selected.The proposed data processing and analysis methodology are successfully applied to real-world data sets from online publishers.展开更多
Due to a great deal of valuable information contained in the Web log file, the result of Web mining can be used to enhance the decision making for electronic commerce (EC) operation and management. Because of ambiguo...Due to a great deal of valuable information contained in the Web log file, the result of Web mining can be used to enhance the decision making for electronic commerce (EC) operation and management. Because of ambiguous and abundance of the Web log file, the least decision making model based on rough set theory was presented for Web mining. And an example was given to explain the model. The model can predigest the decision making table, so that the least solution of the table can be acquired. According to the least solution, the corresponding decision for individual service can be made in sequence. Web mining based on rough set theory is also currently the original and particular method.展开更多
The main thrust of this paper is application of a novel data mining approach on the log of user' s feedback to improve web multimedia information retrieval performance. A user space model was constructed based...The main thrust of this paper is application of a novel data mining approach on the log of user' s feedback to improve web multimedia information retrieval performance. A user space model was constructed based on data mining, and then integrated into the original information space model to improve the accuracy of the new information space model. It can remove clutter and irrelevant text information and help to eliminate mismatch between the page author' s expression and the user' s understanding and expectation. User spacemodel was also utilized to discover the relationship between high-level and low-level features for assigning weight. The authors proposed improved Bayesian algorithm for data mining. Experiment proved that the au-thors' proposed algorithm was efficient.展开更多
As the increasing popularity and complexity of Web applications and the emergence of their new characteristics, the testing and maintenance of large, complex Web applications are becoming more complex and difficult. W...As the increasing popularity and complexity of Web applications and the emergence of their new characteristics, the testing and maintenance of large, complex Web applications are becoming more complex and difficult. Web applications generally contain lots of pages and are used by enormous users. Statistical testing is an effective way of ensuring their quality. Web usage can be accurately described by Markov chain which has been proved to be an ideal model for software statistical testing. The results of unit testing can be utilized in the latter stages, which is an important strategy for bottom-to-top integration testing, and the other improvement of extended Markov chain model (EMM) is to present the error type vector which is treated as a part of page node. this paper also proposes the algorithm for generating test cases of usage paths. Finally, optional usage reliability evaluation methods and an incremental usability regression testing model for testing and evaluation are presented. Key words statistical testing - evaluation for Web usability - extended Markov chain model (EMM) - Web log mining - reliability evaluation CLC number TP311. 5 Foundation item: Supported by the National Defence Research Project (No. 41315. 9. 2) and National Science and Technology Plan (2001BA102A04-02-03)Biography: MAO Cheng-ying (1978-), male, Ph.D. candidate, research direction: software testing. Research direction: advanced database system, software testing, component technology and data mining.展开更多
The backdoor or information leak of Web servers can be detected by using Web Mining techniques on some abnormal Web log and Web application log data. The security of Web servers can be enhanced and the damage of illeg...The backdoor or information leak of Web servers can be detected by using Web Mining techniques on some abnormal Web log and Web application log data. The security of Web servers can be enhanced and the damage of illegal access can be avoided. Firstly, the system for discovering the patterns of information leakages in CGI scripts from Web log data was proposed. Secondly, those patterns for system administrators to modify their codes and enhance their Web site security were provided. The following aspects were described: one is to combine web application log with web log to extract more information,so web data mining could be used to mine web log for discovering the information that firewall and Information Detection System cannot find. Another approach is to propose an operation module of web site to enhance Web site security. In cluster server session, Density -Based Clustering technique is used to reduce resource cost and obtain better efficiency.展开更多
A new method for Web users fuzzy clustering based on analysis of user interest characteristic is proposed in this article. The method first defines page fuzzy categories according to the links on the index page of the...A new method for Web users fuzzy clustering based on analysis of user interest characteristic is proposed in this article. The method first defines page fuzzy categories according to the links on the index page of the site, then computes fuzzy degree of cross page through aggregating on data of Web log. After that, by using fuzzy comprehensive evaluation method, the method constructs user interest vectors according to page viewing times and frequency of hits, and derives the fuzzy similarity matrix from the interest vectors for the Web users. Finally, it gets the clustering result through the fuzzy clustering method. The experimental results show the effectiveness of the method. Key words Web log mining - fuzzy similarity matrix - fuzzy comprehensive evaluation - fuzzy clustering CLC number TP18 - TP311 - TP391 Foundation item: Supported by the Natural Science Foundation of Heilongjiang Province of China (F0304)Biography: ZHAN Li-qiang (1966-), male, Lecturer, Ph. D. research direction: the theory methods of data mining and theory of database.展开更多
A semantic session analysis method partitioning Web usage logs is presented. Semantic Web usage log preparation model enhances usage logs with semantic. The Markov chain model based on ontology semantic measurement is...A semantic session analysis method partitioning Web usage logs is presented. Semantic Web usage log preparation model enhances usage logs with semantic. The Markov chain model based on ontology semantic measurement is used to identifying which active session a request should belong to. The competitive method is applied to determine the end of the sessions. Compared with other algorithms, more successful sessions are additionally detected by semantic outlier analysis.展开更多
In web-based learning environment,College English writing has always been a thorny issue.Here both asynchronous and synchronous communications in college English writing mean the new interactive teaching belief. This ...In web-based learning environment,College English writing has always been a thorny issue.Here both asynchronous and synchronous communications in college English writing mean the new interactive teaching belief. This paper attempts to do the blending of two in the traditional writing learning and teaching in college English in order to promote a more flexible,efficient and interactive learning environment in accordance with students' interests and needs.展开更多
The increasing usage of internet requires a significant system for effective communication. To pro- vide an effective communication for the internet users, based on nature of their queries, shortest routing ...The increasing usage of internet requires a significant system for effective communication. To pro- vide an effective communication for the internet users, based on nature of their queries, shortest routing path is usually preferred for data forwarding. But when more number of data chooses the same path, in that case, bottleneck occurs in the traffic this leads to data loss or provides irrelevant data to the users. In this paper, a Rule Based System using Improved Apriori (RBS-IA) rule mining framework is proposed for effective monitoring of traffic occurrence over the network and control the network traffic. RBS-IA framework integrates both the traffic control and decision making system to enhance the usage of internet trendier. At first, the network traffic data are ana- lyzed and the incoming and outgoing data information is processed using apriori rule mining algorithm. After generating the set of rules, the network traffic condition is analyzed. Based on the traffic conditions, the decision rule framework is introduced which derives and assigns the set of suitable rules to the appropriate states of the network. The decision rule framework improves the effectiveness of network traffic control by updating the traffic condition states for identifying the relevant route path for packet data transmission. Experimental evaluation is conducted by extrac- ting the Dodgers loop sensor data set from UCI repository to detect the effectiveness of theproposed Rule Based System using Improved Apriori (RBS-IA) rule mining framework. Performance evaluation shows that the proposed RBS-IA rule mining framework provides significant improvement in managing the network traffic control scheme. RBS-IA rule mining framework is evaluated over the factors such as accuracy of the decision being obtained, interestingness measure and execution time.展开更多
With the growing popularity of the World Wide Web, large volume of useraccess data has been gathered automatically by Web servers and stored in Web logs. Discovering andunderstanding user behavior patterns from log fi...With the growing popularity of the World Wide Web, large volume of useraccess data has been gathered automatically by Web servers and stored in Web logs. Discovering andunderstanding user behavior patterns from log files can provide Web personalized recommendationservices. In this paper, a novel clustering method is presented for log files called Clusteringlarge Weblog based on Key Path Model (CWKPM), which is based on user browsing key path model, to getuser behavior profiles. Compared with the previous Boolean model, key path model considers themajor features of users'' accessing to the Web: ordinal, contiguous and duplicate. Moreover, forclustering, it has fewer dimensions. The analysis and experiments show that CWKPM is an efficientand effective approach for clustering large and high-dimension Web logs.展开更多
Attacks on web servers are part of the most serious threats in network security fields.Analyzing logs of web attacks is an effective approach for malicious behavior identification.Traditionally,machine learning models...Attacks on web servers are part of the most serious threats in network security fields.Analyzing logs of web attacks is an effective approach for malicious behavior identification.Traditionally,machine learning models based on labeled data are popular identification methods.Some deep learning models are also recently introduced for analyzing logs based on web logs classification.However,it is limited to the amount of labeled data in model training.Web logs with labels which mark specific categories of data are difficult to obtain.Consequently,it is necessary to follow the problem about data generation with a focus on learning similar feature representations from the original data and improve the accuracy of classification model.In this paper,a novel framework is proposed,which differs in two important aspects:one is that long short-term memory(LSTM)is incorporated into generative adversarial networks(GANs)to generate the logs of web attack.The other is that a data augment model is proposed by adding logs of web attack generated by GANs to the original dataset and improved the performance of the classification model.The results experimentally demonstrate the effectiveness of the proposed method.It improved the classification accuracy from 89.04%to 95.04%.展开更多
In this era of a data-driven society, useful data(Big Data) is often unintentionally ignored due to lack of convenient tools and expensive software. For example, web log files can be used to identify explicit informat...In this era of a data-driven society, useful data(Big Data) is often unintentionally ignored due to lack of convenient tools and expensive software. For example, web log files can be used to identify explicit information of browsing patterns when users access web sites. Some hidden information,however, cannot be directly derived from the log files. We may need external resources to discover more knowledge from browsing patterns. The purpose of this study is to investigate the application of web usage mining based on web log files. The outcome of this study sets further directions of this investigation on what and how implicit information embedded in log files can be efficiently and effectively extracted. Further work involves combining the use of social media data to improve business decision quality.展开更多
基金supported by the ABT SHIELD(Anti-Bot and Trolls Shield)project at the Systems Research Institute,Polish Academy of Sciences,in cooperation with EDGE NPDRPMA.01.02.00-14-B448/18-00 funded by the Regional Development Fund for the development of Mazovia.
文摘Distinguishing between web traffic generated by bots and humans is an important task in the evaluation of online marketing campaigns.One of the main challenges is related to only partial availability of the performance metrics:although some users can be unambiguously classified as bots,the correct label is uncertain in many cases.This calls for the use of classifiers capable of explaining their decisions.This paper demonstrates two such mechanisms based on features carefully engineered from web logs.The first is a man-made rule-based system.The second is a hierarchical model that first performs clustering and next classification using human-centred,interpretable methods.The stability of the proposed methods is analyzed and a minimal set of features that convey the classdiscriminating information is selected.The proposed data processing and analysis methodology are successfully applied to real-world data sets from online publishers.
文摘Due to a great deal of valuable information contained in the Web log file, the result of Web mining can be used to enhance the decision making for electronic commerce (EC) operation and management. Because of ambiguous and abundance of the Web log file, the least decision making model based on rough set theory was presented for Web mining. And an example was given to explain the model. The model can predigest the decision making table, so that the least solution of the table can be acquired. According to the least solution, the corresponding decision for individual service can be made in sequence. Web mining based on rough set theory is also currently the original and particular method.
文摘The main thrust of this paper is application of a novel data mining approach on the log of user' s feedback to improve web multimedia information retrieval performance. A user space model was constructed based on data mining, and then integrated into the original information space model to improve the accuracy of the new information space model. It can remove clutter and irrelevant text information and help to eliminate mismatch between the page author' s expression and the user' s understanding and expectation. User spacemodel was also utilized to discover the relationship between high-level and low-level features for assigning weight. The authors proposed improved Bayesian algorithm for data mining. Experiment proved that the au-thors' proposed algorithm was efficient.
文摘As the increasing popularity and complexity of Web applications and the emergence of their new characteristics, the testing and maintenance of large, complex Web applications are becoming more complex and difficult. Web applications generally contain lots of pages and are used by enormous users. Statistical testing is an effective way of ensuring their quality. Web usage can be accurately described by Markov chain which has been proved to be an ideal model for software statistical testing. The results of unit testing can be utilized in the latter stages, which is an important strategy for bottom-to-top integration testing, and the other improvement of extended Markov chain model (EMM) is to present the error type vector which is treated as a part of page node. this paper also proposes the algorithm for generating test cases of usage paths. Finally, optional usage reliability evaluation methods and an incremental usability regression testing model for testing and evaluation are presented. Key words statistical testing - evaluation for Web usability - extended Markov chain model (EMM) - Web log mining - reliability evaluation CLC number TP311. 5 Foundation item: Supported by the National Defence Research Project (No. 41315. 9. 2) and National Science and Technology Plan (2001BA102A04-02-03)Biography: MAO Cheng-ying (1978-), male, Ph.D. candidate, research direction: software testing. Research direction: advanced database system, software testing, component technology and data mining.
文摘The backdoor or information leak of Web servers can be detected by using Web Mining techniques on some abnormal Web log and Web application log data. The security of Web servers can be enhanced and the damage of illegal access can be avoided. Firstly, the system for discovering the patterns of information leakages in CGI scripts from Web log data was proposed. Secondly, those patterns for system administrators to modify their codes and enhance their Web site security were provided. The following aspects were described: one is to combine web application log with web log to extract more information,so web data mining could be used to mine web log for discovering the information that firewall and Information Detection System cannot find. Another approach is to propose an operation module of web site to enhance Web site security. In cluster server session, Density -Based Clustering technique is used to reduce resource cost and obtain better efficiency.
文摘A new method for Web users fuzzy clustering based on analysis of user interest characteristic is proposed in this article. The method first defines page fuzzy categories according to the links on the index page of the site, then computes fuzzy degree of cross page through aggregating on data of Web log. After that, by using fuzzy comprehensive evaluation method, the method constructs user interest vectors according to page viewing times and frequency of hits, and derives the fuzzy similarity matrix from the interest vectors for the Web users. Finally, it gets the clustering result through the fuzzy clustering method. The experimental results show the effectiveness of the method. Key words Web log mining - fuzzy similarity matrix - fuzzy comprehensive evaluation - fuzzy clustering CLC number TP18 - TP311 - TP391 Foundation item: Supported by the Natural Science Foundation of Heilongjiang Province of China (F0304)Biography: ZHAN Li-qiang (1966-), male, Lecturer, Ph. D. research direction: the theory methods of data mining and theory of database.
基金Supported by the Huo Yingdong Education Foundation of China(91101)
文摘A semantic session analysis method partitioning Web usage logs is presented. Semantic Web usage log preparation model enhances usage logs with semantic. The Markov chain model based on ontology semantic measurement is used to identifying which active session a request should belong to. The competitive method is applied to determine the end of the sessions. Compared with other algorithms, more successful sessions are additionally detected by semantic outlier analysis.
文摘In web-based learning environment,College English writing has always been a thorny issue.Here both asynchronous and synchronous communications in college English writing mean the new interactive teaching belief. This paper attempts to do the blending of two in the traditional writing learning and teaching in college English in order to promote a more flexible,efficient and interactive learning environment in accordance with students' interests and needs.
文摘The increasing usage of internet requires a significant system for effective communication. To pro- vide an effective communication for the internet users, based on nature of their queries, shortest routing path is usually preferred for data forwarding. But when more number of data chooses the same path, in that case, bottleneck occurs in the traffic this leads to data loss or provides irrelevant data to the users. In this paper, a Rule Based System using Improved Apriori (RBS-IA) rule mining framework is proposed for effective monitoring of traffic occurrence over the network and control the network traffic. RBS-IA framework integrates both the traffic control and decision making system to enhance the usage of internet trendier. At first, the network traffic data are ana- lyzed and the incoming and outgoing data information is processed using apriori rule mining algorithm. After generating the set of rules, the network traffic condition is analyzed. Based on the traffic conditions, the decision rule framework is introduced which derives and assigns the set of suitable rules to the appropriate states of the network. The decision rule framework improves the effectiveness of network traffic control by updating the traffic condition states for identifying the relevant route path for packet data transmission. Experimental evaluation is conducted by extrac- ting the Dodgers loop sensor data set from UCI repository to detect the effectiveness of theproposed Rule Based System using Improved Apriori (RBS-IA) rule mining framework. Performance evaluation shows that the proposed RBS-IA rule mining framework provides significant improvement in managing the network traffic control scheme. RBS-IA rule mining framework is evaluated over the factors such as accuracy of the decision being obtained, interestingness measure and execution time.
文摘With the growing popularity of the World Wide Web, large volume of useraccess data has been gathered automatically by Web servers and stored in Web logs. Discovering andunderstanding user behavior patterns from log files can provide Web personalized recommendationservices. In this paper, a novel clustering method is presented for log files called Clusteringlarge Weblog based on Key Path Model (CWKPM), which is based on user browsing key path model, to getuser behavior profiles. Compared with the previous Boolean model, key path model considers themajor features of users'' accessing to the Web: ordinal, contiguous and duplicate. Moreover, forclustering, it has fewer dimensions. The analysis and experiments show that CWKPM is an efficientand effective approach for clustering large and high-dimension Web logs.
基金the National Natural Science Fund of China(61871046,61601053)。
文摘Attacks on web servers are part of the most serious threats in network security fields.Analyzing logs of web attacks is an effective approach for malicious behavior identification.Traditionally,machine learning models based on labeled data are popular identification methods.Some deep learning models are also recently introduced for analyzing logs based on web logs classification.However,it is limited to the amount of labeled data in model training.Web logs with labels which mark specific categories of data are difficult to obtain.Consequently,it is necessary to follow the problem about data generation with a focus on learning similar feature representations from the original data and improve the accuracy of classification model.In this paper,a novel framework is proposed,which differs in two important aspects:one is that long short-term memory(LSTM)is incorporated into generative adversarial networks(GANs)to generate the logs of web attack.The other is that a data augment model is proposed by adding logs of web attack generated by GANs to the original dataset and improved the performance of the classification model.The results experimentally demonstrate the effectiveness of the proposed method.It improved the classification accuracy from 89.04%to 95.04%.
基金Supported by Royal Thai Government ScholarshipFaculty of IT,Monash University,Resources Support
文摘In this era of a data-driven society, useful data(Big Data) is often unintentionally ignored due to lack of convenient tools and expensive software. For example, web log files can be used to identify explicit information of browsing patterns when users access web sites. Some hidden information,however, cannot be directly derived from the log files. We may need external resources to discover more knowledge from browsing patterns. The purpose of this study is to investigate the application of web usage mining based on web log files. The outcome of this study sets further directions of this investigation on what and how implicit information embedded in log files can be efficiently and effectively extracted. Further work involves combining the use of social media data to improve business decision quality.