Distinguishing between web traffic generated by bots and humans is an important task in the evaluation of online marketing campaigns.One of the main challenges is related to only partial availability of the performanc...Distinguishing between web traffic generated by bots and humans is an important task in the evaluation of online marketing campaigns.One of the main challenges is related to only partial availability of the performance metrics:although some users can be unambiguously classified as bots,the correct label is uncertain in many cases.This calls for the use of classifiers capable of explaining their decisions.This paper demonstrates two such mechanisms based on features carefully engineered from web logs.The first is a man-made rule-based system.The second is a hierarchical model that first performs clustering and next classification using human-centred,interpretable methods.The stability of the proposed methods is analyzed and a minimal set of features that convey the classdiscriminating information is selected.The proposed data processing and analysis methodology are successfully applied to real-world data sets from online publishers.展开更多
In this era of a data-driven society, useful data(Big Data) is often unintentionally ignored due to lack of convenient tools and expensive software. For example, web log files can be used to identify explicit informat...In this era of a data-driven society, useful data(Big Data) is often unintentionally ignored due to lack of convenient tools and expensive software. For example, web log files can be used to identify explicit information of browsing patterns when users access web sites. Some hidden information,however, cannot be directly derived from the log files. We may need external resources to discover more knowledge from browsing patterns. The purpose of this study is to investigate the application of web usage mining based on web log files. The outcome of this study sets further directions of this investigation on what and how implicit information embedded in log files can be efficiently and effectively extracted. Further work involves combining the use of social media data to improve business decision quality.展开更多
Attacks on web servers are part of the most serious threats in network security fields.Analyzing logs of web attacks is an effective approach for malicious behavior identification.Traditionally,machine learning models...Attacks on web servers are part of the most serious threats in network security fields.Analyzing logs of web attacks is an effective approach for malicious behavior identification.Traditionally,machine learning models based on labeled data are popular identification methods.Some deep learning models are also recently introduced for analyzing logs based on web logs classification.However,it is limited to the amount of labeled data in model training.Web logs with labels which mark specific categories of data are difficult to obtain.Consequently,it is necessary to follow the problem about data generation with a focus on learning similar feature representations from the original data and improve the accuracy of classification model.In this paper,a novel framework is proposed,which differs in two important aspects:one is that long short-term memory(LSTM)is incorporated into generative adversarial networks(GANs)to generate the logs of web attack.The other is that a data augment model is proposed by adding logs of web attack generated by GANs to the original dataset and improved the performance of the classification model.The results experimentally demonstrate the effectiveness of the proposed method.It improved the classification accuracy from 89.04%to 95.04%.展开更多
Web-log contains a lot of information related with user activities on the Internet. How to mine user browsing interest patterns effectively is an important and challengeable research topic. On the analysis of the pres...Web-log contains a lot of information related with user activities on the Internet. How to mine user browsing interest patterns effectively is an important and challengeable research topic. On the analysis of the present algorithm’s advantages and disadvantages we propose a new concept: support-interest. Its key insight is that visitor will backtrack if they do not find the information where they expect. And the point from where they backtrack is the expected location for the page. We present User Access Matrix and the corresponding algorithm for discovering such expected locations that can handle page caching by the browser. Since the URL-URL matrix is a sparse matrix which can be represented by List of 3-tuples, we can mine user preferred sub-paths from the computation of this matrix. Accordingly, all the sub-paths are merged, and user preferred paths are formed. Experiments showed that it was accurate and scalable. It’s suitable for website based application, such as to optimize website’s topological structure or to design personalized services. Key words Web Mining - user preferred path - Web-log - support-interest - personalized services CLC number TP 391 Foundation item: Supported by the National High Technology Development (863 program of China) (2001AA113182)Biography: ZHOU Hong-fang (1976-), female.Ph. D candidate, research direction: data mining and knowledge discovery in databases.展开更多
In this paper we investigate the effectiveness of ensemble-based learners for web robot session identification from web server logs. We also perform multi fold robot session labeling to improve the performance of lear...In this paper we investigate the effectiveness of ensemble-based learners for web robot session identification from web server logs. We also perform multi fold robot session labeling to improve the performance of learner. We conduct a comparative study for various ensemble methods (Bagging, Boosting, and Voting) with simple classifiers in perspective of classification. We also evaluate the effectiveness of these classifiers (both ensemble and simple) on five different data sets of varying session length. Presently the results of web server log analyzers are not very much reliable because the input log files are highly inflated by sessions of automated web traverse software’s, known as web robots. Presence of web robots access traffic entries in web server log repositories imposes a great challenge to extract any actionable and usable knowledge about browsing behavior of actual visitors. So web robots sessions need accurate and fast detection from web server log repositories to extract knowledge about genuine visitors and to produce correct results of log analyzers.展开更多
Due to a great deal of valuable information contained in the Web log file, the result of Web mining can be used to enhance the decision making for electronic commerce (EC) operation and management. Because of ambiguo...Due to a great deal of valuable information contained in the Web log file, the result of Web mining can be used to enhance the decision making for electronic commerce (EC) operation and management. Because of ambiguous and abundance of the Web log file, the least decision making model based on rough set theory was presented for Web mining. And an example was given to explain the model. The model can predigest the decision making table, so that the least solution of the table can be acquired. According to the least solution, the corresponding decision for individual service can be made in sequence. Web mining based on rough set theory is also currently the original and particular method.展开更多
The World Wide Web has been an environment with many security threats and lots of reported cases of security breaches. Various tools and techniques have been applied in trying to curb this problem, however new attacks...The World Wide Web has been an environment with many security threats and lots of reported cases of security breaches. Various tools and techniques have been applied in trying to curb this problem, however new attacks continue to plague the Internet. We discuss risks that affect web applications and explain how network-centric and host-centric techniques, as much as they are crucial in an enterprise, lack necessary depth to comprehensively analyze overall application security. The nature of web applications to span a number of servers introduces a new dimension of security requirement that calls for a holistic approach to protect the information asset regardless of its physical or logical separation of modules and tiers. We therefore classify security mechanisms as either infrastructure-centric or application-centric based on what asset is being secured. We then describe requirements for such application-centric security mechanisms.展开更多
基金supported by the ABT SHIELD(Anti-Bot and Trolls Shield)project at the Systems Research Institute,Polish Academy of Sciences,in cooperation with EDGE NPDRPMA.01.02.00-14-B448/18-00 funded by the Regional Development Fund for the development of Mazovia.
文摘Distinguishing between web traffic generated by bots and humans is an important task in the evaluation of online marketing campaigns.One of the main challenges is related to only partial availability of the performance metrics:although some users can be unambiguously classified as bots,the correct label is uncertain in many cases.This calls for the use of classifiers capable of explaining their decisions.This paper demonstrates two such mechanisms based on features carefully engineered from web logs.The first is a man-made rule-based system.The second is a hierarchical model that first performs clustering and next classification using human-centred,interpretable methods.The stability of the proposed methods is analyzed and a minimal set of features that convey the classdiscriminating information is selected.The proposed data processing and analysis methodology are successfully applied to real-world data sets from online publishers.
基金Supported by Royal Thai Government ScholarshipFaculty of IT,Monash University,Resources Support
文摘In this era of a data-driven society, useful data(Big Data) is often unintentionally ignored due to lack of convenient tools and expensive software. For example, web log files can be used to identify explicit information of browsing patterns when users access web sites. Some hidden information,however, cannot be directly derived from the log files. We may need external resources to discover more knowledge from browsing patterns. The purpose of this study is to investigate the application of web usage mining based on web log files. The outcome of this study sets further directions of this investigation on what and how implicit information embedded in log files can be efficiently and effectively extracted. Further work involves combining the use of social media data to improve business decision quality.
基金the National Natural Science Fund of China(61871046,61601053)。
文摘Attacks on web servers are part of the most serious threats in network security fields.Analyzing logs of web attacks is an effective approach for malicious behavior identification.Traditionally,machine learning models based on labeled data are popular identification methods.Some deep learning models are also recently introduced for analyzing logs based on web logs classification.However,it is limited to the amount of labeled data in model training.Web logs with labels which mark specific categories of data are difficult to obtain.Consequently,it is necessary to follow the problem about data generation with a focus on learning similar feature representations from the original data and improve the accuracy of classification model.In this paper,a novel framework is proposed,which differs in two important aspects:one is that long short-term memory(LSTM)is incorporated into generative adversarial networks(GANs)to generate the logs of web attack.The other is that a data augment model is proposed by adding logs of web attack generated by GANs to the original dataset and improved the performance of the classification model.The results experimentally demonstrate the effectiveness of the proposed method.It improved the classification accuracy from 89.04%to 95.04%.
文摘Web-log contains a lot of information related with user activities on the Internet. How to mine user browsing interest patterns effectively is an important and challengeable research topic. On the analysis of the present algorithm’s advantages and disadvantages we propose a new concept: support-interest. Its key insight is that visitor will backtrack if they do not find the information where they expect. And the point from where they backtrack is the expected location for the page. We present User Access Matrix and the corresponding algorithm for discovering such expected locations that can handle page caching by the browser. Since the URL-URL matrix is a sparse matrix which can be represented by List of 3-tuples, we can mine user preferred sub-paths from the computation of this matrix. Accordingly, all the sub-paths are merged, and user preferred paths are formed. Experiments showed that it was accurate and scalable. It’s suitable for website based application, such as to optimize website’s topological structure or to design personalized services. Key words Web Mining - user preferred path - Web-log - support-interest - personalized services CLC number TP 391 Foundation item: Supported by the National High Technology Development (863 program of China) (2001AA113182)Biography: ZHOU Hong-fang (1976-), female.Ph. D candidate, research direction: data mining and knowledge discovery in databases.
文摘In this paper we investigate the effectiveness of ensemble-based learners for web robot session identification from web server logs. We also perform multi fold robot session labeling to improve the performance of learner. We conduct a comparative study for various ensemble methods (Bagging, Boosting, and Voting) with simple classifiers in perspective of classification. We also evaluate the effectiveness of these classifiers (both ensemble and simple) on five different data sets of varying session length. Presently the results of web server log analyzers are not very much reliable because the input log files are highly inflated by sessions of automated web traverse software’s, known as web robots. Presence of web robots access traffic entries in web server log repositories imposes a great challenge to extract any actionable and usable knowledge about browsing behavior of actual visitors. So web robots sessions need accurate and fast detection from web server log repositories to extract knowledge about genuine visitors and to produce correct results of log analyzers.
文摘Due to a great deal of valuable information contained in the Web log file, the result of Web mining can be used to enhance the decision making for electronic commerce (EC) operation and management. Because of ambiguous and abundance of the Web log file, the least decision making model based on rough set theory was presented for Web mining. And an example was given to explain the model. The model can predigest the decision making table, so that the least solution of the table can be acquired. According to the least solution, the corresponding decision for individual service can be made in sequence. Web mining based on rough set theory is also currently the original and particular method.
文摘The World Wide Web has been an environment with many security threats and lots of reported cases of security breaches. Various tools and techniques have been applied in trying to curb this problem, however new attacks continue to plague the Internet. We discuss risks that affect web applications and explain how network-centric and host-centric techniques, as much as they are crucial in an enterprise, lack necessary depth to comprehensively analyze overall application security. The nature of web applications to span a number of servers introduces a new dimension of security requirement that calls for a holistic approach to protect the information asset regardless of its physical or logical separation of modules and tiers. We therefore classify security mechanisms as either infrastructure-centric or application-centric based on what asset is being secured. We then describe requirements for such application-centric security mechanisms.