With the great commercial success of several IPTV (internet protocal television) applications, PPLive has received more and more attention from both industry and academia. At present, PPLive system is one of the most ...With the great commercial success of several IPTV (internet protocal television) applications, PPLive has received more and more attention from both industry and academia. At present, PPLive system is one of the most popular instances of IPTV applications which attract a large number of users across the globe; however, the dramatic rise in popularity makes it more likely to become a vulnerable target. The main contribution of this work is twofold. Firstly, a dedicated distributed crawler system was proposed and its crawling performance was analyzed, which was used to evaluate the impact of pollution attack in P2P live streaming system. The measurement results reveal that the crawler system with distributed architecture could capture PPLive overlay snapshots with more efficient way than previous crawlers. To the best of our knowledge, our study work is the first to employ distributed architecture idea to design crawler system and discuss the crawling performance of capturing accurate overlay snapshots for P2P live streaming system. Secondly, a feasible and effective pollution architecture was proposed to deploy content pollution attack in a real-world P2P live streaming system called PPLive, and deeply evaluate the impact of pollution attack from following five aspects:dynamic evolution of participating users, user lifetime characteristics, user connectivity-performance, dynamic evolution of uploading polluted chunks and dynamic evolution of pollution ratio. Specifically, the experiment results show that a single polluter is capable of compromising all the system and its destructiveness is severe.展开更多
Content extraction of HTML pages is the basis of the web page clustering and information retrieval,so it is necessary to eliminate cluttered information and very important to extract content of pages accurately.A nove...Content extraction of HTML pages is the basis of the web page clustering and information retrieval,so it is necessary to eliminate cluttered information and very important to extract content of pages accurately.A novel and accurate solution for extracting content of HTML pages was proposed.First of all,the HTML page is parsed into DOM object and the IDs of all leaf nodes are generated.Secondly,the score of each leaf node is calculated and the score is adjusted according to the relationship with neighbors.Finally,the information blocks are found according to the definition,and a universal classification algorithm is used to identify the content blocks.The experimental results show that the algorithm can extract content effectively and accurately,and the recall rate and precision are 96.5% and 93.8%,respectively.展开更多
基金Project(2007CB311106) supported by National Basic Research Program of ChinaProject(242-2009A82) supported by National Information Security Special Plan Program of China
文摘With the great commercial success of several IPTV (internet protocal television) applications, PPLive has received more and more attention from both industry and academia. At present, PPLive system is one of the most popular instances of IPTV applications which attract a large number of users across the globe; however, the dramatic rise in popularity makes it more likely to become a vulnerable target. The main contribution of this work is twofold. Firstly, a dedicated distributed crawler system was proposed and its crawling performance was analyzed, which was used to evaluate the impact of pollution attack in P2P live streaming system. The measurement results reveal that the crawler system with distributed architecture could capture PPLive overlay snapshots with more efficient way than previous crawlers. To the best of our knowledge, our study work is the first to employ distributed architecture idea to design crawler system and discuss the crawling performance of capturing accurate overlay snapshots for P2P live streaming system. Secondly, a feasible and effective pollution architecture was proposed to deploy content pollution attack in a real-world P2P live streaming system called PPLive, and deeply evaluate the impact of pollution attack from following five aspects:dynamic evolution of participating users, user lifetime characteristics, user connectivity-performance, dynamic evolution of uploading polluted chunks and dynamic evolution of pollution ratio. Specifically, the experiment results show that a single polluter is capable of compromising all the system and its destructiveness is severe.
基金Project(2012BAH18B05) supported by the Supporting Program of Ministry of Science and Technology of China
文摘Content extraction of HTML pages is the basis of the web page clustering and information retrieval,so it is necessary to eliminate cluttered information and very important to extract content of pages accurately.A novel and accurate solution for extracting content of HTML pages was proposed.First of all,the HTML page is parsed into DOM object and the IDs of all leaf nodes are generated.Secondly,the score of each leaf node is calculated and the score is adjusted according to the relationship with neighbors.Finally,the information blocks are found according to the definition,and a universal classification algorithm is used to identify the content blocks.The experimental results show that the algorithm can extract content effectively and accurately,and the recall rate and precision are 96.5% and 93.8%,respectively.