XML (eXtensible Markup Language) is a standard which is widely appliedin data representation and data exchange. However, as an important concept of XML, DTD(Document Type Definition) is not taken full advantage in cur...XML (eXtensible Markup Language) is a standard which is widely appliedin data representation and data exchange. However, as an important concept of XML, DTD(Document Type Definition) is not taken full advantage in current applications. In this paper, anew method for clustering DTDs is presented, and it can be used in XML document clustering.The two-level method clusters the elements in DTDs and clusters DTDs separately. Elementclustering forms the first level and provides element clusters, which are the generalization ofrelevant elements. DTD clustering utilizes the generalized information and forms the secondlevel in the whole clustering process. The two-level method has the following advantages: 1) Ittakes into consideration both the content and the structure within DTDs; 2) The generalizedinformation about elements is more useful than the separated words in the vector model; 3) Thetwo-level method facilitates the searching of outliers. The experiments show that this methodis able to categorize the relevant DTDs effectively.展开更多
文摘XML (eXtensible Markup Language) is a standard which is widely appliedin data representation and data exchange. However, as an important concept of XML, DTD(Document Type Definition) is not taken full advantage in current applications. In this paper, anew method for clustering DTDs is presented, and it can be used in XML document clustering.The two-level method clusters the elements in DTDs and clusters DTDs separately. Elementclustering forms the first level and provides element clusters, which are the generalization ofrelevant elements. DTD clustering utilizes the generalized information and forms the secondlevel in the whole clustering process. The two-level method has the following advantages: 1) Ittakes into consideration both the content and the structure within DTDs; 2) The generalizedinformation about elements is more useful than the separated words in the vector model; 3) Thetwo-level method facilitates the searching of outliers. The experiments show that this methodis able to categorize the relevant DTDs effectively.