Abstract
The identification of categories of visitors to a Web-site is very useful towards improved Web designs and improved Web applications. However, the large volume involved in mining access-logs and visitation paths, and the uncertainty to fully identify the visitor demand efficient clustering algorithms that are also resistant to noise and outliers. Also, visitation paths are discrete, and dissimilarity between visitation paths involves sophisticated evaluation and results in attribute-vectors with large dimension. We provide randomized, iterative clustering algorithms for generic dissimilarity in paths. Our algorithms are robust because they use medians rather than means as estimators of location, and the resulting representative of a cluster is actually a path in the data set. We demonstrate mathematically that our algorithms converge and have subquadratic complexity. We also show experimentally that they are resistant to noise by recovering clusters from synthetic data generated by a mixture of distributions of paths in a graph. Our non-crisp method proposed generalizes approaches that allow a data item to have a degree of membership in a cluster.
| Original language | English |
|---|---|
| Pages (from-to) | 497-520 |
| Number of pages | 24 |
| Journal | International Journal of Foundations of Computer Science |
| Volume | 13 |
| Issue number | 4 |
| DOIs | |
| Publication status | Published - 2002 |
| Externally published | Yes |
Keywords
- Clustering
- Dissimilarity
- Visitation paths
- Web-User Mining