Advances in clustering based on inter-cluster mapping

  • Arshad M. Muhammad

Western Sydney University thesis: Doctoral thesis

Abstract

Data mining involves searching for certain patterns and facts about the structure of data within large complex datasets. Data mining can reveal valuable and interesting relationships which can improve the operations of business, health and many other disciplines. Extraction of hidden patterns and strategic knowledge from large datasets which are stored electronically, is therefore a challenge faced by many organizations. One commonly used technique in data mining for producing useful results is cluster analysis. A basic issue in cluster analysis is deciding the optimal number of clusters for a dataset. A solution to this issue is not straightforward as this form of clustering is unsupervised learning and no clear definition of cluster quality exists. In addition, this issue will be more challenging and complicated for multi-dimensional datasets. Finding the estimated number of clusters and their quality is generally based on so-called validation indexes. A limitation with typical existing validation indexes is that they only work well with specific types of datasets compatible with their design assumptions. Also their results may be inconsistent and an algorithm may need to be run multiple times to find a best estimate of the number of clusters. Furthermore, these existing approaches may not be effective for complex problems in large datasets with varied structure. To help overcome these deficiencies, an efficient and effective approach for stable estimation of the number of clusters is essential. Many clustering techniques including partitioning, hierarchal, grid-base and model-based clustering are available. Here we consider only the partitioning method e.g. the k-means clustering algorithm for analysing data. This thesis will describe a new approach for stable estimation of the number of clusters, based on use of the k-means clustering algorithm. First results obtained from the k-means clustering algorithm will be used to gain a forward and backward mapping of common elements for adjacent and non-adjacent clusters. These will be represented in the form of proportion matrices which will be used to compute combined mapped information using a matrix inner product similarity measure. This will provide indicators for the similarity of mapped elements and overlap (dissimilarity), average similarity and average overlap (average dissimilarity) between clusters. Finally, the estimated number of clusters will be decided using the maximum average similarity, minimum average overlap and coefficient of variation measure. The new approach provides more information than an application of typical existing validation indexes. For example, the new approach offers not only the estimated number of clusters but also gives an indication of fully or partially separated clusters and defines a set of stable clusters for the estimated number of clusters. The advantage of the new approach over several existing validation indexes for evaluating clustering results is demonstrated empirically by applying it on a variety of simulated and real datasets.
Date of Award2016
Original languageEnglish

Keywords

  • data mining
  • cluster analysis

Cite this

'