This thesis describes an approach for clustering multivariate time series with variables taking both categorical and continuous values. Time series of this type are frequent in healthcare, where they represent the health trajectories of individuals. The problem is challenging because categorical variables make it difficult to define a meaningful distance between trajectories. Clustering is one of the most common and useful tasks in data mining, so it is a well-studied problem. However, clustering of sequential or longitudinal data is more challenging than traditional clustering as sequence of observations should be processed rather than single data point values. The analysis of longitudinal data is an interesting application area in epidemiology and clinical research, since it allows researchers to observe individual patterns of change and to capture the relationship between exposure and outcome. The typical approach to longitudinal clustering in health services research uses K-means clustering to form the heath states (conditions) and a first order Markov chain to describe the transitions between states. This procedure ignores information from temporally-adjacent observations and prevents uncertainty from parameter estimation and cluster assignments from being incorporated into the analysis. The approach proposed here was based on incorporation of the Hidden Markov Models (HMMs), using the following steps: first, map each trajectory into an HMM, then define a suitable distance between HMMs, and finally proceed to cluster the HMMs with a method based on a distance matrix. The assumption was made that the health conditions to be observed are just the manifestations of a true health state that cannot be observed directly, and that remain hidden. Therefore, rather than modelling the transitions of each health state, the transitions of the hidden states were modelled, as well as the probabilities of observing certain health conditions in each hidden state. The approach was tested on a simulated, but realistic, data set of 1,255 trajectories of individuals from the 45 and Up data set, on a synthetic validation set consist of 1,255 trajectories with known clustering structure, and on a smaller set of 268 trajectories extracted from the longitudinal Health and Retirement Survey. The proposed method can be implemented quite simply using standard packages in R and Matlab, and may be a good candidate for solving the difficult problem of clustering multivariate time series with categorical variables using tools that do not require advanced statistical knowledge, and therefore are accessible to a wide range of researchers.
| Date of Award | 2014 |
|---|
| Original language | English |
|---|
- time-series analysis
- Markov processes
- health
- data processing
- clustering
- parameter estimation
Clustering longitudinal health data using hidden Markov models
Ghassem Pour, S. (Author). 2014
Western Sydney University thesis: Doctoral thesis