Abstract
Data preparation typically consumes 80-90% of the total time taken to complete a data mining project. It is a crucial step as the performance of any model highly depends upon Garbage in, Garbage out pre-processing stage in a large dataset is missing values. Air pollution and meteorological data typically consist of many missing values. Proper imputations should be carried out to avoid any bias caused by missing values. The main objective of this study was to propose suitable techniques to be used in data preprocessing for air pollution and meteorological data in Sydney region, Australia. The dataset consists of hourly measurements of air pollution and meteorological variables from 1994-01-01 01:00:00 AEST (Australian Eastern Standard Time) to 2018-12-31 24:00:00 AEST recorded at each station in Sydney Region. The preprocessed data can be used in spatiotemporal analysis to assess the impact of climate change on different health aspects. Principal Component Analysis (PCA) was used to analyze the relationships of variables. Highly positively-correlated variable groups were[CO, NO, NO2], [O3,temperature,wind speed], [Visibility, PM2.5, PM10] and [wind direction, humidity]. Humidity was highly negatively correlated with O3 and temperature. Further, 82% of the total variation is explained by the first five principal components. Six well-established techniques to impute missing values in time series data; Mean Imputation, Spline Interpolation, Simple Moving Average, Exponentially Weighted Moving Average, Kalman Smoothing on Structural Time Series Models and Kalman Smoothing on Autoregressive Integrated Moving Average (ARIMA) models were compared. Imputation method based on Kalman Smoothing on Structural Time Series model showed better performance over the other methods for missing values under Missing Completely at Random (MCAR) mechanism for the data obtained in Sydney area.
Original language | English |
---|---|
Title of host publication | Proceedings: International Conference on Environmental and Medical Statistics, 9-10 January 2020, Postgraduate Institute of Science, University of Peradeniya, Sri Lanka |
Publisher | University of Peradeniya |
Pages | 30-30 |
Number of pages | 1 |
Publication status | Published - 2020 |
Event | International Conference on Environmental and Medical Statistics - Duration: 1 Jan 2020 → … |
Conference
Conference | International Conference on Environmental and Medical Statistics |
---|---|
Period | 1/01/20 → … |
Keywords
- meteorology
- air
- pollution
- data processing
- Sydney (N.S.W.)