Data exploration and pre-processing techniques on air pollution and meteorological data in Sydney region

W. M. L. K. N. Wijesekara, L. Liyanage

Research output: Chapter in Book / Conference PaperConference Paperpeer-review

Abstract

Data preparation typically consumes 80-90% of the total time taken to complete a data mining project. It is a crucial step as the performance of any model highly depends upon Garbage in, Garbage out pre-processing stage in a large dataset is missing values. Air pollution and meteorological data typically consist of many missing values. Proper imputations should be carried out to avoid any bias caused by missing values. The main objective of this study was to propose suitable techniques to be used in data preprocessing for air pollution and meteorological data in Sydney region, Australia. The dataset consists of hourly measurements of air pollution and meteorological variables from 1994-01-01 01:00:00 AEST (Australian Eastern Standard Time) to 2018-12-31 24:00:00 AEST recorded at each station in Sydney Region. The preprocessed data can be used in spatiotemporal analysis to assess the impact of climate change on different health aspects. Principal Component Analysis (PCA) was used to analyze the relationships of variables. Highly positively-correlated variable groups were[CO, NO, NO2], [O3,temperature,wind speed], [Visibility, PM2.5, PM10] and [wind direction, humidity]. Humidity was highly negatively correlated with O3 and temperature. Further, 82% of the total variation is explained by the first five principal components. Six well-established techniques to impute missing values in time series data; Mean Imputation, Spline Interpolation, Simple Moving Average, Exponentially Weighted Moving Average, Kalman Smoothing on Structural Time Series Models and Kalman Smoothing on Autoregressive Integrated Moving Average (ARIMA) models were compared. Imputation method based on Kalman Smoothing on Structural Time Series model showed better performance over the other methods for missing values under Missing Completely at Random (MCAR) mechanism for the data obtained in Sydney area.
Original languageEnglish
Title of host publicationProceedings: International Conference on Environmental and Medical Statistics, 9-10 January 2020, Postgraduate Institute of Science, University of Peradeniya, Sri Lanka
PublisherUniversity of Peradeniya
Pages30-30
Number of pages1
Publication statusPublished - 2020
EventInternational Conference on Environmental and Medical Statistics -
Duration: 1 Jan 2020 → …

Conference

ConferenceInternational Conference on Environmental and Medical Statistics
Period1/01/20 → …

Keywords

  • meteorology
  • air
  • pollution
  • data processing
  • Sydney (N.S.W.)

Fingerprint

Dive into the research topics of 'Data exploration and pre-processing techniques on air pollution and meteorological data in Sydney region'. Together they form a unique fingerprint.

Cite this