Machine learning estimation of galaxy redshift for large radio surveys

Western Sydney University thesis: Doctoral thesis

Abstract

Obtaining the redshift of radio galaxies detected by large-scale surveys is hard. While it would be best to directly measure the redshift of all known astrophysical sources spectroscopically, there isn’t enough telescope time on the planet to gather spectra for the tens of millions of radio sources being discovered by the new radio surveys coming online. Alternatively, photometric template fitting can be highly accurate in estimating the redshift of sources. However, this process has difficulties with radio galaxies. Current templates used struggle to differentiate the emission from star formation, from the emission from an Active Galactic Nucleus (AGN), which is problematic as a significant portion of radio galaxies play host to bright AGN. Finally, it has been shown that Machine Learning (ML) approaches can be effective for the estimation of redshift, particularly when there are limited photometric bands (as there are for all-sky radio surveys). However, current methods tend to focus on optically selected samples, with the Sloan Digital Sky Survey (SDSS) Galaxy and QSO spectroscopic samples frequently used for the training samples used, meaning they are not representative of the galaxies detected by current and upcoming radio surveys.

In this thesis, I started by compiling a radio-selected training sample, spanning the redshift range 0 < 𝑧 < 7. This radio-selected training sample has observations in optical g, r, i, and z bands from both the northern (taken from the SDSS) and southern (taken from the Dark Energy Survey (DES)) hemisphere, and AllWISE/CatWISE W1, W2, W3, and W4 infrared bands. However, given the different optical surveys used, homogenisation is needed to convert the SDSS photometry to DES photometry. Once converted, different ML algorithms were compared, with the simple 𝑘-Nearest Neighbours (kNN) and Random Forest (RF) algorithms benchmarked against the commonly used GPz and ANNz algorithms. For a point estimate, the kNN algorithm performed best, however, the kNN algorithm is unable to provide uncertainties, or Probability Distribution Functions (PDFs), unlike the more complicated GPz and ANNz.

Using the training sample and the algorithms identified above, I then estimated the redshift of all radio sources in the Evolutionary Map of the Universe – Pilot Survey 1 (EMU-PS) survey that had optical and infrared counterparts, resulting in a catalogue of ∼102,000 radio sources with redshifts, many of which had never been observed at radio wavelengths before. From this sample, I identified 3 sources at 𝑧 > 5, 28 sources at 𝑧 > 4, and 318 sources at 𝑧 > 3 using the kNN algorithm.

Finally, noting that only ∼102,000 of the ∼220,000 EMU-PS sources have “complete” optical and infrared photometry (where “complete” photometry means 𝑔, 𝑟, 𝑖, and 𝑧 optical magnitudes, and at least W1, W2 infrared magnitudes), I investigated different methods of filling in missing data, with simple methods like replacing missing photometry with mean, maximum, and minimum values compared against more complex methods like kNN imputation, Multiple Imputation by Chained Equations (MICE), and a Generative Adversarial Network (GAN) based approach. This work showed that for data that is missing at random, the complex algorithms work best in filling missing data, although it must be noted that for many astronomical sources, the data are not missing at random, but often missing due to sensitivity limits.
Date of Award2025
Original languageEnglish
Awarding Institution
  • Western Sydney University
SupervisorRay Norris (Supervisor), Rosalind Wang (Supervisor), Laurence Park (Supervisor), Miroslav Filipovic (Supervisor) & Ying Guo (Supervisor)

Cite this

'