EGRA-Xhosa-14.9k: Annotated Child Reading Audio Dataset

  • Sergio Chevtchenko (Creator)
  • Nikhil Navas (Creator)
  • Rafaella Vale (Creator)
  • Franco Ubaudi (Creator)
  • Sipumelele Lucwaba (Creator)
  • Cally Ardington (Creator)
  • Soheil Afshar (Creator)
  • Mark Antoniou (Creator)
  • Saeed Afshar (Creator)

Dataset

Description

The project involves collecting the child reading dataset for the language is Xhosa, a South African Bantu language. The collected dataset is then processed with the help of native speakers and utilized to train state-of-the-art machine learning models focussed on assessing whether the child has spoken the word correctly or not. The dataset contains 14,972 recordings with an average of 4 seconds each. Each recording is annotated by three independent markers and consists of children speaking a particular word or letter from the Xhosa language in a classroom setting.
Date made available21 May 2025
PublisherWestern Sydney University
Date of data production1 Feb 2024 - 30 Nov 2024

Cite this