EGRA-Xhosa-14.9k: Annotated Child Reading Audio Dataset

  • Sergio Chevtchenko (Creator)
  • Nikhil Navas (Creator)
  • Rafaella Vale (Creator)
  • Franco Ubaudi (Creator)
  • Sipumelele Lucwaba (Creator)
  • Cally Ardington (Creator)
  • Soheil Afshar (Creator)
  • Mark Antoniou (Creator)
  • Saeed Afshar (Creator)

    Dataset

    Description

    The project involves collecting the child reading dataset for the language is Xhosa, a South African Bantu language. The collected dataset is then processed with the help of native speakers and utilized to train state-of-the-art machine learning models focussed on assessing whether the child has spoken the word correctly or not. The dataset contains 14,972 recordings with an average of 4 seconds each. Each recording is annotated by three independent markers and consists of children speaking a particular word or letter from the Xhosa language in a classroom setting.
    Date made available21 May 2025
    PublisherWestern Sydney University
    Date of data production1 Feb 2024 - 30 Nov 2024

    UN SDGs

    This dataset contributes to the following UN Sustainable Development Goals (SDGs)

    1. SDG 4 - Quality Education
      SDG 4 Quality Education

    Cite this