TY - JOUR
T1 - A review of the machine learning datasets in mammography, their adherence to the FAIR principles and the outlook for the future
AU - Logan, Joe
AU - Kennedy, Paul J.
AU - Catchpoole, Daniel
N1 - Publisher Copyright:
© 2023, Springer Nature Limited.
PY - 2023/12
Y1 - 2023/12
N2 - The increasing rates of breast cancer, particularly in emerging economies, have led to interest in scalable deep learning-based solutions that improve the accuracy and cost-effectiveness of mammographic screening. However, such tools require large volumes of high-quality training data, which can be challenging to obtain. This paper combines the experience of an AI startup with an analysis of the FAIR principles of the eight available datasets. It demonstrates that the datasets vary considerably, particularly in their interoperability, as each dataset is skewed towards a particular clinical use-case. Additionally, the mix of digital captures and scanned film compounds the problem of variability, along with differences in licensing terms, ease of access, labelling reliability, and file formats. Improving interoperability through adherence to standards such as the BIRADS criteria for labelling and annotation, and a consistent file format, could markedly improve access and use of larger amounts of standardized data. This, in turn, could be increased further by GAN-based synthetic data generation, paving the way towards better health outcomes for breast cancer.
AB - The increasing rates of breast cancer, particularly in emerging economies, have led to interest in scalable deep learning-based solutions that improve the accuracy and cost-effectiveness of mammographic screening. However, such tools require large volumes of high-quality training data, which can be challenging to obtain. This paper combines the experience of an AI startup with an analysis of the FAIR principles of the eight available datasets. It demonstrates that the datasets vary considerably, particularly in their interoperability, as each dataset is skewed towards a particular clinical use-case. Additionally, the mix of digital captures and scanned film compounds the problem of variability, along with differences in licensing terms, ease of access, labelling reliability, and file formats. Improving interoperability through adherence to standards such as the BIRADS criteria for labelling and annotation, and a consistent file format, could markedly improve access and use of larger amounts of standardized data. This, in turn, could be increased further by GAN-based synthetic data generation, paving the way towards better health outcomes for breast cancer.
UR - http://www.scopus.com/inward/record.url?scp=85170229138&partnerID=8YFLogxK
U2 - 10.1038/s41597-023-02430-6
DO - 10.1038/s41597-023-02430-6
M3 - Article
C2 - 37684306
AN - SCOPUS:85170229138
SN - 2052-4463
VL - 10
JO - Scientific Data
JF - Scientific Data
IS - 1
M1 - 595
ER -