TY - JOUR
T1 - Analysis of half a billion datapoints across ten machine-learning algorithms identifies key elements associated with insulin transcription in human pancreatic islet cells
AU - Wong, Wilson K. M.
AU - Thorat, Vinod
AU - Joglekar, Mugdha V.
AU - Dong, Charlotte X.
AU - Lee, Hugo
AU - Chew, Yi Vee
AU - Bhave, Adwait
AU - Hawthorne, Wayne J.
AU - Engin, Feyza
AU - Pant, Aniruddha
AU - Dalgaard, Louise T.
AU - Bapat, Sharda
AU - Hardikar, Anandwardhan A.
N1 - Publisher Copyright:
Copyright © 2022 Wong, Thorat, Joglekar, Dong, Lee, Chew, Bhave, Hawthorne, Engin, Pant, Dalgaard, Bapat and Hardikar.
PY - 2022/3/23
Y1 - 2022/3/23
N2 - Machine learning (ML)-workflows enable unprejudiced/robust evaluation of complex datasets. Here, we analyzed over 490,000,000 data points to compare 10 different ML-workflows in a large (N=11,652) training dataset of human pancreatic single-cell (sc-)transcriptomes to identify genes associated with the presence or absence of insulin transcript(s). Prediction accuracy/sensitivity of each ML-workflow was tested in a separate validation dataset (N=2,913). Ensemble ML-workflows, in particular Random Forest ML-algorithm delivered high predictive power (AUC=0.83) and sensitivity (0.98), compared to other algorithms. The transcripts identified through these analyses also demonstrated significant correlation with insulin in bulk RNA-seq data from human islets. The top-10 features, (including IAPP, ADCYAP1, LDHA and SST) common to the three Ensemble ML-workflows were significantly dysregulated in scRNA-seq datasets from Ire-1αβ-/- mice that demonstrate dedifferentiation of pancreatic β-cells in a model of type 1 diabetes (T1D) and in pancreatic single cells from individuals with type 2 Diabetes (T2D). Our findings provide direct comparison of ML-workflows in big data analyses, identify key elements associated with insulin transcription and provide workflows for future analyses.
AB - Machine learning (ML)-workflows enable unprejudiced/robust evaluation of complex datasets. Here, we analyzed over 490,000,000 data points to compare 10 different ML-workflows in a large (N=11,652) training dataset of human pancreatic single-cell (sc-)transcriptomes to identify genes associated with the presence or absence of insulin transcript(s). Prediction accuracy/sensitivity of each ML-workflow was tested in a separate validation dataset (N=2,913). Ensemble ML-workflows, in particular Random Forest ML-algorithm delivered high predictive power (AUC=0.83) and sensitivity (0.98), compared to other algorithms. The transcripts identified through these analyses also demonstrated significant correlation with insulin in bulk RNA-seq data from human islets. The top-10 features, (including IAPP, ADCYAP1, LDHA and SST) common to the three Ensemble ML-workflows were significantly dysregulated in scRNA-seq datasets from Ire-1αβ-/- mice that demonstrate dedifferentiation of pancreatic β-cells in a model of type 1 diabetes (T1D) and in pancreatic single cells from individuals with type 2 Diabetes (T2D). Our findings provide direct comparison of ML-workflows in big data analyses, identify key elements associated with insulin transcription and provide workflows for future analyses.
UR - https://hdl.handle.net/1959.7/uws:65361
U2 - 10.3389/fendo.2022.853863
DO - 10.3389/fendo.2022.853863
M3 - Article
SN - 1664-2392
VL - 13
JO - Frontiers in Endocrinology
JF - Frontiers in Endocrinology
M1 - 853863
ER -