TY - JOUR
T1 - MLDeCNV
T2 - a machine learning approach for predicting copy number variation types in plant genomes
AU - Das, Parinita
AU - Saha, Bibek
AU - Sharma, Nitesh Kumar
AU - Iquebal, Mir Asif
AU - Papanicolaou, Alexie
AU - Angadi, U. B.
AU - Kumar, Dinesh
AU - Jaiswal, Sarika
PY - 2026/1/15
Y1 - 2026/1/15
N2 - Copy number variations (CNVs) play a crucial role in shaping genetic diversity and influencing various plant traits. However, existing methods for CNV characterization often face challenges due to the complexity and repetitive nature of plant genomes. Here, we present MLDeCNV (Machine Learning for Decoding Copy Number Variation) a novel open-source machine-learning based tool optimized for predicting CNV types (deletions, duplications, and non-CNVs) in plant genomes. Built on the XGBoost model, MLDeCNV utilizes 32 selected CNV-related features derived from coverage metrics, nucleotide composition, and sequencing statistics. The model was trained on a high-confidence CNV dataset comprising of experimentally validated and computationally predicted CNVs. It exhibits strong performance across various CNV size ranges and training set sizes, achieving an accuracy of 89.27 %, with precision, recall, and F1-score, all at 89.3 %, and an Area Under Curve of 0.9783, underscoring its robustness and reliability. Extensive comparisons with traditional machine learning models reveal that XGBoost outperforms other methods, particularly in handling complex, nonlinear interactions within the CNV data. Additionally, while MLDeCNV does not perform de novo CNV detection, it evaluates CNV type classification from pre-identified genomic regions, making it a post-detection classification tool. This tool, accessible at http://46.202.167.198:5004/ can be integrated downstream of CNV detection pipelines, enhancing the accuracy of CNV type categorization. The precise classification of CNV types from pre-identified genomic regions will streamline downstream genomic analyses, facilitating enhanced understanding and utilization of genetic variation in plants.
AB - Copy number variations (CNVs) play a crucial role in shaping genetic diversity and influencing various plant traits. However, existing methods for CNV characterization often face challenges due to the complexity and repetitive nature of plant genomes. Here, we present MLDeCNV (Machine Learning for Decoding Copy Number Variation) a novel open-source machine-learning based tool optimized for predicting CNV types (deletions, duplications, and non-CNVs) in plant genomes. Built on the XGBoost model, MLDeCNV utilizes 32 selected CNV-related features derived from coverage metrics, nucleotide composition, and sequencing statistics. The model was trained on a high-confidence CNV dataset comprising of experimentally validated and computationally predicted CNVs. It exhibits strong performance across various CNV size ranges and training set sizes, achieving an accuracy of 89.27 %, with precision, recall, and F1-score, all at 89.3 %, and an Area Under Curve of 0.9783, underscoring its robustness and reliability. Extensive comparisons with traditional machine learning models reveal that XGBoost outperforms other methods, particularly in handling complex, nonlinear interactions within the CNV data. Additionally, while MLDeCNV does not perform de novo CNV detection, it evaluates CNV type classification from pre-identified genomic regions, making it a post-detection classification tool. This tool, accessible at http://46.202.167.198:5004/ can be integrated downstream of CNV detection pipelines, enhancing the accuracy of CNV type categorization. The precise classification of CNV types from pre-identified genomic regions will streamline downstream genomic analyses, facilitating enhanced understanding and utilization of genetic variation in plants.
KW - Bio-curation
KW - Copy number variant
KW - Machine learning
KW - Pangenome
KW - Plant genomics
UR - http://www.scopus.com/inward/record.url?scp=105025712411&partnerID=8YFLogxK
UR - https://go.openathens.net/redirector/westernsydney.edu.au?url=https://doi.org/10.1016/j.compbiomed.2025.111394
U2 - 10.1016/j.compbiomed.2025.111394
DO - 10.1016/j.compbiomed.2025.111394
M3 - Article
AN - SCOPUS:105025712411
SN - 0010-4825
VL - 201
JO - Computers in Biology and Medicine
JF - Computers in Biology and Medicine
M1 - 111394
ER -