MLDeCNV: a machine learning approach for predicting copy number variation types in plant genomes

Parinita Das, Bibek Saha, Nitesh Kumar Sharma, Mir Asif Iquebal, Alexie Papanicolaou, U. B. Angadi, Dinesh Kumar, Sarika Jaiswal

Research output: Contribution to journalArticlepeer-review

Abstract

Copy number variations (CNVs) play a crucial role in shaping genetic diversity and influencing various plant traits. However, existing methods for CNV characterization often face challenges due to the complexity and repetitive nature of plant genomes. Here, we present MLDeCNV (Machine Learning for Decoding Copy Number Variation) a novel open-source machine-learning based tool optimized for predicting CNV types (deletions, duplications, and non-CNVs) in plant genomes. Built on the XGBoost model, MLDeCNV utilizes 32 selected CNV-related features derived from coverage metrics, nucleotide composition, and sequencing statistics. The model was trained on a high-confidence CNV dataset comprising of experimentally validated and computationally predicted CNVs. It exhibits strong performance across various CNV size ranges and training set sizes, achieving an accuracy of 89.27 %, with precision, recall, and F1-score, all at 89.3 %, and an Area Under Curve of 0.9783, underscoring its robustness and reliability. Extensive comparisons with traditional machine learning models reveal that XGBoost outperforms other methods, particularly in handling complex, nonlinear interactions within the CNV data. Additionally, while MLDeCNV does not perform de novo CNV detection, it evaluates CNV type classification from pre-identified genomic regions, making it a post-detection classification tool. This tool, accessible at http://46.202.167.198:5004/ can be integrated downstream of CNV detection pipelines, enhancing the accuracy of CNV type categorization. The precise classification of CNV types from pre-identified genomic regions will streamline downstream genomic analyses, facilitating enhanced understanding and utilization of genetic variation in plants.

Original languageEnglish
Article number111394
Number of pages12
JournalComputers in Biology and Medicine
Volume201
DOIs
Publication statusPublished - 15 Jan 2026

Keywords

  • Bio-curation
  • Copy number variant
  • Machine learning
  • Pangenome
  • Plant genomics

Fingerprint

Dive into the research topics of 'MLDeCNV: a machine learning approach for predicting copy number variation types in plant genomes'. Together they form a unique fingerprint.

Cite this