Prediction of DNA i-motifs via machine learning

Bibo Yang, Dilek Guneri, Haopeng Yu, Elisé P. Wright, Wenqian Chen, Zoë A.E. Waller, Yiliang Ding

    Research output: Contribution to journalArticlepeer-review

    14 Citations (Scopus)
    12 Downloads (Pure)

    Abstract

    i-Motifs (iMs), are secondary str uct ures formed in cytosine-rich DNA sequences and are in v olv ed in multiple functions in the genome. Although putativ e iM f orming sequences are widely distributed in the human genome, the folding status and strength of putative iMs vary dramati- cally. Muc h previous researc h on iM has focused on assessing the iM folding properties using bioph y sical e xperiments. Ho w e v er, there are no dedicated computational tools for predicting the folding status and strength of iM structures. Here, we introduce a machine learning pipeline, iM-Seeker, to predict both folding status and structural st abilit y of DNA iMs. The programme iM-Seeker incorporates a Balanced Random Forest classifier trained on genome-wide iMab antibody-based CUT&Tag sequencing data to predict the folding status and an Extreme Gradient Boost- ing regressor to estimate the folding strength according to both literature biophysical data and our in-house biophysical experiments. iM-Seeker predicts DNA iM f olding status with a classification accuracy of 81% and estimates the f olding strength with coefficient of determination ( R 2 ) of 0.642 on the test set. Model interpretation confirms that the nucleotide composition of the C-rich sequence significantly affects iM st abilit y, with a positive correlation with sequences containing cytosine and thymine and a negative correlation with guanine and adenine.

    Original languageEnglish
    Pages (from-to)2188-2197
    Number of pages10
    JournalNucleic Acids Research
    Volume52
    Issue number5
    DOIs
    Publication statusPublished - 21 Mar 2024

    Fingerprint

    Dive into the research topics of 'Prediction of DNA i-motifs via machine learning'. Together they form a unique fingerprint.

    Cite this