k-link EST clustering : evaluating error introduced by chimeric sequences under different degrees of linkage

Lauren M. Bragg, Glenn Stone

    Research output: Contribution to journalArticlepeer-review

    4 Citations (Scopus)

    Abstract

    Motivation: The clustering of expressed sequence tags (ESTs) is a crucial step in many sequence analysis studies that require a high level of redundancy. Chimeric sequences, while uncommon, can make achieving the optimal EST clustering a challenge. Single-linkage algorithms are particularly vulnerable to the effects of chimeras. To avoid chimera-facilitated erroneous merges, researchers using single-linkage algorithms are forced to use stringent sequence-similarity thresholds. Such thresholds reduce the sensitivity of the clustering algorithm. Results: We introduce the concept of k-link clustering for EST data. We evaluate how clustering error rates vary over a range of linkage thresholds. Using k-link, we show that Type II error decreases in response to increasing the number of shared ESTs (ie. links) required. We observe a base level of Type II error likely caused by the presence of unmasked low-complexity or repetitive sequence. We find that Type I error increases gradually with increased linkage. To minimize the Type I error introduced by increased linkage requirements, we propose an extension to k-link which modifies the required number of links with respect to the size of clusters being compared.
    Original languageEnglish
    Pages (from-to)2302-2308
    Number of pages7
    JournalBioinformatics
    Volume25
    Issue number18
    DOIs
    Publication statusPublished - 2009

    Fingerprint

    Dive into the research topics of 'k-link EST clustering : evaluating error introduced by chimeric sequences under different degrees of linkage'. Together they form a unique fingerprint.

    Cite this