Fast and scalable protein motif sequence clustering based on Hadoop framework

Erfan Farhangi, Nasser Ghadiri, Mahsa Asadi, Mohammad Amin Nikbakht, Sylvain Pitre

Research output: Chapter in Book / Conference PaperConference Paperpeer-review

2 Citations (Scopus)

Abstract

![CDATA[In recent years, we are faced with large amounts of sporadic unstructured data on the web. With the explosive growth of such data, there is a growing need for effective methods such as clustering to analyze and extract information. Biological data forms an important part of unstructured data on the web. Protein sequence databases are considered as a primary source of biological data. Clustering can help to organize sequences into homologous and functionally similar groups and can improve the speed of data processing and analysis. Proteins are responsible for most of the activities in cells. The majority of proteins show their function through interaction with other proteins. Hence, prediction of protein interactions is an important research area in the biomedical sciences. Motifs are fragments frequently occurred in protein sequences. A well- known method to specify the protein interaction is based on motif Clustering. Existing works on motif clustering methods share the problem of limitation in the number of clusters. However, regarding the vast amount of motifs and the necessity of a large number of clusters, it seems that an efficient, scalable and fast method is necessary to cluster such large number of sequences. In this paper, we propose a novel approach to cluster a large number of motifs. Our approach includes extracting motifs within protein sequences, feature selection, preprocessing, dimension reduction and utilizing BigFCM (a large-scale fuzzy clustering) on several distributed nodes with Hadoop framework to take the advantage of MapReduce Programming. Experimental Results show very good Performance of our approach.]]
Original languageEnglish
Title of host publicationProceedings of the 3rd International Conference on Web Research (ICWR), Tehran, Iran, 19-20 April, 2017
PublisherIEEE
Pages24-31
Number of pages8
ISBN (Print)9781538604205
DOIs
Publication statusPublished - 2017
EventInternational Conference on Web Research -
Duration: 19 Apr 2017 → …

Conference

ConferenceInternational Conference on Web Research
Period19/04/17 → …

Fingerprint

Dive into the research topics of 'Fast and scalable protein motif sequence clustering based on Hadoop framework'. Together they form a unique fingerprint.

Cite this