Data Mining is the process of discovering interesting knowledge from large amounts of data stored either in databases, data warehouses, or other information repositories. The aim of this thesis was to enhance the effectiveness of the integration of hydrogen peroxide response data related to yeast gene expression data to obtain a protein response process model and to label a set of important genes related to this approach. From biological studies, the Yeast Proteome Database (YPD) is a model/repository for the organization and presentation of genome-wide functional data. Accordingly, a yeast gene expression which is a unicellular DNA is selected which contains 6500 genes and the database combined with a number of related dataset to create a general dataset. DNA-binding transcriptional regulators interpret the genome's regulatory code by binding to specific sequences to induce or repress gene expression. The gene products including RNA and protein are responsible for the development and functioning of all living membranes by 2 steps process, transcription and translation. Various transcription factors control gene transcription by binding to the promoter regions. Translation is the production of proteins from mRNA produced in transcription. In this study, out of the 169 transcription factors known to access yeast, we are considering those thought to be involved in the response of Hydrogen Peroxide (H2O2). They are 22 transcription factors. Each one is partitioned to 3 parts: TF with No H2O2, TF with Low H2O2 and TF with High H2O2. Data were collected from multiple yeast datasets: "Harbison data" which holds the 22 Transcription Factors features, "Environ dataset" which includes the peroxide times features that help to create the Microarray Data Output (mRNA phase), and "Microarray dataset" which contains the Protein Response to H2O2 feature, to build a general dataset with 110 variables and 6103 observations. Data Processing phase is carried out by using Enterprise Miner and the process consists of data integration, cleaning, variable transformation and then constructing data for modeling. Decision Tree Model is used to identify possible clusters within the data. This analysis prepared by three ways: gene to mRNA, gene to protein through mRNA and gene to protein without mRNA. Same analysis was done with the 5 transcription factors of Alpha treatment and demonstrates that there is no correlation between it and Protein Response phase on contrast as the H2O2 treatment. Various studies have attempted to make genetic regulatory networks based on datasets derived from the whole-genome methodologies. In addition, several computational methods based on microarray data are currently used to study genomewide transcriptional regulation. The previous research, prepared by Causton, H et al., in yeast gene expression data explains that a network describes interactions between diverse heterogeneous data leading to protein induction or repression in response to H2O2 treatment. The result for this study that applies multiple stresses to yeast cells had shown that the partition of data is still noisy and the work needs to evaluate their biological possibility. The purpose of our study is to demonstrate that a huge numbers of yeast genes should be involved in various response biological changes and identify the global set of genes induced and repressed by binding DNA sequence, with initially the good processing of the data source. This process concludes a several important genes in each stage of the 4 ways discussed above. Research Methodology The research methodology is organized as follows: - Understanding the application domain for gene expression data. - Review of Data Mining Techniques. - Review of existing clustering strategies for high dimensional dataset analysis. - Understanding the yeast data. - Data Preparation: collecting, integrating and collating the gene expression data on a database. - Data analysis via Decision Tree methods using SAS software and data mining methods using Enterprise Miner Software. - Compare H2O2 treatment result with Alpha treatment. - Comparing the result with Causton Result. - Conclusion and future direction.
Date of Award | 2010 |
---|
Original language | English |
---|
- yeast
- Saccharomyces cerevisiae
- gene expression
- genetics
- hydrogen peroxide
- data
- data mining
- decision tree models
- protein response
- proteomics
- Yeast Proteome Database (YPD)
Analysis of high dimensionality yeast gene expression data using data mining
Aouf, M. (Author). 2010
Western Sydney University thesis: Master's thesis