Using PCA and Factor Analysis for Dimensionality Reduction of Bio-informatics Data

Reading time: 6 minute
...

📝 Original Info

  • Title: Using PCA and Factor Analysis for Dimensionality Reduction of Bio-informatics Data
  • ArXiv ID: 1707.07189
  • Date: 2017-07-25
  • Authors: ** - M. Usman Ali (COMSATS Institute of Information Technology, Sahiwal, Pakistan) - Shahzad Ahmed (COMSATS Institute of Information Technology, Sahiwal, Pakistan) - Javed Ferzund (COMSATS Institute of Information Technology, Sahiwal, Pakistan) - Atif Mehmood (Riphah Institute of Computing and Applied Sciences, Riphah International University, Lahore, Pakistan) - Abbas Rehman (COMSATS Institute of Information Technology, Sahiwal, Pakistan) **

📝 Abstract

Large volume of Genomics data is produced on daily basis due to the advancement in sequencing technology. This data is of no value if it is not properly analysed. Different kinds of analytics are required to extract useful information from this raw data. Classification, Prediction, Clustering and Pattern Extraction are useful techniques of data mining. These techniques require appropriate selection of attributes of data for getting accurate results. However, Bioinformatics data is high dimensional, usually having hundreds of attributes. Such large a number of attributes affect the performance of machine learning algorithms used for classification/prediction. So, dimensionality reduction techniques are required to reduce the number of attributes that can be further used for analysis. In this paper, Principal Component Analysis and Factor Analysis are used for dimensionality reduction of Bioinformatics data. These techniques were applied on Leukaemia data set and the number of attributes was reduced from to.

💡 Deep Analysis

Deep Dive into Using PCA and Factor Analysis for Dimensionality Reduction of Bio-informatics Data.

Large volume of Genomics data is produced on daily basis due to the advancement in sequencing technology. This data is of no value if it is not properly analysed. Different kinds of analytics are required to extract useful information from this raw data. Classification, Prediction, Clustering and Pattern Extraction are useful techniques of data mining. These techniques require appropriate selection of attributes of data for getting accurate results. However, Bioinformatics data is high dimensional, usually having hundreds of attributes. Such large a number of attributes affect the performance of machine learning algorithms used for classification/prediction. So, dimensionality reduction techniques are required to reduce the number of attributes that can be further used for analysis. In this paper, Principal Component Analysis and Factor Analysis are used for dimensionality reduction of Bioinformatics data. These techniques were applied on Leukaemia data set and the number of attributes

📄 Full Content

(IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 8, No. 5, 2017 415 | P a g e www.ijacsa.thesai.org Using PCA and Factor Analysis for Dimensionality Reduction of Bio-informatics Data M. Usman Ali Department of Computer Science COMSATS Institute of Information Technology Sahiwal, Pakistan Shahzad Ahmed Department of Computer Science COMSATS Institute of Information Technology Sahiwal, Pakistan Javed Ferzund Department of Computer Science COMSATS Institute of Information Technology Sahiwal, Pakistan Atif Mehmood Riphah Institute of Computing and Applied Sciences (RICAS) Riphah International University Lahore, Pakistan Abbas Rehman Department of Computer Science COMSATS Institute of Information Technology Sahiwal, Pakistan

Abstract—Large volume of Genomics data is produced on daily basis due to the advancement in sequencing technology. This data is of no value if it is not properly analysed. Different kinds of analytics are required to extract useful information from this raw data. Classification, Prediction, Clustering and Pattern Extraction are useful techniques of data mining. These techniques require appropriate selection of attributes of data for getting accurate results. However, Bioinformatics data is high dimensional, usually having hundreds of attributes. Such large a number of attributes affect the performance of machine learning algorithms used for classification/prediction. So, dimensionality reduction techniques are required to reduce the number of attributes that can be further used for analysis. In this paper, Principal Component Analysis and Factor Analysis are used for dimensionality reduction of Bioinformatics data. These techniques were applied on Leukaemia data set and the number of attributes was reduced from to. Keywords—Bioinformatics; Statistics; Microarray; Leukaemia; Feature Selection; Statistical tests; PCA; Factor Analysis; R tool I. INTRODUCTION Bioinformatics experiments are based on Genome, DNA, RNA and Chromosomes. Genomics plays an imperative role in this field. Huge amount of data has been produced in Genomics with a substantial portion produced in Functional Genomics (in the form of protein-protein association), Structural Genomics (in the form of 3-D structure). By using NGS (Next Generation Sequencing) technique, a lot of work has been done in the field of Microarray. This technique is helpful to identify human diseases. NGS is sequencing technique which is used to detect sequences of proteomics for next generation. Genetic Diseases are caused by Genetic disorders that are more complex because of multiple genes interaction. These disorders are breast cancer, colon cancer, skin cancer, autism, progeria, and haemophilia. These are caused by mutation in genes or sometimes inherit from parents. Leukaemia is a cancer of blood cells that occurs due to genome abnormality.
Microarray includes genes expression data that is present at large scale. Bioinformatics data needs to be store in an efficient manner and include a lot of Attributes (Variables). The major problem is that, most of tools crash when large data stored in it. Statistics plays superlative role in the field of Bioinformatics, Mathematics and Computer Science. It is used to extract, organise, analyse and visualise large amount of data. . For this purpose, a lot of tools like Excel, Weka, Matlab and R are available. Many Statistical tests are used for the extraction of relevant information. These are t-test, chi-squared test (χ2-test), ANOVA (Analysis of Variance), Kruskal-Wallis, Friedman and PCA (Principle Component Analysis) tests [1] Statistical t-test is used to check the difference between sample Mean and hypothesised value. ANOVA is parametric (distribution) test used to check the difference of dependent variables with levels of independent variables. Kruskal-Wallis is non-parametric (distribution free) test in which assumptions are not including unlike ANOVA. Friedman test is used when there is one distributed dependent variable and one independent variable with two or many levels. It is used to check the difference in reading and math scores and writing. Chi-squared test is used to compare the observed data with data according to specific hypothesis. PCA test is used to Select/Extract relevant and specific information about variables (attributes) in large dataset. In PCA, correlation is found between principal components and original data. All of these tests applied in Statistical tool. R is open source Statistical tool that is used for loading, extracting, interpretation and analysis of data. It includes many operations such as standard deviation, correlation, mean, variance, median, mode, graphs, plot, charts, and histograms. It automatically loads libraries and packages. It performs Machine Learning Classification and Clustering tasks very quickly and effective

…(Full text truncated)…

📸 Image Gallery

cover.png page_2.webp page_3.webp

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut