Flow cytometry is often used to characterize the malignant cells in leukemia and lymphoma patients, traced to the level of the individual cell. Typically, flow cytometric data analysis is performed through a series of 2-dimensional projections onto the axes of the data set. Through the years, clinicians have determined combinations of different fluorescent markers which generate relatively known expression patterns for specific subtypes of leukemia and lymphoma -- cancers of the hematopoietic system. By only viewing a series of 2-dimensional projections, the high-dimensional nature of the data is rarely exploited. In this paper we present a means of determining a low-dimensional projection which maintains the high-dimensional relationships (i.e. information) between differing oncological data sets. By using machine learning techniques, we allow clinicians to visualize data in a low dimension defined by a linear combination of all of the available markers, rather than just 2 at a time. This provides an aid in diagnosing similar forms of cancer, as well as a means for variable selection in exploratory flow cytometric research. We refer to our method as Information Preserving Component Analysis (IPCA).
Deep Dive into Information Preserving Component Analysis: Data Projections for Flow Cytometry Analysis.
Flow cytometry is often used to characterize the malignant cells in leukemia and lymphoma patients, traced to the level of the individual cell. Typically, flow cytometric data analysis is performed through a series of 2-dimensional projections onto the axes of the data set. Through the years, clinicians have determined combinations of different fluorescent markers which generate relatively known expression patterns for specific subtypes of leukemia and lymphoma – cancers of the hematopoietic system. By only viewing a series of 2-dimensional projections, the high-dimensional nature of the data is rarely exploited. In this paper we present a means of determining a low-dimensional projection which maintains the high-dimensional relationships (i.e. information) between differing oncological data sets. By using machine learning techniques, we allow clinicians to visualize data in a low dimension defined by a linear combination of all of the available markers, rather than just 2 at a time.
arXiv:0804.2848v1 [stat.ML] 17 Apr 2008
1
Information Preserving Component Analysis:
Data Projections for Flow Cytometry Analysis
Kevin M. Carter1, Raviv Raich2, William G. Finn3, and Alfred O. Hero III1
1 Department of EECS, University of Michigan, Ann Arbor, MI 48109
2 School of EECS, Oregon State University, Corvallis, OR 97331
3 Department of Pathology, University of Michigan, Ann Arbor, MI 48109
{kmcarter,wgfinn,hero}@umich.edu, raich@eecs.oregonstate.edu
Abstract
Flow cytometry is often used to characterize the malignant cells in leukemia and lymphoma patients,
traced to the level of the individual cell. Typically, flow cytometric data analysis is performed through
a series of 2-dimensional projections onto the axes of the data set. Through the years, clinicians have
determined combinations of different fluorescent markers which generate relatively known expression
patterns for specific subtypes of leukemia and lymphoma – cancers of the hematopoietic system. By
only viewing a series of 2-dimensional projections, the high-dimensional nature of the data is rarely
exploited. In this paper we present a means of determining a low-dimensional projection which maintains
the high-dimensional relationships (i.e. information) between differing oncological data sets. By using
machine learning techniques, we allow clinicians to visualize data in a low dimension defined by a
linear combination of all of the available markers, rather than just 2 at a time. This provides an aid
in diagnosing similar forms of cancer, as well as a means for variable selection in exploratory flow
cytometric research. We refer to our method as Information Preserving Component Analysis (IPCA).
Index Terms
Flow cytometry, statistical manifold, information geometry, multivariate data analysis, dimension-
ality reduction, clustering
Acknowledgement: This work is partially funded by the National Science Foundation, grant No. CCR-0325571.
September 30, 2018
DRAFT
2
I. INTRODUCTION
Clinical flow cytometric data analysis usually involves the interpretation of data culled from
sets (i.e. cancerous blood samples) which contain the simultaneous analysis of several measure-
ments. This high-dimensional data set allows for the expression of different fluorescent markers,
traced to the level of the single blood cell. Typically, diagnosis is determined by analyzing
individual 2-dimensional scatter plots of the data, in which each point represents a unique blood
cell and the axes signify the expression of different biomarkers. By viewing a series of these
histograms, a clinician is able to determine a diagnosis for the patient through clinical experience
of the manner in which certain leukemias and lymphomas express certain markers.
Given that the standard method of cytometric analysis involves projections onto the axes of the
data (i.e. visualizing the scatter plot of a data set with respect to 2 specified markers), the multi-
dimensional nature of the data is not fully exploited. As such, typical flow cytometric analysis
is comparable to hierarchical clustering methods, in which data is segmented on an axis-by-axis
basis. Marker combinations have been determined through years of clinical experience, leading to
relative confidence in analysis given certain axes projections. These projection methods, however,
contain the underlying assumption that marker combinations are independent of each other, and
do not utilize the dependencies which may exist within the data. Ideally, clinicians would like
to analyze the full-dimensional data, but this cannot be visualized outside of 3-dimensions.
There have been previous attempts at using machine learning to aid in flow cytometry di-
agnosis. Some have focused on clustering in the high-dimensional space [1], [2], while others
have utilized information geometry to identify differences in sample subsets and between data
sets [3], [4]. These methods have not satisfied the problem because they do not significantly
approach the aspect of visualization for ‘human in the loop’ diagnosis, and the ones that do
[5], [6] only apply dimensionality reduction to a single set at a time. The most relevant work,
compared to what we are about to present, is that which we have recently presented [7] where
we utilized information geometry to simultaneously embed each patient data set into the same
low-dimensional space, representing each patient as a single vector. The current task differs in
that we do not wish to reduce each set to a single point for comparative analysis, but to use
dimensionality reduction as a means to individually study the distributions of each patient. As
such, we aim to reduce the dimension of each patient data set while maintaining the number of
September 30, 2018
DRAFT
3
data points (i.e. cells).
With input from the Department of Pathology at the University of Michigan, we have deter-
mined that the ideal form of dimensionality reduction for flow cytometric visualization would
contain several properties. The data nee
…(Full text truncated)…
This content is AI-processed based on ArXiv data.