%
% $Description: Author guidelines and sample document in LaTeX 2.09$
%
% $Author: ienne $
% $Date: 1995/09/15 15:20:59 $
% $Revision: 1.4 $
%
\documentclass[times, 10pt,twocolumn]{article}
\usepackage{latex8}
\usepackage{times}
%\documentstyle[times,art10,twocolumn,latex8]{article}
%-------------------------------------------------------------------------
% take the % away on next line to produce the final camera-ready version
%\pagestyle{empty}
%\setcounter{page}{228}
%-------------------------------------------------------------------------
\begin{document}
\title{Pattern Discovery in Data Hypercubes}
\author{
Jia Li\\
Department of Statistics\\
The Pennsylvania State University\\
University Park, PA 16802\\
{\tt jiali@stat.psu.edu}\\
% For a paper whose authors are all at the same institution,
% omit the following lines up until the closing ``}''.
% Additional authors and addresses can be added with ``\and'',
% just like the second author.
\and
Hongyuan Zha\\
Department of
Computer Science and Engineering\\
The Pennsylvania State University\\
University Park, PA 16802\\
{\tt zha@cse.psu.edu}\\
}
\maketitle
%\thispagestyle{empty}
\begin{abstract}
Data hypercubes are used to represent data along multiple dimensions
of interest. Many data mining applications naturally lead to modeling
and inference problems utilizing data hypercubes
as the main data structure. The existing data
mining tools in OLAP for data hypercubes tend to
focus on efficient methods for
computing relatively simple statistics to answer database user queries.
In this extended abstract we point out several research issues in
developing computational and statistical methods that
can discover more insightful and complicated patterns in data hypercubes.
We concentrate on approaches
exploring simultaneous clustering, mixture modeling and outlier detection.
We also mention the potential
use of the methods in several interesting applications
including modeling
and sub-topic extraction of multilingual documents,
microarray analysis for gene expression.
\end{abstract}
%-------------------------------------------------------------------------
\Section{Introduction}
Many conventional data mining tools focus on data objects modeled as vectors
(multi-dimensional) with either categorical or numerical attributes.
Algorithms and methods abound for classification, clustering and outlier
detection with this type of data objects. With the expansion of data mining
to new application fields, we have seen many new types of data that do not
conform to the traditional multi-dimensional vector viewpoint. Examples
include data objects modeled as functions of several variables such as time
series, hyerpspectral images, and video sequences. A great amount of effort
has been spent on extending conventional data mining tools to handle
those new types of data. In this report we discuss some research issues in
developing data mining algorithms for data hypercubes, a data structure that
arises in many applications, e.g., enterprise business decision
support, scientific database containing 3-dimensional observational data in
astronomy, and demographic, transactional and activity data in link and group
detection. Data hypercubes, sometimes also called multi-way arrays,
multi-dimensional arrays or simply data cubes, represent data along multiple
dimensions of interest. For example, clothing sales data for a department
store chain can be represented through brand name, size, color,
location, advertising cost, etc.
We should emphasize that data hypercubes tend to be sparse especially
for higher dimensions. They
present a simple, logical structure of the data, even though they may use a
different, more sophisticated, underlying model to compress
and store the sparse data in memory and secondary storage.
In OLAP and data warehousing, database query languages
provide many useful operations such as drill, rollup, slice-dice
for exploring data hypercubes. To gain better insights and discover
more interesting and complicated information patterns, we need to develop
more sophisticated data mining tools for data hypercubes. In this
report, we approach
data mining for data hypercubes
through developing methods for
simultaneous clustering, mixture modeling and outlier detection. We will
outline some of the research issues and approaches to address those
issues, we also briefely discuss several applications of mining
data hypercubes.
\Section{Simultaneous Clustering and Grouping in Data Hypercubes}
One major research issue is what kind of patterns we are interested in
discovering in data hypercubes.
There seem to be many possibilities, we discuss a few here
based on our own experiences in document modeling and microarray
gene expression analysis.
One type of interesting patterns corresponds to arranging the data hypercube
simultaneously along all the $d$ attributes, and extracting $K$
{\it dense} sub-hypercubes. In each of these $K$ sub-hypercubes, the items
along all the dimensions are closely tied to each other, and can be
considered as forming clusters. We call this kind of patterns as
{\it simultaneous} clusters because all the dimensions are simultaneously
involved in determining a cluster.
To this end, we denote a $d$-way data hypercube $H$ by \[
H=[h_{i_1,\dots,i_d}], \quad
i_j=1,\dots,n_j, j=1,\dots,d.\]
We first introduce a cost function to measure the simultaneous
clustering quality of a $K$-partition. Here
for a $K$-partition denoted by $\Pi$,
it consists of a simultaneous permutation of the data hypercubes
along all the $d$ dimensions, and then partition each dimension into $K$
groups. We denote the result as
\[\Pi(H) = [H_{s_1,\dots,s_d}], s_j =1,\dots, K,\]
and each $H_{s_1,\dots,s_d}$ is a sub-hypercube. The cost function for
$H$ with partition $\Pi$ is defined as
\[ J(H, \Pi) = \frac{s(H_{1,\dots,1})}{d(H_{1,\dots,1})}
+ \cdots + \frac{s(H_{K,\dots,K})}{d(H_{K,\dots,K})}\]
here $s(\cdot)$ denotes the sum of all the elements of the hypercube
in question, and $d(\cdot)$ denotes its size. Then we seek a
partition $\Pi^*$ such that
\[ \Pi^* = \mbox{\rm argmin}_{ \Pi} J(H, \Pi). \]
The intuitive idea is to permute the hypercube such that the relative density
of the $K$ diagonal sub-hypercubes are large. The above problem can also
be considered as a graph-partition problem for $d$-partite hypergraphs.
For the common case of $d=2$, we developed methods based on a
bipartite graph model (Zha et al., 2001).
Partitioning the underlying bipartite graph
is constructed by minimizing a {\it normalized}
sum of edge weights between {\it unmatched} pairs of vertices
of the bipartite graph.
We show that an approximate solution to the minimization
problem can be obtained by computing
a partial singular value decomposition (SVD)
of the associated edge weight
matrix of the bipartite graph. There is also a close connection
of the partition algorithm to correspondence analysis used in
multivariate analysis. There are many possibilities to extend our
methods to handle the general $d$ attribute case. One approach is to
conceptually flatten the hypercube by constructing the edge incidence matrix
for the associated $d$-partite hypergraph,
it can be shown that a $K$-partition of the hypercube corresponds to a
checker-board $K$-partition of the incidence matrix. The later can be
constructed by extending the spectral methods for $d=2$ case
incorporating permutation constraints.
Another approach is to start with a 2-dimensional marginal hypercube, find
its $K$-partition and then expand the marginal hypercube one attribute at
a time until we get back the original hypercube.
An interesting variation of the
above type of patterns is to consider patterns
consisting of a collection of possibly
overlapping sub-hypercubes, in each sub-hypercube, the values of the
elements are about the same. Each of this sub-hypercube can be modeled
by a suitable uniform distribution, and the original hypercube can be
considered as a mixture of those uniform distributions. We will have
more to say about mixture modeling in nthe next section.
In many applications such as intrusion detection and link/group detection
it is usually the deviation from normal behaviors that is of greater
interests. We can make use of the simultaneous clustering to construct
models for normal behaviors, and detect, say intrusion elements, by
measuring their deviation from the cluster model. Another interesting
approach is based on the idea that all the $K$ clusters are not
created equal, intrusion elements tend to form smaller clusters than
the normal activities.
%-------------------------------------------------------------------------
\Section{Mixture Modeling}
Clustering based on the mixture of multivariate normals was considered by
Day (1969), Binder (1978), and Symons (1981). Banfield and Raftery (1993)
advanced the work by considering different structural constrains on the
covariance matrices of the Gaussian components in the mixture and by
investigating mixtures of some non-Gaussian distributions. By employing
mixture models, standard statistical inference methodologies, such as maximum
likelihood estimation and Bayesian approaches, lend themselves naturally to
clustering. Since the theoretical and practical properties of these
statistical methods have been extensively studied, they are very helpful for
understanding the algorithmic aspects of a clustering procedure.
Furthermore, statistical models shed light into the hidden structure in data
and make clustering results easier to interpret. For classification,
also referred to as supervised clustering, mixture models are used in similar
ways. Pioneering work was done by Hastie and Tibshirani (1996).
Conventional mixture models assume data are independent samples from a
certain distribution and treat each sample as a vector. Relations among
variables inside the vector are not explored. In this sense, a clustering
algorithm based on the mixture model treats data as a sequence rather than an
array or more generally a hypercube. In many applications we face nowadays,
clustering in this one dimensional (or sequence) manner is inefficient at
discovering structures in data. For instance, a sample in a microarray data
set often contains expression levels of thousands of genes. Disease
detection or discrimination calls for classification techniques. Treating
each sample as a vector leads to classification in very high dimensional
space. An even daunting problem is that the number of samples available for
training is usually small, in the order of a few hundreds, due to the high
cost of obtaining the microarray data. To achieve good classification
performance, patterns governing the relation among the variables, or genes in
particular for microarray data, need to be extracted and exploited, that is,
data should be viewed as a hypercube. Similar scenario arises
with document analysis.
In order to discover patterns in data represented as hypercubes, we extend
traditional mixture models to multi-way mixture models. In a multi-way
mixture model, a hierarchy of mixture models is assumed. Using the
microarray data as an example, the top level mixture model contains
components intended to summarize patterns across samples. Within each
top-level component, a mixture model is assumed for the variables, i.e., the
gene expression levels. There is great flexibility to construct models under
this framework. Different structural constraints can be further assumed on
the multi-way mixture model to suit different applications. To establish a
multi-way mixture model using training data, various statistical methods can
be applied, e.g., maximum likelihood estimation using the EM algorithm. A
particular form of the multi-way mixture model has been studied by Li and Zha
(2002). The application of the model to gene clustering and disease type
classification using microarray expression data has demonstrated promising
results. Under the umbrella of the multi-way mixture model, pattern
discovering can be optimized in an overall way with strict statistical
foundation, in contrast to sequential clustering along each direction of a
hypercube.
\Section{Applications}
In this section, we discuss some applications of mining data hypercubes.
We only list applications that we already have experiences in.
\SubSection{Multilingual document modeling}
As pointed out in \cite{tides},
English is only a small fraction of the total textual information available;
a great deal of vital information exists only in, or appears
sooner in, other languages. We are interested in text
analysis methods that seek to compare and highlight
similarities and differences of multilingual documents, providing
the first step towards translingual summarization.
In particular,
we are interested in monitoring newswires from several sources or
countries. The goal is to be able to track events and discover similarities
and differences in event reporting and opinions. One of the task is to
correlate several news articles in different languages and to discover
the sentences and paragraphs representing overlapping or shared
sub-topics. We can build a hypercube with each attribute corresponding
to the sentences or paragraphs in a document. The hypercube element
$h_{i_1,\dots,i_d}$ characterizes the similarity among the sentences
$i_1, \dots,i_d$ each from one of the $d$ documents. By applying
simultaneous clustering to this hypercube, we can extract shared topics
in the document set.
\SubSection{Microarray analysis for gene expression }
DNA microarray is a rapidly developing technology that allows the gene
expression in an organism to be examined on a genomic scale, simultaneously
measuring the transcription levels of tens of thousands of genes. A
particular research topic based on gene-expression microarray data is to
detect disease or to distinguish disease types by applying classification
techniques. A major challenge in classification of microarray data is high
dimensionality. Tibshirani et al.~\cite{Tibshirani:2002} has proposed a
method of shrinking centroids of gene expression. Another approach to treat
high dimensionality is by clustering gene profiles. Extensive review on the
topic is provided in~\cite{Ben:2002}. Although many clustering methods have
been developed, they are targeted at discovering patterns of gene profiles.
The fact that gene expression levels are measured from samples that
themselves are categorized and sometimes have category identities provided is
ignored. Hypercube pattern extracting techniques can address this problem
directly. By simultaneous clustering of gene profiles and samples, more
biologically meaningful gene clusters may be discovered. Furthermore,
disease type classification can be improved by optimizing gene clustering in
a manner to best assist classification.
%\bibliographystyle{latex8}
%\bibliography{latex8}
\begin{thebibliography}{99}
\bibitem{Banfield:1993}
Banfield, J. D., and A. E. Raftery (1993), ``Model-Based Gaussian and
Non-Gaussian Clustering,'' {\em Biometrics}, Vol. 49, Sep., pp. 803-821.
\bibitem{Ben:2002}
Ben-Dor, A., B. Chor, R. Karp, and Z. Yakhini (2002)
Discovering local structure in gene expression data: the
order-preserving submatrix problem.
{\em Proc. RECOMB}, 49-57.
\bibitem{Binder:1978}
Binder, D. A. (1978), ``Bayesian Cluster Analysis,'' {\em Biometrika},
Vol. 65, pp. 31-38.
\bibitem{Day:1969}
Day, N. E. (1969), ``Estimating the Components of a Mixture of Normal
Distributions,'' {\em Biometrika}, Vol. 56, pp. 463-474.
\bibitem{Hastie:1996}
Hastie, T. and R. Tibshirani (1996) ``Discriminant Analysis by Gaussian
Mixtures,'' {\em JRSSB}, Vol. 58, pp. 155-176.
\bibitem{Li:2002}
Li, J. and H. Zha (2002), ``Simultaneous Classification and Feature Clustering
Using Discriminant Vector Quantization with Applications to Microarray
Data Analysis,'' {\em IEEE Computer Society Bioinformatics Conference (CSB)},
Stanford, CA, IEEE.
\bibitem{Symons:1981}
Symons, M. J. (1981), ``Clustering Criteria and Multivariate Normal
Mixtures,'' {\em Biometrics}, Vol. 37, Mar., pp. 35-43.
\bibitem{Tibshirani:2002}
Tibshirani, R., T. Hastie, B. Narashiman and G. Chu (2002)
Diagnosis of multiple cancer types by shrunken centroids of gene expression.
{\it PNAS}, 99:6567-6572.
\bibitem{tides}
Translingual Information Detection, Extraction and Summarization.\\
{\small {\tt http://www.darpa.mil/ito/research/tides/}}
\bibitem{zhhd:01}
Zha, H., H. He, C. Ding, M. Gu and H. Simon (2001),
``Bipartite Graph Partitioning and Data Clustering'',
{\em Proceedings of ACM CIKM 2001,
the Tenth International Conference on
Information and Knowledge Management}, pp. 25-32.
\bibitem{zha:02}
Zha, H (2002),
``Generic Summarization and Keyphrase Extraction Using
Mutual Reinforcement Principle and
Sentence Clustering'', {\em
Proceedings of the 25th Annual International ACM SIGIR
Conference on Research and
Development in Information Retrieval}, pp. 113-120.
\bibitem{zhji:02}
Zha, H. and X. Ji (2002),
``Correlating Multilingual Documents via Bipartite
Graph Modeling'', {\em
Proceedings of the 25th Annual International ACM SIGIR
Conference on Research and
Development in Information Retrieval}, pp. 443-444.
\end{thebibliography}
\end{document}