Efficient Data Reduction with EASE
Written with
Bin Chen, Manoranjan Dash, Peter Haas, and Peter Scheuerman.
-
Version to appear at Knowledge and Data Discovery (KDD'03).

-
Code which was can be used to perform the experiments in the
paper (chi2.cpp, freq1.cpp: ask the author).
Abstract:
A variety of mining and analysis problems --- ranging from association-rule
discovery to contingency table analysis to materialization of certain
approximate datacubes --- involve the extraction of knowledge from a set of
categorical count data. Such data can be viewed as a collection of
``transactions,'' where a transaction is a fixed-length vector of
counts. Classical algorithms for solving count-data problems require one or
more computationally intensive passes over the entire database and can be
prohibitively slow. One effective method for dealing with this
ever-worsening scalability problem is to run the algorithms on a small
sample of the data. We present a new data-reduction algorithm, called
EASE,
for producing such a sample. Like the FAST algorithm introduced by Chen et
al., EASE is especially designed for count data applications. Both EASE and
FAST take a relatively large initial random sample and then
deterministically produce a subsample whose ``distance'' --- appropriately
defined --- from the complete database is minimal. Unlike FAST, which
obtains the final subsample by quasi-greedy descent, EASE uses
epsilon-approximation methods to obtain the final subsample by a process of
repeated halving. Experiments both in the context of association rule mining
and classical χ2 contingency-table analysis show that
EASE outperforms
both FAST and simple random sampling, sometimes dramatically.
Related publications
Copyright © 2002, 2003, Hervé Brönnimann, hbr@poly.edu
Last modified: July 7th, 2003