Efficient
Data-Reduction Methods for On-Line Association Rule Discovery
With
Bin Chen, Manoranjan Dash, Peter Haas, Yi Qiao and Peter Scheuerman.
-
Extended version to appear in MIT Press.

-
Version presented at NSF Workshop on Next-Generation Data Mining (NGDM'02).

Abstract:
Classical data mining algorithms that require one or more
computationally
intensive passes over the entire database can be prohibitively slow. One
effective method for dealing with this ever-worsening scalability
problem is
to run the algorithms on a small sample of the data. We present and
empirically compare two data-reduction algorithms for producing such a
sample; these algorithms, called FAST and EA, are tailored to
``count''
data applications such as association-rule mining. The algorithms are
similar in that both attempt to produce a sample whose ``distance'' ---
appropriately defined --- from the complete database is minimal. They
differ
greatly, however, in the way that they greedily search through the
exponential number of possible samples. FAST, originally presented in
obtain the final sample. Unlike FAST, the EA algorithm provides a
guaranteed level of accuracy. Our experiments show that EA is more
expensive to run than FAST, but yields more accurate results for a
given
sample size. Thus, depending on the specific problem under
consideration,
the user can trade off speed and accuracy by selecting the appropriate
method. We conclude by showing how the EA data-reduction approach can
potentially be adapted to provide data-reduction schemes for streaming
data
systems. The proposed schemes favor recent data while still retaining
partial information about all of the data seen so far.
Related publications
Copyright © 2002, 2003, Hervé Brönnimann, hbr@poly.edu
Last modified: Jul 7th, 2003