Deterministic Sampling beyond EASE: Reducing MultiDimensional Data
Written with
Hüseyin Akcan, Alex Astashyn, and Leon Bukhman.
-
Version submitted to SIAM DM'06
(920K)
Abstract:
A wide range of mining and analysis problems involve
extracting knowledge from count data. A general approach
that scales well with the data is sampling, and a deterministic
algorithm for efficiently reducing such data is EASE.
We present a vastly simplified and more powerful algorithm
(Biased-L2), and show how to use it for deterministically
sampling streaming count data and iceberg cubes. The
algorithm produces samples better than EASE, and more
than an order of magnitude better than random samples
(as measured by RMS errors of the frequency vectors over
items). Coupling with other methods such as Lossy Counting
further improves on the memory footprint and accuracy.
We also present a related deterministic variant of reservoir
sampling (DRS) which produces samples of similar quality
and can help maintain a sample for a data stream, for a table
in a relational database or data cube under insertions,
and has an excellent recovery rate in case the distribution
of item changes.
Related publications
-
Data-Reduction Methods for On-Line Association Rule
Discovery,
with
Bin Chen, Manoranjan Dash, Yi Qiao and Peter Scheuerman.
Efficient Data Reduction with EASE,
with
Bin Chen, Manoranjan Dash, Peter Haas and Peter Scheuerman.
-
Data Reduction (MS Thesis, Polytechnic University),
by Alex Astashyn.
-
Approximation of Iceberg
Cubes (MS Thesis, Polytechnic University),
by Leon Bukhman.
-
Notes on Distributed Deterministic Sampling, with H. Akcan.
Copyright © 2005, Hervé Brönnimann, hbr@poly.edu