Deterministic Sampling beyond EASE: Reducing MultiDimensional Data

Written with Hüseyin Akcan, Alex Astashyn, and Leon Bukhman.
Abstract: A wide range of mining and analysis problems involve extracting knowledge from count data. A general approach that scales well with the data is sampling, and a deterministic algorithm for efficiently reducing such data is EASE. We present a vastly simplified and more powerful algorithm (Biased-L2), and show how to use it for deterministically sampling streaming count data and iceberg cubes. The algorithm produces samples better than EASE, and more than an order of magnitude better than random samples (as measured by RMS errors of the frequency vectors over items). Coupling with other methods such as Lossy Counting further improves on the memory footprint and accuracy. We also present a related deterministic variant of reservoir sampling (DRS) which produces samples of similar quality and can help maintain a sample for a data stream, for a table in a relational database or data cube under insertions, and has an excellent recovery rate in case the distribution of item changes.
Accuracies and RMS of SRS, FAST, EASE and BL2

Related publications



Copyright © 2005, Hervé Brönnimann, hbr@poly.edu