Streaming Self-Scaling Histograms with Stability and Optimality
Guarantees
Written with Joannes Vermorel.
-
Version submitted.

-
The code which was used to perform the experiments in the
paper (To appear).
Abstract:
Histograms provide discrete efficient representation of continuous
data and can be constrained to fit various purposes. We provide
the first offline sub-quadratic algorithm which applies to any
sub-additive histogram cost functions. Sub-additive cost
functions includes in particular all the Lp norms. This
algorithm outputs a histogram with at most 2m buckets and costs
at most (1+&eps;) the optimal cost achievable with m
buckets. Then we show how to adapt our algorithm in a streaming
context but restricted for a certain cost function that bounds the
maximum error for one-dimensional aggregate range queries. An
advantage of our approach is that is only uses O(m)
space and O(log(m)) per item processing time, m
being the number of buckets, independent from N the number of
values processed. We also prove that the histogram outputs of our
streaming algorithm are very stable. While reading the streams,
the output histogram undergoes at most O(m log(N))
boundary displacements. To our knowledge, this concept of
stability has never been studied for streaming histograms.
Stability implies that extra properties associated to the buckets
don't have to be reconstructed too often, strengthening their
quality and accuracy. Our empirical evaluations confirm the good
behavior of our streaming algorithm in practice.
Related Publications:
Copyright © 2004, Hervé Brönnimann, hbr@poly.edu