Streaming Self-Scaling Histograms with Stability and Optimality Guarantees

Written with Joannes Vermorel.
Abstract: Histograms provide discrete efficient representation of continuous data and can be constrained to fit various purposes. We provide the first offline sub-quadratic algorithm which applies to any sub-additive histogram cost functions. Sub-additive cost functions includes in particular all the Lp norms. This algorithm outputs a histogram with at most 2m buckets and costs at most (1+&eps;) the optimal cost achievable with m buckets. Then we show how to adapt our algorithm in a streaming context but restricted for a certain cost function that bounds the maximum error for one-dimensional aggregate range queries. An advantage of our approach is that is only uses O(m) space and O(log(m)) per item processing time, m being the number of buckets, independent from N the number of values processed. We also prove that the histogram outputs of our streaming algorithm are very stable. While reading the streams, the output histogram undergoes at most O(m log(N)) boundary displacements. To our knowledge, this concept of stability has never been studied for streaming histograms. Stability implies that extra properties associated to the buckets don't have to be reconstructed too often, strengthening their quality and accuracy. Our empirical evaluations confirm the good behavior of our streaming algorithm in practice.

Related Publications:



Copyright © 2004, Hervé Brönnimann, hbr@poly.edu