Deterministic Data Reduction Methods for Transactional Data Sets
Thesis for the degree of Master of Science (Computer Science)
by Alexander Astashyn, June 2004
-
Final master thesis

Abstract:
As the volumes of available data grow too fast for the hardware to keep
up, the data mining tasks encounter the scalability problem. An
effective way to address the issue is to run the mining algorithms on
sample of the data. Gearing the data reduction algorithms to generate
high-quality samples for the problem at hand provides better
approximation results compared to simple random sampling (SRS). This
thesis discusses two existing sampling algorithms, FAST and EA, along
with their variations, which are tailored to work with count data sets,
such as used for association rule mining in transactional databases.
Both algorithms attempt to produce a sample which is as close as
possible (in terms of metrics defined), to the original dataset. FAST
accomplishes that by using large SRS to obtain accurate representation
of the data, and then trimming away the outlier transactions while
attempting to maintain the property of the sample to resemble the
original picture. EA repeatedly halves the dataset until the sample of
desired size or quality is obtained. During each halving, the algorithm
splits the dataset in two, while trying to keep their properties
symmetrical. Certain heuristics are introduced to improve the
effectiveness of FAST and several one-pass extensions to the EA
algorithm are developed to alleviate the need for multiple halvings.
The EA algorithm was found to outperform FAST in all aspects; the
biased extension to the EA significantly reduces the memory requirements
and reduces the processing time by a factor of 2-4 compared to the
original. A simplified version of EA that uses common sum-of-squares
metric was found to be as effective as the original with optimum
parameter settings, making it the algorithm of choice among the ones
discussed. The deterministic sampling algorithms were found to be
well-suited for preprocessing large datasets for count data
applications. The thesis concludes with the discussion of possible
modifications of the above algorithms to work with other types of data,
as well as possible applications for applying sampling in streaming
data systems.
Related Publications:
Copyright © 2004, Alex Astashyn, Advisor: Hervé
Broönnimannhbr@poly.edu