Deterministic Data Reduction Methods for Transactional Data Sets

Thesis for the degree of Master of Science (Computer Science)

by Alexander Astashyn, June 2004


Abstract: As the volumes of available data grow too fast for the hardware to keep up, the data mining tasks encounter the scalability problem. An effective way to address the issue is to run the mining algorithms on sample of the data. Gearing the data reduction algorithms to generate high-quality samples for the problem at hand provides better approximation results compared to simple random sampling (SRS). This thesis discusses two existing sampling algorithms, FAST and EA, along with their variations, which are tailored to work with count data sets, such as used for association rule mining in transactional databases. Both algorithms attempt to produce a sample which is as close as possible (in terms of metrics defined), to the original dataset. FAST accomplishes that by using large SRS to obtain accurate representation of the data, and then trimming away the outlier transactions while attempting to maintain the property of the sample to resemble the original picture. EA repeatedly halves the dataset until the sample of desired size or quality is obtained. During each halving, the algorithm splits the dataset in two, while trying to keep their properties symmetrical. Certain heuristics are introduced to improve the effectiveness of FAST and several one-pass extensions to the EA algorithm are developed to alleviate the need for multiple halvings. The EA algorithm was found to outperform FAST in all aspects; the biased extension to the EA significantly reduces the memory requirements and reduces the processing time by a factor of 2-4 compared to the original. A simplified version of EA that uses common sum-of-squares metric was found to be as effective as the original with optimum parameter settings, making it the algorithm of choice among the ones discussed. The deterministic sampling algorithms were found to be well-suited for preprocessing large datasets for count data applications. The thesis concludes with the discussion of possible modifications of the above algorithms to work with other types of data, as well as possible applications for applying sampling in streaming data systems.

Related Publications:



Copyright © 2004, Alex Astashyn, Advisor: Hervé Broönnimannhbr@poly.edu