The smallest set of constraints that explains the data: a randomization approach

Reference:

Jefrey Lijffijt, Panagiotis Papapetrou, Niko Vuokko, and Kai Puolamäki. The smallest set of constraints that explains the data: a randomization approach. Technical Report TKK-ICS-R31, Aalto University School of Science and Technology, Department of Information and Computer Science, May 2010.

Abstract:

Randomization methods can be used to assess statistical significance of data mining results. A randomization method typically consists of a sampler which draws data sets from a null distribution, and a test statistic. If the value of the test statistic on the original data set is more extreme than the test statistic on randomized data sets we can reject the null hypothesis. It is often not immediately clear why the null hypothesis is rejected. For example, the cost of clustering can be significantly lower in the original data than in the randomized data, but usually we would also like to know why the cost is small. We introduce a methodology for finding the smallest possible set of constraints, or patterns, that explains the data. In principle any type of patterns can be used as long as there exists an appropriate randomization method. We show that the problem is, in its general form, NP-hard, but that in a special case an exact solution can be computed fast, and propose a greedy algorithm that solves the problem. The proposed approach is demonstrated on time series data as well as on frequent itemsets in 0-1 matrices, and validated theoretically and experimentally.

Keywords:

hypothesis testing, randomization, significant patterns, time series, frequent patterns.

Suggested BibTeX entry:

@techreport{TKK-ICS-R31,
    author = {Jefrey Lijffijt and Panagiotis Papapetrou and Niko Vuokko and Kai Puolam{\"a}ki},
    institution = {Aalto University School of Science and Technology, Department of Information and Computer Science},
    month = {May},
    number = {TKK-ICS-R31},
    pages = {28},
    title = {The smallest set of constraints that explains the data: a randomization approach},
    type = {Technical Report},
    year = {2010},
}

This work is not available online here.