Theory of Data Mining
When people ask what data mining is, I often reply by saying "It is hard to define but if you look at what data miners do then one easily notices that they are working on problems which are a composition of four underlying subproblems: classification, clustering, pattern search and outlier detection." Can these four problems be abstracted further? I think it can and the closest known and familiar abstraction is mixture modeling. My brave conjecture is that
The theory of data mining will emerge as the theory to solve a generalized form of constrained mixture modeling (GCMM).
In fact the data mining community will do a great service to itself if people agree on it upfront and we spend the rest of our careers working on solving incremental versions of GCMM.
It is clear to those who follow the literature that classification, clustering and outlier detection can easily be modeled as instances of mixture modeling. How about Association Rule Mining? I think the correct way of thinking of association rule mining or frequent pattern mining is that the frequent patterns provide constraints on the space of mixture models? This all nicely fits together. For a particular problem we design an appropriate anti-monotonic measure to mine the patterns. The mined patterns are then specified as Constraints in the GCMM -QED!
The theory of data mining will emerge as the theory to solve a generalized form of constrained mixture modeling (GCMM).
In fact the data mining community will do a great service to itself if people agree on it upfront and we spend the rest of our careers working on solving incremental versions of GCMM.
It is clear to those who follow the literature that classification, clustering and outlier detection can easily be modeled as instances of mixture modeling. How about Association Rule Mining? I think the correct way of thinking of association rule mining or frequent pattern mining is that the frequent patterns provide constraints on the space of mixture models? This all nicely fits together. For a particular problem we design an appropriate anti-monotonic measure to mine the patterns. The mined patterns are then specified as Constraints in the GCMM -QED!
