Thursday, October 19, 2006

The Curse of Data Mining

Most data miners are familiar with the Curse of Dimensionality - computation time is an exponential function of the number of the dimensions. The statistical version is that the amount of data needed to carry out a low variance inference is an exponential function of the dimension.

In data mining it often said that the 80% of a data miners time is spent on data preparation. I think the nature of data mining suggests that it will be very hard to break this barrier. In some sense it is the curse of data mining.

Wednesday, October 11, 2006

Closet Data Miners

There is a tendency, especially among some statisticians and econometricians, to be condescending about "data mining". This is well known, understandable and can be explained as ignorance. It is the job of data miners to educate them and bring them to the 21st century. Recently, I noticed that even people involved in "machine learning" tend to look down upon data mining. Now that is a surprise -especially given the history of failures associated with AI. Paradoxically, the leaders in different fields tend to be most open to data mining. Partly because they are not shy to admit that all scientific inquiry involves some sort of data mining activity. In some sense all scientists are closet data miners.


When one of our papers was rejected at IEEE ICDM 2006, one of the reviewers can down on "association rule mining" like a ton-of-bricks. He (or She) made the preposterous claim that association rule mining cannot be used for classification because it is based on frequency and we should look at other machine learning based algorithms! Hello, what about classification algorithms which use Bayes rule (for e.g., Naive Bayes). Isn't that based on frequency. Data mining is unashamedly an inductive task - and it has to be based on counting.


I like to think of association rule mining framework as an "algorithm design pattern". In spirit similar to dynamic programming - "subproblems of optimal problems must be optimal". In association rule mining it is the anti-monotonic condition which is the key insight - "subsets of interesting sets must be interesting". How we go about using this insight is now upto us. Who would have have thought that Maximum A Posteriori (MAP) estimate for a Markov Random Field could be derived using a max-flow min-cut algorithm. Similary, the most fundamental advance in 21st century statistics is probably Gibbs sampling, and that appeared in an IEEE PAMI paper on image segmentation. Looks like for some machine learners the world is bookended by PAC estimates and the kernel trick.