Tuesday, August 14, 2007

Theory of Data Mining

When people ask what data mining is, I often reply by saying "It is hard to define but if you look at what data miners do then one easily notices that they are working on problems which are a composition of four underlying subproblems: classification, clustering, pattern search and outlier detection." Can these four problems be abstracted further? I think it can and the closest known and familiar abstraction is mixture modeling. My brave conjecture is that

The theory of data mining will emerge as the theory to solve a generalized form of constrained mixture modeling (GCMM).

In fact the data mining community will do a great service to itself if people agree on it upfront and we spend the rest of our careers working on solving incremental versions of GCMM.

It is clear to those who follow the literature that classification, clustering and outlier detection can easily be modeled as instances of mixture modeling. How about Association Rule Mining? I think the correct way of thinking of association rule mining or frequent pattern mining is that the frequent patterns provide constraints on the space of mixture models? This all nicely fits together. For a particular problem we design an appropriate anti-monotonic measure to mine the patterns. The mined patterns are then specified as Constraints in the GCMM -QED!

Sunday, May 20, 2007

Economists and Data Mining

Economists are some of the strongest critics of data mining. They often (perhaps partially correctly) argue that the theory must come before data; data driven theory is "bad science," data mining is data fishing and data dredging etc. So I was quite surprised when I read the the following article by Ted Goertzel. As it turns out most econometric studies which use regression never use it for prediction but just for description. In other words many of them don't understand the concept of "generalization" and tend to use overfitted models! Unbelievable.

Thursday, March 29, 2007

Shortest Path to Statistical Competence

We get a lot of students in our data mining course who have little or no statistical background. This semester I am using Larry Wasserman's excellent textbook on statistics "All of Statistics as a lead in to Tan, Steinbach and Kumar's book on Data Mining. Maybe I am trying to be over ambitious but I think pedagogically it is important that students understand statistical inference before they jump into the field of data mining. In particular students should be comfortable with (i) axioms of probability (ii) use of Bayes theorem to learn how to reason from effect to cause (iii) elementary random variables and probability distributions. In my lecture I gave them a couple of examples of heavy-tailed distributions and their relationship to real world events like the frequeny of hurricanes. One thing I discovered, though I need to flesh it out further, is to introduce probability distributions in conjunction with outlier detection. One can immediately start talking about tail bounds and most importantly bring out the the completely non-obvious relationship between the Mahalanobis distance and the Chi-Square distribution.

After the intro to statistical inference I will jump straight into the chapter eight and nine on clustering. Chapter 8 will introduce students to the combinatorial aspects of data mining and chapter 9 will be on the EM framework. Then I will move to to classification and finally association rule mining.

Teaching real and serious data mining is a challenge but it is not unexciting.

Tuesday, January 02, 2007

IEEE ICDM Conference in HK

The IEEE ICDM conference in HK was quite interesting -especially the "panel" discussion on the last day. The Top 10 algorithms in data mining were "explained" and discussed. All the panelist did a wonderful job in explaining each of the algorithms. I think this was a great effort because it injected common knowledge into the data mining research community. At the minimum it will help us do a better job in reviewing papers.

Thursday, October 19, 2006

The Curse of Data Mining

Most data miners are familiar with the Curse of Dimensionality - computation time is an exponential function of the number of the dimensions. The statistical version is that the amount of data needed to carry out a low variance inference is an exponential function of the dimension.

In data mining it often said that the 80% of a data miners time is spent on data preparation. I think the nature of data mining suggests that it will be very hard to break this barrier. In some sense it is the curse of data mining.

Wednesday, October 11, 2006

Closet Data Miners

There is a tendency, especially among some statisticians and econometricians, to be condescending about "data mining". This is well known, understandable and can be explained as ignorance. It is the job of data miners to educate them and bring them to the 21st century. Recently, I noticed that even people involved in "machine learning" tend to look down upon data mining. Now that is a surprise -especially given the history of failures associated with AI. Paradoxically, the leaders in different fields tend to be most open to data mining. Partly because they are not shy to admit that all scientific inquiry involves some sort of data mining activity. In some sense all scientists are closet data miners.


When one of our papers was rejected at IEEE ICDM 2006, one of the reviewers can down on "association rule mining" like a ton-of-bricks. He (or She) made the preposterous claim that association rule mining cannot be used for classification because it is based on frequency and we should look at other machine learning based algorithms! Hello, what about classification algorithms which use Bayes rule (for e.g., Naive Bayes). Isn't that based on frequency. Data mining is unashamedly an inductive task - and it has to be based on counting.


I like to think of association rule mining framework as an "algorithm design pattern". In spirit similar to dynamic programming - "subproblems of optimal problems must be optimal". In association rule mining it is the anti-monotonic condition which is the key insight - "subsets of interesting sets must be interesting". How we go about using this insight is now upto us. Who would have have thought that Maximum A Posteriori (MAP) estimate for a Markov Random Field could be derived using a max-flow min-cut algorithm. Similary, the most fundamental advance in 21st century statistics is probably Gibbs sampling, and that appeared in an IEEE PAMI paper on image segmentation. Looks like for some machine learners the world is bookended by PAC estimates and the kernel trick.

Thursday, June 22, 2006

Industrial Strength Data Mining

In one of the panel sessions in PAKDD 2006, Prabhakhar Raghavan, the head of Yahoo Research revealed that Yahoo collects approx. 10 terabytes of data everyday!

That's a lot of data. I wonder how much of it is actually stored in an rdbms or managed using dbms principles and how much is processed by a "data mining engine." One thing is clear though - in the next few years data mining and data management will undergo a mutual inversion: data mining will become an integral component in large data enterprise settings and dbms technology may permeate into smaller desktop applications (e.g. through Cloudscape/ApacheDerby).

It is clear that in the next decade or so, Yahoo and Google are likely to spend more research dollars on "data mining" than "database management."

However a lot of current practical data mining is hit and miss. If I do a "real" industrial project in data mining then very soon into the project I have to use sheer brute force to get results. I have no conceptual-logical-physical modeling paradigm to help me and neither do I have a "relational model" to decouple the application from the implementation. All I have are the four soft abstract methods: classification, clustering, association rule mining and outlier detection.

In its most common known form, association rule mining (ARM) is completely useless though I feel this method has the most potential. Already current classifiers based on ARM are out-performing known classifiers including decision-trees and svms. ARM partly suffers from what Jayant Haritsa (IIS, Bangalore) says is the YAAMA problem. YAAMA stands for "Yet Another Association Mining Algorithm." YAMA of course is also the God of Death in Hindu mythology.

Classification is useful but there are many applications where getting labeled data is not only exteremly expensive but impossible. How do you tell whether a particular insurance transaction is fraud or not? Outlier Detection as David Hand (Imperial College) says is like "diamond mining"- so an unstated corollary is you cannot be lucky everytime! Thus one is stuck using clustering and projecting results onto 2D using SVD over and over again. Also I wish data mining would give me a test to conclude "OK, sorry, I am sure that there are no useful nuggets in your data - throw it away!"

I think there are three things that Data Mining immediately needs if it wants to rise up and meet the data explosion challenge.

1. A "relational like model" for data mining. However it has to be more sophisticated than current proposals (for e.g., adding a clustering operator in SQL).

2. A systematic way of incorporating semantics of the domain into the mining process - is this impossible.

3. Engineers who can show us how to build real successful data mining systems -somebody equivalent to Jim Gray.

The challenge in data mining is NOT to design another efficient algorithm to estimate statistics in a large data set but to leverage the availability of a large data set to search for an accurate model of the underlying process which generated the data. ARM is probably an example of such a model but since then most of the data mining research that has appeared in database conferences seem to have completely missed the point.