Data Mining Research: March 2007

We get a lot of students in our data mining course who have little or no statistical background. This semester I am using Larry Wasserman's excellent textbook on statistics "All of Statistics as a lead in to Tan, Steinbach and Kumar's book on Data Mining. Maybe I am trying to be over ambitious but I think pedagogically it is important that students understand statistical inference before they jump into the field of data mining. In particular students should be comfortable with (i) axioms of probability (ii) use of Bayes theorem to learn how to reason from effect to cause (iii) elementary random variables and probability distributions. In my lecture I gave them a couple of examples of heavy-tailed distributions and their relationship to real world events like the frequeny of hurricanes. One thing I discovered, though I need to flesh it out further, is to introduce probability distributions in conjunction with outlier detection. One can immediately start talking about tail bounds and most importantly bring out the the completely non-obvious relationship between the Mahalanobis distance and the Chi-Square distribution.

After the intro to statistical inference I will jump straight into the chapter eight and nine on clustering. Chapter 8 will introduce students to the combinatorial aspects of data mining and chapter 9 will be on the EM framework. Then I will move to to classification and finally association rule mining.

Teaching real and serious data mining is a challenge but it is not unexciting.

Data Mining Research

Thursday, March 29, 2007

Shortest Path to Statistical Competence

About Me

Links

Previous Posts

Archives