Industrial Strength Data Mining
In one of the panel sessions in PAKDD 2006, Prabhakhar Raghavan, the head of Yahoo Research revealed that Yahoo collects approx. 10 terabytes of data everyday!
That's a lot of data. I wonder how much of it is actually stored in an rdbms or managed using dbms principles and how much is processed by a "data mining engine." One thing is clear though - in the next few years data mining and data management will undergo a mutual inversion: data mining will become an integral component in large data enterprise settings and dbms technology may permeate into smaller desktop applications (e.g. through Cloudscape/ApacheDerby).
It is clear that in the next decade or so, Yahoo and Google are likely to spend more research dollars on "data mining" than "database management."
However a lot of current practical data mining is hit and miss. If I do a "real" industrial project in data mining then very soon into the project I have to use sheer brute force to get results. I have no conceptual-logical-physical modeling paradigm to help me and neither do I have a "relational model" to decouple the application from the implementation. All I have are the four soft abstract methods: classification, clustering, association rule mining and outlier detection.
In its most common known form, association rule mining (ARM) is completely useless though I feel this method has the most potential. Already current classifiers based on ARM are out-performing known classifiers including decision-trees and svms. ARM partly suffers from what Jayant Haritsa (IIS, Bangalore) says is the YAAMA problem. YAAMA stands for "Yet Another Association Mining Algorithm." YAMA of course is also the God of Death in Hindu mythology.
Classification is useful but there are many applications where getting labeled data is not only exteremly expensive but impossible. How do you tell whether a particular insurance transaction is fraud or not? Outlier Detection as David Hand (Imperial College) says is like "diamond mining"- so an unstated corollary is you cannot be lucky everytime! Thus one is stuck using clustering and projecting results onto 2D using SVD over and over again. Also I wish data mining would give me a test to conclude "OK, sorry, I am sure that there are no useful nuggets in your data - throw it away!"
I think there are three things that Data Mining immediately needs if it wants to rise up and meet the data explosion challenge.
1. A "relational like model" for data mining. However it has to be more sophisticated than current proposals (for e.g., adding a clustering operator in SQL).
2. A systematic way of incorporating semantics of the domain into the mining process - is this impossible.
3. Engineers who can show us how to build real successful data mining systems -somebody equivalent to Jim Gray.
The challenge in data mining is NOT to design another efficient algorithm to estimate statistics in a large data set but to leverage the availability of a large data set to search for an accurate model of the underlying process which generated the data. ARM is probably an example of such a model but since then most of the data mining research that has appeared in database conferences seem to have completely missed the point.
That's a lot of data. I wonder how much of it is actually stored in an rdbms or managed using dbms principles and how much is processed by a "data mining engine." One thing is clear though - in the next few years data mining and data management will undergo a mutual inversion: data mining will become an integral component in large data enterprise settings and dbms technology may permeate into smaller desktop applications (e.g. through Cloudscape/ApacheDerby).
It is clear that in the next decade or so, Yahoo and Google are likely to spend more research dollars on "data mining" than "database management."
However a lot of current practical data mining is hit and miss. If I do a "real" industrial project in data mining then very soon into the project I have to use sheer brute force to get results. I have no conceptual-logical-physical modeling paradigm to help me and neither do I have a "relational model" to decouple the application from the implementation. All I have are the four soft abstract methods: classification, clustering, association rule mining and outlier detection.
In its most common known form, association rule mining (ARM) is completely useless though I feel this method has the most potential. Already current classifiers based on ARM are out-performing known classifiers including decision-trees and svms. ARM partly suffers from what Jayant Haritsa (IIS, Bangalore) says is the YAAMA problem. YAAMA stands for "Yet Another Association Mining Algorithm." YAMA of course is also the God of Death in Hindu mythology.
Classification is useful but there are many applications where getting labeled data is not only exteremly expensive but impossible. How do you tell whether a particular insurance transaction is fraud or not? Outlier Detection as David Hand (Imperial College) says is like "diamond mining"- so an unstated corollary is you cannot be lucky everytime! Thus one is stuck using clustering and projecting results onto 2D using SVD over and over again. Also I wish data mining would give me a test to conclude "OK, sorry, I am sure that there are no useful nuggets in your data - throw it away!"
I think there are three things that Data Mining immediately needs if it wants to rise up and meet the data explosion challenge.
1. A "relational like model" for data mining. However it has to be more sophisticated than current proposals (for e.g., adding a clustering operator in SQL).
2. A systematic way of incorporating semantics of the domain into the mining process - is this impossible.
3. Engineers who can show us how to build real successful data mining systems -somebody equivalent to Jim Gray.
The challenge in data mining is NOT to design another efficient algorithm to estimate statistics in a large data set but to leverage the availability of a large data set to search for an accurate model of the underlying process which generated the data. ARM is probably an example of such a model but since then most of the data mining research that has appeared in database conferences seem to have completely missed the point.
