Tuesday, March 11, 2008

Distributed Data Mining: Current Pleasures and Emerging Applications

Hillol Kargupta
University of Maryland, Baltimore County

11am, March 12, 2008
A.V. Williams 2120

Abstract

Distributed Data Mining (DDM) deals with the problem of analyzing data by paying careful attention to the distributed resources of data, computing, communication, and human factors in order to use them in a near optimal fashion. DDM algorithms offer communication efficient, scalable, and possibly privacy-preserving performance in large distributed multi-party environments. This talk will start by offering a perspective of the research in the field of distributed data mining over the last decade. It will identify some of the important application areas that have emerged and successfully entered the commercial domain. Next it will discuss a few algorithmic characteristics often needed for scalable performance in the emerging DDM applications. It will particularly focus on local algorithms for distributed data analysis. The talk will consider a few algorithmic approaches and discuss how scalable local DDM algorithms can be designed using simple primitives.

About the Speaker

Hillol Kargupta is an Associate Professor in the Department of Computer Science and Electrical Engineering, University of Maryland, Baltimore County. He received the PhD degree in computer science from the University of Illinois at Urbana-Champaign in 1996. He is also a co-founder of Agnik LLC, a data analytics company for distributed, mobile, and embedded environments. His research interests include mobile and distributed data mining. Dr. Kargupta won a US National Science Foundation CAREER award in 2001 for his research on ubiquitous and distributed data mining. He along with his coauthors received the best paper award at the 2003 IEEE International Conference on Data Mining for a paper on privacy-preserving data mining. His papers were also selected for Best of 2008 SIAM Data Mining Conference (SDM'08) and Most Interesting Paper of WebKDD'06. He won the 2000 TRW Foundation Award, 1997 Los Alamos Award for Outstanding Technical Achievement, and 1996 SIAM annual best student paper award. His research has been funded by the US National Science Foundation, US Air Force, Department of Homeland Security, NASA, and various other organizations. He has published more than 80 peer-reviewed articles in journals, conferences, and books. He has co-edited several books. He is an associate editor of the IEEE Transactions on Knowledge and Data Engineering, IEEE Transactions on Systems, Man, and Cybernetics, Part B and Statistical Analysis and Data Mining Journal. He is/was the General Chair of 2007 NSF Next Generation Data Mining Symposium, Program Co-Chair of 2005 SIAM Data Mining Conference, Program vice-chair of 2005 PKDD Conference, Program vice-chair of 2008 & 2005 IEEE International Data Mining Conference, Program Vice Chair for 2008 & 2005 Euro-PAR Conference, Associate General Chair of the 2003 ACM SIGKDD Conference, and chair of the 2002 NSF Next Generation Data Mining Workshop among others. He regularly serves in the organizing and program committee of many data mining conferences. More information about him can be found at http://www.cs.umbc.edu/~hillol.


2 comments:

chang said...

Data mining in a distributed sense is a emerging topic. It could be challenging due to band width, timeliness, or the availability of data to all nodes. The speaker addresses those challenges, and talked about use cases, a product he built and algorithms. The first half of the talk (products and applications) and the second one (algorithm) seem not very much connected. They are two very good talks in their own, though.

I particularly like the second half when a random walk algorithm is introduced for distributed sampling. The idea of constructing a subgraph within each node is creative. It is also similar to the distributed K-means algorithm that Google uses in its MapReduce course.

My thought related to the first part is that it would be interesting to see how cell phones could be utilized as a distributed computing framework. With the growing capacity I certainly believe they can compute well. From the talk, we can see that we already had distributed sensors. What would be more interesting, though, is how to combine the two functionalities into the cell phone.

Unknown said...

I saw a talk by Dr. Kargupta last fall on this same problem, but it seems that they've made some real progress into thinking about how to deal with the issues that arise in highly distributed environments.

The formal thinking about things like the decomposability of the data representation, and creating a virtual topology is really interesting. I especially like the idea of majority vote computation. I've heard that ensemble-based methods perform better than could really be expected for many tasks.

Although we've been doing computation in relatively closely coupled clusters, existing peer-to-peer networks are becoming a more fertile ground for mining opportunities, and will probably always contain more data than the former clusters.

A feature of cloud computing is that you don't move the data to the computation, but rather, the computation to the data. But a lot of data is still moved in practice. For example, Google still has to do a crawl over the internet and bring back a copy of what they want to mine. But if they could effectively compute something as part of the crawl, things could really change. Mining that is performed truly 'in-place', meaning without moving the data at all, probably warrants more thinking.

I think algorithms that can sit and process a large stream as it goes by, as featured in Samir Khuller's recent class CMSC 498k, may be more useful for 'in-place' mining because of the relatively few resources they require, which may make it easier to have them run on lots of machines you don't own - i.e. cell phones.

Contributors