Tuesday, March 11, 2008

Distributed Data Mining: Current Pleasures and Emerging Applications

Hillol Kargupta
University of Maryland, Baltimore County

11am, March 12, 2008
A.V. Williams 2120

Abstract

Distributed Data Mining (DDM) deals with the problem of analyzing data by paying careful attention to the distributed resources of data, computing, communication, and human factors in order to use them in a near optimal fashion. DDM algorithms offer communication efficient, scalable, and possibly privacy-preserving performance in large distributed multi-party environments. This talk will start by offering a perspective of the research in the field of distributed data mining over the last decade. It will identify some of the important application areas that have emerged and successfully entered the commercial domain. Next it will discuss a few algorithmic characteristics often needed for scalable performance in the emerging DDM applications. It will particularly focus on local algorithms for distributed data analysis. The talk will consider a few algorithmic approaches and discuss how scalable local DDM algorithms can be designed using simple primitives.

About the Speaker

Hillol Kargupta is an Associate Professor in the Department of Computer Science and Electrical Engineering, University of Maryland, Baltimore County. He received the PhD degree in computer science from the University of Illinois at Urbana-Champaign in 1996. He is also a co-founder of Agnik LLC, a data analytics company for distributed, mobile, and embedded environments. His research interests include mobile and distributed data mining. Dr. Kargupta won a US National Science Foundation CAREER award in 2001 for his research on ubiquitous and distributed data mining. He along with his coauthors received the best paper award at the 2003 IEEE International Conference on Data Mining for a paper on privacy-preserving data mining. His papers were also selected for Best of 2008 SIAM Data Mining Conference (SDM'08) and Most Interesting Paper of WebKDD'06. He won the 2000 TRW Foundation Award, 1997 Los Alamos Award for Outstanding Technical Achievement, and 1996 SIAM annual best student paper award. His research has been funded by the US National Science Foundation, US Air Force, Department of Homeland Security, NASA, and various other organizations. He has published more than 80 peer-reviewed articles in journals, conferences, and books. He has co-edited several books. He is an associate editor of the IEEE Transactions on Knowledge and Data Engineering, IEEE Transactions on Systems, Man, and Cybernetics, Part B and Statistical Analysis and Data Mining Journal. He is/was the General Chair of 2007 NSF Next Generation Data Mining Symposium, Program Co-Chair of 2005 SIAM Data Mining Conference, Program vice-chair of 2005 PKDD Conference, Program vice-chair of 2008 & 2005 IEEE International Data Mining Conference, Program Vice Chair for 2008 & 2005 Euro-PAR Conference, Associate General Chair of the 2003 ACM SIGKDD Conference, and chair of the 2002 NSF Next Generation Data Mining Workshop among others. He regularly serves in the organizing and program committee of many data mining conferences. More information about him can be found at http://www.cs.umbc.edu/~hillol.


Monday, March 3, 2008

Storing and Processing Multi-dimensional Scientific Datasets

Alan Sussman
University of Maryland

11am, March 5, 2008
A.V. Williams 3174

Abstract

Large datasets are playing an increasingly important role in many areas of scientific research. Such datasets can be obtained from various sources, including sensors on scientific instruments and simulations of physical phenomena. The datasets often consist of a very large number of records, and have an underlying multi-dimensional attribute space. Because of such characteristics, traditional relational database techniques are not adequate to efficiently support ad hoc queries into the data. We have therefore developed algorithms and designed systems to efficiently store and process these datasets in both tightly coupled parallel computer systems and more loosely coupled distributed computing environments.

I will mainly discuss the design of two systems, the Active Data Repository (ADR) and DataCutter, for managing large datasets in parallel and distributed environments, respectively. Each of these systems provides both a programming model and a runtime framework for implementing high performance data servers. These data servers provide efficient ad hoc query capabilities into very large multi-dimensional datasets. ADR is an object-oriented framework that can be customized to provide optimized storage and processing of disk-based datasets on a parallel machine or network of workstations. DataCutter is a component-based programming model and runtime system for building data intensive applications that can execute efficiently in a Grid distributed computing environment. I will present optimization techniques that enable both systems to achieve high performance in a wide range of application areas. I will also present performance results on real applications on various computing platforms to support that claim.

About the Speaker

Alan Sussman is an Associate Professor in the Computer Science Department and Institute for Advanced Computer Studies at the University of Maryland, College Park. Working with students and other researchers at Maryland and other institutions he has published over 80 conference and journal papers in various topics related to software tools for high performance parallel and distributed (Grid) computing, and has contributed chapters to 6 books. Software tools he has built have been widely distributed and used in many computational science applications, in areas such as earth science, space science, and medical informatics. He received his Ph.D. in computer science from Carnegie Mellon University and his B.S.E. in Electrical Engineering and Computer Science from Princeton University.

Contributors