Friday, October 10, 2008

Problem Set 4: PageRank

Got comments, questions, gripes, issues, etc. on Problem Set 4: PageRank in my cloud computing course? Post here!

Sunday, October 5, 2008

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective

Tamer Elsayed
University of Maryland

9:30am, October 6, 2008
Hornbake 2119

Abstract
In this talk, I will discuss the problem of computing pairwise document similarity in large text collections. This general problem appears in many different applications such as text clustering, ad-hoc retrieval, and co-reference resolution. A simple MapReduce solution to the problem will be presented, along with different optimizations that make the solution more efficient but still effective. I will then illustrate how the solution can be leveraged in the context of one application, identity resolution in email collections.

About the Speaker
Tamer Elsayed is a Ph.D. candidate in the Computer Science Department at the University of Maryland. He has earned B.Sc. and M.Sc. degrees in Computer Science from Alexandria University (Egypt) and a second M.Sc. in Computer Science from the University of Maryland. His research interests include identity resolution in
informal media and large-scale text processing using the MapReduce framework.

Problem Set 3: Boolean retrieval

Got comments, questions, gripes, issues, etc. on Problem Set 3: Boolean retrieval in my cloud computing course? Post here!

Problem Set 2: Invert index construction

Got comments, questions, gripes, issues, etc. on Problem Set 2: Invert index construction in my cloud computing course? Post here!

Sorry, I meant to post this before the problem set was due... but I'd still be interested in feedback, reactions, comments, etc.

Saturday, September 20, 2008

Thoughts on Bigram Counting

I found the bigram counting homework to be a nice and gentle introduction to MapReduce and computing on EC2. I'd just like to share some thoughts about how choosing different implementations of the bigram counter can benefit or hurt performance, and my performance results when running on EC2 versus a single local Hadoop process. Also I played a bit with not-so-random sentence generation using the bigram model - thankfully it's not a substitute for creativity :-)

Part 1. Count the bigrams.

For this part, I implemented two versions of the bigram counter: one using the "pairs" approach, and the other using the "stripes" approach. The pairs approach is to map each bigram to a one, and then the reducers add all the ones corresponding to each bigram. This is analogous to the method for single word counting. In my implementation of the stripes approach, I had each mapper create an individual HashMap<String, HashMapWritable<Text, Integer>> for each input line. Bigrams "A B" are represented here by "A" being the first map key, B being the second key, and the count being the value. Each time a bigram was seen in the sentence, the count was updated. The reducers receive all maps corresponding to bigrams for a particular symbol, and combine the counts from all the maps to get final counts.

I found that in terms of performance, the "pairs" approach was somewhat faster than the "stripes": on a 2 node cluster with 10 mappers and 4 reducers, the pairs took 47s, while the stripes took 1m10s. Initially I thought that the stripes would be faster, but it turned out the other way around. This is probably because of the small size of the input records given to the mapper; only one sentence at a time gets processed. This results in stripe hashes that nearly always contain bigram counts of all ones, which severely limits the amount of aggregation that can take place in the mapper. As is, the stripes approach thus amounts to simply adding serialization overhead to the map and reduce tasks. If the input were mapped using larger chunks at a time (say, by books of the Bible or individual works of Shakespeare), the stripes approach might outperform the pairs approach due to better mapper bigram aggregation.

Before running the counters on the EC2 cluster, I did local testing on my machine (to save money!). Interestingly, the performance was much better: it only took 20s for the pairs approach, and 57s for the stripes. I suspect the difference is in the communication overhead for the distributed filesystem and the distributed sort in the cluster. However, once the single machine was loaded to capacity by a much larger dataset, the EC2 cluster would win out because it has much higher throughput.

Part 2. From counts to conditional probabilities.

For this part it was straightforward to modify the stripes bigram counter to compute conditional probabilities. The performance was almost the same as the stripes counter, running in 1m9s with 10 map tasks and 4 reduce tasks. On a single machine, it took 54s.

Postscript.

The bigrams provide a simplistic model of the relationships between words in sentences. It can be entertaining to use the bigrams to generate weird Bible/Shakespeare stream-of-consciousness style narratives. I created a sentence generator that takes the conditional probabilities as input and generates these sentences. Here are a few of the more interesting (and comprehensible) ones:
decking with promise please't your lordships pleasures of swine because they in very greatly polluted my mouth as well my free from heartstring

defacers of this miserable change the woman and immediately provided to the other god from the years old limbs in any man's untrue'

shortcake upon earth for they cry out if god for your own affairs upon a year and when he say i will therefore thus much like bucklersbury

and humility sirrah take heed for the lord standeth in word of the precious fruit of your sake and drink nought so great favour in one good morrow general'

whereby thou must not thou spendest more violent hands that saul thou hast destroyed it remains alive and running to his garment'

for this is love for thus smiling with him fortune in the congregation of the house of judah from the wind and their tinkling cymbal

foulcankering rust doth he hath set up and to fall and follow the works can call him to pluck the heavens had even so much repairs itself or hot january

the wise go into an hundred pound of beaten gold look and come at dice for this ministry and bark the honey set in fulvia's phrase or any have no breath up the point affords this a lady percy and made thee art troubled in hunger and lay not die die
Eerie...

It's a practical way to simulate lots of monkeys hammering away at lots of typewriters, except that the monkeys are somewhat smarter because they sort of know how words fit together. If we kept extending the model to trigrams and beyond, the generated text would gradually improve.

Monday, September 15, 2008

Problem Set 1: Bigram counts

Got comments, questions, gripes, issues, etc. on Problem Set 1: Bigram counts in my cloud computing course?

Post here!

Friday, September 12, 2008

Tuesday, March 11, 2008

Distributed Data Mining: Current Pleasures and Emerging Applications

Hillol Kargupta
University of Maryland, Baltimore County

11am, March 12, 2008
A.V. Williams 2120

Abstract

Distributed Data Mining (DDM) deals with the problem of analyzing data by paying careful attention to the distributed resources of data, computing, communication, and human factors in order to use them in a near optimal fashion. DDM algorithms offer communication efficient, scalable, and possibly privacy-preserving performance in large distributed multi-party environments. This talk will start by offering a perspective of the research in the field of distributed data mining over the last decade. It will identify some of the important application areas that have emerged and successfully entered the commercial domain. Next it will discuss a few algorithmic characteristics often needed for scalable performance in the emerging DDM applications. It will particularly focus on local algorithms for distributed data analysis. The talk will consider a few algorithmic approaches and discuss how scalable local DDM algorithms can be designed using simple primitives.

About the Speaker

Hillol Kargupta is an Associate Professor in the Department of Computer Science and Electrical Engineering, University of Maryland, Baltimore County. He received the PhD degree in computer science from the University of Illinois at Urbana-Champaign in 1996. He is also a co-founder of Agnik LLC, a data analytics company for distributed, mobile, and embedded environments. His research interests include mobile and distributed data mining. Dr. Kargupta won a US National Science Foundation CAREER award in 2001 for his research on ubiquitous and distributed data mining. He along with his coauthors received the best paper award at the 2003 IEEE International Conference on Data Mining for a paper on privacy-preserving data mining. His papers were also selected for Best of 2008 SIAM Data Mining Conference (SDM'08) and Most Interesting Paper of WebKDD'06. He won the 2000 TRW Foundation Award, 1997 Los Alamos Award for Outstanding Technical Achievement, and 1996 SIAM annual best student paper award. His research has been funded by the US National Science Foundation, US Air Force, Department of Homeland Security, NASA, and various other organizations. He has published more than 80 peer-reviewed articles in journals, conferences, and books. He has co-edited several books. He is an associate editor of the IEEE Transactions on Knowledge and Data Engineering, IEEE Transactions on Systems, Man, and Cybernetics, Part B and Statistical Analysis and Data Mining Journal. He is/was the General Chair of 2007 NSF Next Generation Data Mining Symposium, Program Co-Chair of 2005 SIAM Data Mining Conference, Program vice-chair of 2005 PKDD Conference, Program vice-chair of 2008 & 2005 IEEE International Data Mining Conference, Program Vice Chair for 2008 & 2005 Euro-PAR Conference, Associate General Chair of the 2003 ACM SIGKDD Conference, and chair of the 2002 NSF Next Generation Data Mining Workshop among others. He regularly serves in the organizing and program committee of many data mining conferences. More information about him can be found at http://www.cs.umbc.edu/~hillol.


Monday, March 3, 2008

Storing and Processing Multi-dimensional Scientific Datasets

Alan Sussman
University of Maryland

11am, March 5, 2008
A.V. Williams 3174

Abstract

Large datasets are playing an increasingly important role in many areas of scientific research. Such datasets can be obtained from various sources, including sensors on scientific instruments and simulations of physical phenomena. The datasets often consist of a very large number of records, and have an underlying multi-dimensional attribute space. Because of such characteristics, traditional relational database techniques are not adequate to efficiently support ad hoc queries into the data. We have therefore developed algorithms and designed systems to efficiently store and process these datasets in both tightly coupled parallel computer systems and more loosely coupled distributed computing environments.

I will mainly discuss the design of two systems, the Active Data Repository (ADR) and DataCutter, for managing large datasets in parallel and distributed environments, respectively. Each of these systems provides both a programming model and a runtime framework for implementing high performance data servers. These data servers provide efficient ad hoc query capabilities into very large multi-dimensional datasets. ADR is an object-oriented framework that can be customized to provide optimized storage and processing of disk-based datasets on a parallel machine or network of workstations. DataCutter is a component-based programming model and runtime system for building data intensive applications that can execute efficiently in a Grid distributed computing environment. I will present optimization techniques that enable both systems to achieve high performance in a wide range of application areas. I will also present performance results on real applications on various computing platforms to support that claim.

About the Speaker

Alan Sussman is an Associate Professor in the Computer Science Department and Institute for Advanced Computer Studies at the University of Maryland, College Park. Working with students and other researchers at Maryland and other institutions he has published over 80 conference and journal papers in various topics related to software tools for high performance parallel and distributed (Grid) computing, and has contributed chapters to 6 books. Software tools he has built have been widely distributed and used in many computational science applications, in areas such as earth science, space science, and medical informatics. He received his Ph.D. in computer science from Carnegie Mellon University and his B.S.E. in Electrical Engineering and Computer Science from Princeton University.

Monday, February 25, 2008

Cloud Computing: Project Overviews

Jimmy Lin
University of Maryland

11am, February 27, 2008
A.V. Williams 2120

Abstract

The University of Maryland is one of six U.S. universities involved in a new initiative lead by Google/IBM to explore the MapReduce programming paradigm, through an open-source implementation called Hadoop. Using MapReduce, programmers can harness enormous amounts of computing power centralized in large data centers. The initiative aims to provide faculty and students with access to next-generation "cloud computing" technologies, allowing them to think at "Web scale".

The Spring 2008 "cloud computing" course represents an attempt to integrate teaching and research in the context of this Google/IBM initiative. The basic idea is to bring together graduate and undergraduate students to tackle large-data problems, focused on the areas of natural language processing and information retrieval. This semester, twelve students are working on five different projects:

  • Statistical Machine Translation
  • Language Modeling
  • Identity Resolution in Email Archives
  • Retrieval in the Biomedical Domain
  • Text-Background Separation in Picture Books
  • Biological Sequence Alignment

After a very brief introduction, students in the cloud computing course will provide an overview of their projects and outline what they plan to accomplish this semester. Additional information can be found at the projects overview page.

Contributors