Cloud Computing at Maryland

Tamer Elsayed
University of Maryland

9:30am, October 6, 2008
Hornbake 2119

Abstract
In this talk, I will discuss the problem of computing pairwise document similarity in large text collections. This general problem appears in many different applications such as text clustering, ad-hoc retrieval, and co-reference resolution. A simple MapReduce solution to the problem will be presented, along with different optimizations that make the solution more efficient but still effective. I will then illustrate how the solution can be leveraged in the context of one application, identity resolution in email collections.

About the Speaker
Tamer Elsayed is a Ph.D. candidate in the Computer Science Department at the University of Maryland. He has earned B.Sc. and M.Sc. degrees in Computer Science from Alexandria University (Egypt) and a second M.Sc. in Computer Science from the University of Maryland. His research interests include identity resolution in
informal media and large-scale text processing using the MapReduce framework.

I found the bigram counting homework to be a nice and gentle introduction to MapReduce and computing on EC2. I'd just like to share some thoughts about how choosing different implementations of the bigram counter can benefit or hurt performance, and my performance results when running on EC2 versus a single local Hadoop process. Also I played a bit with not-so-random sentence generation using the bigram model - thankfully it's not a substitute for creativity :-)

Part 1. Count the bigrams.

For this part, I implemented two versions of the bigram counter: one using the "pairs" approach, and the other using the "stripes" approach. The pairs approach is to map each bigram to a one, and then the reducers add all the ones corresponding to each bigram. This is analogous to the method for single word counting. In my implementation of the stripes approach, I had each mapper create an individual HashMap<String, HashMapWritable<Text, Integer>> for each input line. Bigrams "A B" are represented here by "A" being the first map key, B being the second key, and the count being the value. Each time a bigram was seen in the sentence, the count was updated. The reducers receive all maps corresponding to bigrams for a particular symbol, and combine the counts from all the maps to get final counts.

I found that in terms of performance, the "pairs" approach was somewhat faster than the "stripes": on a 2 node cluster with 10 mappers and 4 reducers, the pairs took 47s, while the stripes took 1m10s. Initially I thought that the stripes would be faster, but it turned out the other way around. This is probably because of the small size of the input records given to the mapper; only one sentence at a time gets processed. This results in stripe hashes that nearly always contain bigram counts of all ones, which severely limits the amount of aggregation that can take place in the mapper. As is, the stripes approach thus amounts to simply adding serialization overhead to the map and reduce tasks. If the input were mapped using larger chunks at a time (say, by books of the Bible or individual works of Shakespeare), the stripes approach might outperform the pairs approach due to better mapper bigram aggregation.

Before running the counters on the EC2 cluster, I did local testing on my machine (to save money!). Interestingly, the performance was much better: it only took 20s for the pairs approach, and 57s for the stripes. I suspect the difference is in the communication overhead for the distributed filesystem and the distributed sort in the cluster. However, once the single machine was loaded to capacity by a much larger dataset, the EC2 cluster would win out because it has much higher throughput.

Part 2. From counts to conditional probabilities.

For this part it was straightforward to modify the stripes bigram counter to compute conditional probabilities. The performance was almost the same as the stripes counter, running in 1m9s with 10 map tasks and 4 reduce tasks. On a single machine, it took 54s.

Postscript.

The bigrams provide a simplistic model of the relationships between words in sentences. It can be entertaining to use the bigrams to generate weird Bible/Shakespeare stream-of-consciousness style narratives. I created a sentence generator that takes the conditional probabilities as input and generates these sentences. Here are a few of the more interesting (and comprehensible) ones:

decking with promise please't your lordships pleasures of swine because they in very greatly polluted my mouth as well my free from heartstring

defacers of this miserable change the woman and immediately provided to the other god from the years old limbs in any man's untrue'

shortcake upon earth for they cry out if god for your own affairs upon a year and when he say i will therefore thus much like bucklersbury

and humility sirrah take heed for the lord standeth in word of the precious fruit of your sake and drink nought so great favour in one good morrow general'

whereby thou must not thou spendest more violent hands that saul thou hast destroyed it remains alive and running to his garment'

for this is love for thus smiling with him fortune in the congregation of the house of judah from the wind and their tinkling cymbal

foulcankering rust doth he hath set up and to fall and follow the works can call him to pluck the heavens had even so much repairs itself or hot january

the wise go into an hundred pound of beaten gold look and come at dice for this ministry and bark the honey set in fulvia's phrase or any have no breath up the point affords this a lady percy and made thee art troubled in hunger and lay not die die

Eerie...

It's a practical way to simulate lots of monkeys hammering away at lots of typewriters, except that the monkeys are somewhat smarter because they sort of know how words fit together. If we kept extending the model to trigrams and beyond, the generated text would gradually improve.

Cloud Computing at Maryland

Friday, October 10, 2008

Problem Set 4: PageRank

Sunday, October 5, 2008

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective

Problem Set 3: Boolean retrieval

Problem Set 2: Invert index construction

Saturday, September 20, 2008

Thoughts on Bigram Counting

Monday, September 15, 2008

Problem Set 1: Bigram counts

Friday, September 12, 2008

Problem Set 0: Getting onto the cloud

Blog Archive

Contributors