Monday, September 15, 2008

Problem Set 1: Bigram counts

Got comments, questions, gripes, issues, etc. on Problem Set 1: Bigram counts in my cloud computing course?

Post here!

4 comments:

Mike Lieberman said...

Hi, I've been playing with the problem set #1 and just thought I'd share some info that may be useful. It seems that the default EC2 hadoop configuration turns on output compression for mapreduce jobs, so when your job completes you will get compressed output files. It can be a pain to write another map-only job to decompress the output files. Instead, you can call this somewhere in your main function to turn off the compressed output:

FileOutputFormat.setCompressOutput(conf, false);

Then you'll get readable output.

Gobar said...

Mike's hack works perfectly well. I guess we can deserialize the compressed output by examining how it is serialized but it is hidden deep in the code. Is there any clue on where to look at?

Also, is there a tutorial on how MapReduce works in detail on hadoop?

Parin Choganwala said...

Hi All,

This might help
http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#Example%3A+WordCount+v1.0

Jimmy Lin said...

> Also, is there a tutorial on
> how MapReduce works in detail
> on hadoop?

There's always the source code! ;)

No, seriously, I've frequently found the need to consult underlying source to figure out what's going on behind the scenes.

This is both the blessing and the curse of such an abstraction...

Contributors