Hi, I've been playing with the problem set #1 and just thought I'd share some info that may be useful. It seems that the default EC2 hadoop configuration turns on output compression for mapreduce jobs, so when your job completes you will get compressed output files. It can be a pain to write another map-only job to decompress the output files. Instead, you can call this somewhere in your main function to turn off the compressed output:
Mike's hack works perfectly well. I guess we can deserialize the compressed output by examining how it is serialized but it is hidden deep in the code. Is there any clue on where to look at?
Also, is there a tutorial on how MapReduce works in detail on hadoop?
4 comments:
Hi, I've been playing with the problem set #1 and just thought I'd share some info that may be useful. It seems that the default EC2 hadoop configuration turns on output compression for mapreduce jobs, so when your job completes you will get compressed output files. It can be a pain to write another map-only job to decompress the output files. Instead, you can call this somewhere in your main function to turn off the compressed output:
FileOutputFormat.setCompressOutput(conf, false);
Then you'll get readable output.
Mike's hack works perfectly well. I guess we can deserialize the compressed output by examining how it is serialized but it is hidden deep in the code. Is there any clue on where to look at?
Also, is there a tutorial on how MapReduce works in detail on hadoop?
Hi All,
This might help
http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#Example%3A+WordCount+v1.0
> Also, is there a tutorial on
> how MapReduce works in detail
> on hadoop?
There's always the source code! ;)
No, seriously, I've frequently found the need to consult underlying source to figure out what's going on behind the scenes.
This is both the blessing and the curse of such an abstraction...
Post a Comment