Wednesday, March 22, 2017

How Hadoop works

Hadoop divides the given input file into small parts to increase parallel processing. It uses its own file system called HDFS. Each spitted file is assigned to the mapper which works on the same physical machine with the given chunk. 

Mappers are processing small file chunks and passing their processing results to context.Mappers are processing splitted files (each chunk {piece of the main file} size = HDFS block size) line by line in the map function .

Hadoop supports different programming languages so it uses its own serilization/deseriliazation mechanism. That why you see IntWritable, LongWritable,etc types in the examples. You can write your own Writable classess by implementing the Writable interface according to your requirements.

Hadoop collects all different outputs of the mappers and sort them by KEY and forwards these results to Reducers.

"Book says all values with same key will go to same reducer"

map (Key inputKey, Value inputValue, Key outputKey, Value outputValue)

reduce (Key inputKeyFromMapper, Value inputValueFromMapper, Key outputKey, Value output value)

Hadoop calls reduce function for the each line of given file.

And finally writes the result of reducers to the HDFS file system.

See the WordCount example for better understanding : hadoop-wordcount-example

No comments:

Post a Comment