Monday, 27 July 2015

HADOOP : WORD COUNT PROBLEM



WHAT IS WORD COUNT
      Word count is typical problem which works on hadoop distributed file system and map       reduce is a intended count the no. of occurrence of each word in the provided input file. Word count operation takes places in two phases:-
1.      Mapper phase: In this phase first the text is tokenized into word then we form a key value pair with these word where the key being the word itself and value ‘1’. Mapper class execute completely on the entire data set splitting the word and forming the initial key value pair. Only after this entire process is completed the reducer start.
2.      Reducer phase: In reduce phase the key are grouped together and the value for similar keys are added. This could give the no of occurrence of each word in the input file. It creates an aggregation phase for key.

MAP   REDUCE   ALGORITHM
1.      Map reduce is a programming model and it is design to compute large volume of data in a parallel fashion.
2.      Map operation written by user in which takes a set of input key/values pairs and produces a set of intermediate key #1
3.      Reduce function also written by user, accept an intermediate key #1 and a set of value for that key .it merges together these values to form a possible smaller set of value.
4.      Map reduce operations are carried out in hadoop.
5.      Hadoop is a distributed sorting   engine. 
  
Map and reduce in word count problem Algorithm
                             mapper(file name , file-count);
                                                for each word in file-contents;
                                               emit(word,1)
                             reducer (word, value);
                                              sum=0
                                              for each value in values
                                              sum=sum +   value
                                           emit(word , sum)

    DATA FLOW DIAGRAM




METHODOLOGY

Map reduce is a 3 steps approach to solving a problem :-
Step 1:- Map
The purpose of a map step is to group or divide data into set based on desire value. While using map function we need to be careful about 3 things.
      1. How do we want to divide or group the data?
      2. Which part of the data we need or which part of the data is extraneous?
      3. In what form or structure do we need our data?
 Step 2:- Reduce
     Reduce operation combine different values for each given key using a user defined function. Reduce operation will take up each key and pick up all the values created from map step and process them one by one using custom define logic. It will take 2 parameters:-
1.    Key
2.     Array of values
 Step 3:-Finalized
     It is used to do any required transformation on the final output of the reduce.

FUTURE SCOPE
It is used in many applications because of parallel processing like document clustering, web link graph reversing and inverted index construction. As it is map reduce so it increases the efficiency of  handling big data .it is used where we need data to be available all the times and security is needed.map reduce related woks are:-
·     Yahoo!: Web map application uses hadoop to create a database of information on all known webpage.
·    Facebook: Hadoop provides Hive data center.
·     Backspace: It analyzes server log files and usage data using hadoop.






1 comment:

  1. There are lots of information about latest technology and how to get trained in them, like Hadoop Training Chennai have spread around the web, but this is a unique one according to me. The strategy you have updated here will make me to get trained in future technologies(Hadoop Training in Chennai). By the way you are running a great blog. Thanks for sharing this.

    Big Data Training | Big Data Course in Chennai

    ReplyDelete