WHAT IS WORD COUNT
Word count is typical problem which works
on hadoop distributed file system and map
reduce is a intended count the
no. of occurrence of each word in the provided input file. Word count operation
takes places in two phases:-
1.
Mapper
phase: In this phase first the text is tokenized into word
then we form a key value pair with these word where the key being the word
itself and value ‘1’. Mapper class execute completely on the entire data set
splitting the word and forming the initial key value pair. Only after this
entire process is completed the reducer start.
2.
Reducer
phase: In reduce phase the key are grouped together and the
value for similar keys are added. This could give the no of occurrence of each
word in the input file. It creates an aggregation phase for key.
MAP
REDUCE
ALGORITHM
1. Map
reduce is a programming model and it is design to compute large volume of data
in a parallel fashion.
2. Map
operation written by user in which takes a set of input key/values pairs and
produces a set of intermediate key #1
3. Reduce
function also written by user, accept an intermediate key #1 and a set of value
for that key .it merges together these values to form a possible smaller set of
value.
4. Map
reduce operations are carried out in hadoop.
5. Hadoop
is a distributed sorting engine.
Map and reduce in word count
problem Algorithm
mapper(file name ,
file-count);
for each word in file-contents;
emit(word,1)
reducer (word, value);
sum=0
for each value in values
sum=sum +
value
emit(word , sum)
DATA FLOW DIAGRAM
METHODOLOGY
Map
reduce is a 3 steps approach to solving a problem :-
Step 1:- Map
The
purpose of a map step is to group or divide data into set based on desire
value. While using map function we need to be careful about 3 things.
1. How do we want to divide or group the
data?
2. Which part of the data we need or
which part of the data is extraneous?
3. In what form or structure do we need
our data?
Step
2:- Reduce
Reduce operation combine different values
for each given key using a user defined function. Reduce operation will take up
each key and pick up all the values created from map step and process them one
by one using custom define logic. It will take 2 parameters:-
1. Key
2. Array of values
Step
3:-Finalized
It is used to do any required
transformation on the final output of the reduce.
FUTURE SCOPE
It
is used in many applications because of parallel processing like document
clustering, web link graph reversing and inverted index construction. As it is
map reduce so it increases the efficiency of
handling big data .it is used
where we need data to be available all the times and security is needed.map
reduce related woks are:-
· Yahoo!:
Web map application uses hadoop to create a database of information on all
known webpage.
· Facebook:
Hadoop provides Hive data center.
· Backspace:
It analyzes server log files and usage data using hadoop.
There are lots of information about latest technology and how to get trained in them, like Hadoop Training Chennai have spread around the web, but this is a unique one according to me. The strategy you have updated here will make me to get trained in future technologies(Hadoop Training in Chennai). By the way you are running a great blog. Thanks for sharing this.
ReplyDeleteBig Data Training | Big Data Course in Chennai