Tuesday 23 September 2014

Hadoop a little deeper


Map Reduce :  Follows Master -  Slave Model 
Master :
NameNode
JobTracker 

Slave :
DataNode
Task Tracker


Dynamo/ Casandra => Peer to Peer

Client is neither master nor Slave.
=> submit map reduce task
=> describe how data to be processed.
=> retrieve data.

Hadoop Components  :

 
NameNode : 
=> Files are not stored in NameNode , it just contains filesystem meta data which points files to blocks 
=> Metadata also contains information such as DiskSpace, last access to NN, permissions . 
=> Name Node is rack aware. To what Rack data node is on.  
=> NameNode coordinates access to data node.



Data Node:
=> Manages Data 
=> Sends Heartbeat message to NN to say they are alive.
=> Comunicates with one another to replicate data, move and copy data around
=> Stores data as blocks.  

JobTracker: 
=> Manages job and resources in Hadoop. 
=> Client application submit MR request to Job Tracker
=> Schedule Client jobs and allocates task to Task Tracker.

 
Task Tracker: 
=> Slaves deployed at each machine. 
=> They follow instruction of Job Tracker and runs Map Reduce task
=> Handles movement of data between map and reduce phase.


Secondry NameNode:  
=> Name is a misnomer. 
=> It does all housekeeping task in HDFS. 
=> Namenode store all filesystem metadata in RAM. It doesnt have any capability to persist data to disk. 2ndNN sends message to NN every hour pull all the data from NameNode and merges into a file called Checkpoint.
 



 *Image taken from http://www.gigacom.com


Hadoop 1.**   => Hadoop 2.**:

1. Horizontal Scaling
2. Single Point of Failure for Name Node
3. Impossible to run Non Map Reduce tools because of tight coupling of JobTracker + MR
4. Does not support Multitanency
5. Job Tracker overburdened bcz of too much work.




1. Horizontal Scaling

NameNode all metadata stored in RAM of NameNode
RAM size is limited you cannot take it beyond certain point. 
Bottleneck after 4000 Nodes.

2. Single Point of Failure
No backup node if Namenode fails

3. Impossible to run Non Map Reduce tools because of tight coupling of JobTracker + MR
Only Map Reduce processing can be achieved
Realtime analytics , MPI difficult, No Graph processing
You cannot do in HDFS you have to move data out of HDFS. 
Only Batch processing in 1.** 


4. Multitanency :
Only 1 type of job at a time even if you run it from different application not possible.


Hadoop Component 2.** extra
1. Resource Manager
2. Namenode High Availability
3. YARN: Yet another resource Negotiator.


Instead of having single Name node multiple name nodes are there. Independent to each other. Adhere to specific namespace.

Both JobTracker and task Tracker removed in Hadoop 2 .
Job Tracker task : Resource Management & Job Scheduling was split into two components.below 

New Components in Hadoop2.


1. Resource Manager :  
=> Scheduler that allocates resources in the cluster to various running application. 
=> Schedules task based on Application Container. 

2. Application Manager
=> Launches task in containers. 
=> Starting Application Master container on failure.

3. Node Manager:  
=> Runs on each node
=> Follows orders of Resource Manager. 
=> Responsible for maintaining container 
=> manages resources of a single node.


Other Feautures:

Name Node HA:
Automatic failover and recovery for NameNode master survice.
2 Namenodes :   Active and Passive when one fails other take control.

Snapshots:
Point intime recovery for backup.

Federation :
Generic block storage layer.

No comments:

Post a Comment