Tuesday 23 September 2014

Hadoop a little deeper


Map Reduce :  Follows Master -  Slave Model 
Master :
NameNode
JobTracker 

Slave :
DataNode
Task Tracker


Dynamo/ Casandra => Peer to Peer

Client is neither master nor Slave.
=> submit map reduce task
=> describe how data to be processed.
=> retrieve data.

Hadoop Components  :

 
NameNode : 
=> Files are not stored in NameNode , it just contains filesystem meta data which points files to blocks 
=> Metadata also contains information such as DiskSpace, last access to NN, permissions . 
=> Name Node is rack aware. To what Rack data node is on.  
=> NameNode coordinates access to data node.



Data Node:
=> Manages Data 
=> Sends Heartbeat message to NN to say they are alive.
=> Comunicates with one another to replicate data, move and copy data around
=> Stores data as blocks.  

JobTracker: 
=> Manages job and resources in Hadoop. 
=> Client application submit MR request to Job Tracker
=> Schedule Client jobs and allocates task to Task Tracker.

 
Task Tracker: 
=> Slaves deployed at each machine. 
=> They follow instruction of Job Tracker and runs Map Reduce task
=> Handles movement of data between map and reduce phase.


Secondry NameNode:  
=> Name is a misnomer. 
=> It does all housekeeping task in HDFS. 
=> Namenode store all filesystem metadata in RAM. It doesnt have any capability to persist data to disk. 2ndNN sends message to NN every hour pull all the data from NameNode and merges into a file called Checkpoint.
 



 *Image taken from http://www.gigacom.com


Hadoop 1.**   => Hadoop 2.**:

1. Horizontal Scaling
2. Single Point of Failure for Name Node
3. Impossible to run Non Map Reduce tools because of tight coupling of JobTracker + MR
4. Does not support Multitanency
5. Job Tracker overburdened bcz of too much work.




1. Horizontal Scaling

NameNode all metadata stored in RAM of NameNode
RAM size is limited you cannot take it beyond certain point. 
Bottleneck after 4000 Nodes.

2. Single Point of Failure
No backup node if Namenode fails

3. Impossible to run Non Map Reduce tools because of tight coupling of JobTracker + MR
Only Map Reduce processing can be achieved
Realtime analytics , MPI difficult, No Graph processing
You cannot do in HDFS you have to move data out of HDFS. 
Only Batch processing in 1.** 


4. Multitanency :
Only 1 type of job at a time even if you run it from different application not possible.


Hadoop Component 2.** extra
1. Resource Manager
2. Namenode High Availability
3. YARN: Yet another resource Negotiator.


Instead of having single Name node multiple name nodes are there. Independent to each other. Adhere to specific namespace.

Both JobTracker and task Tracker removed in Hadoop 2 .
Job Tracker task : Resource Management & Job Scheduling was split into two components.below 

New Components in Hadoop2.


1. Resource Manager :  
=> Scheduler that allocates resources in the cluster to various running application. 
=> Schedules task based on Application Container. 

2. Application Manager
=> Launches task in containers. 
=> Starting Application Master container on failure.

3. Node Manager:  
=> Runs on each node
=> Follows orders of Resource Manager. 
=> Responsible for maintaining container 
=> manages resources of a single node.


Other Feautures:

Name Node HA:
Automatic failover and recovery for NameNode master survice.
2 Namenodes :   Active and Passive when one fails other take control.

Snapshots:
Point intime recovery for backup.

Federation :
Generic block storage layer.

Wednesday 10 September 2014

A * Star Algorithm

A* is a path finding/Graph Traversal  algorithm generally used in game programming to determine the shortest path to reach a particular destination following Best First Search approach. Can be called as an extension of Dijkstra and reduces the number of comparisons while guaranteeing optimality. Heuristic is used in A*. A* uses heuristic to determine most promising node.   






Consider a grid with each square numbered. As in the above figure , cross sign (Sq. 14) represents start point and cross sign (Sq 35) represents destination. All Blue bars in a square represents the block. 
Each square considered a node.
Some terminology used with A*
Node Data
H Value :  Heuristic Value
G Value : Movement Cost                
F Value : G + H Value
Parent :  Pointer Back to previous node.

List
Open Node : List of Nodes to be visited
Close Node :  Already Visited.

Task : Shortest Path from 14 to 35.
H Value : Distance of a Node from destination node in our case from Sq 35.
Precompute all H value

If H value is Zero A* behaves like Dijksatra Algorithm. From Node 17 H value is 3 as you can go from 17 to 23 to 29 leading to 35
For Node 25 H value is 5 as you go from node 25 to 26 to 27 to 28 to 29 leading to 35 , block nodes are also considered for precomputing H values.

G Value is movement cost from Start node to another node.
Consider Movement cost of 10 For Horizontal and Vertical and of 14 for diagonal movement.
Diagonal > Horizontal (Pythagorean theoram)
So to Move from 14 to 8 the cost is 10 and from 14 to 9 it is 14.
Each square has parents.
For Sq. 14 we need to get parent , which can be done through open list and close list.
Write now no node on Close or open list.

All Node parent to14.
CloseList : 14
Open List : 7,8,9,13,15,19,20,21
Calculate G Cost for all. for 8 it is for Node 14 it is 0 for Node 8 it is 10 so total is 5. For Node 9 it is 14.




The Numbers in Black represents the H value of Nodes ie Distance of Node to Node 35.
The numbers in Blue represent G Value Node 8 as G Value 10.
F Value is sum of G Value + H Value which is written in red for the nodes. (For node 7 it is 17 => 10+7)
Need to Use Node with smallest F Value.
So in our case square number 15 as it has F-Value 15 which is minimum. Node 20 too has value 15 and can be selected too.
Take Node 15 put in close list.
Now repeat process for Node 15.

Now Calculate all values for Node 15.
CloseList : 14 , 15
Open List : 7,8,9,13,19,20,21,10
We see Node 14 already on Close list dont do anything.
Consider Node 8 so we check if it is more faster to go to 8 from 14 or  from 15.
For Node 15 We see :
G Value for Node 15 was 10 . G Value for Node 8 considering the current node 14 is 14
So total Movement is 14+10 ie 24
If the total movement > current G value dont change values else
If G value wrt node 15 is less then G Value of Node 8 then re parent node  8 to node 15.

If(Parents G value+Current G < Parents New G){
Add new Block to close list.
New Block parent = Prev Block Parent.
}


This process continues till you are next to destination node.

Closed : 14,15,20
Open : 7,8,9,13,19,21,10,22,25,27
Keep Doing and then see neighbor is destination so make destination parented to the other node and you are done.
Trace back path from destination to source.
35 to 29 to 22 to 15 to 14.
Make sure you change the parent if you find new G value smaller then previous computed.



All Green Arrow denotes final position of main node parents. Red is the path from source to destination.

Shortest Path : 14 ->15->22->29->35

   

 
Useful links for the topic 

Wiki
Stanford Amits Page
Video  
Video
Video
Algo