Claiming ownership of the freed self..: November 2013

Hadoop Installation for Windows..

Steps :

A. Prequisites

1. Install Putty. http://www.chiark.greenend.org.uk/~sgtatham/putty/

2.Get Ec2 Instance.

3.Go to EC2 instance and check the public DNS name.

4.Get Private key(.pem)

B.Convert .pem to .ppk

1. Start PuTTYgen.

2.Under Type of key to generate, select SSH-2 RSA.

3.Click Load. By default, PuTTYgen displays only files with the extension .ppk. To locate your .pem file, select the option to display files of all types. Generate PPK from Pem file you obtained from ec2.

4a. Click Load if already saved Config

4b.Click SSh to browse PPK key.

5.Install JAVA (Hadoop is written in Java)

sudo apt-get install openjdk-7-jdk

6.Install OpenSSH for SSH

sudo apt-get install openssh-server

7.Create Hadoop user. So that you can run your program

sudo addgroup hadoop // creating Hadoop Group

sudo adduser --ingroup hadoop hduser // adding user to the group

sudo adduser hduser sudo

8. Generating Keys

$ ssh-keygen -t rsa -P ''

...

Your identification has been saved in /home/hduser/.ssh/id_rsa.

...

$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

$ ssh localhost

9.Disabling IPV6

net.ipv6.conf.all.disable_ipv6 = 1

net.ipv6.conf.default.disable_ipv6 = 1

net.ipv6.conf.lo.disable_ipv6 = 1

9. Installing Hadoop

Cd /usr/local/

sudo wget http://archive.apache.org/dist/hadoop/core/hadoop-0.20.203.0/hadoop-0.20.203.0rc1.tar.gz

//Unzip the tar

sudo tar xzf hadoop-0.20.203.0rc1.tar.gz

//move

sudo mv hadoop-0.20.203.0rc1 hadoop

//set ownership

sudo chown -R hduser:hadoop hadoop

10.Update .bashrc with the proper path

cd ~

add java_home

export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64

export HADOOP_HOME=/usr/local/hadoop

11.cd /usr/local/hadoop/etc/hadoop

vi hadoop-env.sh

JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64

12.Create tmp for hdfs

$ sudo mkdir -p /app/hadoop/tmp

$ sudo chown hduser:hadoop /app/hadoop/tmp

# ...and if you want to tighten up security, chmod from 755 to 750...

$ sudo chmod 750 /app/hadoop/tmp

13. Configure Hadoop - edi core-site.xml

cd /usr/local/hadoop/etc/hadoop

or
cd vi /usr/local/hadoop/cong/core-site.xml

vi core-site.xml

For version <2

<property>

<name>hadoop.tmp.dir</name>

<value>/app/hadoop/tmp</value>

<description>A base for other temporary directories.</description>

</property>

<property>

<name>fs.default.name</name>

<value>hdfs://localhost:54310</value>

<description>The name of the default file system. A URI whose

scheme and authority determine the FileSystem implementation. The

uri's scheme determines the config property (fs.SCHEME.impl) naming

the FileSystem implementation class. The uri's authority is used to

determine the host, port, etc. for a filesystem.</description>

</property>

For version >2

Bcz of Yarn

<property>

<name>fs.default.name</name>

<value>hdfs://localhost:9000</value>

</property

13. update for version < 2

mapred-site.xml

<name>mapred.job.tracker</name>

<value>localhost:54311</value>

<description>The host and port that the MapReduce job tracker runs

at. If "local", then jobs are run in-process as a single map

and reduce task.

</description>

</property>

for version >2

<name>yarn.nodemanager.aux-services</name>

<value>mapreduce_shuffle</value>

</property>

<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>

<value>org.apache.hadoop.mapred.ShuffleHandler</value>

</property>

14. Set Replication Factor

file conf/hdfs-site.xml

<name>dfs.replication</name>

<description>Default block replication.

The actual number of replications can be specified when the file is created.

The default is used if replication is not specified in create time.

</description>

</property>

15. Format nAmenode

hadoop namenode -format

16.Start-all.sh

depecated for v >2 but will work Invokes both dfs and mapred-site.xml

----- Till here single node ---

For multinode..

17.Everything is working fine..

18.Create few more instances of EC2 clusters in same way

Update vi etc/hosts from home directory

add IP address and name which you will use

ex...

10.23.45.67 master10.11.11.11 slave1

10.12.12.12 slave2

19.Copy pub key of master to the list of authorized key of client

vi /home/hduser/.ssh/id_rsa.pub

add the content to

vi /home/hduser/.ssh/id_rsa/authorized_keys of slave1

20.you can ssh to both master and slave

21.once you type ssh slave you will be taken to slave machine

exit it from maser

22.On master edit conf/master

master

23. On master edit conf/slaves

add the following

master

slave1

slave2

23. Update core-site.xml ALL MACHINE

<name>fs.default.name</name>

<value>hdfs://IPADDRESS OF MASTER HERE:54310</value>

<description>The name of the default file system. A URI whose

scheme and authority determine the FileSystem implementation. The

uri's scheme determines the config property (fs.SCHEME.impl) naming

the FileSystem implementation class. The uri's authority is used to

determine the host, port, etc. for a filesystem.</description>

</property>

23. update mapred-site.xml for all machine

<name>mapred.job.tracker</name>

<value>masterIP ADDRESS:54311</value>

<description>The host and port that the MapReduce job tracker runs

at. If "local", then jobs are run in-process as a single map

and reduce task.

</description>

</property>

24.set replication update hdfs-site.xml

<name>dfs.replication</name>

<description>Default block replication.

The actual number of replications can be specified when the file is created.

The default is used if replication is not specified in create time.

</description>

</property>

25.format name node

26.start-all.sh

This is what i got from reading from various blogs.. I have installed Hadoop many times

and feel like I should write the steps

Its the mix of both single and multinode cluster

I used http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/#java-io-ioexception-incompatible-namespaceids

but modified in many ways..

PREGEL

This paper talks about Pregel: a large graph processing framework designed by google.The paper first highlights the main problems in processing large graphs and then talks about how pregel works.One of the advantage with Prgel is that pregel programmes can be scaled automatically and programmer can concentrate on the algorithm they implement without worrying about fault tolerance and other mechanism. In short Pregel is a simple,flexible yet powerful framework that will find its usage in distributed social networks and other large graph processing applications.

Summery:
1)As the size of network is increasing, there representation is becoming more and more difficult with billions,trillions of nodes.This paper provides a computation model for processing these large graphs.
2)Pregel is inspired from Valiants Bulk Synchronous Programming model.
3)It focuses on message passing instead of remote reads.
4)Author provides a model of computation . Input to pregel is a large graph with each vertex identified by an ID.A sequence of supersterps are followed until algorithm terminates.
5)Paper provides C++ API and examples which can be written using pregel like PageRank,Clustering,Shortest Path.
6)The network is used only for sending messages and therefore communication overhead is significantly reduced.
7)Experiments that shows how graph scales with changes in number of vertices.

Cons :
1)Author describes fairly complex system and leaves out many details.
2)Not much experiments provided.
3)Talk about Map Reduce but nowhere they have provided evaluation on the two.
4)What will happen in case of Master failure was not answered.

Claiming ownership of the freed self..

Tuesday 19 November 2013

Install Hadoop steps - Both single and Multiple machine...

Sunday 17 November 2013

PREGEL : Graph Framework

PREGEL