Saturday, 13 September 2014

Data Analysis Yahoo Finance Data

What the Script does :
=> Downloads historical Stock Prices for a company from Yahoo Finance website.
=> Calculates the 10 day moving average.
=> Generate the graph for the data

a) Necessary Imports:
1. Import Panda. Panda is a data analysis library of python.  Pandas has tool to read and write between in-memory data structures and different file formats. It has efficient data frame object for data manipulation with better indexing support.
2. Import Datetime: Provides classes to manipulate date and time objects. Api to convert between different file formats.
3. Import matplotlib: Python 2D plotting library , simple to use and genrates graphs, plots etc with few lines of code. 
4. Import Numpy : Scientific computing package for python,N-Dimensional array object ,Linear algebra related functions.
5. Import urlib: URL handling module for python
import pandas as pd
from datetime import timedelta
import datetime as dt
from pandas import Series, DataFrame
import matplotlib.pyplot as plt
import matplotlib as mpl
import urllib.request
import numpy as numpy
from datetime import datetime
from matplotlib.pyplot import *
import matplotlib.dates as mdates

b) Create a class called Stock 
The main(Init) function should accept as parameter the company symbol, lookback period, window size,and the end date.
def __init__(self,symbol,lookback_period,window_size, 
Suppose your lookback period is 100 then get prices for 100 days.
SO to get prices for 100 days subtract from the end date the lookback period.
But for that you need to use same format . 
In this example I have used timedelta.  Time delta helps you get the start date by specifying the number of days  from given date. Like your end date is today and you want stoc prices for last 100 days ,we can use timedelta.

start = end - timedelta(days=lookback_period)
c) Convert date into required format.
start_date = start.isoformat() 
d)  get the required Url for data analysis. 
url = "{0}".format(symbol)
url += "&a={0}&b={1}&c={2}".format(start_month,start_day,start_year)
url += "&d={0}&e={1}&f={2}".format(end_month,end_day,end_year)
e) Parse data
df = pd.read_csv(urllib.request.urlopen(url))
#get the adj close from csv
saved_column = df['Adj Close']
#get the matching date
y_data = df['Date']
f) Get the moving average
def movingaverage(self,interval, window_size):
window = numpy.ones(int(window_size))/float(window_size)
return numpy.convolve(interval, window, 'same')

Moving Average smooths price fluctuations by removing the noise. It computes the averages of a subset of full data set.

Moving Average Wiki 

g) Generate the graph
 x_data = x_points[0:70]
#get the moving average
y_av = self.movingaverage(saved_column,window_size)
#generate graph
figure("Plot of stocks")
x = [dt.datetime.strptime(d,"%Y-%m-%d").date() for d in x_data]
ylabel("adjusted close")


Friday, 29 August 2014

A * Star Algorithm

A* is a path finding/Graph Traversal  algorithm generally used in game programming to determine the shortest path to reach a particular destination following Best First Search approach. Can be called as an extension of Dijkstra and reduces the number of comparisons while guaranteeing optimality. Heuristic is used in A*. A* uses heuristic to determine most promising node.   

Consider a grid with each square numbered. As in the above figure , cross sign (Sq. 14) represents start point and cross sign (Sq 35) represents destination. All Blue bars in a square represents the block. 
Each square considered a node.
Some terminology used with A*
Node Data
H Value :  Heuristic Value
G Value : Movement Cost                
F Value : G + H Value
Parent :  Pointer Back to previous node.

Open Node : List of Nodes to be visited
Close Node :  Already Visited.

Task : Shortest Path from 14 to 35.
H Value : Distance of a Node from destination node in our case from Sq 35.
Precompute all H value

If H value is Zero A* behaves like Dijksatra Algorithm. From Node 17 H value is 3 as you can go from 17 to 23 to 29 leading to 35
For Node 25 H value is 5 as you go from node 25 to 26 to 27 to 28 to 29 leading to 35 , block nodes are also considered for precomputing H values.

G Value is movement cost from Start node to another node.
Consider Movement cost of 10 For Horizontal and Vertical and of 14 for diagonal movement.
Diagonal > Horizontal (Pythagorean theoram)
So to Move from 14 to 8 the cost is 10 and from 14 to 9 it is 14.
Each square has parents.
For Sq. 14 we need to get parent , which can be done through open list and close list.
Write now no node on Close or open list.

All Node parent to14.
CloseList : 14
Open List : 7,8,9,13,15,19,20,21
Calculate G Cost for all. for 8 it is for Node 14 it is 0 for Node 8 it is 10 so total is 5. For Node 9 it is 14.

The Numbers in Black represents the H value of Nodes ie Distance of Node to Node 35.
The numbers in Blue represent G Value Node 8 as G Value 10.
F Value is sum of G Value + H Value which is written in red for the nodes. (For node 7 it is 17 => 10+7)
Need to Use Node with smallest F Value.
So in our case square number 15 as it has F-Value 15 which is minimum. Node 20 too has value 15 and can be selected too.
Take Node 15 put in close list.
Now repeat process for Node 15.

Now Calculate all values for Node 15.
CloseList : 14 , 15
Open List : 7,8,9,13,19,20,21,10
We see Node 14 already on Close list dont do anything.
Consider Node 8 so we check if it is more faster to go to 8 from 14 or  from 15.
For Node 15 We see :
G Value for Node 15 was 10 . G Value for Node 8 considering the current node 14 is 14
So total Movement is 14+10 ie 24
If the total movement > current G value dont change values else
If G value wrt node 15 is less then G Value of Node 8 then re parent node  8 to node 15.

If(Parents G value+Current G < Parents New G){
Add new Block to close list.
New Block parent = Prev Block Parent.

This process continues till you are next to destination node.

Closed : 14,15,20
Open : 7,8,9,13,19,21,10,22,25,27
Keep Doing and then see neighbor is destination so make destination parented to the other node and you are done.
Trace back path from destination to source.
35 to 29 to 22 to 15 to 14.
Make sure you change the parent if you find new G value smaller then previous computed.

All Green Arrow denotes final position of main node parents. Red is the path from source to destination.

Shortest Path : 14 ->15->22->29->35


Useful links for the topic 

Stanford Amits Page

Friday, 11 July 2014

Hadoop-HDFS Code Analysis-1

I had hard time understanding the HDFS code, not only because it was too big and complex but also because of lack of documentation available.

1. Create File System
FileSystem is an abstract class extending Hadoops configured class.
Like a Normal file system it defines all File functions.
All File System that Hadoop use indirectly extends FileSystem class.
Which File System to be invoked depends on  Hadoop Configuration file.
 Suppose we specify hdfs://localhost:9000 which tells hadoop uses DistributedFileSystem.
This file can be invoked from core-default.xml
file:/// invokes LocalFileSystem

2.Client invokes Server
mkdirs function assume calls the DistributedFileSystems mkdir

public boolean mkdirs(Path f, FsPermission permission) throws IOException {
    return mkdirsInternal(f, permission, true);

mkdir() calls mkdirInternal which inturn uses variable : dfs which is object of DFSClient
Apache File Descritption

DFSClient can connect to a Hadoop Filesystem and perform basic file tasks. It uses the ClientProtocol to communicate with a NameNode daemon, and connects directly to DataNodes to read / write block data. Hadoop DFS users should obtain an instance of DistributedFileSystem, which uses DFSClient to handle filesystem tasks.

Initialize Method is called.

All operation on DistributedFileSystem transferred to DFSClient.

public DFSClient(URI nameNodeUri, ClientProtocol rpcNamenode,
      Configuration conf, FileSystem.Statistics stats)
    throws IOException {

DFSclient method call forwarded to namenode and its assoicate variable rpcNameNode
Client Protocol provide all method through which methods from client to server can be invoked.
All method calls from rpcNamenode come down to Invokers invoke() method.
This way the calls are transferred to server side method calls.
Eventually it gets connected to Server class.
3 major components of server
Listener, Responder, Handler,


Thursday, 3 July 2014

Setting up Hadoop Development Envoirnment (Hadoop Snapshot Version 3)

If you want to play around with Hadoop code, you need to setup Hadoop Development environment.
Steps that needs to be followed regarding the same.

Some Softwares needed for Hadoop Installation:
4) JDK
5) Findbugs
6) Apache Ivy

1) Install Jdk

a) sudo apt-get install openjdk-7-jdk
b) set JAVA_HOME in .bashrc file
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
export PATH=$JAVA_HOME/bin:$PATH

2) Install Maven on your Ubuntu Machine

a) sudo wget
b) Once the tar is downloaded you need to untar it using
tar -xzf /apache-maven-3.2.2-bin.tar.gz -C /usr/local
c) ln -s apache-maven-3.2.2 maven
d) add following lines to .bashrc file
export MAVEN_HOME=/usr/local/maven

3) Install Protocol Buffer
a) Unpack
b) tar zxvf protobuf-2.5.tar.gz
c) cd protobuf-2.5
d) ./configure
e) make
f) sudo make install

4) Install Apache Ant, Apache Ivy and Firebug
Unpack them,set home and add them as a path in .bashrc file

5)Clone Hadoop Codebase
git clone git:// hadoop

6)Go to directory hadoop
mvn clean install -DskipTests -Pdist

That's it your Hadoop is ready for development

Next task run examples on hadoop

7) Go to Hadoop-dist folder for Hadoop current release

8) edit .bashrc
vi /home/hduser/.bashrc
HADOOP_HOME will be a folder which has bin in it so in our case /usr/local/hadop/hadoop-dist/target/hadoop-3.0.0-SNAPSHOT

Add following lines
HADOOP_HOME= export /usr/local/hadoop/hadoop-dist/target/hadoop-3.0.0-SNAPSHOT
export PATH=$PATH:$HADOOP_HOME/sbin 

9)Now update xmls 
a) cd /usr/local/hadop/hadoop-dist/target/hadoop-3.0.0-SNAPSHOT/
vi ./etc/hadoop/
c)add the following
export JAVA_HOME= /usr/lib/jvm/java-7-openjdk-amd64
d)test by typing ./bin/hadoop version
 Hadoop 3.0.0-SNAPSHOT
Subversion git:// -r 2b58a0b0b7928c5de2f3bbee04606790fa8345d6
Compiled by hduser on 2014-07-04T00:29Z
Compiled with protoc 2.5.0
From source with checksum 98bf118b8fd146ceb73158fc38fe1e9
This command was run using /home/hduser/projects/hadoop/hadoop-dist/target/hadoop-3.0.0-SNAPSHOT/share/hadoop/common/hadoop-common-3.0.0-SNAPSHOT.jar
My output: 

Hadoop 3.0.0-SNAPSHOT
Subversion git:// -r 2b58a0b0b7928c5de2f3bbee04606790fa8345d6
Compiled by hduser on 2014-07-04T00:29Z
Compiled with protoc 2.5.0
From source with checksum 98bf118b8fd146ceb73158fc38fe1e9

This command was run using /home/hduser/projects/hadoop/hadoop-dist/target/hadoop-3.0.0-SNAPSHOT/share/hadoop/common/hadoop-common-3.0.0-SNAPSHOT.jar

e)Configure Hadoop env

vi ./etc/hadoop/core-site.xml

f) configure yarn-site.xml
#Paste following between <configuration>
<value>mapreduce_shuffle</value> </property> <property>    <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>    <value>org.apache.hadoop.mapred.ShuffleHandler</value> 

g) cp ./etc/hadoop/mapred-site.xml.template ./etc/hadoop/mapred-site.xml
h)Add following lines between Configuration tags for mapred-site.xml


i)Now make the directory for namenode and datanode
mkdir -p /tmp/namenode
mkdir -p /tmp/datanode

j)update hdfs-site.xml
vi ./etc/hadoop/hdfs-site.xml


k)./bin/hdfs namenode -format
Should give something like

32454 NameNode
760 ResourceManager
32706 DataNode
558 SecondaryNameNode
2516 Jps
1044 NodeManager

n) ./bin/hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-3.0.0-SNAPSHOT.jar pi 2 5

Thats it
You are now ready to run examples, write or modify code.

This is the way I installed Hadoop
Check these tutorial really awesome one for Hadoop Installation.