Wednesday, 24 June 2015

Playing with Tumblr API

1. Install Ouath2
2. Install Pytumblr
3. Register an Application
4. Get tumblr ouath2, you will get once you create app

5. Enter your credentials in following code in Python file
client = pytumblr.TumblrRestClient(
pytumblr is a library,  through which you can make calls to tumblr.

6. Code to get all blogs you are following
off =0
while True:
    my_dict = client.following(offset =off)
    res = my_dict['blogs']
    for rs in res:
        print(rs['name'] + "...." + rs['title'])
7. Number of posts liked for each blog
off =0
like_dict= {}
while True:
    my_dict = client.blog_likes('',offset =off)
    res = my_dict['liked_posts']
    for rs in res:
        strs = str(rs['tags']).strip('[]')
        #print(rs['blog_name'] +" "+ strs)
        if rs['blog_name'] in like_dict.keys():
            like_dict[rs['blog_name']] += 1
            #print rs['blog_name'] +"  " + str(like_dict[rs['blog_name']])
            like_dict[rs['blog_name']] = 1   
for the_key, the_value in like_dict.iteritems():
    print the_key, 'corresponds to', the_value 
8. Sample Output for code 6
sportspage....Sports Page
themobilemovement....The Mobile Movement
adidasfootball....adidas Football
instagram-engineering....Instagram Engineering
sony....Sony on Tumblr
yahoolabs....Yahoo Labs
taylorswift....Taylor Swift
beyonce....Beyoncé | I Am
itscalledfutbol....Did someone say "futbol"?
futbolarte....Futbol Arte
fcyahoo....FC Yahoo
yahooscreen....Yahoo Screen
engineering....Tumblr Engineering
yahoodevelopers....Yahoo Developer Network
mongodb....The MongoDB Community Blog
yahooeng....Yahoo Engineering
marissamayr....Marissa's Tumblr
staff....Tumblr Staff

narendra-modi....Narendra Modi
nytvideo....New York Times Video
bonjovi-is-my-life....Bon Jovi♥ Is My Life
game-of-thrones....You win or you die.
teamindiacricket....Team India
gameofthrones....Game of Thrones: Cast A Large Shadow
forzaibra....Forza Ibra

Friday, 24 April 2015

Tumblr Blog

My Tumblr Blog

Why I am moving to tumblr :
1. Lots of stuff to read
2. Simple to use.
3. 230 Million Blogs
4. No Charge.
5. Great Design optimized for Mobile.
6. Easier to get followers

7. I am joining Yahoo :)

My blog will remain active but most of the new post will be on tumblr.

Thursday, 12 March 2015

Spark Streaming

In my previous post I mentioned about Spark Stack. In this post I am to give a brief overview of the component Spark Streaming.
Spark Streaming is an extension to Apache Spark that allows processing of live streams of data.
Data in Spark can be ingested from Kafka, Flume, Twitter or TCP sockets. 

Live Data is broken into chunks/batches for predefined interval of time. Each chunk of data represents an RDD and is processed using RDD operations. Once the operations are performed the results are returned in chunks.
DStream is a basic abstraction in Spark Streaming. They represent a chunk of data and as such implemented as an RDD. Dstreams are created from streaming input sources like Kafka, Twitter etc or by applying transformation operations on existing DStream.
Spark Streaming

The incoming data as mentioned above is processed in predifined interval. All the data for any interval is stored across the cluster for that interval. This results in creation of a dataset. Once the time interval is completed dataset is processed using various operations. The operations could be map-reduce or join. 
Streaming Context is the main entry point of spark application.
val sc = new StreamingContext(sparkContext, seconds(1))
Using sc, Dstreams can be created that represents streaming data from input sources ex. TCP/Twitter. 
val sData = ssc.socketTextStream("", 1000)
Here First parameter represents Ip address and second port number. sData represents Dstream of data that will be received from server.
val tData = TwitterUtil.createStream(ssc, oauth)
Where oauth denotes the Oauth. Twitter uses Ouath for authorization requests.

Once this is done transformations are applied on the created RDD. One such transormation is flatMap. 
val hashTag = tData.flatMap(status => getTag(status))
val words = sData.flatMap(_.split(" "))
flatMap is an operation that is similar to map but each input is mapped to 0 or more output items resulting in a sequence of data as output. 

Thursday, 25 December 2014

Apache Spark a brief overview

Apache Spark a PHD resarch project at UCBerkley came into limelight when it broke the record of sorting petabyte of data recently.
Various Sorting Records.

What is Spark ?
Apache Spark™ is a fast cluster computing framework and general engine for large-scale data processing.
Spark Goals:
Generality : Diverse workloads, operators, jobsizes.
Latency:Low Latency
Fault Tolerance

Spark supports Hadoop, Amazon S3, Cassandra, cluster management tools like YARN and Mesos. Spark does more In memory data processing as compared to DiskBased processing as in Hadoop.

Spark stack comes bundled with tools like Spark SQL, MLlib, Spark Streaming and GraphX.

  1. SparkSQL: Unified access to structured data , provides compatibility with Apace Hive and Standard Connectivity to tools like JDBC and ODBC.
  2. Spark Streaming:  For Scalable fault tolerant Streaming applications and spark can run in both batch and interactive mode.
  3. MLlib:Scalable Machine Learning library.
  4. GraphX:Large Scale Graph Processing Framework.
Learn More about Spark

Spark vs Hadoop

Spark is 100X times faster then Hadoop.
The speed can be attributed to the fact that Spark keeps the intermediate data cached in local JVM.  Hadoop on the other hand in the name of Fault Tolerance write the intermediate data on to the disk and disk is expensive.
                                                                                                               *Image from Spark

Spark doesnt replace anything in Hadoop Ecosystem rather it offers a readable, testable way to write programs freeing us from the painful Map Reduce jobs. MR model is unsuitable for Iterative algorithms. MR jobs are pain to program too. Although there are tools to reduce our efforts in writing MR jobs like Hive, Cascading etc but internally they call MR jobs thus not improving the performance.

Spark programming model
The main abstraction for computation in Spark is Resilient(Can start automatically) Distributed Datasets.

What is RDD ? 
RDD(Spark Paper) :
RDD's are fault-tolerant, parallel data structures that let users explicitly persist intermediate results in memory, control their partitioning to optimize data placement, and manipulate them using a rich set of operators.

Spark transformations applies same data operation to many data items. This results in better Fault tolerance as only the lineage of transformation is logged rather then the actual data. Also if some of RDD's are lost it has enough information to how it is derived from other  RDD's.
RDD can be created through Transformations(map,filter,join). Spark creates Direct Acyclic Graph of these transformation . Once the RDD's are defined through transformations , actions can be applied on them.
Actions are application that returns a value(count,collect and save). In Spark RDD's can be stored in disk by calling persist.
Since it is just an RDD it can be queried via SQL Interface, ML algos etc.

In short
User Program => 
Create Spark Context
sc = new SparkContest
Create Distributed Datasets called RDD's 
Perform Operations.

Inside spark context act as client and master per application.

Block tracker => What is in memory what is on Disk ?
Shuffle => Shuffle operation like Groupby

Scheduler talk through cluster manager talks to a worker.
Contains Block Manager for Block Mgmt.
Recieves task that run in thread pools.
Task can talk to HDFS.