Thursday 12 March 2015

Spark Streaming

In my previous post I mentioned about Spark Stack. In this post I am to give a brief overview of the component Spark Streaming.
Spark Streaming is an extension to Apache Spark that allows processing of live streams of data.
Data in Spark can be ingested from Kafka, Flume, Twitter or TCP sockets. 

Live Data is broken into chunks/batches for predefined interval of time. Each chunk of data represents an RDD and is processed using RDD operations. Once the operations are performed the results are returned in chunks.
DStream is a basic abstraction in Spark Streaming. They represent a chunk of data and as such implemented as an RDD. Dstreams are created from streaming input sources like Kafka, Twitter etc or by applying transformation operations on existing DStream.
Spark Streaming

The incoming data as mentioned above is processed in predifined interval. All the data for any interval is stored across the cluster for that interval. This results in creation of a dataset. Once the time interval is completed dataset is processed using various operations. The operations could be map-reduce or join. 
Streaming Context is the main entry point of spark application.
val sc = new StreamingContext(sparkContext, seconds(1))
Using sc, Dstreams can be created that represents streaming data from input sources ex. TCP/Twitter. 
TCP
val sData = ssc.socketTextStream("1.2.3.4", 1000)
Here First parameter represents Ip address and second port number. sData represents Dstream of data that will be received from server.
Twitter
val tData = TwitterUtil.createStream(ssc, oauth)
Where oauth denotes the Oauth. Twitter uses Ouath for authorization requests.

Once this is done transformations are applied on the created RDD. One such transormation is flatMap. 
val hashTag = tData.flatMap(status => getTag(status))
val words = sData.flatMap(_.split(" "))
flatMap is an operation that is similar to map but each input is mapped to 0 or more output items resulting in a sequence of data as output.