## Sunday, 2 March 2014

### The Curious Case of Leonardo Di Caprio's Oscar :Sentiment Analyisis

I was very excited yesterday night for the Oscars as Leonardo Di Caprio was in the last few of Best actor nominees. Though he has done some brilliant movies in the past and he is a great actor , I was not confident with this movie getting him the award as I felt he has done much better work in other films . But Still fingers were crossed for Brilliant actor like Leonardo. I was just curious to see how twitter is doing with the Oscars. I did sentiment analysis on Tweets to see whats people point of view on Leonardo is Just before the Oscar . How many of them wanted him to win. How many feel that Leonardo is not the right person for oscars and someother actor should win it.
Sentiment Analysis on tweets gave me interesting results.
Steps :
1. Extract tweets with HashTag on Leonardo
2. Generate CSV of Tweets
3. Extract required information
4. Natural Language Processing - Tokenizing ,Stamming etc.
5.Classify them as Positive Negative Neutral
6.Apply Naivebayes.

Positive Tweets
RT @FindingSquishy_: If #Leonardo Di Caprio wins an Oscar tonight, Tumblr will probably break
if #Leonardo di Caprio doesn't win an oscar I am going to scream
RT @Mohammed_Meho: #Leonardo Di Caprio better win an Oscar tonight.
RT @Miralemcc: #The Wolf of the Wall Street and# Leonardo di Caprio for #Oscars2014

Negative Tweet
#Leonardo Di Caprio doesn't deserve and never has deserved an oscar. Deal with it

.............................................

Step1 : Step 1 is Scrapping tweets for the required tag. This can be done using the twitter API or You can use online sites for searching tweets and extract the search results from it. There are many sites that can give you direct Sentiment analysis results like NCSU project :
http://www.csc.ncsu.edu/faculty/healey/tweet_viz/tweet_app/
Stanford Project :
Sentiment140
http://www.sentiment140.com/
But I choose twitter seeker that just gives you search result without sentiments and I wanted to do Sentiment analysis myself.
TwitterSeeker generates you a Excel sheet with all tweet information.

You can filter it by selecting language as english  In the image I applied no filter.
Excel file generated will have user name ,time of posting,tweet and many other as option. In the current case I am only concerned with the tweet.

STEP 2 : generate CSV of Tweets.
For my data as input to ML algorithms , I used CSV file. CSV is Comma Seperated Value format in which each column is seperated by delimiter. After getting excel from twitter I converted into a CSV file.

STEP 3: Extract Requried Information:
This is the step where your knowledge of Data mining will come into use. As in the present I am only concerned with one column that is tweet. Now general tweet is generally in a form
which can very randomly.
Now I removed all the unnecessary words from it . All usernames tags and links.

#updated every day.

## Saturday, 1 March 2014

### Keyword Analysis on Apache Hadoop Issue Tracker

In my recent project titled "Recommending similar defects and Effort Estimate for Apache Hadoop Issue Tracker"
I recently wrote python code to extract most used hadoop specific keywords in the issue tracker after removing irrelevant words and stop words from the list.
I am classifying them into various classes like HDFS Hadoop Error MINING DataNode etc Some of the words that is found on the list are posted.
Click for the word list

List has approximately 3700 words .
Duplicate words were removed from the list.
The list analyzed first 200 defects from Hadoop Commons and Hadoop HDFS.
Both Summary and Description of the defects were analyzed and were selected based on their need on the defect analysis.

The stop word list is prepared by combining various lists available online like FoxStoplist.txt

I believe that this list might be useful for someone working on Language Processing for Issue related words and Hadoop Specific words.
Defect Example : HDFS : 6001
Description : When hdfs is set up with HA enable, FileSystem.getUri returns hdfs:// Here dfs.nameservices is defined when HA is enabled.
In documentation: This is probably ok or even intended. But a caller may further process the URI, for example, call URI.getHost(). This will return the 'mycluster', which is not a valid host anywhere.
Summary : In HDFS HA setup, FileSystem.getUri returns hdfs://

Keywords : #Hdfs #dfs.nameservices # FileSystem #getUri #Nameservices #host #URIL #HA #returns

## Saturday, 22 February 2014

### Supervised Learning : A Mathematical Foundation

Supervised Learning as per Wiki

Supervised learning is the machine learning task of inferring a function from labeled training data. The training data consist of a set of training examples. In supervised learning, each example is a pair consisting of an input object (typically a vector) and a desired output value (also called the supervisory signal). A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples. An optimal scenario will allow for the algorithm to correctly determine the class labels for unseen instances.

In Short : Learning from historical data a mapping between input and output variables and applying this to predict output for an unseen data.

STEPS of SUPERVISED LEARNING

• Determine type of training example
• What kind of data to be used.
• Handwriting analysis : single handwritten character, word,line
• Gather a Training set
• Determine Input feature representation of learned function.
• Input object transformed into feature vector.
• Determine structure of learned function and corresponding learning algorithm.
• SVM or DT
• Complete the design.
• Run Algorithm on Training set.
• Evaluate accuracy.

Mathematics behind Supervised Learning

Goal : Infer a function
f : X Y
Classifier from sample data An
An = ((x1, y1), ..., (xn, yn)) ∈ (X × Y )n.
Input and Output points xi and yi Y

yi IR for regression problems, and yi is discrete for classification problems {-1, +1}

2 Hypothesis :
1)Find a function f to model depandency in P(x,y)
2)Error or loss between prediction f(x) desired output y.

Loss function : L : Y ×Y IR+

If Binary Classification Y in {-1,+1}   L( h (x), y)) = 1/2 | h(x) − y|.
Unsupervised Learning a function can be
Lu( h(x)) = log( h(x)).
Risk for a function or Generalization Error

R( h ) =    L( h(x), y)dP(x, y).
Classification : Function f that minimises R(f) but joint probablity P(x,y) unknown
Using Input and Random variables Risk can be written as

$R(h) = \mathbf{E}[L(h(x), y)] = \int L(h(x), y)\,dP(x, y).$

Empirical Risk Minimisation :

Learning algo find Hypothesis h such that R(h) minimum

$h^* = \arg \min_{h \in \mathcal{H}} R(h).$
R(h) cannot be computed  so approximation Empirical Risk by leveraging loss

$\! R_\mbox{emp}(h) = \frac{1}{m} \sum_{i=1}^m L(h(x_i), y_i).$
for Large numbers As m goes to infinity point wise convergence of h to R(h)
Minimise Rem
$\hat{h} = \arg \min_{h \in \mathcal{H}} R_{\mbox{emp}}(h).$

Consistency is important :

Thus learning depands upon function h or f in above case
It has to have smooth boundries , regularization is applied : Regularization theory.

Regularized Risk :
Ω ( f ) : Roughness Penalty

Risk Bounds

H , An and δ , such that, for any f ∈ H , with a probability at least 1 − δ :

consider the case of a finite class of functions, |H | = N.
Hoeffding Inequality by summing over set :
Risk =  sum of 2 terms : Empirical error and bound depand on size.
As n -> infintiy second term tends to 0
Bias Variance Dilemma
H large one can find f that fits the data but noise due to large points resulting in poor performance
Called Overfitting

Overfitting according to Wiki
Generally, a learning algorithm is said to overfit relative to a simpler one if it is more accurate in fitting known data (hindsight) but less accurate in predicting new data (foresight).

VC dimension
Measure of capacity of function to different label of class

STRUCTURAL RISK MINIMISATION
Structural risk minimization seeks to prevent overfitting by incorporating a regularization penalty into the optimization. The regularization penalty can be viewed as implementing a form of Occam's razor that prefers simpler functions over more complex ones

select classifier that maximizes gamma  .

*Formulas taken from paper by cunningham
http://tinyurl.com/ll4jxhq