Saturday 1 March 2014

Keyword Analysis on Apache Hadoop Issue Tracker

In my recent project titled "Recommending similar defects and Effort Estimate for Apache Hadoop Issue Tracker"
I recently wrote python code to extract most used hadoop specific keywords in the issue tracker after removing irrelevant words and stop words from the list.
I am classifying them into various classes like HDFS Hadoop Error MINING DataNode etc Some of the words that is found on the list are posted.
Click for the word list



List has approximately 4700 words .
Duplicate words were removed from the list.
The list analyzed first 200 defects from Hadoop Commons and Hadoop HDFS.
Both Summary and Description of the defects were analyzed and were selected based on their need on the defect analysis.


The stop word list is prepared by combining various lists available online like FoxStoplist.txt  
stopwords-lists 
I believe that this list might be useful for someone working on Language Processing for Issue related words and Hadoop Specific words.  
Defect Example : HDFS : 6001  
Description : When hdfs is set up with HA enable, FileSystem.getUri returns hdfs:// Here dfs.nameservices is defined when HA is enabled.
In documentation: This is probably ok or even intended. But a caller may further process the URI, for example, call URI.getHost(). This will return the 'mycluster', which is not a valid host anywhere.
Summary : In HDFS HA setup, FileSystem.getUri returns hdfs://  


Keywords : #Hdfs #dfs.nameservices # FileSystem #getUri #Nameservices #host #URIL #HA #returns

1 comment:

  1. Link doesn't work for the tech words. Thanks for the post .

    ReplyDelete