CHAOSS - Augur: SentiCR
This is my first blog entry for working at CHAOSS under Augur. Now I do believe that it’s important to keep this blog going because it helps me look over my work and go over what I’ve done and see if I made any mistakes. This week was focused on using SentiCR which is a Sentiment Analysis tool for Software Engineering Research. You can check out the research. So this week’s tasks with my mentor Sean is as follows.
Goals for the week:
- Look at implementing SentiCR as the sentiment analysis tool for the messages.
Goals
Goal 1:
For this goal we have to look at replacing using NLTK as our main source of doing sentiment analysis. Now this doesn’t mean that we won’t be using nltk because it is still being used but this is through SentiCR. This uses a number of different models and determines which model (e.g. Naive Bayes, Linear Support Vector Classification, Gradient Boosting Tree (GBT)) performs best on the dataset which from the paper was the GBT.
We first go about looking at the sentiment analysis for Pipermail for mailing lists. This resulted in me using a new jupyter notebook called senticr_piper. Now in general everything remained the same however some of my time was actually spent learning to use SentiCR because I am implementing it in python 3 it seems that it was more meant for python 2. So before it was doing python for i, j in dic.iteritems():
which is used for python 2 but as seen below I changed it so that it could run for python 3.
SentiCR:
1
2
3
4
def replace_all(text, dic):
for i, j in dic.items():
text = text.replace(i, j)
return text
Next I needed to decode the messages to ‘utf-8’ luckily someone responded to my issue for github, so before it wasn’t decoding it, I’m assuming maybe that is how it is for python 2 but it definitely didn’t work for python 3.
1
2
3
4
5
6
7
8
def preprocess_text(text):
comments = text.encode('ascii', 'ignore').decode('utf-8')
comments = expand_contractions(comments)
comments = remove_url(comments)
comments = replace_all(comments, emodict)
comments = handle_negation(comments)
return comments
Jupyter Notebooks:
I also had to learn about calling the class in my jupyter notebook which involved me importing SentiCR senticr_piper by switching my directory.
1
2
3
if("notebooks" in os.getcwd()):
os.chdir("../SentiCR")
from SentiCR.SentiCR import SentiCR
The only other part I changed with this jupyter notebook was that the sentiment analysis tool used was SentiCR.
1
2
3
4
5
6
7
for group in grouped:
parts = 0
numb = len(df2.loc[df2['message_id'] == group]['message_parts_tot'].tolist())
message = (df2.loc[df2['message_id'] == group]['message_text']).tolist()
message_text = ''.join(message)
#print(message_text)
score=sentiment_analyzer.get_sentiment_polarity(message_text)
Then I just did the same thing for the github issues and then for the pull requests. Since it has the basic structure as when I did sentiment analysis using only NLTK (Natural Language Processing Toolkit) I won’t go into it.
Resources: My branch on augur
Files Used: Python File - SentiCR
Jupyter Notebook - senticr_piper
Jupyter Notebook - github_issues_scores_senticr
Jupyter Notebook - github_pull_requests_scores_senticr