GSoC Week 13: Time to reconnect
This is my second to last week, it’s actually coming to an end. Pretty excited for my last post, it’s going to be a good one, well I like to think so lol. So this week actually was more spent working on finishing the task of last week (Goal 1, this week) and then most of the week was spent on documentation. I will only have a short disucssion about Goal 1 because it’s basically a rehash of what I did for github issues from last week but I will just go through a quick run-through of my documentation (I won’t have a blog post talking about my documentation since it just explains how the code works).
So this week’s task with my mentor Sean Goggins is as follows:
Goals for the week:
- Make the pipermail/github/perceval architecture to pull from the last time the pulls took place. So, if I pull at August 3, 2018 at 2:22pm UTC, I would then check that last successful run time and only pull from communication records after that time.
- Investigate non-Perceval mechanisms for getting HTML stored email lists.
- Look at setting up SentiCR for sentiment Analysis
Extra
4) Add documentation
Goals
Goal 1:
For the first goal we look back to finishing from last week and ensuring that for the pull requests if I already uploaded pull requests and comments to the database, it would only upload pull requests or comments made after that data was uploaded. I actually won’t go over what I added for this due to the fact that it is basically the same thing as last week but instead of downloading github issues I’m just downloading pull-requests. I also changed ‘PiperReader’ to ‘piper_reader’ to keep with lower case naming conventions in the ‘augur’ folder with all the other python files
Goal 2:
I’m pretty sure you know I haven’t start this because it’s my last week for GSoC
Goal 3:
THis is going to be one of the first things I will start after my last day for GSoC, it’s going to be interesting to do because I’m going to be looking at the actual NLP (Natural Language Techniques) like tfidf (Term Frequency Inverse Document Frequency).
Goal 4:
Mailing lists:
So I went about documentation all my code (this include 1 python file and 6 jupyter notebooks). So first off was piper_reader an example would be:
Then we have both PiperMail which pulls the mailing lists and uses ‘piper_reader’ from above and then Sentiment Piper which does the sentiment analysis of the messages from the mailing lists
Github Issues:
We then have github-issues which pulls and uploads github issues to the database. Then we use github_issues_scores to do the senitment analysis on the messages in the the database.
Github Pull Requests:
Next we have github_pull_requests which retrieves and uploads github pull requests to the database. Then we use github_pull_requests_scores to do the sentiment analysis on the messages in the database.
In the last blog post for GSoC I will be explaining how to run through my code and in what order.
Resources:
GSoC ideas (Specifically Ideas 2 & 3): Ideas
My proposal: My proposal
Files Used:
Python File - piper_reader
Jupyter Notebook - PiperMail 13
Jupyter Notebook - Sentiment_Piper 9
Jupyter Notebook - github-issues 8
Jupyter Notebook - github_issues_scores 8
Jupyter Notebook - github_pull_requests 4
Jupyter Notebook - github_pull_requests_scores 3s