GSoC Week 7: Wasn't that kind of negative?
Now onto my fourth blog post, I got to admit I’m actually happy I’m writing up these posts I think it gives me time to go over what I’ve done and see how much time I spent on a problem and I’m now able to see such an easier solution (always fun haha, it feels strange to use lol is it informal? who knows). Also someone had asked to include what my project actually is and a brief overview so I’m putting it at the bottom. I applied for the second idea in that link but I believe it’s a combination between the second idea and the third idea since I’m looking at mailing lists and trying to determine the positive/negative messages in there but overall I’m pulling in new data sources Pipermail is just one of them, there are a lot more as you can see Perceval can pull a lot more data sources. This might be a pretty long intro but let’s say I’m making up for not having posted the first few weeks when GSoC started (haha). So we have our goals as per usual now this week as expected I thought it was going to be a simple walk in the park, going to finish pretty soon but of course I was wrong. The goal I spent on the most surprisingly was on number 2 where I had to add the score to the SQL table in the database, I thought the 5 goal was going to take a good amount of time but it really just required me to think a lot and one I had that down it was easier to implement (was I just typing things and seeing what happen of course but eventually you see a pattern you know haha) (This was a long intro waw I just went on didn’t I?).
So this week’s task with my mentor Sean Goggins is as follows:
Goals for the week:
- setup augur.config.json parameter for loading a set of mailing Lists
- Do a simple sentiment analysis on each message. Perhaps keep the “score” in an additional column in the table.
- Work with @Derek, @Gabe and @Carter to get your fork merged
- Build a filter to disregard long attachments or included items …
- Experiment with ways of filtering out old message strings from subsequent messages on the mailing lists and identifying mailing list discussion threads (some lists have them automatically, some don’t) … so, when you reply to an email, it often includes the text you are responding to … it’s about getting rid off all the prior messages before the message at the top …
Goals
Goal 1:
For this goal I first started off in the jupyter notebook PiperMail and what I’m doing is getting the path (line 3) of where the file is stored with all the mailing lists. If the file was never created the file is created but the user has to enter the mailing lists and the links for it.
1
2
3
4
5
6
7
8
9
10
11
12
13
if "notebooks" in os.getcwd():
os.chdir("..")
Piper, path= augurApp.piper()
print(path,"Place")
if(not os.path.exists(path)):
file = open(path, "w+")
else:
file = open(path, "r")
print("yeah")
if (os.stat(path).st_size == 0):
file.write("Link,mail_list")
print("Please enter the mailing lists and the links for them please")
file.close()
To get the file path we have to go back to our friend application.py which reads a file called “augur.config.json” this is used to specify what exactly the person is going to pull.
1
2
3
4
5
6
7
def piper(self):
from augur.PiperReader import PiperMail
path = self.read_config('PiperMail','mailing_lists')
if self.__piper is None:
logger.debug('Initializing PiperMail')
self.__piper = PiperMail()
return self.__piper,path
In our case it’s this part of the file for “augur.config.json
To give you an example of how the file is supposed to be you can see it below this is in a file called “mailing_lists.csv”
Link | mail_list |
---|---|
https://lists.opendaylight.org/pipermail/ | ”aalldp-dev” |
https://lists.opendaylight.org/pipermail/ | ”archetypes-dev” |
https://lists.opendaylight.org/pipermail/ | ”announce” |
Goal 2:
Now with this goal which I may need to improve on but for now it gets the job done we are going to analyse the sentences again but this time we’re going to add the score to the SQL table. We will look at “Sentiment_Piper.ipynb” the jupyter notebook seen below. Well one of the differences is that we actually get the mailing lists from the database (line 2) instead of when I was reading a json file that had the mailing lists names.
1
2
3
4
5
if("mailing_list_jobs" in table_names):
lists_createdSQL = s.sql.text("""SELECT project FROM mailing_list_jobs""")
df1 = pd.read_sql(lists_createdSQL, connect.db)
print(df1)
val = True
For the next part we will look at when I calculate the scores, now when using NLTK the way that we went about getting the scores is that we break the messages into different tokens (line 19) and we go about analysing each part. Another interesting thing is that it’s actually supposed to get a better score by doing it this way because when just analysing the message as a whole the NLP has a harder time getting a grasp of the whole message. So just to get the score of the whole message I just average the different scores for the different parts of the message. This may not be the best implementation but for now it works.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
sid = SentimentIntensityAnalyzer()
col = 'score','sentiment'
df3 = pd.DataFrame(columns = col)
for i in df1['project'].values:
print(i,"Mailing List")
SQL = s.sql.text("""SELECT subject,message_id,message_text,message_parts_tot
FROM mail_lists WHERE mailing_list = """ + "'" + i + "'")
df2 = pd.read_sql(SQL, connect.db)
#print(df2)
grouped = df2.groupby('message_id').groups
print("Here!!!!")
#print("\n\n","-"*70,"\n\n")
for group in grouped:
parts = 0
numb = len(df2.loc[df2['message_id'] == group]['message_parts_tot'].tolist())
message = (df2.loc[df2['message_id'] == group]['message_text']).tolist()
message_text = ''.join(message)
sentences = tokenizer.tokenize(message_text)
compound = parts = 0
sentiment = "Positive"
for sentence in sentences:
scores = sid.polarity_scores(sentence)
compound+= scores['compound']
parts+=1
avg_score = compound/parts
if(avg_score == 0):
sentiment = "Neutral"
elif(avg_score < 0):
sentiment = "Negative"
#print(message)
print("\n\n")
#print("Score",avg_score)
print("\n\n\n")
for i in range(numb):
li = [ [avg_score, sentiment] ]
df_temp = pd.DataFrame(li,columns = col)
#print(df_temp['score'])
df3 = df3.append(df_temp)
#print(df3)
print("\n\n")
The next part is where we combine together the scores into a dataframe. Now here is where the real dilemma begins because I don’t know if you realize but in order to get the message I actually had to query it from the database, now I need to add a column to the table I just queried. This is where it gets tricky but from what I’ve seen I can’t simply append a column like that especially because I’m using dataframes. So my solution was to first create a dataframe with all the scores as seen below.
if(len(list(df3)) == 2):
df3.reset_index(level=0, inplace=True)
df3.columns = ['augurmsgID','score','sentiment']
for i in range(len(df3['augurmsgID'])):
df3.loc[i,'augurmsgID'] = i+1
print(df3)
#for i in range(len(df4['score'])):
augurmsgID score sentiment
0 1 0.140263 Positive
1 2 0.140263 Positive
2 3 0.624650 Positive
3 4 0.624650 Positive
4 5 0.148000 Positive
5 6 0.148000 Positive
6 7 0.285950 Positive
7 8 0.285950 Positive
8 9 0.254578 Positive
9 10 0.254578 Positive
After I uploaded this to the SQL database.
Next take those two tables and combine them into a single dataframe and then upload that dataframe.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
df7 = pd.read_sql("""SELECT
mail_lists.augurmsgID as augurmsgID,
mail_lists.backend_name as backend_name,
mail_lists.project as project,
mail_lists.mailing_list as mailing_list,
mail_lists.category as category,
mail_lists.message_part as message_part,
mail_lists.message_parts_tot as message_parts_tot,
mail_lists.subject as subject,
mail_lists.date as date,
mail_lists.message_from as message_from,
mail_lists.message_id as message_id,
sentiment_scores.score as score,
sentiment_scores.sentiment as sentiment,
mail_lists.message_text as message_text
FROM mail_lists,sentiment_scores
WHERE mail_lists.augurmsgID
= sentiment_scores.augurmsgID;""",connect.db)
df7.to_sql(name='mail_lists_sentiment',con=connect.db,if_exists='replace',index=False)
So my thought was that since I had to initially pull the table with the messages I should have just pulled the whole table instead of only taking part of it and then pulling it again but now the full table. I may need to fix that but if I do get a way to somehow just add the score column without any of that well I’ll just do that.
Goal 3:
This will be cool to eventually do cause I kind of like getting commits on my Github account (haha) but we haven’t done this as yet, at the moment I work on a branch called pipermail on augur.
Goal 4:
I haven’t started this as yet, I need to look up a bit more how I could go about doing this, but I do feel this will be an ongoing thing.
Goal 5:
Now this was the real brain teaser I remember just sitting for awhile and just thinking about this and how I can approach this problem (no I wasn’t just lazing about but from an outside perspective it would definitely look like that haha). So the problem was that when I downloaded the mailing list emails it downloaded all the emails, now that’s great but if you want to actually do queries and analyse emails this tends to eat up a lot of memory. So I had to find a way of getting rid of the emails that were already combined in another email. Because these email have a thread going on when you reply to an email. So we head back to good old PiperMail (I do feel I need to do some name changes I keep thinking everything just sounds the same at this point lol, yes I just decided to change how I laugh it’s quite common for me). So we first look at the code below, is it a lot, yes, yes it is. To be honest as I’m writing this I’m thinking to myself crap you need to do this again because I didn’t do it for the part where I go through the updated messages instead,
but I’ll be a good programmer (well i’ll try) and create a function for it. So let’s look at what we have here, to explain this well I’m going to go with a normal example, so what happens if it’s the first email, well for this case let’s assume it has no references to other emails (sometimes it does which was surprising cause they don’t store that email).
So we have this try statement (line 5), we check to see if it has any references (line 6), of course it doesn’t cause it’s the first email so we go into the except section now since this is our first run through the messages “thread” and “store” was set to “None” so we just store that message. Now the next email may have a reference to the email before so we enter the try statement and “thread” is still “None” but “store” isn’t “None” but lucky for me I check to see if the “Message-ID” for store was in the current message which it was so we just store the message in “thread”. Now for the message let’s say that it’s a new message with no references, so “thread” isn’t “None” so we output that to the file and set “thread” to “None” and “store” is set to the new message. Now I could go through the different scenarios but I really don’t, however I can assure that it worked, well I tried it and I went through a bunch of messages and they were all there and it skipped the messages that were repeated so hopefully it’s fine. I do believe it met the criteria though but you know sometimes you need people to test it for you so overtime I hope to get some feedback. Also at the end of the loop (line 40) say we had a new message with no references then actually it wouldn’t output it to a file so I actually had to do a check and output that last message.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
for message in repo.fetch():
#print(message,"\n\n\n\n\n\n\n\n")
#print(message['data']['Message-ID'])
ID = message['data']['Message-ID']
try:
message['data']['References']
'''if(message['data']['Message-ID'] == '<CAFHD1sO814do11F9cKVZgr5fo+dw5q-VmfrYO_Q9vv6kXe8NjA@mail.gmail.com>'):
print(thread)
print(store)'''
if((not thread == None) and (thread['data']['Message-ID'] not in message['data']['References'])):
obj = json.dumps(thread, indent=4, sort_keys=True)
outfile.write(obj)
outfile.write('\n')
store = None
print("why")
elif( (not store == None) and (store['data']['Message-ID'] not in message['data']['References'])):
#print(message['data']['References'])
obj = json.dumps(store, indent=4, sort_keys=True)
outfile.write(obj)
outfile.write('\n')
store = None
print("yep")
thread = message
except:
#print("got'em")
if(not thread == None):
obj = json.dumps(thread, indent=4, sort_keys=True)
outfile.write(obj)
outfile.write('\n')
thread = None
print("got-em")
elif(not store == None):
obj = json.dumps(store, indent=4, sort_keys=True)
outfile.write(obj)
outfile.write('\n')
store = None
print("getting")
store = message
#print("!"*50,"NEW MESSAGE","!"*50)
if( (thread == None) and (not store == None) ):
obj = json.dumps(store, indent=4, sort_keys=True)
outfile.write(obj)
outfile.write('\n')
elif( (store == None) and (not thread == None)):
obj = json.dumps(thread, indent=4, sort_keys=True)
outfile.write(obj)
outfile.write('\n')
elif(store['data']['Message-ID'] in thread['data']['References']):
obj = json.dumps(thread, indent=4, sort_keys=True)
outfile.write(obj)
outfile.write('\n')
else:
obj = json.dumps(store, indent=4, sort_keys=True)
outfile.write(obj)
outfile.write('\n')
outfile.close()
mail_check[mail_list[x]] = 'new'
print("Created File",mail_list[x])
Resources:
GSoC ideas (Specifically Ideas 2 & 3): Ideas
My proposal: My proposal
Files Used:
Python File - PiperRead 12
Jupyter Notebook - PiperMail 6
Jupyter Notebook - Sentiment_Piper 5