GSoC Week 10: Could you not?
To be honest I’m really struggling here with naming these things, I think this week especially lol. But good news I passed my second evaluation (phew that was close) just one more, to be honest I’m very excited for that t-shirt lol.
So this week’s task with my mentor Sean Goggins is as follows:
Goals for the week:
- Add columns (comment id, time stamp and repository id) for the github SQL Table
- Upload the sentiment scores to the SQL Database.
- Upload the emails from the Linux Kernel mailing list
Bonus:
4) Work on getting rid of text files on a whole.
Goals
Goal 1:
So this is like a continuation from last week and I spoke with Sean and what we needed to do was just better detail the tables. So seen below is the new set-up for the tables. So we just added these columns(comment id, time stamp and repository id) and what it does is just make it easier for people to query things.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
columns1 = 'augurmsgID',"backend_name",'repo_link',"owner","repo","subject",\
"status","category","issue_number","timestamp",\
"issue_id","user","body"
df = pd.DataFrame(columns=columns1)
item = 1
df.to_sql(name="github_issues", con=db,\
if_exists='replace',index=False,
dtype={'augurmsgID': s.types.Integer,
'backend_name': s.types.VARCHAR(length=300),
'repo_link': s.types.VARCHAR(length=300),
'owner': s.types.VARCHAR(length=300),
'repo': s.types.VARCHAR(length=300),
'subject': s.types.VARCHAR(length=300),
'status': s.types.VARCHAR(length=10),
'category': s.types.VARCHAR(length=10),
'issue_number': s.types.Integer,
'timestamp': s.types.Integer,
'issue_id': s.types.Integer,
'user': s.types.VARCHAR(length=100),
'body':s.types.TEXT
})
There seems to be a problem with uploading the github issues where there seems to be some messages where these are some characters that are not in the format used by the SQL Database. The one I set the database to is ‘utf8’ but that doesn’t seem to be able to recognize the characters that are in some of these issues so I will need to do some research about what can store them.
Goal 2:
Now before I was doing the sentiment analysis on the messages but I didn’t actually upload this to the SQL Database but now I go about uploading it. As you can see here I create the dataframe and start by arranging the messages by subject and then analysing each of the messages related to that subject and determining the sentiment (either positive, negative or neutral) of each message and appending this to a dataframe.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
grouped = df_users.groupby('subject').groups
num = 0
col = 'score','sentiment'
columns1 = 'augurmsgID',"backend_name",'repo_link',"owner","repo","subject",\
"status","category","issue_number","timestamp",\
"issue_id","user","body"
df3 = pd.DataFrame(columns=col)
df_list = pd.DataFrame(columns = columns1)
df_list = df_list.append(df_users)
for group in grouped:
#print(group)
issue_num = (df_users.loc[df_users['subject'] == group]['issue_number']).values
messages = (df_users.loc[df_users['subject'] == group]['body']).tolist()
id1 = (df_users.loc[df_users['subject'] == group]['body']).tolist()
#print("Issue number: ",issue_num[0])
#break
for i in messages:
#print(i)
sentences = tokenizer.tokenize(i)
compound = 0
part = 0
sentiment = "Positive"
for sentence in sentences:
scores = sid.polarity_scores(sentence)
compound+=scores['compound']
part+=1
#print(sentence,scores)
#if(scores['compound'] < 0.1):
# print(sentence,scores['compound'])
avg = compound/part
if(avg == 0):
sentiment = "Neutral"
elif(avg < 0):
sentiment = "Negative"
li = [ [avg,sentiment]]
df_temp = pd.DataFrame(li,columns=col)
df3 = df3.append(df_temp)
#if(avg < -0.5):
#print(i,avg)
#print("\n\n\n\n\n\n")
#if(num == 40):
# break
num+=1
#print("\n\n\n")
print(num)
I then combine the sentiment score dataframe and the original dataframe with all the messages.
1
2
3
4
5
6
#print(df3)
print(df_list)
df3 = df3.reset_index(drop=True)
print(df3.head())
df_list = df_list.reset_index(drop=True)
combine = (df_list.join(df3))
After I upload this to the SQL Database as a table ‘github_issues_sentiment_scores’
1
2
combine.to_sql(name='github_issues_sentiment_scores',\
con=db,if_exists='replace',index=False)
Goal 3:
This is a somewhat a continuation of last week, now I spoke with someone at CHAOSS and they gave me a link to a mailing archive that was storing the linux kernel mailing lists, but the problem was that it seemed like I could download them in chunks. Which is a problem because the main point of what I’m doing is to make things easier to download and analyse. But fear not I was put onto someone who actually has these emails, the only problem is that it is a large amount of emails (more than 3 GB) so it will take awhile to download and then upload.
Goal 4:
Now to be honest this was a goal I was planning to work on after I finished Google Summer of Code because I do plan to still continue my work and working on this project because I believe it has a lot of potential and can be useful to people. So what I wanted to start doing is actually getting rid of the need to actually write the the messages to text files. So what this entails was to actually update ‘write_message’ in PiperMail the jupyter notebook and instead of writing to a text file it would store the message in a dictionary and after running through around 5000 (line 66) it would call PiperReader (line 67) and then upload these messages and then it goes through the next set of messages. It does all of this for one mailing list and then it exits ‘write_message’ and a next mailing list is given. This gets rid of the need of storing it in a text file first and then reading from the text file.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
def write_message(repo,type_archive,mail_check,pos,db,res,session,archives,numb,mail_lists,time=None):
thread = None
store = None
k = 0
di = {}
#print("HEREEEE")
if(pos == "lkml"):
time = Piper.convert_date("Thu, 1 Jan 2013 20:37:11 +0000")
for message in repo.fetch(from_date=time):
#print(message,"\n\n\n\n\n\n\n\n")
#print(message['data']['Message-ID'])
if(type_archive == 'not_new'):
mess_check = Piper.convert_date(message['data']['Date'])
#mess_check = Piper.convert_date("Thu, 24 Mar 2019 20:37:11 +0000")
#print(time)
if(type_archive == 'not_new' and mess_check <= time ):
print("Right here")
continue
elif(type_archive == 'not_new' and mess_check > time):
mail_check[pos] = 'update'
ID = message['data']['Message-ID']
try:
message['data']['References']
'''if(message['data']['Message-ID'] == '<CAFHD1sO814do11F9cKVZgr5fo+dw5q-VmfrYO_Q9vv6kXe8NjA@mail.gmail.com>'):
print(thread)
print(store)'''
if((not thread == None) and (thread['data']['Message-ID'] not in message['data']['References'])):
#bj = json.dumps(thread, indent=4, sort_keys=True)
di[k] = thread
#utfile.write(obj)
#utfile.write('\n')
store = None
k+=1
print("why")
elif( (not store == None) and (store['data']['Message-ID'] not in message['data']['References'])):
#print(message['data']['References'])
di[k] = store
#bj = json.dumps(store, indent=4, sort_keys=True)
#utfile.write(obj)
#utfile.write('\n')
store = None
print("yep")
k+=1
thread = message
except:
#print("got'em")
if(not thread == None):
di[k] = thread
#bj = json.dumps(thread, indent=4, sort_keys=True)
#utfile.write(obj)
#utfile.write('\n')
thread = None
print("got-em")
k+=1
elif(not store == None):
di[k] = store
#bj = json.dumps(store, indent=4, sort_keys=True)
#utfile.write(obj)
#utfile.write('\n')
store = None
print("getting")
k+=1
store = message
if(len(di) == 5000):
numb,mail_lists = Piper.make(connect.db,mail_check,archives,mail_lists,res,session,di,numb)
di = {}
k = 0
#print("!"*50,"NEW MESSAGE","!"*50)
if(len(di) < 5000 and len(di) > 0):
print(len(di))
#print(di)
numb,mail_lists = Piper.make(connect.db,mail_check,archives,mail_lists,res,session,di,numb)
di = {}
k = 0
else:
di = {}
k = 0
if( (thread == None) and (store == None)):
good = 1
elif( (thread == None) and (not store == None) ):
di[k] = store
#obj = json.dumps(store, indent=4, sort_keys=True)
#outfile.write(obj)
#outfile.write('\n')
elif( (store == None) and (not thread == None)):
di[k] = thread
#obj = json.dumps(thread, indent=4, sort_keys=True)
#outfile.write(obj)
#outfile.write('\n')
elif(store['data']['Message-ID'] in thread['data']['References']):
di[k] = thread
#obj = json.dumps(thread, indent=4, sort_keys=True)
#outfile.write(obj)
#outfile.write('\n')
else:
di[k] = store
#obj = json.dumps(store, indent=4, sort_keys=True)
#outfile.write(obj)
#outfile.write('\n')
#outfile.close()
if(bool(di)):
numb,mail_lists = Piper.make(connect.db,mail_check,archives,mail_lists,res,session,di,numb)
return numb,mail_lists
PiperReader was actually updated but essentially all I did was get rid of the function read_json and instead of referencing the text file at some points it would only look at the dictionary that was passed.
Resources:
GSoC ideas (Specifically Ideas 2 & 3): Ideas
My proposal: My proposal
Files Used:
Python File - PiperReader 16
Jupyter Notebook - PiperMail 11
Jupyter Notebook - Sentiment_Piper 7
Jupyter Notebook - github-issues 2
Jupyter Notebook - github_issues_scores 2