To be honest I’m really struggling here with naming these things, I think this week especially lol. But good news I passed my second evaluation (phew that was close) just one more, to be honest I’m very excited for that t-shirt lol.

So this week’s task with my mentor Sean Goggins is as follows:

Goals for the week:

  1. Add columns (comment id, time stamp and repository id) for the github SQL Table
  2. Upload the sentiment scores to the SQL Database.
  3. Upload the emails from the Linux Kernel mailing list

Bonus:

4) Work on getting rid of text files on a whole.

Goals

Goal 1:
So this is like a continuation from last week and I spoke with Sean and what we needed to do was just better detail the tables. So seen below is the new set-up for the tables. So we just added these columns(comment id, time stamp and repository id) and what it does is just make it easier for people to query things.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
columns1 = 'augurmsgID',"backend_name",'repo_link',"owner","repo","subject",\
         "status","category","issue_number","timestamp",\
         "issue_id","user","body"
df = pd.DataFrame(columns=columns1)
item = 1
df.to_sql(name="github_issues", con=db,\
   if_exists='replace',index=False,
   dtype={'augurmsgID': s.types.Integer,
           'backend_name': s.types.VARCHAR(length=300),
           'repo_link': s.types.VARCHAR(length=300),
           'owner': s.types.VARCHAR(length=300),
           'repo': s.types.VARCHAR(length=300),
           'subject': s.types.VARCHAR(length=300),
           'status': s.types.VARCHAR(length=10),
           'category': s.types.VARCHAR(length=10),
           'issue_number': s.types.Integer,
           'timestamp': s.types.Integer,
           'issue_id': s.types.Integer,
           'user': s.types.VARCHAR(length=100),
           'body':s.types.TEXT             
   })

There seems to be a problem with uploading the github issues where there seems to be some messages where these are some characters that are not in the format used by the SQL Database. The one I set the database to is ‘utf8’ but that doesn’t seem to be able to recognize the characters that are in some of these issues so I will need to do some research about what can store them.

Goal 2:
Now before I was doing the sentiment analysis on the messages but I didn’t actually upload this to the SQL Database but now I go about uploading it. As you can see here I create the dataframe and start by arranging the messages by subject and then analysing each of the messages related to that subject and determining the sentiment (either positive, negative or neutral) of each message and appending this to a dataframe.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
grouped = df_users.groupby('subject').groups
num = 0
col = 'score','sentiment'
columns1 = 'augurmsgID',"backend_name",'repo_link',"owner","repo","subject",\
         "status","category","issue_number","timestamp",\
         "issue_id","user","body"
df3 = pd.DataFrame(columns=col)
df_list = pd.DataFrame(columns = columns1)
df_list = df_list.append(df_users)
for group in grouped:
   #print(group)
   issue_num = (df_users.loc[df_users['subject'] == group]['issue_number']).values
   messages = (df_users.loc[df_users['subject'] == group]['body']).tolist()
   id1 = (df_users.loc[df_users['subject'] == group]['body']).tolist()
   #print("Issue number: ",issue_num[0])
   #break
   for i in messages:
       #print(i)
       sentences = tokenizer.tokenize(i)
       compound = 0
       part = 0
       sentiment = "Positive"
       for sentence in sentences:
           scores = sid.polarity_scores(sentence)
           compound+=scores['compound']
           part+=1
           #print(sentence,scores)
           #if(scores['compound'] < 0.1):
           #    print(sentence,scores['compound'])
       avg = compound/part
       if(avg == 0):
           sentiment = "Neutral"
       elif(avg < 0):
           sentiment = "Negative"
       li = [ [avg,sentiment]]
       df_temp = pd.DataFrame(li,columns=col)
       df3 = df3.append(df_temp)
       #if(avg < -0.5):
           #print(i,avg)
           #print("\n\n\n\n\n\n")
   #if(num == 40):
   #    break
   num+=1
   #print("\n\n\n")
print(num)

I then combine the sentiment score dataframe and the original dataframe with all the messages.

1
2
3
4
5
6
#print(df3)
print(df_list)
df3 = df3.reset_index(drop=True)
print(df3.head())
df_list = df_list.reset_index(drop=True)
combine = (df_list.join(df3))

After I upload this to the SQL Database as a table ‘github_issues_sentiment_scores’

1
2
combine.to_sql(name='github_issues_sentiment_scores',\
              con=db,if_exists='replace',index=False)

Goal 3:
This is a somewhat a continuation of last week, now I spoke with someone at CHAOSS and they gave me a link to a mailing archive that was storing the linux kernel mailing lists, but the problem was that it seemed like I could download them in chunks. Which is a problem because the main point of what I’m doing is to make things easier to download and analyse. But fear not I was put onto someone who actually has these emails, the only problem is that it is a large amount of emails (more than 3 GB) so it will take awhile to download and then upload.

Goal 4:
Now to be honest this was a goal I was planning to work on after I finished Google Summer of Code because I do plan to still continue my work and working on this project because I believe it has a lot of potential and can be useful to people. So what I wanted to start doing is actually getting rid of the need to actually write the the messages to text files. So what this entails was to actually update ‘write_message’ in PiperMail the jupyter notebook and instead of writing to a text file it would store the message in a dictionary and after running through around 5000 (line 66) it would call PiperReader (line 67) and then upload these messages and then it goes through the next set of messages. It does all of this for one mailing list and then it exits ‘write_message’ and a next mailing list is given. This gets rid of the need of storing it in a text file first and then reading from the text file.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
def write_message(repo,type_archive,mail_check,pos,db,res,session,archives,numb,mail_lists,time=None):
   thread = None
   store = None
   k = 0
   di = {}
   #print("HEREEEE")
   if(pos == "lkml"):
       time = Piper.convert_date("Thu, 1 Jan 2013 20:37:11 +0000")
   for message in repo.fetch(from_date=time):
       #print(message,"\n\n\n\n\n\n\n\n")
       #print(message['data']['Message-ID'])
       if(type_archive == 'not_new'):
           mess_check = Piper.convert_date(message['data']['Date'])
           #mess_check = Piper.convert_date("Thu, 24 Mar 2019 20:37:11 +0000")
           #print(time)
       if(type_archive == 'not_new' and mess_check <= time ):
           print("Right here")
           continue           
       elif(type_archive == 'not_new' and mess_check > time):
           mail_check[pos] = 'update'
          
       ID = message['data']['Message-ID']
       try:
           message['data']['References']
           '''if(message['data']['Message-ID'] == '<CAFHD1sO814do11F9cKVZgr5fo+dw5q-VmfrYO_Q9vv6kXe8NjA@mail.gmail.com>'):
               print(thread)
               print(store)'''                            
           if((not thread == None) and (thread['data']['Message-ID'] not in message['data']['References'])):
               #bj = json.dumps(thread, indent=4, sort_keys=True)
               di[k] = thread
               #utfile.write(obj)
               #utfile.write('\n')
               store = None
               k+=1
               print("why")
              
           elif( (not store == None) and (store['data']['Message-ID'] not in message['data']['References'])):
               #print(message['data']['References'])
               di[k] = store
               #bj = json.dumps(store, indent=4, sort_keys=True)
               #utfile.write(obj)
               #utfile.write('\n')
               store = None
               print("yep")
               k+=1
           thread = message
       except:
           #print("got'em")
           if(not thread == None):
               di[k] = thread
               #bj = json.dumps(thread, indent=4, sort_keys=True)
               #utfile.write(obj)
               #utfile.write('\n')
               thread = None
               print("got-em")
               k+=1
           elif(not store == None):
               di[k] = store
               #bj = json.dumps(store, indent=4, sort_keys=True)
               #utfile.write(obj)
               #utfile.write('\n')
               store = None
               print("getting")
               k+=1
           store = message
       if(len(di) == 5000):
           numb,mail_lists = Piper.make(connect.db,mail_check,archives,mail_lists,res,session,di,numb)
           di = {}
           k = 0
       #print("!"*50,"NEW MESSAGE","!"*50)
   if(len(di) < 5000 and len(di) > 0):
       print(len(di))
       #print(di)
       numb,mail_lists = Piper.make(connect.db,mail_check,archives,mail_lists,res,session,di,numb)
       di = {}
       k = 0
   else:
       di = {}
       k = 0
   if( (thread == None) and (store == None)):
       good = 1
   elif( (thread == None) and (not store == None) ):
       di[k] = store
       #obj = json.dumps(store, indent=4, sort_keys=True)
       #outfile.write(obj)
       #outfile.write('\n')
   elif( (store == None) and (not thread == None)):
       di[k] = thread
       #obj = json.dumps(thread, indent=4, sort_keys=True)
       #outfile.write(obj)
       #outfile.write('\n')
   elif(store['data']['Message-ID'] in thread['data']['References']):
       di[k] = thread
       #obj = json.dumps(thread, indent=4, sort_keys=True)
       #outfile.write(obj)
       #outfile.write('\n')
   else:
       di[k] = store
       #obj = json.dumps(store, indent=4, sort_keys=True)
       #outfile.write(obj)
       #outfile.write('\n') 
   #outfile.close()
   if(bool(di)):
       numb,mail_lists = Piper.make(connect.db,mail_check,archives,mail_lists,res,session,di,numb)
   return numb,mail_lists

PiperReader was actually updated but essentially all I did was get rid of the function read_json and instead of referencing the text file at some points it would only look at the dictionary that was passed.

Resources: GSoC ideas (Specifically Ideas 2 & 3): Ideas
My proposal: My proposal

Files Used: Python File - PiperReader 16
Jupyter Notebook - PiperMail 11

Jupyter Notebook - Sentiment_Piper 7

Jupyter Notebook - github-issues 2

Jupyter Notebook - github_issues_scores 2