My fifth blog post, don’t worry this one is going to be pretty short. Maybe I’m giving you all a rest after last week, maybe….. So this week actually doesn’t have a week planned out per say, I just mostly worked on some things I needed to complete from last week.

So this week’s task with my mentor Sean Goggins is as follows:

Goals for the week:

  1. Build a filter to disregard long attachments or included items …
  2. Do a simple sentiment analysis on each message. Perhaps keep the “score” in an additional column in the table.
  3. Experiment with ways of filtering out old message strings from subsequent messages on the mailing lists and identifying mailing list discussion threads (some lists have them automatically, some don’t) … so, when you reply to an email, it often includes the text you are responding to … it’s about getting rid off all the prior messages before the message at the top …

Goals

Goal 1:
This is something I still am not entirely sure about how to do, so I will discuss it but I have been looking into it, the problem is how to disregard those long sections and then look at text after it. The section sizes could change.

Goal 2:
I almost forgot that I did this, but this is what I believe was a better implementation for combining the two dataframes that have the initial table with all the messages “df_list” and then we have the dataframe with the scores “df3” (I might need to work on those names) and I simply combine them (line 5) into a new dataframe “combine”.

1
2
3
4
5
print(df3)
print(df_list)
df3 = df3.reset_index(drop=True)
df_list = df_list.reset_index(drop=True)
combine = (df_list.join(df3))

I then upload this as a new dataframe (line 1)(seen below). Of course the problem is what happens when you say for example have a SQL table with like 1 GB of messages, so will see how we can improve on that.

1
combine.to_sql(name='mail_lists_sentiment_scores',con=connect.db,if_exists='replace',index=False)

Goal 3:
Now coming back from last week I was able to complete this task but the problem was that it wasn’t set for future messages. So if I were to download new messages it would still download all the messages that related to a thread, and we want it to disregard those messages and only download the email that had the full thread in it. So what I did was actually create a separate function on a whole to check for both new messages being downloaded or if it’s your first time downloading the mail archive. As seen below is the function “write_message” now it’s actually longer but I just copied the first section of it because the rest of the function is more or less the same. Now we first enter the for loop (line 4) to check through each message the reason I have this first part is to actually check to see if the messages being downloaded is being downloaded for the first time (lines 7,11 and 13) the first if statement ( line 7) checks if it’s not new, if it is it converts the date of that message to a datetime object (line 8) and then it goes into the next if statement (line 11) and see’s if it’s before the most recent message, if it it goes onto the next message (line 13). If it’s not it changes the dictionary “mail_check” (line 16) that stores the type of mailing list it is, for example it can be “False” meaning it doesn’t need to be updated, it could be “update” which I hope you know what it means (lol) and then it can be “new” meaning we have to add all the messages (In Week 6 previous post I spoke about using this dictionary and declaring it).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
def write_message(file,repo,type_archive,mail_check,pos,time=None):
    thread = None
    store = None
    for message in repo.fetch(from_date=time):
        #print(message,"\n\n\n\n\n\n\n\n")
        #print(message['data']['Message-ID'])
        if(type_archive == 'not_new'):
            mess_check = Piper.convert_date(message['data']['Date'])
            #mess_check = Piper.convert_date("Thu, 24 Mar 2019 20:37:11 +0000")
            #print(time)
        if(type_archive == 'not_new' and mess_check <= time ):
            print("Right here")
            continue            
        elif(type_archive == 'not_new' and mess_check > time):
            print("sigh")
            mail_check[pos] = 'update'
            
        ID = message['data']['Message-ID']

Resources: GSoC ideas (Specifically Ideas 2 & 3): Ideas
My proposal: My proposal

Files Used: Python File - PiperRead 12
Jupyter Notebook - PiperMail 8

Jupyter Notebook - Sentiment_Piper 6