So it’s my third post, maybe I’m actually going to keep this up, wouldn’t that just be great, hopefully. I passed my first evaluation by the way! That made my day to be honest, especially because some people actually failed it due to issues with their mentor it seemed, so I’m very thankful that didn’t happen to me. Now this blog post is about improving some stuff in both the PiperReader python file and the jupyter notebook. We’re going to look at how to save some space, considering I didn’t care about it before, but that could be an issue when you try to download for example 1GB worth of emails, yeah it’s happened before. ALSO we will touch a little on some NLP (Natural Language Processing), sadly it’s in the very early stages but I’m pretty excited for it, I hope it goes well.

So this week’s task with my mentor Sean Goggins is as follows:

Goals for the week:

  1. Make table revisions to add columns for tracking message parts
  2. Add a mailing list retrieval parameter to the augur.config.json file … You can use the .git repository retrieval specification in examples to see how this could be formatted. You may need to ask derek or carter how to then have your program parse out mailing list retrieval
  3. Include logic to keep track of the timestamp of the LAST message retrieved on each “load” …. add a “last_run” column to mailing_list_jobs, and then download only the mboxes not already included, and, for the oldest one, you will likely need to look for messages > the “last_run” column timestamp. All the UTC conversions need to occur before this comparison …
  4. Do a simple sentiment analysis on each message. Perhaps keep the “score” in an additional column in the table.
  5. (Stretch Goal) : Experiment with ways of filtering out old message strings from mailing lists and identifying mailing list discussion threads (some lists have them automatically, some don’t)

Goals

Goal 1:
Now what is this about? Well since in my last blog post we were looking at how to separate out really long posts and after talking with Sean he suggested that for people querying the SQL table having a counter as to how many lines the message is split up into and also what part of the message we’re in. So I had to therefore add some columns to the table and you will see it below, the added columns are ‘augurmsgID’,’message_part’ and ‘message_parts_tot’.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
df5.to_sql(name=db_name, con=db,if_exists='replace',index=False,
            dtype={'augurmsgID': s.types.Integer,
                'backend_name': s.types.VARCHAR(length=300),
                'project': s.types.VARCHAR(length=300),
                'mailing_list': s.types.VARCHAR(length=1000),
                'category': s.types.VARCHAR(length=300),
                'message_part': s.types.Integer,
                'message_parts_tot': s.types.Integer,
                'subject': s.types.VARCHAR(length=400),
                'date': s.types.DateTime(),
                'message_from': s.types.VARCHAR(length=500),
                'message_id': s.types.VARCHAR(length=500),
                'message_text': s.types.VARCHAR(length=12000)
            })

So ‘message_parts_tot’ is how many parts the message was split into as seen below (Line 14) is where I actually calculated it, the if statement is for when the ‘mess_row_tot’ went over a multiple of 7000, since I was storing the message in dataframes of 7000 characters in length. For ‘message_part’ as seen below I used a variable called row_num (line 10) and set it to zero before the loop, and going through the iterations of the message I would increment it by 1 (line 18).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
def add_row_mess(self,columns1,df,di,row,archives,augurmsgID):
    temp =  di['data']['body']['plain']
    words = ""
    k = 1
    val = False
    prev = 0
    length = len(temp)
    #print(length)
    mess_row_tot = 1
    row_num = 0
    if(length < 100):
        j = length
    else:
        mess_row_tot+= int(length/7000)
        if( (mess_row_tot*7000) > length ):
            mess_row_tot+=1
        for j in range(100,length,7000):
            row_num+=1
            k+=1
            date = self.convert_date(di['data']['Date'])
            li = [[augurmsgID,di['backend_name'],di['origin'],archives,
            di['category'], row_num, mess_row_tot, di['data']['Subject'],
            date, di['data']['From'],
            di['data']['Message-ID'],
            temp[prev:j] ]]
            df1 = pd.DataFrame(li,columns=columns1)
            df3 = df.append(df1)
            df = df3
            prev = j
            row+=1
            augurmsgID+=1

Also overall in the python file I tried to make it more readable, by putting everything in the class, I might have put the functions in a bit of a stranger way, I think when I go more into documentation I’ll search up what is the conventional way.

Goal 2:
I haven’t started this as yet, I wanted to get more clarification about how to implement it, doesn’t mean I was anywhere near to starting it though because I spent a bit of time on the next goal.

Goal 3:
Now for this we’re going to start with the jupyter notebook (PiperMail) since that’s how it occurred to me and then I went to the python file after. So for this we want that if the user is first downloading the messages and uploads it to the SQL Database and created their table, they have in the table with only the mailing lists names called ‘mailing_list_jobs’ as seen below to store the most recent message they have. After say a day after they want to run the program again to download the most recent messages it only downloads and uploads the most recent message. Before I would keep downloading the messages, so we essentially want to save space because if they don’t download messages for like a few months and decide to download them it could be quite a lot of messages to download.

So I first check to see if the SQL table ‘mail_lists’ which is where all the messages are downloaded. I think check if the table ‘mailing_list_jobs’ was created which is where I store the names of all the mailing lists and I take the names of the mailing lists; if it wasn’t created then I create the table.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
table_names = s.inspect(connect.db).get_table_names()
print(table_names)
val = False
mail_lists = True
if("mail_lists" not in table_names):
    mail_lists = False
if("mailing_list_jobs" in table_names):
    lists_createdSQL = s.sql.text("""SELECT project FROM mailing_list_jobs""")
    df1 = pd.read_sql(lists_createdSQL, connect.db)
    print(df1)
    val = True
else:
    columns2 = "backend_name","mailing_list_url","project","last_message_date"
    df_mail_list = pd.DataFrame(columns=columns2)
    df_mail_list.to_sql(name='mailing_list_jobs',con=connect.db,if_exists='replace',index=False)
    lists_createdSQL = s.sql.text("""SELECT project FROM mailing_list_jobs""")
    df1 = pd.read_sql(lists_createdSQL, connect.db)
    print("Created Table")
['issue_response_time']
Created Table

The next part is where I’m going to download the mail archives. This is split into two parts, either the mailing list is not in the table in the SQL database so we have to download the archives and we use a dictionary ‘mail_check’ I made to say that it’s ‘new’ (Lines 20 - 33). Or the mailing list is already in the database so we pull from the table ‘mailing_list_jobs’ and take the last_message_date column for that mailing list and check to see if there have been anymore recent messages, from Perceval there is a class called Pipermail that does this for us and if it has new messages we create the JSON file with the new messages and update that mailing list in the dictionary ‘mail_check’ to ‘update’ (Lines 34 - 54). Note that even if it has no recent messages at least the most recent mailing archive has to be downloaded because the class Pipermail has to check the messages in that archive.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
# create an Augur application so we can test our function
Piper = augurApp.piper()
print(os.getcwd())
print(perceval.__path__)
print(os.getcwd())
link = "https://lists.opendaylight.org/pipermail/"
#mail = ["aalldp-dev","alto-dev","archetypes-dev"]
#mail = ["aalldp-dev","alto-dev","archetypes-dev","dev"]
#mail = ["aalldp-dev","archetypes-dev","alto-dev"]
mail = ["aalldp-dev","archetypes-dev"]
mail_check = {key:False for key in mail}
print(mail_check)
#print(os.getcwd())
file = "opendaylight-"
if "notebooks" in os.getcwd():
    os.chdir("..")
path = "/augur/data/opendaylight-" 
for x in range(len(mail)):
    #print(link+mail[x])
    if(mail[x] not in df1['project'].values ):
        #print(os.getcwd())
        #print(os.path.join(os.getcwd() + path+'.json'))
        place = os.path.join(os.getcwd() + path + mail[x] +'.json')           
        repo = Pipermail(url = "https://lists.opendaylight.org/pipermail/"+ mail[x] + "/",dirpath="tmp/archives_"+mail[x])
        #print(repo)
        outfile = open(place,"w+")
        for message in repo.fetch():
            obj = json.dumps(message, indent=4, sort_keys=True)
            outfile.write(obj)
            outfile.write('\n')
        outfile.close()
        mail_check[mail[x]] = 'new'
        print("Created File",mail[x])
    else:
        last_updatedSQL = s.sql.text("""SELECT last_message_date FROM 
        mailing_list_jobs WHERE project = """ +  "'" + mail[x] + "'")
        last_updated_df = pd.read_sql(last_updatedSQL, connect.db)
        time = (last_updated_df['last_message_date'])  
        time = time.astype(object)
        place = os.path.join(os.getcwd() + path + 'temp_' + mail[x] +'.json')       
        repo = Pipermail(url = "https://lists.opendaylight.org/pipermail/"+ mail[x] + "/",dirpath="tmp/archives_"+mail[x])
        outfile = open(place,"w+")
        print(type(time[0]))
        for message in repo.fetch(from_date=time[0]):
            mess_check = Piper.convert_date(message['data']['Date'])
            #mess_check = Piper.convert_date("Thu, 24 Mar 2019 20:37:11 +0000")
            if (mess_check > time[0]):
                obj = json.dumps(message, indent=4, sort_keys=True)
                outfile.write(obj)
                outfile.write('\n')
                print("Updated messages downloaded")
                mail_check[mail[x]] = 'update'
        outfile.close()
        print("Checking to see for updated messages")
print(mail_check)
print("Finished downloading files")
2018-06-21 15:10:43 keanu-Inspiron-5567 perceval.backends.core.pipermail[21549] INFO Looking for messages from 'https://lists.opendaylight.org/pipermail/aalldp-dev/' since 1970-01-01 00:00:00+00:00
2018-06-21 15:10:43 keanu-Inspiron-5567 perceval.backends.core.pipermail[21549] INFO Downloading mboxes from 'https://lists.opendaylight.org/pipermail/aalldp-dev/' to since 1970-01-01 00:00:00+00:00


/home/keanu/temp/augur_push/augur4/augur/notebooks
_NamespacePath(['/home/keanu/anaconda3/envs/augur/lib/python3.6/site-packages/perceval'])
/home/keanu/temp/augur_push/augur4/augur/notebooks
{'aalldp-dev': False, 'archetypes-dev': False}


2018-06-21 15:10:45 keanu-Inspiron-5567 perceval.backends.core.pipermail[21549] INFO 1/1 MBoxes downloaded
2018-06-21 15:10:45 keanu-Inspiron-5567 perceval.backends.core.mbox[21549] INFO Done. 2/2 messages fetched; 0 ignored
2018-06-21 15:10:45 keanu-Inspiron-5567 perceval.backends.core.pipermail[21549] INFO Fetch process completed
2018-06-21 15:10:45 keanu-Inspiron-5567 perceval.backends.core.pipermail[21549] INFO Looking for messages from 'https://lists.opendaylight.org/pipermail/archetypes-dev/' since 1970-01-01 00:00:00+00:00
2018-06-21 15:10:45 keanu-Inspiron-5567 perceval.backends.core.pipermail[21549] INFO Downloading mboxes from 'https://lists.opendaylight.org/pipermail/archetypes-dev/' to since 1970-01-01 00:00:00+00:00


Created File aalldp-dev


2018-06-21 15:10:47 keanu-Inspiron-5567 perceval.backends.core.pipermail[21549] INFO 1/1 MBoxes downloaded
2018-06-21 15:10:47 keanu-Inspiron-5567 perceval.backends.core.mbox[21549] INFO Done. 2/2 messages fetched; 0 ignored
2018-06-21 15:10:47 keanu-Inspiron-5567 perceval.backends.core.pipermail[21549] INFO Fetch process completed


Created File archetypes-dev
{'aalldp-dev': 'new', 'archetypes-dev': 'new'}
Finished downloading files

Below also is the outline of the ‘mailing_list_jobs’ table in the SQL database.

df_mail_list.to_sql(name="mailing_list_jobs",con=db,if_exists='replace',index=False,
                    dtype={'backend_name': s.types.VARCHAR(length=300),
                            'mailing_list_url': s.types.VARCHAR(length=300),
                            'project': s.types.VARCHAR(length=300),
                            'last_message_date': s.types.DateTime()
                    })

We now go back to the PiperReader python file. So I’m mostly going to focus on these few lines of code because the rest of the program mainly stays the same. This essentially uses the ‘mail_check’ dictionary I created before and if it’s key value is update then it reads the JSON file to pull the messages and upload it to the database. But if it’s new then it takes the JSON file that was created with all the messages and uploads all of it. However it it’s neither of these things and it’s set to the default value I set as False, then it just goes onto the next mailing list.

for i in range(len(archives)):
    if(mail_check[archives[i]] == "update"):
        place = os.getcwd() + path + 'opendaylight-' + 'temp_' + archives[i]
    elif(mail_check[archives[i]] == 'new' ):
        place = os.getcwd() + path + 'opendaylight-' + archives[i]
        new = True
    else:
        print("Skipping")
        continue

I also made the function convert_date because I used it in the jupyter notebook and it just converts from a datetime object to a datetime64 object in UTC to be stored in pandas.

def convert_date(self,di):
    split = di.split()
    sign = split[5][0]
    if sign == '-':
        sign = +1
    else:
        sign = -1
    hours = int(split[5][1:3]) * sign
    mins = int(split[5][3:6]) * sign
    s = " "
    date = parse(s.join(split[:5]))
    date = date + timedelta(hours = hours)
    date = date + timedelta(minutes = mins)
    return date

Now lets look at the Sentiment_Piper jupyter notebook, with this we start looking at the sentiment around the email. With this however you can see their are some problems with what NLTK (Natural Language Tool Kit) is looking at because for example in some of the emails there are error messages that it analyses which it doesn’t need to. So just to talk about what I’m using essentially we use NLTK but we also use this tool called VADER which has already trained the NLP (Natural Language Processor) to score messages. So NLTK is breaking up the messages into smaller parts and analyzing it and giving you a score, one is the positive, negative, neutral and compound scores respectively. We are going to have to fine tune it a bit to ignore things such as errors for example and I’m going to have to look at deleting out some message threads.

import nltk, re, pprint
from nltk import word_tokenize
import os,json
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize 
import nltk.data
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk import sentiment
from nltk import word_tokenize
%matplotlib inline
#nltk.download('punkt')
#nltk.download('stopwords')
#pip install twython
#nltk.download('vader_lexicon')
def read_json(p):
    k = j = 0
    y=""
    for line in p:
        if(p[j:j+9] == "\"origin\":" or p[j:j+11] == "\"unixfrom\":"):
            k+=1
        y+=line
        j+=1
        if(k==2 and line == "}"):
            break
    return y,j
link = "https://lists.opendaylight.org/pipermail/"
archives = ["aalldp-dev","archetypes-dev"]
path = "/augur/data/"

if "notebooks" in os.getcwd():
    os.chdir("..")

for i in range(len(archives)):
    place = os.getcwd() + path + 'opendaylight-' + archives[i]
    name = os.getcwd() + path + archives[i]
    f = open(place + '.json','r')
    x = f.read()
    #print(x,"\n\n\n")
    temp = json.dumps(x)
    f.close()
    data,j = read_json(x)
    di = json.loads(data)
    #print(repr((di['data']['body']['plain'])))
    while(j<len(x)):
        raw = di['data']['body']['plain']
        #print(raw)
        # Next, we initialize VADER so we can use it within our Python script
        sid = SentimentIntensityAnalyzer()

        # We will also initialize our 'english.pickle' function and give it a short name

        tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
        message_text = raw
        sentences = tokenizer.tokenize(message_text)
        for sentence in sentences:
            scores = sid.polarity_scores(sentence)
            #print(scores)
            print(sentence)
            for key in sorted(scores):
                    print('{0}: {1}, '.format(key, scores[key]), end='')
            print("\n\n")
        #print("\n\n","-"*70,"\n\n")
        data,r= read_json(x[j:])
        j+=r
        if(j==len(x)):
            break
        print("*"*40,"NEW MESSAGE","*"*40,"\n\n")
        di = json.loads(data)
    print("\n\n\n")
    
Hi Brian Kaczynski <kaczynskib at avaya.com>, Dennis Flynn <drflynn at avaya.com>,

Please reply to this email to indicate that you are still an active committer on this project.
compound: 0.6124, neg: 0.0, neu: 0.839, pos: 0.161, 


Best Regards,
An Ho
compound: 0.6369, neg: 0.0, neu: 0.417, pos: 0.583, 


**************************************** NEW MESSAGE **************************************** 


Hi AALLDP Team,



1.
compound: 0.0, neg: 0.0, neu: 1.0, pos: 0.0, 


Does your project have any plans to be archived?
compound: 0.0, neg: 0.0, neu: 1.0, pos: 0.0, 


If your project has no plans for any future active development, project committers can vote to move a project to the Archived State by a Termination Review according to the OpenDaylight Project Lifecycle and Releases document [1].
compound: 0.128, neg: 0.058, neu: 0.871, pos: 0.071, 


2.
compound: 0.0, neg: 0.0, neu: 1.0, pos: 0.0, 


If you projects plans to remain active, does it intend to participate in the Boron Simultaneous Release as documented in the Boron Release Plan [2]?
compound: 0.4019, neg: 0.0, neu: 0.899, pos: 0.101, 


If not, could someone please publicly decline to participate with a message like so: "The <PROJECT_NAME> project is formally declining to participate in the Boron SR".
compound: 0.1376, neg: 0.071, neu: 0.838, pos: 0.091, 


There are no project related activity on Gerrit.
compound: -0.296, neg: 0.239, neu: 0.761, pos: 0.0, 


Best Regards,

An Ho



[1] https://www.opendaylight.org/project-lifecycle-releases

[2] https://wiki.opendaylight.org/view/Simultaneous_Release:Boron_Release_Plan


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.opendaylight.org/pipermail/aalldp-dev/attachments/20160324/6e137881/attachment.html>
compound: 0.7506, neg: 0.0, neu: 0.726, pos: 0.274, 






This is just a test.
compound: 0.0, neg: 0.0, neu: 1.0, pos: 0.0, 


-- 
Andrew J Grimberg
Lead, IT Release Engineering
The Linux Foundation

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: OpenPGP digital signature
URL: <http://lists.opendaylight.org/pipermail/archetypes-dev/attachments/20180417/126c0b23/attachment-0001.sig>
compound: 0.296, neg: 0.0, neu: 0.932, pos: 0.068, 


**************************************** NEW MESSAGE **************************************** 


Hello archetypians,

Just to let you all know that
https://lists.opendaylight.org/pipermail/archetypes-dev/ works now (thanks
Andy!).
compound: 0.0, neg: 0.0, neu: 1.0, pos: 0.0, 


Just in case anyone in the future every looks back at the first message on
this list, here is the background to how this all started:

* https://wiki.opendaylight.org/view/Archetypes

* https://lists.opendaylight.org/pipermail/tsc/2018-April/009333.html

* https://jira.linuxfoundation.org/browse/RELENG-854 and its linked issues

Tx,
M.
--
Michael Vorburger, Red Hat
vorburger at redhat.com | IRC: vorburger @freenode | ~ = http://vorburger.ch
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.opendaylight.org/pipermail/archetypes-dev/attachments/20180418/b55eba01/attachment.html>
compound: 0.5719, neg: 0.0, neu: 0.923, pos: 0.077, 

Resources: https://programminghistorian.org/en/lessons/sentiment-analysis

Files Used: Python File - PiperRead 12
Jupyter Notebook - PiperMail 4

Jupyter Notebook - Sentiment_Piper 2