Discover Emails Structures


Overview

Emails are used for communications and therefore represent a network. Is it possible to discover structures from our emails? This post answers this question.
Looking to see how my email connection structure look like, who are the central people who connect between teams, who I'm most interact with and which communities could be discovered?

Using Python, Pandas and NetworkX as the main analytics tools, Jupyter notebook with matplotlib or D3.js as the presentation and a Python implementation of BigCLAM for community detection. For privacy reasons, all emails addresses and names were replaced by User1, User2, etc.

There are few steps involved in email discovery, each is described in more details with code examples:

  • Step 1: Export emails information from the mailing system (Outlook Exchange)
  • Step 2: Pre-process1 - Generate nodes (person/mail address) and edges (email between sender and recipient)
  • Step 3: Pre-process2 - Arrange and Filter the top connections
  • Step 4: Create a network graph using nodes and edges
  • Step 5: Apply different algorithms such as node importance and communities to detect relations in the email graph

Result

The graph below is the result after applying steps 1-5. Each node represents a mail address and each edge represents a communication between one mailer to another. The graph shows all top 500 connections - mailers that have most interactions between them. The mailer color is based on its domain, the size is based on the number of connections it has, and the link width is related to the number of mails between two connected mailers. Some mailers are used more for connecting or coordinating between teams. These relations in the graph are shown by clicking on the buttons which marks the relevant nodes.
The actual node name is masked for privacy.

  • Me - I'm mot connected to all mailers, in fact I don't even know a lot of them because I was just CCed to a mail or got it via a mailing list. I can see how do I connect with other persons and who is invoved in the connection.
  • Betweeness - Are mailers who are top in terms of "betweeness centrality". These mailers exist in the shortest path between other nodes and connect other ndoes.
  • Hubs and Authorities - Hubs and Authorities are central mailers based on the HITS algorithm. Hubs are central mailers based on outgoing links and points to many other mailers, Authorities are central mailers based on incoming links and represents a mailer that was linked by many different hubs.
  • Most Interactive - Are mailers I most interact with, either receive or send mails to them.
  • Only Send - Are mailers that only send mails. Usually these are Ad/Info mails or people who publish data.
  • Only Receive - Are mailers that only receive mails. Usually these are mailing lists.
  • Communities - Are group of mailers who are highly connected as a group based on modified version of the BigCLAM algorithm I implemented in Python. A mailer can belong to multiple groups where the probability of two nodes being highly connected is based on the number of groups they share in common. The algorithm have found multiple such groups which are indeed related as a team.

Here's a description of each step details and associated code:

Step 1: Export email information from mailing system

I'm using Outlook Exchange as my mailing system. In order to extract the mail info, I've used a Visual Basic program. The program exports the sender and recipients for each mail along with the mail header and time into a CSV file.
Visual Basic was used since outlook does not allow to export the time with the standard export.
The Visual Basic program below reads all emails for a folder. For each mail it generates the sender address, sender name, list of recipients addresses and names, time and email title. I saved the output in file "nir_mails_merged.csv"

# Show / Hide code Button
from IPython.display import HTML
HTML('''<button class="toggle">Show/Hide Code</button>''')
Sub ExportToExcel()
' Need to enable Tools --> References (in the VBE). Check Microsoft Excel XXX Object Library
    Dim appExcel As Excel.Application
    Dim wkb As Excel.Workbook
    Dim wks As Excel.Worksheet
    Dim rng As Excel.Range
    Dim strSheet
    Dim strPath As String
    Dim intRowCounter As Integer
    Dim msg As Outlook.MailItem
    Dim nms As Outlook.NameSpace
    Dim fld As Outlook.MAPIFolder
    Dim itm As Object
    Dim Header As Variant

    'Select export folder
    Set nms = Application.GetNamespace("MAPI")
    Set fld = nms.PickFolder
    'Handle potential errors with Select Folder dialog box.
    If fld Is Nothing Then
        MsgBox "There are no mail messages to export", vbOKOnly, "Error"
        Exit Sub
    ElseIf fld.DefaultItemType <> olMailItem Then
        MsgBox "There are no mail messages to export", vbOKOnly, "Error"
        Exit Sub
    ElseIf fld.Items.Count = 0 Then
        MsgBox "There are no mail messages to export", vbOKOnly, "Error"
        Exit Sub
    End If

    'Open and activate Excel workbook.
    Set appExcel = CreateObject("Excel.Application")
    strSheet = appExcel.GetSaveAsFilename(Title:="Choose Output File", FileFilter:="Excel Files *.xls* (*.xls*),")
    If strSheet = False Then
        MsgBox "No file selected.", vbExclamation, "Sorry!"
        Exit Sub
    Else
        appExcel.Workbooks.Open (strSheet)
        Set wkb = appExcel.ActiveWorkbook
        Set wks = wkb.Sheets(1)
        wks.Activate
        appExcel.Application.Visible = True
    End If

    'Add header
    Header = Array("From: (Address)", "From: (Name)", "To: (Address)", "To: (Name)", "CC: (Address)", "CC: (Name)", "BCC: (Address)", "BCC: (Name)", "Received", "Subject")
    wks.Range("A1:J1") = Header

    'Copy field items in mail folder.
    intRowCounter = 1
    For Each itm In fld.Items
        If TypeOf itm Is MailItem Then
            Set msg = itm
            If msg.MessageClass <> "IPM.Note.SMIME.MultipartSigned" Then
                intRowCounter = intRowCounter + 1
                Set rng = wks.Cells(intRowCounter, 1)
                rng.Value = msg.SenderEmailAddress
                Set rng = wks.Cells(intRowCounter, 2)
                rng.Value = msg.SenderName
                Set rng = wks.Cells(intRowCounter, 3)
                rng.Value = getRecepiantAddress(msg, olTo)
                Set rng = wks.Cells(intRowCounter, 4)
                rng.Value = msg.To
                Set rng = wks.Cells(intRowCounter, 5)
                rng.Value = msg.CC
                Set rng = wks.Cells(intRowCounter, 6)
                rng.Value = getRecepiantAddress(msg, olCC)
                Set rng = wks.Cells(intRowCounter, 7)
                rng.Value = msg.BCC
                Set rng = wks.Cells(intRowCounter, 8)
                rng.Value = getRecepiantAddress(msg, olBCC)
                Set rng = wks.Cells(intRowCounter, 9)
                rng.Value = msg.ReceivedTime
                Set rng = wks.Cells(intRowCounter, 10)
                rng.Value = msg.Subject
            End If
        End If
    Next itm    
End Sub

Function getRecepiantAddress(msg As Outlook.MailItem, rtype As OlMailRecipientType) As String
    Dim recip As Recipient
    Dim allRecips As String
    allRecips = ""
    For Each recip In msg.Recipients
        If recip.Type = rtype Then
            If (Len(allRecips) > 0) Then allRecips = allRecips & "; "
            allRecips = allRecips & recip.Address
        End If
    Next
    getRecepiantAddress = allRecips
End Function

Step 2: Pre-process1 - Generate nodes and edges

After running the Visual Basic program, I run a pre-process script to create the list of ndoes, addresses and associated ID (attribute.xls), the number of emails sent from every sender to every recipiant (email_counts.csv) and a mapping between a group ID to the group name/domain (groups.xls)
The scripts are a slightly modifed version of: http://flowingdata.com/2014/05/07/downloading-your-email-metadata/

Importing the necessary Python libraries

# Show / Hide code Button
HTML('''<button class="toggle">Show/Hide Code</button>''')
import csv
import sys
import json
from datetime import datetime
import networkx as nx
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.colors as mcl
%matplotlib inline

Input files

File "nir_mails_merged.csv" is the output of step 1, setting.json is given below

# Show / Hide code Button
HTML('''<button class="toggle">Show/Hide Code</button>''')
{
  "no_special_characters":false,
  "group_by_address_domain":true,
  "include_email_in_attributes":true,
  "default_group_name":"DEFAULT",
  "address_delimiter": ";",
  "split_address_by_text": "CN=RECIPIENTS/CN=",
  "name_delimiter": ";",
  "output_delimiter": ",",
  "csv_keys": {
    "from_email": "From: (Address)",
    "to_email": "To: (Address)",
    "cc_email": "CC: (Address)",
    "bcc_email": "BCC: (Address)",
    "from_name":"From: (Name)",
    "to_name":"To: (Name)",
    "cc_name":"CC: (Name)",
    "bcc_name":"BCC: (Name)", 
    "date":"Received",
    "subject":"Subject"

  },
  "address_string_to_ignore":"/O=CISCO SYSTEMS"
}

Pre-processing code

# Show / Hide code Button
HTML('''<button class="toggle">Show/Hide Code</button>''')
# Run parameters
csv_input_file = "nir_mails_merged.csv"
# Prompts for what to include
include_cc = 'N'
include_bcc = 'N'
trigger_max_recipients = 'N'
max_recipients = 0
to_date = "10/8/2017  23:59"  # mails to this date are included
date_range_days = 365 # Include one year of mails

# Array of which types of emails to include
recipient_scope = ['to']
if include_cc == 'Y':
    recipient_scope.append('cc')
if include_bcc == 'Y':
    recipient_scope.append('bcc')

# Settings read into variables
settings_file = open('settings.json', 'r')
settings = settings_file.read()
settings = json.loads(settings)

csv_keys = settings['csv_keys']
no_special_characters = settings['no_special_characters']
output_delimiter = settings['output_delimiter'].encode('utf_8')
name_delimiter = settings['name_delimiter'].encode('utf_8')
address_delimiter = settings['address_delimiter'].encode('utf_8')
split_address_by_text = settings['split_address_by_text'].encode('utf_8')
group_by_address_domain = settings['group_by_address_domain']
default_group_name = settings['default_group_name'].encode('utf_8')
address_string_to_ignore = settings['address_string_to_ignore'].encode('utf_8')
include_email_in_attributes = settings['include_email_in_attributes']

# Helper functions
def strip_non_ascii(string):
    # Returns the string without non ASCII characters
    stripped = (c for c in string if 0 < ord(c) < 127)
    return ''.join(stripped)

def process_name(string):
    string = string.strip().replace("'","")
    if string is None or string == "":
        string = "BLANK"
    if no_special_characters:
        string = strip_non_ascii(string)
    return string

def date_diff_days(from_date, to_date):
    fd = datetime.strptime(from_date, "%m/%d/%Y  %H:%M")
    td = datetime.strptime(to_date, "%m/%d/%Y  %H:%M")
    delta = td-fd
    return delta.days

# Open input file
input_file = csv.DictReader(open(csv_input_file))

# Variables to store node and tie details
nodes = {}
nodes_with_ids = {}
ties = {}

# Preparing text strings
group_text = "\"sep=" + output_delimiter + "\"" + "\n"
group_text += "group_id" + output_delimiter + "group_name\n"
ties_text = "\"sep=" + output_delimiter + "\"" + "\n"
ties_text += "sender" + output_delimiter + "recipient" + output_delimiter + "count\n"
attribute_text = ""  # \"sep=" + output_delimiter + "\"" + "\n"
attribute_text += "\"id\"" + output_delimiter + "\"name\"" + output_delimiter + "\"group_id\""
if include_email_in_attributes:
    attribute_text += output_delimiter + "\"email\""
attribute_text += "\n"

# Read each line in the CSV file
for row in input_file:
    
    msg_days = date_diff_days(row[csv_keys["date"]], to_date)
    if msg_days>0 and msg_days<date_range_days:
        
        # Iterate through each scope that has been included (to, cc, bcc)
        for scope in recipient_scope:
    
            # Populating nodes with recipients
            recipient_addresses = []
            sender_addresses = []
    
            if row[csv_keys[scope + '_email']] is not None and row[csv_keys[scope + '_name']] is not None:
                temp_recipient_addresses = row[csv_keys[scope + '_email']].upper().replace(address_delimiter, split_address_by_text).split(split_address_by_text)
                recipient_names = row[csv_keys[scope + '_name']].upper().split(name_delimiter)
    
                for temp_recipient_address in temp_recipient_addresses:
                    if address_string_to_ignore not in temp_recipient_address:
                        recipient_addresses.append(temp_recipient_address)
    
                if len(recipient_addresses) > len(recipient_names):
                    recipient_addresses.pop(0)
                elif len(recipient_addresses) != len(recipient_names):
                    print 'PROBLEM: ' + str(len(recipient_addresses)) + " ADDRESSES VS " + str(len(recipient_names)) + " NAMES"
    
            # Check against max count and skip if more included
            if trigger_max_recipients == 'N' or (trigger_max_recipients == 'Y' and len(recipient_names) <= int(max_recipients)):
                for idx, recipient_address in enumerate(recipient_addresses):
                    nodes[recipient_address.split(address_delimiter)[0]] = recipient_names[idx]
    
                # Populating nodes with senders
                if row[csv_keys['from_email']] is not None and row[csv_keys['from_name']] is not None:
                    temp_sender_addresses = row[csv_keys['from_email']].upper().replace(address_delimiter, split_address_by_text).split(split_address_by_text)
                    sender_names = row[csv_keys['from_name']].upper().split(";")
    
                    for temp_sender_address in temp_sender_addresses:
                        if address_string_to_ignore not in temp_sender_address:
                            sender_addresses.append(temp_sender_address)
    
                # Populating ties
                for idx, sender_address in enumerate(sender_addresses):
                    nodes[sender_address.upper().split(address_delimiter)[0]] = sender_names[idx]
        
                    if sender_address not in ties:
                        ties[sender_address] = {}
                    for recipient_address in recipient_addresses:
                        recipient_address = recipient_address.upper().split(";")[0]
                        if recipient_address not in ties[sender_address]:
                            ties[sender_address][recipient_address] = 0
                        ties[sender_address][recipient_address] += 1
                        

# Variables for setting ids and capture group details
node_id = 1
groups = {}
group_id = 1
nodes_listed_by_id = {}

# Structure node information and write group text
for node_email, node_name in nodes.iteritems():
    nodes_with_ids[node_email] = {}
    nodes_with_ids[node_email]['id'] = node_id
    nodes_with_ids[node_email]['name'] = node_name
    nodes_with_ids[node_email]['email'] = node_email
    nodes_listed_by_id[node_id] = node_email

    # Capturing group name if relevant - TODO: also if not set
    if group_by_address_domain:
        split_email = node_email.split("@")
        if len(split_email) > 1:
            group_name = split_email[1].strip()
        else:
            group_name = default_group_name
        if group_name not in groups.keys():
            groups[group_name] = group_id
            group_id += 1
    else:
        # Note: will be set to the same over an over again
        group_name = default_group_name
        groups[group_name] = group_id 

    nodes_with_ids[node_email]['group_name'] = group_name
    nodes_with_ids[node_email]['group_id'] = groups[group_name]
    node_id += 1

# Write group text
for group_name, group_id in groups.iteritems():
    group_text += str(group_id) + output_delimiter + "\"" + group_name + "\"\n"

# Write tie text
for sender, recipients in ties.iteritems():
    for recipient, count in recipients.iteritems():
        ties_text += str(nodes_with_ids[sender]['id']) + output_delimiter + str(nodes_with_ids[recipient]['id']) + output_delimiter + str(count) + "\n"

# Write attribute text
for node_id, node_email in nodes_listed_by_id.iteritems():
    node_details = nodes_with_ids[node_email]

    attribute_text += str(node_details['id']) + output_delimiter + "\"" + process_name(node_details['name']) + "\"" + output_delimiter
    if group_by_address_domain:
        attribute_text += str(node_details['group_id'])
    else:
        attribute_text += str(1)
    if include_email_in_attributes:
        attribute_text += output_delimiter + "\"" + process_name(node_details['email']) + "\""
    attribute_text += "\n"        
    
# Write output files
attribute_file = open('attributes.csv', 'w')
attribute_file.write(attribute_text)
attribute_file.close()

groups_file = open('groups.csv', 'w')
groups_file.write(group_text)
groups_file.close()

email_counts_file = open('email_counts.csv', 'w')
email_counts_file.write(ties_text)
email_counts_file.close()

# Check that names are in ascii form, if not it creates an error in the graph visualization in networkx
nodeNames_df = pd.read_csv('attributes.csv', header=None, skiprows=1, names=['ID', 'Name', 'GroupID', 'Email'])
nodeNames_df.set_index(['ID'], inplace=True)
# Check ascii encoding
def check_non_ascii_labels(node_labels):
    non_ascii = False
    for n in node_labels.items():
        try:
            txt = n[1].decode('ascii')
        except:
            non_ascii = True
            n_pos = []
            for i,c in enumerate(n[1]):
                oc = ord(c)
                if oc >= 128: n_pos.append(i)
            print n[1], "   :Non ascii Pos:", n_pos
    if non_ascii == False: print "Non-ascii not found - attribute list is good to go!"
node_labels =dict(nodeNames_df['Name'])
check_non_ascii_labels(node_labels)
Non-ascii not found - attribute list is good to go!

Step 3: Pre-process2 - Arrange and Filter the top connections

This step reads the email_counts.csv and attributes.csv files created in step 2 and filter top connections.
Here's how these tables look like after reading into Pandas

email_count.csv

# Show / Hide code Button
HTML('''<button class="toggle">Show/Hide Code</button>''')
email_dff = pd.read_csv('email_counts.csv', header=None, skiprows=2, names=['n1', 'n2', 'count'])
email_dff.head()
n1 n2 count
0 828 4139 2
1 828 1450 2
2 828 2439 2
3 828 2911 2
4 828 4156 2

attributes.csv

# Show / Hide code Button
HTML('''<button class="toggle">Show/Hide Code</button>''')
nodeNames_df = pd.read_csv('attributes.csv', header=None, skiprows=1, names=['ID', 'Name', 'GroupID', 'Email'])
nodeNames_df.set_index(['ID'], inplace=True)
# Replace names that include the ? char with the email address
hnodes = nodeNames_df['Name'].str.contains("\?")
nodeNames_df.loc[hnodes,'Name'] = nodeNames_df.loc[hnodes,'Email']
# Masked user names and keep my node as "Me"
nodeNames_df['MaskedName'] = ['User'+str(i) for i in nodeNames_df.index]
nodeNames_df.loc[nodeNames_df['Email'].str.contains("NIRBD"), 'MaskedName'] = 'Me'
nodeNames_df[['GroupID', 'MaskedName']].head()
GroupID MaskedName
ID
1 1 User1
2 1 User2
3 1 User3
4 2 User4
5 3 User5

Filter top connections

# Show / Hide code Button
HTML('''<button class="toggle">Show/Hide Code</button>''')
# Filter X top count edges
use_top_vals = True
num_top_vals = 500
doMaskNames = True
email_dff.sort_values(['count'], axis=0, ascending=False, inplace=True)
if use_top_vals:
    email_df = email_dff[:num_top_vals]

Step 4: Create a network graph using nodes and edges

This step creates and visualize the email connection graph from the selected connections

To visualize the graph, I used a custom layout where nodes are placed according to their shortest path length from each node to other nodes. The edges width represent the email count between two nodes The node color represent the email group/domain

# Show / Hide code Button
HTML('''<button class="toggle">Show/Hide Code</button>''')
G = nx.from_pandas_dataframe(email_df, 'n1', 'n2', edge_attr='count', create_using=nx.DiGraph())
# Filter node labels for only those who are used in the top edges
nodeNames_df = nodeNames_df.ix[list(G.nodes()),:]

def layout_dist(G):
    # This controls the distance between nodes
    # It allocates a shortest path distance +1, or 10 when no path exists
    ret = dict()
    for row, nr in enumerate(G):
        rdist = dict()
        for col, nc in enumerate(G):
            try:
                d = nx.shortest_path_length(G,source=nr,target=nc)
            except:
                d = 10
            rdist[nc] = d + 10
        ret[nr] = rdist
    return ret

pos = nx.kamada_kawai_layout(G, weight='count', dist=layout_dist(G))
node_color = np.array(nodeNames_df['GroupID'])
edge_width = np.array([G[u][v]['count'] for u,v in G.edges()])
edge_width = 1.0 + 3.0 * edge_width / max(edge_width)
if doMaskNames: 
    node_labels = dict(nodeNames_df['MaskedName'])
else:
    node_labels = dict(nodeNames_df['Name'])
    
# Function to draw the graph using Python and matplotlib. 
# It's not used here, the graph in the bottom is based on D3.JS
def draw_graph(G, pos, node_color, edge_width, node_labels, node_size=300, label_fsize=8):
    fig = plt.figure(figsize=(20,18))
    nodes = nx.draw_networkx_nodes(G, pos, node_shape='o', alpha=0.4, node_color=node_color, 
                                   node_size=node_size, cmap=plt.cm.terrain)
    nx.draw_networkx_labels(G, pos, labels=dict([(n, str(n)) for n in G.nodes()]), font_size=10, font_color='black')
    lpos = dict(map(lambda v:(v[0], np.array([v[1][0], v[1][1]+0.02])), pos.items()))
    nx.draw_networkx_labels(G, lpos, labels=node_labels , font_size=label_fsize, font_color='black')
    nx.draw_networkx_edges(G, pos, alpha=0.4, edge_color='.4', width=edge_width, arrows=False, cmap=plt.cm.Pastel1)
    # Reset graph edge color and width
    node_edgecolor = ['black'] * G.number_of_nodes()
    node_linewidth = [1] * G.number_of_nodes()
    nodes.set_edgecolor(node_edgecolor)
    nodes.set_linewidth(node_linewidth)
    plt.axis('off')
    plt.tight_layout();
    
    return nodes, fig

# nodes are pathCollection when drawing nodes, node_list are label of nodes to mark
def mark_nodes(nodes, node_list, color, reset):
    node_edgecolor = nodes.get_edgecolor()
    node_linewidth = nodes.get_linewidth()
    for i, n in enumerate(G):
        if n in node_list:
            node_edgecolor[i]=mcl.colorConverter.to_rgba(color)
            node_linewidth[i] = 5
        else:
            if reset:
                node_edgecolor[i] = mcl.colorConverter.to_rgba('black')
                node_linewidth[i] = 1
    nodes.set_edgecolor(node_edgecolor)
    nodes.set_linewidth(node_linewidth)

#nodes, fig = draw_graph(G, pos, node_color, edge_width, node_labels, node_size=400, label_fsize=12)

Step 5: Apply different algorithms

Node Importance

There are different measures for node importance.
Here are marking of nodes based on betweenes centrality, nodes that appear more in the shortest path (connect other nodes)

# Show / Hide code Button
HTML('''<button class="toggle">Show/Hide Code</button>''')
# Importance by betweeness, Node that appear more in the shortest path (connect other nodes) are considered more important
#nodes, fig = draw_graph(G, pos, node_color, edge_width, node_labels, node_size=400, label_fsize=12)
ib = sorted(nx.betweenness_centrality(G, normalized=True, endpoints=True).items(), key=lambda x: x[1], reverse=True)
#mark_nodes(nodes, [n[0] for n in ib[0:10]], 'red', True)

Another measure is Hubs and Authorities.
Aauthorities are estimated based on in-coming links, Hubs based on out-going links

# Show / Hide code Button
HTML('''<button class="toggle">Show/Hide Code</button>''')
#nodes, fig = draw_graph(G, pos, node_color, edge_width, node_labels, node_size=400, label_fsize=12)
(hubs, auth) = nx.hits(G)
ih = sorted(hubs.items(), key=lambda x: x[1], reverse=True)
ia = sorted(auth.items(), key=lambda x: x[1], reverse=True)
#mark_nodes(nodes, [n[0] for n in ih[0:10]], 'blue', True)
#mark_nodes(nodes, [n[0] for n in ia[0:10]], 'red', False)

People I most Interact with

The following shows people I most interact with.
People in Red are most sending to me or receiving to me.

# Show / Hide code Button
HTML('''<button class="toggle">Show/Hide Code</button>''')
#nodes, fig = draw_graph(G, pos, node_color, edge_width, node_labels, node_size=400, label_fsize=12)
# my node
MyNodeNum = nodeNames_df[nodeNames_df['Email'].str.contains("NIRBD")].index[0]# Sending most mails to me
# The count DF already sorted so we find the top 5 senders to me
MostSendingToMe = list(email_df[email_df['n2']==MyNodeNum][:6]['n1'])
MostReceivingFromMe = list(email_df[email_df['n1']==MyNodeNum][:6]['n2'])
MostInteractingWithMe = list(MostReceivingFromMe)
MostInteractingWithMe.extend(MostSendingToMe)
MostInteractingWithMe = set(MostInteractingWithMe)
#mark_nodes(nodes, MostInteractingWithMe, 'red', True)

Mailers that are only receiving or only sending

The following are only receving mailers, there are mostly mailing-lists. Only sending mailers are mostly ads or distributors

# Show / Hide code Button
HTML('''<button class="toggle">Show/Hide Code</button>''')
#nodes, fig = draw_graph(G, pos, node_color, edge_width, node_labels, node_size=400, label_fsize=12)
# Find nodes that only receive mails
only_receive = email_dff[~email_dff['n2'].isin(email_dff['n1'])].iloc[0:30]['n2'].unique()
#mark_nodes(nodes, only_receive, 'red', True)
# Find nodes that only send mails
only_send = email_dff[~email_dff['n1'].isin(email_dff['n2'])].iloc[0:]['n1'].unique()
#mark_nodes(nodes, only_send, 'blue', False)

Teams and Communities

The following is an implementation of the BigClam algorithm for finding communities. Each node could belong to multiple communites, the most probable community is displayed as a different color.

BigCLAM algorithm for finding communities

https://cs.stanford.edu/people/jure/pubs/bigclam-wsdm13.pdf

# Show / Hide code Button
HTML('''<button class="toggle">Show/Hide Code</button>''')
A = nx.adjacency_matrix(G, weight='count').todense()

def log_likelihood(F, A, IA):
    # IA is a matrix with one where A = zero
    As = F.dot(F.T)

    part1 = A*np.log(1.-np.exp(-1.*As))
    sum_edges = np.sum(part1)
    part2 = IA*As
    sum_nedges = np.sum(part2)

    log_likeli = sum_edges - sum_nedges
    return log_likeli

def gradient(F, A, sumF, i):
    N, C = F.shape

    neighbours = np.where(A[i])
    sum_neigh = np.zeros((C,))
    sum_nneigh = sumF.copy()
    for nb in neighbours[1]:
        dotproduct = F[nb].dot(F[i])
        sum_neigh += A[nb,i]*F[nb]*(np.divide(np.exp(-1.*dotproduct),1.-np.exp(-1.*dotproduct)))
        sum_nneigh -= F[nb]
        
    grad = sum_neigh - sum_nneigh
    return grad

def train(A, C, iterations = 100):
    # initialize an F
    N = A.shape[0]
    np.random.seed(0)
    F = np.random.rand(N,C)
    
    rows, cols = np.where(A==0)
    IA = np.zeros(np.shape(A), dtype=np.int)
    IA[rows,cols] = 1    

    for n in range(iterations):
        # Calculate sumF in advance so the gradient done on neighbours only
        sumF = F.sum(axis=0)
        for person in range(N):
            grad = gradient(F, A, sumF, person)
            F[person] += 0.005*grad
            F[person] = np.maximum(0.001, F[person]) # Handle negatives
        ll = log_likelihood(F, A, IA)
    return F
F = train(A, 6, iterations=10001)

Display Communities in the Graph

# Show / Hide code Button
HTML('''<button class="toggle">Show/Hide Code</button>''')
def find_nodes_for_community(node_communities, c):
    ret = []
    # c = community number
    for k,v in node_communities.items():
        if c in v[1]:
            ret.append(k)
    return ret

# Find communities each node belongs to
node_communities = {}
for i,n in enumerate(G):
    # For each graph node, include a tuple with the node index and list of communities
    node_communities[n] = (i, list(np.where(F[i]>0.01)[0]))
    
node_colors = list(np.argmax(F,1))
node_w_community = []
max_col = np.shape(F)[1]
# put None for nodes that don't have communities, create list of nodes that belong to a community
for k,v in node_communities.items():
    if len(v[1]) == 0:  # Node doesn't belong to a community, set it to 
        node_colors[v[0]] = max_col
    elif len(v[1])>1 or (v[1][0] != 0):  # don't include community 0
        node_w_community.append(k)

# Draw the subgraph with associated communities
nodeset = find_nodes_for_community(node_communities, 3)
# show only nodes with community
GC = G.subgraph(node_w_community)
cpos = nx.spring_layout(GC)
cnode_color = [node_colors[node_communities[n][0]] for n in GC.nodes()]
cedge_width = np.array([GC[u][v]['count'] for u,v in GC.edges()])
cedge_width = 1.0 + 3.0 * cedge_width / max(cedge_width)
if doMaskNames:
    cnode_labels = dict(nodeNames_df['MaskedName'].loc[node_w_community])
else:
    cnode_labels = dict(nodeNames_df['Name'].loc[node_w_community])    
#cnodes, cfig = draw_graph(GC, cpos, cnode_color, cedge_width, cnode_labels, node_size=300, label_fsize=8)

Summary

It's possible to visualize our email structure and explore relationships in it.


Further mail analysis could be done showing:

  • Extract email patterns over time
  • Add text analysis to detect communities, sentiments, etc.

About me

I'm a system/software architect, working for the Core Software Group at Cisco Systems. Having many years of experience in software development, management, and architecture. I was working on various network services products in the areas of security, media, application recognition, visibility, and control. I'm holding a Master of Science degree in electrical engineering and business management from Tel Aviv University. Machine learning and analytics is one of my favorite topics. More can be found in my website: http://nir.bendvora.com/Work