Be Prepared for the Unexpected

“There a known knowns, known unknowns, and unknown unknowns.”
–Hon. Donald Rumsfeld

Following up from the previous post, I built a simple Python program to read in our SampleEmailData.csv file. The file has two columns (Subj; Body) with the rest of the metadata removed. The data set is composed of some actual emails (sanitized) that are representative of the types of emails received in a real-world operating environment. To recap the previous post, our task is as follows:

Task: Automatically extract contact information from archived emails.

In our data set we have some emails with no contact info, some with basic Name/Email lines, and some with highly structured signature blocks. I also included some email chains that include multiple contacts in a single message body and some that include special
characters (i.e. “Meet 1400 @ building 149”) just to see how we can handle those.

First, I did a quick check to see whether there was any contact information in the message using Python’s Regular Expression (re) library. If the message body contains info of the form [\w-]@[\w-].\w{2-3} or \d{3}-\d{3}-\d{4} then we have reason to believe there is an email address or phone number (assuming American) in the message body.

[Note: \w is the code for any character A-Z, a-z, 0-9, or _ and \d is any digital 0-9]

We then build an object (Class objData) to store extracted contact information. We first split the email body by lines (\r\n) and each line for a matching expression (in dataLoad). Then we use Python’s list comprehension to build a list of words (phrases) that most likely contain contact information.

Once we have the objData classes built, we store the contact information in each line of a .csv file. Remembering the last element of the military’s TPME planning scheme ‘Effects’, let’s evaluate how effective our solution is for meeting our stated task. Here are the
results. Each line of correct contact information is colored green and each line of saved information that isn’t contact info (inside the signature block) is colored red.

Through manual labeling of the sample data, there are 224 lines of contact info in the original data. We returned 207 lines of actual contact information, true positives (TP), and 34 lines of incorrect information, false positives (FP). That gives us a precision of 207/207+34=0.8589 (TP/TP+FP) and a recall of 207/207+17 = 0.92 (TP/TP+FN). Our precision indicates a high utility of the program and our recall indicates it will find a significant portion of the available data.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s