Difference between revisions of "APA Final Progress"
Jump to navigation
Jump to search
(Add Column information) |
|||
Line 31: | Line 31: | ||
<p> | <p> | ||
− | '''Before Cleaning'''<br> | + | <big>'''Before Cleaning'''<br></big> |
− | Our data consists of 14 columns | + | Our data consists of 14 columns as described below: |
{| class="wikitable" | {| class="wikitable" | ||
|+Column Explanations | |+Column Explanations | ||
Line 78: | Line 78: | ||
|Encoded Message ID | |Encoded Message ID | ||
|} | |} | ||
+ | |||
+ | {| class="wikitable" | ||
+ | |+Data Statistics | ||
+ | |- | ||
+ | |'''Number of rows''' | ||
+ | |121,154 | ||
+ | |- | ||
+ | |'''Date Range''' | ||
+ | |11/26/2016 8:00 am to 02/01/2017 00:00 am | ||
+ | |} | ||
+ | <br> | ||
+ | <big>'''After Cleaning'''<br></big> | ||
+ | For our analysis, we only need emails sent internally, that is, from one employee in the company to another. Our data also contained instant messaging information which we will also not be using for our analysis. Thus we filtered out <br> | ||
+ | * Originator: Inbound and Outbound (which is either to or from external email addresses) | ||
+ | * Subject: im (instant messaging) | ||
+ | Additionally, we also found several system email addresses that can potentially skew the data (due to mass emails). Hence, we decided to eliminate emails to and from these email addresses as well. Below, we have listed some of the email addresses and the number of times they occurred in the data set. | ||
+ | |||
+ | {| class="wikitable" | ||
+ | |+ | ||
+ | |- | ||
+ | |analytics@trustsphere.com (134) | ||
+ | |heartbeat@trustsphere.com (1658) | ||
+ | |- | ||
+ | |jira@trustsphere.com (1386) | ||
+ | |sfdc@trustsphere.com (899) | ||
+ | |- | ||
+ | |trustsphere.office@trustsphere.com (15) | ||
+ | |tv.reports@trustsphere.com (1394) | ||
+ | |- | ||
+ | |marketing.team@trustsphere.com (322) | ||
+ | |trustvault.selfservice@trustsphere.com (95) | ||
+ | |} | ||
+ | {| class="wikitable" | ||
+ | |+Data Statistics | ||
+ | |- | ||
+ | |'''Number of rows after removing im + inbound/outbound''' | ||
+ | |45,855 | ||
+ | |- | ||
+ | |'''Number of rows after removing system email addresses''' | ||
+ | |29,797 | ||
+ | |} | ||
+ | |||
+ | Further, we removed columns that we do not need in our analysis. These include: | ||
+ | * Remote IP: We don't need this because we are only using internal email exchange communication | ||
+ | * Remote Domain: Since we only have internal data, this will always be TrustSphere's domain. Hence keeping it will be redundant. | ||
+ | * Local Domain: Since we only have internal data, this will always be TrustSphere's domain. Hence keeping it will be redundant. | ||
+ | * Direction: This will always be within TrustSphere | ||
+ | * Inbound Count: At this stage, we are focusing on relationship analysis (between two employees) rather than individual analysis within a network. We may need this column in the second half of our project. | ||
+ | * Outbound Count: At this stage, we are focusing on relationship analysis (between two employees) rather than individual analysis within a network. We may need this column in the second half of our project. | ||
+ | * Subject: We are only using email data, and hence this will always have a value of 'em'. Hence keeping it will be redundant. | ||
<ul> | <ul> |
Revision as of 12:56, 23 February 2017
Preliminary analysis
Before Cleaning
Additionally, we also found several system email addresses that can potentially skew the data (due to mass emails). Hence, we decided to eliminate emails to and from these email addresses as well. Below, we have listed some of the email addresses and the number of times they occurred in the data set.
Further, we removed columns that we do not need in our analysis. These include:
|