Difference between revisions of "Fu Yi - Data Preparation"
Yi.fu.2017 (talk | contribs) (Created page with "<div style=background:#2b3856 border:#A3BFB1> 150px <font size = 5; color="#FFFFFF"> VAST MINI CHALLENGE 3 - Find out the suspiciousness</font> </div...") |
Yi.fu.2017 (talk | contribs) |
||
Line 31: | Line 31: | ||
<div style="background:#2b3856; border:#002060; padding-left:15px; text-align:left;"> | <div style="background:#2b3856; border:#002060; padding-left:15px; text-align:left;"> | ||
− | <font size = 4; color="#FFFFFF"><span style="font-family:Century Gothic;"> | + | <font size = 4; color="#FFFFFF"><span style="font-family:Century Gothic;">Data Preparation Question 1</span></font> |
</div> | </div> | ||
+ | |||
+ | a) Add titles | ||
+ | Open 4 large tables (calls, emails, purchases, meetings) in Excel. Add title for each column (source, eType, target, time) for each of 4 tables. | ||
+ | |||
+ | |||
+ | b) Change date | ||
+ | |||
+ | Import tables to JMP, since the real time should start from 11/05/2015, 14:00. I created 2 new columns for 11/05/2015 and 14:00 respectively, and combine Old time, Date, Time of date together to get the correct date. | ||
+ | -> -> | ||
+ | |||
+ | |||
+ | c) No duplication | ||
+ | |||
+ | Check summary of each table to eliminate the duplication. | ||
+ | |||
+ | |||
+ | d) Clear out incomplete month | ||
+ | |||
+ | The date starts from May,2015, however, the first 2 months have incomplete data. I delete the first 2 months data (May + June 2015) to make the dataset have a complete cycle. | ||
+ | The description of final 4 tables: | ||
+ | |||
+ | - Calls table: 10,091,409 rows | ||
+ | - Emails table: 13,846,639 rows | ||
+ | - Purchase table: 723,586 rows | ||
+ | - Meetings table: 127,110 rows |
Revision as of 14:15, 8 July 2018
|
|
|
|
|
Data Preparation Question 1
a) Add titles Open 4 large tables (calls, emails, purchases, meetings) in Excel. Add title for each column (source, eType, target, time) for each of 4 tables.
b) Change date
Import tables to JMP, since the real time should start from 11/05/2015, 14:00. I created 2 new columns for 11/05/2015 and 14:00 respectively, and combine Old time, Date, Time of date together to get the correct date.
-> ->
c) No duplication
Check summary of each table to eliminate the duplication.
d) Clear out incomplete month
The date starts from May,2015, however, the first 2 months have incomplete data. I delete the first 2 months data (May + June 2015) to make the dataset have a complete cycle. The description of final 4 tables:
- Calls table: 10,091,409 rows - Emails table: 13,846,639 rows - Purchase table: 723,586 rows - Meetings table: 127,110 rows