ISSS608 2016-17 T1 Assign3 Arcchit Mittal Data Preparation
Jump to navigation
Jump to search
|
|
|
|
|
|
DATASETS
There are two sets of data sets:
- Communication Data : There are three different files for each day from Friday to Sunday. Data in the files are of the communication between different people in the theme park and with external parties. It also shows from which location the messages were sent and the timestamp.
- Park Movement Data : The movement data is divided according to the day into three different files. The file shows the check ins and movements of different people in the park with the coordinates and the timestamp.
PARK MOVEMENT DATA
- There is some missing data from the Sunday file. Two rows were found erroneous and were removed.
- 2 additional columns were made. Timestamp column was split into date and time of the day columns by using the new column formula option over timestamp column.
- A new column range i.e time spent by each id in the park was added. First, a new summary data table was made by taking range of timestamp and grouping by id, this will give us a new table with each distinct id and the time spent by them in the park. Now, this table is updated back in the main table.
- Theme park details were also updated into the main table. New columns that were added are Ride No, Name of Ride, X, Y, Category and Sub Category
COMMUNICATION DATA
- In all the three files the external value in to column was given a numerical value (00) so that it can act as an ID.
- Then the to column was changed to numerical and continuous.
- Lastly, all the three files were concatenated together in one file.