ISSS608 2016-17 T1 Assign3 Thian Fong Mei Data Preparation & Investigation Overview
|
|
|
|
|
|
Data preparation is done using JMP. The comms datasets are treated as the main datasets, with the movement datasets to augment to find the locations.
In the comms datasets, the following was done/observed:
- Add Day Column
- Rename From column to Sender column
- Rename To column to Receiver column
- No other missing data observed
- For external parties, only comms are received, never sent
- The Sender and Receiver columns are recoded as numerical, and the value for external receiver is recoded as 1. That is, comms sent to ID 1 are addressed to external parties
In the movement datasets, the following was done/observed:
- No missing data for Fri, Sat data
- There are 2 records with missing data. As the IDs, X and Y values are missing. It would not be meaningful to keep them and thus, are deleted
The Ride Details data set (from VAST) is used later to provide the meaning of X and Y to actual locations within DinoFun World. A new column is also created to register the location areas (e.g Wet Land etc) under which the DinoFun World amenity belong to.
Recognizing that movement records are much more than comms records, it would not be possible to analyse aggregated movement records meaningfully without crashing the JMP application. Movement data is useful in providing the x and y cooordinates.
The datasets are processed using the following manner, day by day, starting with Friday.
- The comms data is concatenated with the movement data with the column of the id changed to Sender.
- A new table results. This new table is sorted by ascending timestamp, followed by ascending Sender. This would sort the concatenated data by time, and by Sender.
- The assumption is the visitor movements are gradual changes with time, and the missing X,Y information from the comms data could be deduced from the movement data in the preceding time.
- Hence, 3 new columns, Sender_Movement, Sender_X, Sender_Y are created, with the following formulas applied respectively.
- If(Is Missing(:type), Lag(:Sender_Movement, 1), :type)
- If(Is Missing(:X), Lag(:Sender_X, 1), :X)
- If(Is Missing(:Y), Lag(:Sender_Y, 1), :Y)
- Go to Tables -> Missing Data Pattern to filter and select back the original comms records. Save this table
- Use this table as the baseline to gather the Receiver information and to repeat steps of concatentating with the movement data, but this time, with the column of id changed to Receiver.
- Repeat similar steps as above. Again this data is sorted, and 3 new columns created, of Receiver_Movement, Receiver_X, Receiver_Y, and formulas applied and modified accordingly.
- Go to Tables -> Missing Data Pattern to filter and select back the original comms records. Save this table.
Now we have all the Senders and Receivers X and Y information and movements for Friday.
- This table is joined with the Ride Details data set, using matching columns of X and Y, to create a new table which includes DinoFun World locations in accordance to its X and Y.
- The Friday data set is now structured.
- Similar steps are followed for the Saturday and Sunday datasets.
- When completed, the Friday, Saturday and Sunday comms datasets are checked that they still have the same records as the original, apart from having more columns, and they are then concatenated together.