ISSS608 2016-17 T1 Assign3 Lim Hui Ting Jaclyn - Data Cleaning

From Visual Analytics and Applications
Jump to navigation Jump to search

Intro

Approach

Findings

Conclusion

Main Page

Data Cleaning

Data Organisation

Data Exploration

Data Cleaning

Communication Data

The communication data had to be transformed in such a way that it would be readable on Gephi for basic network visualisations. I created two main sets of communication data. One of them consisted of the communication data over three days, that was aggregated, to find out the number of times a source node had interacted with a target node on the same day. The other one consisted of communication data over the timeframe of 11am to 130pm on Sunday, which was the time period in which there was the most communication that had taken place in the application.

Aggregated 3-Days Communication Data

To create the edges csv file:

  1. Concatenate communication data over three days into one JMP file.
  2. Change column data type of "From" and "To" from continuous to nominal.
  3. Rename columns "From" and "To" to "Source" and "Target".
  4. Create a column "Type" and ensure that all the fields of the column are filled with "Directed" - This helps to specify the direction of the communication.
  5. Create a new column with the formula "Source || Target", as this concatenates the Source and Target column.
  6. Use the tabulate function to count the number of times the concatenated cell appears in the column - this helps to find out the weight of the unique direction of communication.
  7. Update the table with the weight of the column. Change the column name to "Weight".
  8. To remove duplicates, change the concat column to data type "continuous" and sort the row by concatenated IDs and find the row difference. Those with row difference value of "0" would refer to the duplicates present in the data. Eliminate those rows with row difference of "0".
  9. For those concatenated cells with "external", recode "external" to an arbitrary value (i.e. 9999999999) and conduct the same step above, to remove the duplicates. After that, recode the cell back to "external".
  10. Also, since "Location" and "Timestamp" columns are redundant in this set, these columns can be hidden.

To create the nodes csv file:

  1. Use the tablulate function on JMP to count the number of times a certain ID has appeared in both the "Source" and "Target" column. Merge the files together and tabulate again to find out the unique nodes present in the edges csv file.
  2. Name the column "id", so that Gephi can detect the ids of the nodes.

Communication Data from 11am to 130pm on Sunday

To create the edges csv file:

  1. Carry out Steps 2-5 above for the communication data on Sunday.
  2. Retain only the communication data between 11am and 1.30pm.
  3. To create the Timestamp column, use the current timestamp format, mm/dd/yyyy hh:mm.

To create the nodes csv file:

  1. Use the tablulate function on JMP to count the number of times a certain ID has appeared in both the "Source" and "Target" column. Merge the files together and tabulate again to find out the unique nodes present in the edges csv file.
  2. Name the column "id", so that Gephi can detect the ids of the nodes.

Movement Data

Although the movement data was provided, I had not focused as much on cleaning this dataset, as I was only planning to use it at later stages of the analysis. In the end, I utilised the movement data on both Friday and Sunday. The steps are as follows:

  • In the movement data, change Timestamp format to continuous- with the format dd/mm/yyyy hh:mm:ss.
  • Change data type of "ID" column to nominal.
  • Create an additional data table that includes the values of the "name", "type", "zone" of specific X & Y co-ordinates.

By taking these steps, I could easily visualise the movements taken by each visitor around the park.