ISSS608 Visual Analytics Vast Challenge Mini Challenge 2

Data overview

The following lists the data used for the analysis. The data is obtained from VAST Challenge 2018: Mini-Challenge 2.

Data description

File name	Description	Remarks
1. Boonsong Lekagul waterways readings.csv	Water sensors readings of 9 locations in the preserve, original data contains 5 columns, namely, "id", "value", "location", "sample date" and "measure". Brief introduction as following: "id" - identifier of each record "value" - the value of corresponding measure "location" - the location of each record taken from "sample date" - the day when the sample was taken "measure" - the measure name of each record	There is no missing value in this dataset, whilst, a large amount of record has zeros in value Records are taken from 11 Jan 1998 to 31 Dec 2016 Total 106 different kinds of measures Total 10 locations Every row represents a unique record
2. chemical units of measure.csv	This table contains two columns, one is measure name, the other is unit which corresponding to the measure	Each measure has a corresponding unit, except for Macrozoobenthos
3. Waterways Final.jpg	This picture is a map of Boonsong Lekagul waterways	reveals the where the water sensors located and an approximate dumping point

Tools

R, Excel, Tableau

Data understanding

Integrated the two tables together, a record can be read as below format:

On the same day, a chemical could be examined multiple times, at most three times per day (as founded). One possible reason is that the sensor was triggered three times one day (since the ids are different) but at different date time, since timestamp information is not provided, this assumption can not be verified. However, to be fair, the when grouping data, the mean value is taken instead of the summation of the values.

Data preparation

Step 1

1.1 Filtered data from year 2010 to year 2016 with below reasons: 1.2 In tableau, created a new calculated field, named "measure count" with formula "COUNTD([Measures])"
this value of this field reveals the number of measures appear in each year

1.3 Drag the calculated field (measure count) on rows shelf, sample date on column shelf and location on color card as well as on detail card

From graph above, noted that the sensors in Achara, Decha and Tansanee started to work on year 2009. One year 2009, all the locations have abnormal high starting point, in other words, the number of chemical types in each location are higher than normal value. To investigate further, such anomaly could be caused by storm water [1]or urban flooding.[2]

Step 2

A large amount of zero values were found in the collected data, one assumption of such zeros is that the corresponding chemicals was measured at that specific sensing, whilst none of that chemical was found. Hence, zeros should be removed.
One example:
Put location on rows shelf and sample date on column, expanded sample date to month, drag value on label card, drag year on filter and choose value 2016, drag measure on filter and choose value gamma-Hexachlorocy.

Step 3: Using R to prepare the data

3.1 The comparison should be consistent among the 10 locations, hence, data from 1998 to 2008 are filtered out.
3.2 Anomaly pattern which cause a sudden increase simultaneously across the 10 locations should be ignore, hence, year 2009 is filtered out.
3.3 Zeros are meaningless in the data set, should be removed.
_{related R script will be uploaded to LMS}
With the prepared data, fair comparisons can be performed across the preserve.

ISS608 2017-18 T3 G2 Assign ChenYanchong Data Prep

Contents

Data overview

Data description

Tools

Data understanding

Data preparation

Step 1

Step 2

Step 3: Using R to prepare the data

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools