ISSS608 2017-18 T3 Assign Yang Zhengyan Data Preparation
|
|
|
|
|
Contents
Data Description
Data field
|
Sample Data:
id,value,location,sample date,measure
2221,2,Boonsri,11-Jan-98,Water temperature
2223,9.1,Boonsri,11-Jan-98,Dissolved oxygen
2227,0.33,Boonsri,11-Jan-98,Ammonium
2228,0.01,Boonsri,11-Jan-98,Nitrites
2229,1.47,Boonsri,11-Jan-98,Nitrates
2230,0.06,Boonsri,11-Jan-98,Orthophosphate-phosphorus
2231,0.09,Boonsri,11-Jan-98,Total phosphorus
2232,13.9,Boonsri,11-Jan-98,Sodium
Data tools
|
Variable distribution
There are 10 locations, 106 measures in the dataset. So it is very difficult to apply all measures into the analysis. And we need to characterize the past and most recent situation with respect to chemical contamination in the Boonsong Lekagul waterways. So I filter out the reading values with recent years first (2011-2016) and exclude the remaining data.
The following screens showing the variables remained at dataset.
Variable clustering
Through initial data exploration, there are some types of measures with similar trend and values, e.g Orthophosphate-phosphorus, Total dissolved phosphorus, Total phosphorus. So we need to exclude those similar measures and remain one typical one for further investigation.
Firstly, Hierarchical clustering can handle those categorical data. So I put those records into 10 clusters, and check the scatterplot matrix for those variable. For the mean value comparison across all measures. We can check the pairs comparisons by Tukeu-Kramer HSD.
And then, we can see the difference between each measure and exclude those with very similar pattern reading value.
Finally, we can choose 19 measures to check the insights.