Difference between revisions of "ISSS608 2017-18 T3 Assign Gao Jiaoyang Data Preparation"

From Visual Analytics and Applications
Jump to navigation Jump to search
 
(15 intermediate revisions by the same user not shown)
Line 16: Line 16:
 
| style="font-family:Century Gothic; font-size:100%; solid #1B338F; background:#f5c5aa; text-align:center;" width="25%" |  
 
| style="font-family:Century Gothic; font-size:100%; solid #1B338F; background:#f5c5aa; text-align:center;" width="25%" |  
 
;
 
;
[[ISSS608_2017-18_T3_Assign_Gao_Jiaoyang_Visualization|<b><font size="2.5"><font color="#8B4513">Methodology & Insight/Answer</font></font></b>]]
+
[[ISSS608_2017-18_T3_Assign_Gao_Jiaoyang_Visualization|<b><font size="2.5"><font color="#8B4513">Visualization</font></font></b>]]
  
 
| style="font-family:Century Gothic; font-size:100%; solid #1B338F; background:#f5c5aa; text-align:center;" width="25%" |  
 
| style="font-family:Century Gothic; font-size:100%; solid #1B338F; background:#f5c5aa; text-align:center;" width="25%" |  
Line 28: Line 28:
 
===<font size="3"><font color="#8B4513"> Data Souce</font></font>===
 
===<font size="3"><font color="#8B4513"> Data Souce</font></font>===
 
The original data is offered by the website, and the main data I use for this assignment includes 2 documents and 1 image as show as below.
 
The original data is offered by the website, and the main data I use for this assignment includes 2 documents and 1 image as show as below.
 +
 
[[Image:Gjy data1.jpg|400px]]  
 
[[Image:Gjy data1.jpg|400px]]  
  
Line 87: Line 88:
 
====<font size="2"><font color="#8B4513"> Waterways Final.jpg </font></font>====
 
====<font size="2"><font color="#8B4513"> Waterways Final.jpg </font></font>====
 
The map shows the geographic position of the location samples, and how they connected with each other through river.
 
The map shows the geographic position of the location samples, and how they connected with each other through river.
 +
 
[[Image:Waterways Final.jpg|400px]]
 
[[Image:Waterways Final.jpg|400px]]
  
===<font size="3"><font color="#8B4513"> Data Preparation</font></font>===
+
==<font size="5"><font color="#8B4513">'''Data Preparation'''</font></font>==
====<font size="2"><font color="#8B4513"> Data Standardization </font></font>====
+
===<font size="3"><font color="#8B4513"> Data Standardization </font></font>===
 
The data are recorded by different units, this may make analysis result deviate. So we need to uniform the units before moving to next stage. This progress could be easily done in JMP.
 
The data are recorded by different units, this may make analysis result deviate. So we need to uniform the units before moving to next stage. This progress could be easily done in JMP.
 +
 
Firstly, join the two tables by matching the measure.  
 
Firstly, join the two tables by matching the measure.  
[[Image:|400px]]
+
 
 
Then write a parallel conditional function to transform the mg/L value to µg/L.
 
Then write a parallel conditional function to transform the mg/L value to µg/L.
 +
 
Now, the data includes 3 kinds of units which are C for water temperature, blank for Macrozoobenthos, and µg/L for the rest of measures.
 
Now, the data includes 3 kinds of units which are C for water temperature, blank for Macrozoobenthos, and µg/L for the rest of measures.
 +
 +
===<font size="3"><font color="#8B4513"> Remove Duplications </font></font>===
 +
When look at the raw data, I found that there are multiple records for the same measure in the same date at the same location. In order to avoid the duplications, I decided to only keep the max id and calculate the mean value as final value. Then the raw data only remain 67,503 observations.
 +
 +
[[Image:GJY duplication.jpg|400px]]
 +
 +
===<font size="3"><font color="#8B4513"> Missing Value </font></font>===
 +
It would be easier to us to view missing values after transfrom measures into column. This could be easily done in JMP. After split raw dataset by measure, the dataset includes 109 columns with 106 measures and 67,503 observations.
 +
 +
And through column viewer, we can see there is a lot of missing value shown in this table, which means the sample measure and sample date are not regularly.
 +
[[Image:missing.jpg|500px]]
 +
 +
===<font size="3"><font color="#8B4513"> Mapping Location </font></font>===
 +
Need to map the locations in the waterways map with coordinates.
 +
 +
 +
[[Image:GJY XY.png|300px]]
 +
 +
[[Image:GJY map.jpg|600px]]

Latest revision as of 22:08, 17 July 2018

Duck pic.jpg VAST Challenge 2018 MC2: Like a Duck to Water

Background

Data Description & Data Preparation

Visualization

Conclusion & Reference

 


Data Description

Data Souce

The original data is offered by the website, and the main data I use for this assignment includes 2 documents and 1 image as show as below.

Gjy data1.jpg

Data Overview

Boonsong Lekagul waterways readings.csv

This file includes all the records for Boonsong Lekagul waterways by 5 columns with 136,824 observations.

Column Overview
ID Unique ID for each observation.
Value Measured value for the chemical or property in this record.
Location Name of the location sample was taken from which include 10 sites in total.
Sample date Date sample was taken from the location, and the data was recorded for 1,965 days from 11/01/1998 to 31/12/2016.
Measure Chemicals or water properties measured in the record which includes 106 different types.

chemical units of measure.csv

This file explains the unit for each measure tested for the research, and includes 106 measures and 4 kinds of unit which are C for water temperature, blank for Macrozoobenthos and mg/L, µg/L for rest of the measures . May need to unify these units before move to further analysis.

Measure Unit

Waterways Final.jpg

The map shows the geographic position of the location samples, and how they connected with each other through river.

Waterways Final.jpg

Data Preparation

Data Standardization

The data are recorded by different units, this may make analysis result deviate. So we need to uniform the units before moving to next stage. This progress could be easily done in JMP.

Firstly, join the two tables by matching the measure.

Then write a parallel conditional function to transform the mg/L value to µg/L.

Now, the data includes 3 kinds of units which are C for water temperature, blank for Macrozoobenthos, and µg/L for the rest of measures.

Remove Duplications

When look at the raw data, I found that there are multiple records for the same measure in the same date at the same location. In order to avoid the duplications, I decided to only keep the max id and calculate the mean value as final value. Then the raw data only remain 67,503 observations.

GJY duplication.jpg

Missing Value

It would be easier to us to view missing values after transfrom measures into column. This could be easily done in JMP. After split raw dataset by measure, the dataset includes 109 columns with 106 measures and 67,503 observations.

And through column viewer, we can see there is a lot of missing value shown in this table, which means the sample measure and sample date are not regularly. Missing.jpg

Mapping Location

Need to map the locations in the waterways map with coordinates.


GJY XY.png

GJY map.jpg