Difference between revisions of "ISSS608 2017-18 T3 Assign Gao Jiaoyang Data Preparation"
Jygao.2017 (talk | contribs) |
Jygao.2017 (talk | contribs) |
||
Line 89: | Line 89: | ||
[[Image:Waterways Final.jpg|400px]] | [[Image:Waterways Final.jpg|400px]] | ||
− | + | ==<font size="3"><font color="#8B4513"> Data Preparation</font></font>== | |
− | + | ===<font size="2"><font color="#8B4513"> Data Standardization </font></font>=== | |
The data are recorded by different units, this may make analysis result deviate. So we need to uniform the units before moving to next stage. This progress could be easily done in JMP. | The data are recorded by different units, this may make analysis result deviate. So we need to uniform the units before moving to next stage. This progress could be easily done in JMP. | ||
Line 100: | Line 100: | ||
Now, the data includes 3 kinds of units which are C for water temperature, blank for Macrozoobenthos, and µg/L for the rest of measures. | Now, the data includes 3 kinds of units which are C for water temperature, blank for Macrozoobenthos, and µg/L for the rest of measures. | ||
− | + | ===<font size="2"><font color="#8B4513"> Remove Duplications </font></font>=== | |
When look at the raw data, I found that there are multiple records for the same measure in the same date at the same location. In order to avoid the duplications, I decided to only keep the max id and calculate the mean value as final value. Then the raw data only remain 67,503 observations. | When look at the raw data, I found that there are multiple records for the same measure in the same date at the same location. In order to avoid the duplications, I decided to only keep the max id and calculate the mean value as final value. Then the raw data only remain 67,503 observations. | ||
[[Image:GJY duplication.jpg|400px]] | [[Image:GJY duplication.jpg|400px]] |
Revision as of 07:08, 8 July 2018
|
|
|
|
Contents
Data Description
Data Souce
The original data is offered by the website, and the main data I use for this assignment includes 2 documents and 1 image as show as below.
Data Overview
Boonsong Lekagul waterways readings.csv
This file includes all the records for Boonsong Lekagul waterways by 5 columns with 136,824 observations.
Column | Overview |
---|---|
ID | Unique ID for each observation. |
Value | Measured value for the chemical or property in this record. |
Location | Name of the location sample was taken from which include 10 sites in total. |
Sample date | Date sample was taken from the location, and the data was recorded for 1,965 days from 11/01/1998 to 31/12/2016. |
Measure | Chemicals or water properties measured in the record which includes 106 different types. |
chemical units of measure.csv
This file explains the unit for each measure tested for the research, and includes 106 measures and 4 kinds of unit which are C for water temperature, blank for Macrozoobenthos and mg/L, µg/L for rest of the measures . May need to unify these units before move to further analysis.
Measure | Unit |
---|---|
Waterways Final.jpg
The map shows the geographic position of the location samples, and how they connected with each other through river.
Data Preparation
Data Standardization
The data are recorded by different units, this may make analysis result deviate. So we need to uniform the units before moving to next stage. This progress could be easily done in JMP.
Firstly, join the two tables by matching the measure. [[Image:|400px]]
Then write a parallel conditional function to transform the mg/L value to µg/L.
Now, the data includes 3 kinds of units which are C for water temperature, blank for Macrozoobenthos, and µg/L for the rest of measures.
Remove Duplications
When look at the raw data, I found that there are multiple records for the same measure in the same date at the same location. In order to avoid the duplications, I decided to only keep the max id and calculate the mean value as final value. Then the raw data only remain 67,503 observations.