Difference between revisions of "ISSS608 2017-18 T3 Assignment LIU YUQING Data Preparation"

From Visual Analytics and Applications
Jump to navigation Jump to search
(Created page with "<div style="background:#002060; border:#002060; text-align:center;"> 300px <font size = 5; color="#FFFFFF"><span style="font-family:Century Gothic;">ISSS...")
 
Line 1: Line 1:
<div style="background:#002060; border:#002060; text-align:center;">  
+
<div style=background:#002060 border:#C4F04E>
[[Image:Header.jpg|300px]]
+
[[Image:Header.jpg|300px]]  
 
<font size = 5; color="#FFFFFF"><span style="font-family:Century Gothic;">ISSS608 Visual Analytics and Applications_Vast Challenge_Mini Challenge2</span></font>  
 
<font size = 5; color="#FFFFFF"><span style="font-family:Century Gothic;">ISSS608 Visual Analytics and Applications_Vast Challenge_Mini Challenge2</span></font>  
 
</div>
 
</div>
 
<!--MAIN HEADER -->  
 
<!--MAIN HEADER -->  
 
{|style="background-color:#FFE79E;" width="100%" cellspacing="0" cellpadding="0" valign="top" border="0"  |  
 
{|style="background-color:#FFE79E;" width="100%" cellspacing="0" cellpadding="0" valign="top" border="0"  |  
| style="font-family:Century Gothic; font-size:100%; solid #000000; background:#002060; text-align:center;" width="25%" |  
+
| style="font-family:Century Gothic; font-size:100%; solid #000000; background:#C00000; text-align:center;" width="25%" |  
 
;  
 
;  
[[ISSS608 2016-17 T3 Assign CAO BO| <font color="#FFFFFF">Background</font>]]  
+
[[ISSS608_2017-18_T3_Assignment_LIU YUQING| <font color="#FFFFFF">Background</font>]]  
 
   
 
   
| style="font-family:Century Gothic; font-size:100%; solid #1B338F; background:#C00000; text-align:center;" width="25%" |   
+
| style="font-family:Century Gothic; font-size:100%; solid #1B338F; background:#002060; text-align:center;" width="25%" |   
 
;  
 
;  
[[CAO_BO_DataPreparation| <font color="#FFFFFF">Overview & Data Preparation</font>]]  
+
[[ISSS608_2017-18_T3_Assignment_LIU YUQING_Data Preparation| <font color="#FFFFFF">Data Preparation</font>]]  
 
   
 
   
| style="font-family:Century Gothic; font-size:100%; solid #1B338F; background:#002060; text-align:center;" width="25%" |  
+
| style="font-family:Century Gothic; font-size:100%; solid #1B338F; background:#002060; text-align:center;" width="25%" |
 
;  
 
;  
[[CAO_BO_PatternSummary| <font color="#FFFFFF">Pattern Summary</font>]]  
+
[[ISSS608_2017-18_T3_Assignment_LIU YUQING_Analysis&Insights| <font color="#FFFFFF">Analysis & Insights</font>]]  
 
   
 
   
 
| style="font-family:Century Gothic; font-size:100%; solid #1B338F; background:#002060; text-align:center;" width="25%" |   
 
| style="font-family:Century Gothic; font-size:100%; solid #1B338F; background:#002060; text-align:center;" width="25%" |   
 
;  
 
;  
[[CAO_BO_Comments| <font color="#FFFFFF">Comments & References</font>]]  
+
[[Analysis&Insights_Comments| <font color="#FFFFFF">Comments & References</font>]]  
 
   
 
   
 
|  &nbsp;  
 
|  &nbsp;  
|}
+
|}  
 +
 
 +
 
 +
=Dataset description=
 +
Mini-Challenge 2 provides two datasets - [Boonsong Lekagul waterways readings.csv] and [chemical units of measure.csv] and one map image about rivers and streams of some specific locations.
 +
[Boonsong Lekagul waterways readings.csv] includes 5 variables: ID, measure, value of masure, location and measuring date. There are 10 locations, 106 measures and measuring date span is from 1998 to 2016.
 +
[chemical units of measure.csv] includes measures and unit of each measure.
  
==Map View==
+
=Map view=
[[File:Lekagul Roadways labeled v2.jpg|400px|]]
+
According to the map image, we can observe geographical distribution of locations, in order to get specific coordinates of each location, we add the image into tableau as background and set basic range of coordinates of X axis and Y axis given by Mini-Challenge 2.
 +
[[Image:image1-1.jpg|800px|center]]
 +
Here we added point as annotate on the top of each arrow of each location, so X&Y coordinates of each location will show on the map. We do same for all other locations and we get the [Location.xlsx] file about coordinates of each location.
 +
[[Image:image1-2.jpg|500px|center]]
  
==Data Overview==
+
=Data augmentation and data pruning=
===Traffic Movement Data===
+
After overview of whole data, we observed value variance among measures are quite big, and measuring time span of each measure also varies greatly. Units of measures are mainly separated into ug/l and mg/l, it’s hard for us to estimate whether the value of this chemical is toxic for water or not. In this case, I found a file from the California Water Board as reference.
[[File:Cb_dataoverview.PNG|800px|center]]
+
[[Image:image1-3.jpg|800px|center]]
 +
This file provides MCL(California’s maximum contaminant levels) and DLR( Detection limits for purpose of reporting) of about 120 measures for regulated drinking water. Maximum level of contaminants in drinking water should be lower than raw water, so any value of chemical less than its corresponding level could be considered insignificant.
 +
<br>
 +
''Notice: all units are mg/l in this file.
  
==Data Discription==
+
''
Entry gates are positioned at the Preserve entrances.  Each vehicle receives an entry ticket at the gate and is assigned a vehicle class; the entry is recorded.  The entry ticket contains an RF-tag that enables the Preserve sensors to pick up the passage of a vehicle through the Preserve. Each vehicle surrenders their entry ticket when exiting the Preserve and the exit is recorded.When vehicles enter the Preserve, they must proceed through a gate and obtain a pass.
 
===Details Discription===
 
1. Entrances: All vehicles pass through an Entrance when entering or leaving the Preserve.
 
  
2. General-gates: All vehicles may pass through these gates.  These sensors provide valuable information for the Preserve Rangers trying to understand the flow of traffic through the Preserve.
+
=Data cleaning=
+
Steps:
3. Gates: These are gates that prevent general traffic from passing. Preserve Ranger vehicles have tags that allow them to pass through these gates to inspect or perform work on the roadway beyond.
+
 
 
+
1. Compare measures of Boonsong Lekagul waterways readings with measures given in California Water Board. Check measures one by one of both dataset, selected common measures of both. There are 25 measures appeared in CWB in total. Saved the 25 measures in [Measures with MCL] file
4. Ranger-stops: These sensors represent working areas for the Rangers, so you will often see a Ranger-stop sensor at the end of a road managed by a Gate. Some Ranger-stops are in other locations, however, so these sensors record all traffic passing by.
+
 
+
2. Unify units of MCL value to standard unit of these 25 measures, for measures with unit ug/l, changed the maximum value in CWB as ug/l for comparison.
5. Camping: These sensors record visitors to the Preserve camping areas.  Visitors pass by these entering and exiting a campground.
 
  
 +
3. Join [Measures with MCL] table to [Boonsong Lekagul waterways readings] table in JMP using left outer join so that measures not contained in CWS can also be reserved. We saved the joined data as excel file.
 +
[[Image:image1-4.jpg|500px|center]]
 +
'''Join table with MCL level'''
 +
[[Image:image1-5.jpg|500px|center]]
 +
Compare value of these measures in excel, for those lower than MCL level, delete the datapoints, others all stay same. Using if function: =IF(B:B<G:G,"null",B:B), column B is measured value, column G is MCL level, after calculation, data file looks like below:
 
==Step by Step Discription==
 
==Step by Step Discription==
 +
[[Image:image1-6.jpg|500px|center]]
 +
We deleted all records with “null” value. After data cleaning, 13 measures all cleaned, which means their value are all lower than MCL level over the date, they are:
 +
 +
1,2,4-Trichlorobenzene 2008 – 2009, Aldrin, Alachlor, Aluminium, Barium, Berilium, Endrin, Hexachlorobenzene, Mercury, Methoxychlor, Pentachlorobenzene, Selenium,Simazine.
 +
 +
Now data records are 110,581 and we save the remain data as [Measures exclude nonimpact] file.
 +
 
===Exclude misleading records===
 
===Exclude misleading records===
 
<table cellspacing="10" border="1">
 
<table cellspacing="10" border="1">

Revision as of 01:51, 8 July 2018

Header.jpg ISSS608 Visual Analytics and Applications_Vast Challenge_Mini Challenge2

Background

Data Preparation

Analysis & Insights

Comments & References

 


Dataset description

Mini-Challenge 2 provides two datasets - [Boonsong Lekagul waterways readings.csv] and [chemical units of measure.csv] and one map image about rivers and streams of some specific locations. [Boonsong Lekagul waterways readings.csv] includes 5 variables: ID, measure, value of masure, location and measuring date. There are 10 locations, 106 measures and measuring date span is from 1998 to 2016. [chemical units of measure.csv] includes measures and unit of each measure.

Map view

According to the map image, we can observe geographical distribution of locations, in order to get specific coordinates of each location, we add the image into tableau as background and set basic range of coordinates of X axis and Y axis given by Mini-Challenge 2.

Image1-1.jpg

Here we added point as annotate on the top of each arrow of each location, so X&Y coordinates of each location will show on the map. We do same for all other locations and we get the [Location.xlsx] file about coordinates of each location.

Image1-2.jpg

Data augmentation and data pruning

After overview of whole data, we observed value variance among measures are quite big, and measuring time span of each measure also varies greatly. Units of measures are mainly separated into ug/l and mg/l, it’s hard for us to estimate whether the value of this chemical is toxic for water or not. In this case, I found a file from the California Water Board as reference.

Image1-3.jpg

This file provides MCL(California’s maximum contaminant levels) and DLR( Detection limits for purpose of reporting) of about 120 measures for regulated drinking water. Maximum level of contaminants in drinking water should be lower than raw water, so any value of chemical less than its corresponding level could be considered insignificant.
Notice: all units are mg/l in this file.

Data cleaning

Steps:

1. Compare measures of Boonsong Lekagul waterways readings with measures given in California Water Board. Check measures one by one of both dataset, selected common measures of both. There are 25 measures appeared in CWB in total. Saved the 25 measures in [Measures with MCL] file

2. Unify units of MCL value to standard unit of these 25 measures, for measures with unit ug/l, changed the maximum value in CWB as ug/l for comparison.

3. Join [Measures with MCL] table to [Boonsong Lekagul waterways readings] table in JMP using left outer join so that measures not contained in CWS can also be reserved. We saved the joined data as excel file.

Image1-4.jpg

Join table with MCL level

Image1-5.jpg

Compare value of these measures in excel, for those lower than MCL level, delete the datapoints, others all stay same. Using if function: =IF(B:B<G:G,"null",B:B), column B is measured value, column G is MCL level, after calculation, data file looks like below:

Step by Step Discription

Image1-6.jpg

We deleted all records with “null” value. After data cleaning, 13 measures all cleaned, which means their value are all lower than MCL level over the date, they are:

1,2,4-Trichlorobenzene 2008 – 2009, Aldrin, Alachlor, Aluminium, Barium, Berilium, Endrin, Hexachlorobenzene, Mercury, Methoxychlor, Pentachlorobenzene, Selenium,Simazine.

Now data records are 110,581 and we save the remain data as [Measures exclude nonimpact] file.

Exclude misleading records

Exclusion criterion:

(1) id without full movement records( all visitors should come in and get out from both entrances so those ids that only have one entrance should be excluded).

(2)id without full timestamp records.( those id without the come in and get out records).

(3) repeat records

Extract the car sequences

The most essentail step for data preparation is to extract the car sequences which helps to summary the life patterns. This step is conducted in JMP with the function[Col Rank(Timestamp,car-id)]

Label vehicles in their movement order and then create Episode based on the movement order and id

Number the movement order for each car-id and then list the car-id with episode.

Level: The order of movement(entrance---gate :level of entrance is one and the level of gate is two)

Episode: Each entrance & exit is one episode, so normally each id should has only one episode.

Compute the last duration, camping hour and single checkpoint duration

Compute the multipul duration in excel for better visualization.Last duration(day),camping hour(hour),single checkpoint duration(min)

Coordinates Extractation and Preparation for Gephi

Extract the coordinates in tableau and then prepare the coordinates for Gephi visualization

Participent Segamentation

All the participents are classified into Camper,No-camper and Ranger