Data preparation

From Visual Analytics and Applications
Jump to navigation Jump to search

Shadiao duck.jpg        Vast: Mini Challenge 2: Like a duck to water

Background

Data preparation

Analysis & Insights

Feedback

Dataset description

Mini-Challenge 2 provides two datasets - [Boonsong Lekagul waterways readings.csv] and [chemical units of measure.csv] and one map image about rivers and streams of some specific locations. [Boonsong Lekagul waterways readings.csv] includes 5 variables: ID, measure, value of masure, location and measuring date. There are 10 locations, 106 measures and measuring date span is from 1998 to 2016. [chemical units of measure.csv] includes measures and unit of each measure.


Map data

According to the map image, we can observe geographical distribution of locations, in order to get specific coordinates of each location, we add the image into tableau as background and set basic range of coordinates of X axis and Y axis given by Mini-Challenge 2.

Image1-1.jpg

Here we added point as annotate on the top of each arrow of each location, so X&Y coordinates of each location will show on the map. We do same for all other locations and we get the [Location.xlsx] file about coordinates of each location.


Image1-2.jpg


Data augmentation and data pruning

After overview of whole data, we observed value variance among measures are quite big, and measuring time span of each measure also varies greatly. Units of measures are mainly separated into ug/l and mg/l, it’s hard for us to estimate whether the value of this chemical is toxic for water or not. In this case, I found a file from the California Water Board as reference.

Image1-3.jpg

This file provides MCL(California’s maximum contaminant levels) and DLR( Detection limits for purpose of reporting) of about 120 measures for regulated drinking water. Maximum level of contaminants in drinking water should be lower than raw water, so any value of chemical less than its corresponding level could be considered insignificant.

Notice: all units are mg/l in this file.

Steps: 1. We first join the two files on readings.measures = units.measures

Image1-4.jpg 2. The resulting file gives a augmented dataset with units included.

3. We gather the maximum level for contaminants in drinking water based on an established standard (in our case, the thresholds for the California Water Board for 2018).

4. This is the maximum level for contaminants in drinking water. Instinctively , raw water should be higher 5) Thus we map each row of the dataset in step 2, to the corresponding max level based on the California Water Board 2018 data, if the levels in raw water is even lower than drinking water. We can just delete the datapoint as insignificant. Image1-7.jpg

Data cleaning

Steps:

1. Compare measures of Boonsong Lekagul waterways readings with measures given in California Water Board. Check measures one by one of both dataset, selected common measures of both. There are 25 measures appeared in CWB in total.

2. Unify units of MCL value to standard unit of these 25 measures, for measures with unit ug/l, changed the maximum value in CWB as ug/l for comparison. 3. Compare value of these measures, for those lower than MCL level, delete the datapoints, others all stay same.

After data cleaning, 13 measures all cleaned, which means their value are all lower than MCL level over the date, they are:

Image1-5.jpg

Image1-6.jpg

There are 106 chemicals in total.

25 of them have standard MCL value.

12 of chemicals less than MCL over the years. So after data cleaning, there are 94 measures left in total.