Difference between revisions of "The Indian Story Data Prep"

From Visual Analytics and Applications
Jump to navigation Jump to search
Line 39: Line 39:
 
[[File:Data Structure.png|border|width = 100| 700px|centre]]  
 
[[File:Data Structure.png|border|width = 100| 700px|centre]]  
  
Every row has certain data attributes. Population information in each row has a hierachial structure beginning with the state code followed by the city/town name then whether it is an Urban or a rural city and finally the age group. Here however there isn't a particular state name given for the state code. This is where the second dataset comes in and helps us understand the meaning of the state codes.
+
Every row has certain data attributes. Population information in each row has a hierarchial structure beginning with the state code followed by the city/town name then whether it is an Urban or a rural city and finally the age group. Here however there isn't a particular state name given for the state code. This is where the second dataset comes in and helps us understand the meaning of the state codes.  
  
Next the variables also have a hierarchial structure.
+
Next the variables also have a hierarchial structure. You have the highest level popluation totals. These tell you total population, total male population and total female population. Then following that you have the division of '''Illiterate''' and '''Literate''' population numbers. Both of these have persons, male and female divisions. Next under Literate you have those with education and those without education. Again under Literate and without education you have population total number of male and female numbers. Then under Literate with education, you have the different levels of education! They go all the way from '''before Primary''' to '''Graduates'''. Then under each of these categories you have population numbers for total, male and female.
 +
 
 +
So let's start addressing one data problem after another. The following explains the problem and the solution used to solve them.
 +
 
 +
== Data Problem 1: Break the hierarchy ==
 +
=== Problem ===
 +
So the dataset itself came with a hierarchy which needed to be broken down.
 +
=== Solution ===
 +
This was done manually on excel. Although Python and R have code logic that can be used to merge the cells. The output of the new column names is printed here. This sheet was then stored as table 1.
 +
<pre>
 +
TableName
 +
State Code
 +
Town Code
 +
Town Name
 +
Urban/Rural
 +
Age-group
 +
Persons.Total
 +
Males.Total
 +
Females.Total
 +
Persons.Illiterate
 +
Males.Illiterate
 +
Females.illiterate
 +
Persons.Literate
 +
Males.Literate
 +
Females.Literate
 +
Persons.L.NoEdu
 +
Males.L.NoEdu
 +
Females.L.NoEdu
 +
Persons.BelowPrim
 +
Males.BelowPrim
 +
Females.BelowPrimary
 +
Persons.P
 +
Males.P
 +
Females.P
 +
Persons.M
 +
Males.M
 +
Females.M
 +
Persons.S
 +
Males.S
 +
Females.S
 +
Persons.PreUni
 +
Males.PreUni
 +
Females.PreUni
 +
Persons.Diploma(NT)
 +
Males.Diploma(NT)
 +
Females.Diploma(NT)
 +
Persons.Diploma(T)
 +
Males.Diploma(T)
 +
Females.Diploma(T)
 +
Persons.Graduates
 +
Males.Graduates
 +
Females.Graduates
 +
Persons.Unclassified
 +
Males.Unclassified
 +
Females.Unclassified
 +
</pre>
 +
 
 +
== Data Problem 2: State Codes are not Sufficient!==
 +
=== Problem Definition ===
 +
Imagine if you read a map, hover over a particular state in India and all you get to see is the state code. Unless you are an educated Indian, you wouldn't know what that means. Hence the second dataset was used. This can be downloaded [http://censusindia.gov.in/2011-prov-results/paper2/data_files/India2/Table_2_PR_Cities_1Lakh_and_Above.xls here]
 +
 
 +
 
 +
 
 +
=== Solution ===
 +
We wrote a beautiful R code to munge this.
 +
 
 +
<pre>
 +
# Level one cleaning
 +
## Import data
 +
table1<- readxl::read_excel("Table 1.xlsx")
 +
table2 <- readxl::read_excel("Table 2.xlsx")
 +
 
 +
## get the state code to name dictionary
 +
cl_table2 <- table2 %>% distinct()
 +
 
 +
## recode it into the State Code column of the original table
 +
table1$`State Code`<- recode(table1$`State Code`, "01" = "JAMMU & KASHMIR",
 +
                            "02" = "HIMACHAL PRADESH",
 +
                            "03" = "PUNJAB",
 +
                            "04" = "CHANDIGARH" ,
 +
                            "05" = "UTTARAKHAND",
 +
                            "06" = "HARYANA" ,
 +
                            "07" = "NCT OF DELHI",
 +
                            "08" = "RAJASTHAN",
 +
                            "09" = "UTTAR PRADESH",
 +
                            "10" = "BIHAR",
 +
                            "13" = "NAGALAND",
 +
                            "14" = "MANIPUR",
 +
                            "15" = "MIZORAM",
 +
                            "16" = "TRIPURA",
 +
                            "17" = "MEGHALAYA",
 +
                            "18" = "ASSAM",
 +
                            "19" = "WEST BENGAL",
 +
                            "20" = "JHARKHAND",
 +
                            "21" = "ORISSA",
 +
                            "22" = "CHHATTISGARH",
 +
                            "23" = "MADHYA PRADESH",
 +
                            "24" = "GUJARAT",
 +
                            "27" = "MAHARASHTRA",
 +
                            "28" = "ANDHRA PRADESH",
 +
                            "29" = "KARNATAKA",
 +
                            "32" = "KERALA",
 +
                            "33" = "TAMIL NADU",
 +
                            "34" = "PUDUCHERRY",
 +
                            "35" = "ANDAMAN & NICOBAR ISLANDS")
 +
</pre>

Revision as of 17:41, 17 July 2017

Banner.png Group 9-The Indian Story

Project Proposal

Data Preparation

Poster

Application

Report


from the data engineers desk

Data

Here we speak about the dataset and munging that we do upon it. In order to create the visualisations and eventually build the Application it is important to have the data in the right form. Let's start with examining the data in its initial form.

Data as downloaded from the website

The copy of the data can be downloaded from this link here

Here is a portion of the data excel data source that somewhat gives us an overview of the data. I'll explain more in the tables below. Sometimes visually looking at the data helps flare off ideas on how to munge it.

Project was data.JPG

The data file is quite a clean file with no missing values. Those with no values are replaced with 0 and they are all in terms of population numbers. No ratios have been computed except for totals. A glance at the data file above will give you the impression that there is more than one level of data and that information is in the hierarchical form. The picture below is what we visualise the dataset to be conceptually.

width = 100

Every row has certain data attributes. Population information in each row has a hierarchial structure beginning with the state code followed by the city/town name then whether it is an Urban or a rural city and finally the age group. Here however there isn't a particular state name given for the state code. This is where the second dataset comes in and helps us understand the meaning of the state codes.

Next the variables also have a hierarchial structure. You have the highest level popluation totals. These tell you total population, total male population and total female population. Then following that you have the division of Illiterate and Literate population numbers. Both of these have persons, male and female divisions. Next under Literate you have those with education and those without education. Again under Literate and without education you have population total number of male and female numbers. Then under Literate with education, you have the different levels of education! They go all the way from before Primary to Graduates. Then under each of these categories you have population numbers for total, male and female.

So let's start addressing one data problem after another. The following explains the problem and the solution used to solve them.

Data Problem 1: Break the hierarchy

Problem

So the dataset itself came with a hierarchy which needed to be broken down.

Solution

This was done manually on excel. Although Python and R have code logic that can be used to merge the cells. The output of the new column names is printed here. This sheet was then stored as table 1.

 
TableName
State Code
Town Code
Town Name
Urban/Rural
Age-group
Persons.Total
Males.Total
Females.Total
Persons.Illiterate
Males.Illiterate
Females.illiterate
Persons.Literate
Males.Literate
Females.Literate
Persons.L.NoEdu
Males.L.NoEdu
Females.L.NoEdu
Persons.BelowPrim
Males.BelowPrim
Females.BelowPrimary
Persons.P
Males.P
Females.P
Persons.M
Males.M
Females.M
Persons.S
Males.S
Females.S
Persons.PreUni
Males.PreUni
Females.PreUni
Persons.Diploma(NT)
Males.Diploma(NT)
Females.Diploma(NT)
Persons.Diploma(T)
Males.Diploma(T)
Females.Diploma(T)
Persons.Graduates
Males.Graduates
Females.Graduates
Persons.Unclassified
Males.Unclassified
Females.Unclassified

Data Problem 2: State Codes are not Sufficient!

Problem Definition

Imagine if you read a map, hover over a particular state in India and all you get to see is the state code. Unless you are an educated Indian, you wouldn't know what that means. Hence the second dataset was used. This can be downloaded here


Solution

We wrote a beautiful R code to munge this.

 
# Level one cleaning 
## Import data 
table1<- readxl::read_excel("Table 1.xlsx")
table2 <- readxl::read_excel("Table 2.xlsx")

## get the state code to name dictionary 
cl_table2 <- table2 %>% distinct()

## recode it into the State Code column of the original table
table1$`State Code`<- recode(table1$`State Code`, "01" = "JAMMU & KASHMIR",
                             "02" = "HIMACHAL PRADESH",
                             "03" = "PUNJAB",
                             "04" = "CHANDIGARH" ,
                             "05" = "UTTARAKHAND",
                             "06" = "HARYANA" ,
                             "07" = "NCT OF DELHI",
                             "08" = "RAJASTHAN",
                             "09" = "UTTAR PRADESH",
                             "10" = "BIHAR",
                             "13" = "NAGALAND",
                             "14" = "MANIPUR",
                             "15" = "MIZORAM",
                             "16" = "TRIPURA",
                             "17" = "MEGHALAYA",
                             "18" = "ASSAM",
                             "19" = "WEST BENGAL",
                             "20" = "JHARKHAND",
                             "21" = "ORISSA",
                             "22" = "CHHATTISGARH",
                             "23" = "MADHYA PRADESH",
                             "24" = "GUJARAT",
                             "27" = "MAHARASHTRA",
                             "28" = "ANDHRA PRADESH",
                             "29" = "KARNATAKA",
                             "32" = "KERALA",
                             "33" = "TAMIL NADU",
                             "34" = "PUDUCHERRY",
                             "35" = "ANDAMAN & NICOBAR ISLANDS")