Difference between revisions of "ShinyNET Data Prep"

From Visual Analytics and Applications
Jump to navigation Jump to search
(Created page with "<div style=background:#0B5345 border:#A3BFB1> <font size = 5; color="#FFFFFF">shinyNET</font> </div> <!--MAIN HEADER --> {|style="background-color:#0B5345;" width="100%...")
 
 
(9 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
<div style=background:#0B5345 border:#A3BFB1>
 
<div style=background:#0B5345 border:#A3BFB1>
  
<font size = 5; color="#FFFFFF">shinyNET</font>     
+
<font size = 5; color="#FFFFFF">shinyNET:A web-based flight data visualisation toolkit using R Shiny and ggraph(Group 14)</font>     
 
</div>
 
</div>
 
<!--MAIN HEADER -->  
 
<!--MAIN HEADER -->  
Line 30: Line 30:
 
<!--MAIN HEADER-->
 
<!--MAIN HEADER-->
  
<div style=background:#0B5345 border:#A3BFB1>
+
Since our visualisation is a geospatial network data, we need both geolocations and network data. The geolocation data is the location coordinates of airports. There are 77 airports in total. For getting geolocation coordinates, we used batch geocoding.<br />
<font size = 3; color="#FFFFFF">Abstract</font>    
+
 
</div>
+
We used an online reverse geocoder to get the location coordinates of 77 airports. We keyed in the names of 77 airports and we get latitude and longitude values of all 77 inputs in order as output. The result latitude and longitude coordinates are combined with airport names and stored in an excel file.<br />
 +
 +
For network data, we need to create nodes and edges data. The nodes data is the excel file that has the airport name and its latitude and longitude coordinates created above.<br />
 +
<center>
 +
[[File:Data5.png|250px]]
 +
</center>
 +
 
 +
The edges are the routes data that tells you the arrival and destination airport details of every flight. The routes data was available in public but the data of 12 carrier airlines were in PDF format. This data cannot be used for analysis. The flight schedule data is effective from 26th March 2017 to 28th October 2017. Since this data is in PDF format, it must be converted into excel first. We used online PDF to excel converter tool to convert this into excel documents. Some values in the excel format were out of place and those were corrected.<br />
 +
 
 +
<center>
 +
[[File:Data1.png|500px]]
 +
</center>
 +
We need a separate arrival and destination city column for our analysis so we transformed the converted excel file into the format shown below.<br />
 +
 
 +
<center>
 +
[[File:Data2.png|500px]]
 +
</center>
 +
The frequency columns show what days of the week are the flights available. It contains either “Daily” or a sequence of characters between 1 and 7 (2357, 12347). Daily means the flight is available on all 7 days in a week. Whereas, value like 147 means the flight is available only on Sunday, Wednesday and Saturday (1-Sunday, 4-Wednesday, 7-Saturday).<br />
 +
 
 +
<center>
 +
[[File:Data3.png|500px]]
 +
</center>
 +
We used python to do the following transformations:<br />
 +
 
 +
1. Departure and Arrival time must be in HH:MM format.<br />
 +
 
 +
2. The frequency column is transformed such a way that we replicated the flight based on what days are available between the effective time period. For example, if the frequency is “1”, then the particular flight is replicated for every Sundays between 26th March 2017 and 28th October 2017.<br />
 +
 
 +
3. Additional carrier column has to be added indicating the carrier.<br />
 +
 
 +
4. Adding date and day column to show date and day of the flight.<br />
  
shinyNET is a web-based visual analytics tool that allows users to visualise flights data as a network graph. It is built by using R Shiny framework to integrate a collection of R packages for data wrangling, data tidying , data visualisation and graph analysis. With the responsive interfaces of shinyNET, users can choose to visualise the entire airlines systems or to visualise the network graph of a selected airlines system. It also allows users to compute network geometrics such as betweenness, closeness and to use these newly derived measures to enhance the data discovery process.
+
5. Replace all null values in stops column to 0.<br />
  
All these analysis and visualisation are performed without having the users to type a single line of code.
+
6. Add frequency column (different from the original column). This frequency column shows the number of flights happened for a particular flight between the effective time period.<br />
  
<div style=background:#0B5345 border:#A3BFB1>
+
7. And finally concatenate all the 12 carrier files into one single big file.<br />
<font size = 3; color="#FFFFFF">Objective</font>   
 
</div>
 
  
Our main aim is to investigate the airport network infrastructure of India (ANI) to explore its various properties and its traffic dynamics. We propose to build a visual network exploration tool using R. This tool can be used not only to explore the airport network in India but also can be used to explore any kind of airport network.
+
A sample of the final prepared routes data file is shown below:<br />
 +
<center>
 +
[[File:Data4.png|500px]]
 +
</center>
 +
The edges and routes data is now prepared and ready for visualisation.

Latest revision as of 23:04, 6 August 2017

shinyNET:A web-based flight data visualisation toolkit using R Shiny and ggraph(Group 14)

Project Proposal

Data Preparation

Poster

Application

Report

 


Since our visualisation is a geospatial network data, we need both geolocations and network data. The geolocation data is the location coordinates of airports. There are 77 airports in total. For getting geolocation coordinates, we used batch geocoding.

We used an online reverse geocoder to get the location coordinates of 77 airports. We keyed in the names of 77 airports and we get latitude and longitude values of all 77 inputs in order as output. The result latitude and longitude coordinates are combined with airport names and stored in an excel file.

For network data, we need to create nodes and edges data. The nodes data is the excel file that has the airport name and its latitude and longitude coordinates created above.

Data5.png

The edges are the routes data that tells you the arrival and destination airport details of every flight. The routes data was available in public but the data of 12 carrier airlines were in PDF format. This data cannot be used for analysis. The flight schedule data is effective from 26th March 2017 to 28th October 2017. Since this data is in PDF format, it must be converted into excel first. We used online PDF to excel converter tool to convert this into excel documents. Some values in the excel format were out of place and those were corrected.

Data1.png

We need a separate arrival and destination city column for our analysis so we transformed the converted excel file into the format shown below.

Data2.png

The frequency columns show what days of the week are the flights available. It contains either “Daily” or a sequence of characters between 1 and 7 (2357, 12347). Daily means the flight is available on all 7 days in a week. Whereas, value like 147 means the flight is available only on Sunday, Wednesday and Saturday (1-Sunday, 4-Wednesday, 7-Saturday).

Data3.png

We used python to do the following transformations:

1. Departure and Arrival time must be in HH:MM format.

2. The frequency column is transformed such a way that we replicated the flight based on what days are available between the effective time period. For example, if the frequency is “1”, then the particular flight is replicated for every Sundays between 26th March 2017 and 28th October 2017.

3. Additional carrier column has to be added indicating the carrier.

4. Adding date and day column to show date and day of the flight.

5. Replace all null values in stops column to 0.

6. Add frequency column (different from the original column). This frequency column shows the number of flights happened for a particular flight between the effective time period.

7. And finally concatenate all the 12 carrier files into one single big file.

A sample of the final prepared routes data file is shown below:

Data4.png

The edges and routes data is now prepared and ready for visualisation.