Difference between revisions of "ShinyNET Data Prep"

From Visual Analytics and Applications
Jump to navigation Jump to search
Line 32: Line 32:
 
<font size = 5; color="#FFFFFF">Data Preparation</font>     
 
<font size = 5; color="#FFFFFF">Data Preparation</font>     
 
</div>
 
</div>
 +
 +
The data was available in public but the data of 12 carrier airlines were in PDF format. This data cannot be used for analysis. The flight schedule data is effective from 26th March 2017 to 28th October 2017. Since this data is in PDF format, it must be converted into excel first. We used online PDF to excel converter tool to convert this into excel documents. Some values in the excel format were out of place and those were corrected.<br />
 +
 +
<center>
 +
[[File:Data1.png|500px]]
 +
</center>
 +
We need a separate arrival and destination city column for our analysis so we transformed the converted excel file into the format shown below.<br />
 +
 +
<center>
 +
[[File:Data2.png|500px]]
 +
</center>
 +
The frequency columns show what days of the week are the flights available. It contains either “Daily” or a sequence of characters between 1 and 7 (2357, 12347). Daily means the flight is available on all 7 days in a week. Whereas, value like 147 means the flight is available only on Sunday, Wednesday and Saturday (1-Sunday, 4-Wednesday, 7-Saturday).<br />
 +
 +
<center>
 +
[[File:Data3.png|500px]]
 +
</center>
 +
We used python to do the following transformations:<br />
 +
 +
1. Departure and Arrival time must be in HH:MM format.<br />
 +
 +
2. The frequency column is transformed such a way that we replicated the flight based on what days are available between the effective time period. For example, if the frequency is “1”, then the particular flight is replicated for every Sundays between 26th March 2017 and 28th October 2017.<br />
 +
 +
3. Additional carrier column has to be added indicating the carrier.<br />
 +
 +
4. Adding date and day column to show date and day of the flight.<br />
 +
 +
5. Replace all null values in stops column to 0.<br />
 +
 +
6. Add frequency column (different from the original column). This frequency column shows the number of flights happened for a particular flight between the effective time period.<br />
 +
 +
7. And finally concatenate all the 12 carrier files into one single big file.<br />
 +
 +
A sample of the final prepared data file is shown below:<br />
 +
<center>
 +
[[File:Data4.png|500px]]
 +
</center>
 +
The data is now prepared and ready for visualisation.

Revision as of 21:11, 6 August 2017

shinyNET:A web-based flight data visualisation toolkit using R Shiny and ggraph

Project Proposal

Data Preparation

Poster

Application

Report

 

Data Preparation

The data was available in public but the data of 12 carrier airlines were in PDF format. This data cannot be used for analysis. The flight schedule data is effective from 26th March 2017 to 28th October 2017. Since this data is in PDF format, it must be converted into excel first. We used online PDF to excel converter tool to convert this into excel documents. Some values in the excel format were out of place and those were corrected.

Data1.png

We need a separate arrival and destination city column for our analysis so we transformed the converted excel file into the format shown below.

Data2.png

The frequency columns show what days of the week are the flights available. It contains either “Daily” or a sequence of characters between 1 and 7 (2357, 12347). Daily means the flight is available on all 7 days in a week. Whereas, value like 147 means the flight is available only on Sunday, Wednesday and Saturday (1-Sunday, 4-Wednesday, 7-Saturday).

Data3.png

We used python to do the following transformations:

1. Departure and Arrival time must be in HH:MM format.

2. The frequency column is transformed such a way that we replicated the flight based on what days are available between the effective time period. For example, if the frequency is “1”, then the particular flight is replicated for every Sundays between 26th March 2017 and 28th October 2017.

3. Additional carrier column has to be added indicating the carrier.

4. Adding date and day column to show date and day of the flight.

5. Replace all null values in stops column to 0.

6. Add frequency column (different from the original column). This frequency column shows the number of flights happened for a particular flight between the effective time period.

7. And finally concatenate all the 12 carrier files into one single big file.

A sample of the final prepared data file is shown below:

Data4.png

The data is now prepared and ready for visualisation.