Difference between revisions of "ISSS608 2017-18 T1 Assign NURUL ASYIKEEN BINTE AZHAR Data Preparation"

From Visual Analytics and Applications
Jump to navigation Jump to search
 
Line 87: Line 87:
 
<br/>
 
<br/>
 
* '''Create new variable, "Sick Related", where 1 indicates that the text is related to sickness and "0" if it is not'''
 
* '''Create new variable, "Sick Related", where 1 indicates that the text is related to sickness and "0" if it is not'''
:From the frequently occurring words and phrases analysed in step 1, words related to symptoms of the disease were picked out. These words form the dictionary of words to extract potentially relevant microblog messages. After the first round of extraction, the text was again analysed and a dictionary was established to weed out the false positives.
+
:From the frequently occurring words and phrases analysed in step 1, words related to symptoms of the disease were picked out. These words form the dictionary of words to extract potentially relevant microblog messages. After the first round of extraction, the text was again analysed and a dictionary was established to weed out the following:
 +
# False positives of words that are related to sickness but used in the text as a slang. For example, "this movie is sick!"
 +
# Microblog messages that refer to sickness experienced by someone else as these observations cannot be accurately matched to the message creation time and location.
 
<br/>
 
<br/>
 
<br/>
 
<br/>

Latest revision as of 21:50, 15 October 2017

Investigation of Smartpolis Epidemic Outbreak

Overview

Data Preparation

Ground Zero

Transmission Mode

Situation Assessment

 


Data prep.gif



The Dataset

Microblog Messages

There are 1,027,599 messages in the dataset from 73,928 unique IDs. This dataset will be primarily used to assess the epidemic situation at Smartpolis. It consists of the following fields:

Field Name Field Description
ID Personal identifier of the individual posting the message
Created_at Date and time of the post
Location Latitude and longitude coordinates of the mobile device at the time of post
Text The posted message that are mostly in English. There are some messages in other languages such as Spanish and Bahasa Indonesia.


Data Cleansing

The data was first explored to assess if there were any input errors. Errors were found in the "Created_at" field where 21 rows where the time of the message creation was replaced by strings of "Fire-", "up th", "One", "Radio", "sc", ":*.", ".", "<3" and "...". These strings were replaced with the time of 12:00 pm after confirming that these messages were not related to any sickness and thus, would not adversely affect the analysis subsequently.

Data Preparation

Location

The location field was separated into separate columns of "Latitude" and "Longitude". These new fields are pivotal in mapping the messages and understanding where ground zero of the epidemic started.

Text

The text messages provides information on what the citizen of Smartpolis was experiencing or witnessing and used to form a daily word cloud, further elaborated in ISSS608_2017-18_T1_Assign_NURUL ASYIKEEN BINTE AZHAR_Ground Zero. The information derived from the texts are used for the following:

  • Understand major events that occurred in the 3 weeks based on frequently occurring phrases
The following events are inferred to have occurred in Smartpolis:


Date Event
2 May 2011 Car accident
4 May 2011 Antique convention
7 May 2011 Music festival
11 May 2011 Sham Wow convention
12 May 2011 Car accident
13 May 2011 Plane crash
14 May 2011 Major fire
16 May 2011 Comic convention

Bomb threats

17 May 2011 Explosion at Smogtown

Truck accident


  • Create new variable, "Sick Related", where 1 indicates that the text is related to sickness and "0" if it is not
From the frequently occurring words and phrases analysed in step 1, words related to symptoms of the disease were picked out. These words form the dictionary of words to extract potentially relevant microblog messages. After the first round of extraction, the text was again analysed and a dictionary was established to weed out the following:
  1. False positives of words that are related to sickness but used in the text as a slang. For example, "this movie is sick!"
  2. Microblog messages that refer to sickness experienced by someone else as these observations cannot be accurately matched to the message creation time and location.



  • Create new variable, "Sick Mentions", that sums the number of texts that each ID mentions about sickness
Once messages that are sick related have been identified, we are interested in the IDs that experienced prolonged symptoms of sickness. After summing the number of sick related messages by ID, only IDs with sick messages of more than 1 are focused on.



  • Create new variable, "Type of Sickness", according to the symptom described in texts that are sickness related
The symptoms are categorised as either "Fever", "Flu", "Breathing Difficulty", "Stomach Related", "Nausea", "Enlarged Lymph Nodes", "Coughing", "Fatigue", "Pains" or "General Sickness".


Map

Data Preparation

The map shows where the 13 city zones are located and through it, we can find out where each of the microblog messages were created. All messages were plotted on the map and segmented into the 13 city zones. A new variable of "Location" is subsequently formed.