Difference between revisions of "ISSS608 2017-18 T1 Assign NURUL ASYIKEEN BINTE AZHAR Data Preparation"

Latest revision as of 21:50, 15 October 2017

Investigation of Smartpolis Epidemic Outbreak

Overview

Data Preparation

Ground Zero

Transmission Mode

Situation Assessment

The Dataset

Microblog Messages

There are 1,027,599 messages in the dataset from 73,928 unique IDs. This dataset will be primarily used to assess the epidemic situation at Smartpolis. It consists of the following fields:

Field Name	Field Description
ID	Personal identifier of the individual posting the message
Created_at	Date and time of the post
Location	Latitude and longitude coordinates of the mobile device at the time of post
Text	The posted message that are mostly in English. There are some messages in other languages such as Spanish and Bahasa Indonesia.

Data Cleansing

The data was first explored to assess if there were any input errors. Errors were found in the "Created_at" field where 21 rows where the time of the message creation was replaced by strings of "Fire-", "up th", "One", "Radio", "sc", ":*.", ".", "<3" and "...". These strings were replaced with the time of 12:00 pm after confirming that these messages were not related to any sickness and thus, would not adversely affect the analysis subsequently.

Data Preparation

Location

The location field was separated into separate columns of "Latitude" and "Longitude". These new fields are pivotal in mapping the messages and understanding where ground zero of the epidemic started.

Text

The text messages provides information on what the citizen of Smartpolis was experiencing or witnessing and used to form a daily word cloud, further elaborated in ISSS608_2017-18_T1_Assign_NURUL ASYIKEEN BINTE AZHAR_Ground Zero. The information derived from the texts are used for the following:

Understand major events that occurred in the 3 weeks based on frequently occurring phrases

The following events are inferred to have occurred in Smartpolis:

Date	Event
2 May 2011	Car accident
4 May 2011	Antique convention
7 May 2011	Music festival
11 May 2011	Sham Wow convention
12 May 2011	Car accident
13 May 2011	Plane crash
14 May 2011	Major fire
16 May 2011	Comic convention Bomb threats
17 May 2011	Explosion at Smogtown Truck accident

Create new variable, "Sick Related", where 1 indicates that the text is related to sickness and "0" if it is not

From the frequently occurring words and phrases analysed in step 1, words related to symptoms of the disease were picked out. These words form the dictionary of words to extract potentially relevant microblog messages. After the first round of extraction, the text was again analysed and a dictionary was established to weed out the following:

False positives of words that are related to sickness but used in the text as a slang. For example, "this movie is sick!"
Microblog messages that refer to sickness experienced by someone else as these observations cannot be accurately matched to the message creation time and location.

Create new variable, "Sick Mentions", that sums the number of texts that each ID mentions about sickness

Once messages that are sick related have been identified, we are interested in the IDs that experienced prolonged symptoms of sickness. After summing the number of sick related messages by ID, only IDs with sick messages of more than 1 are focused on.

Create new variable, "Type of Sickness", according to the symptom described in texts that are sickness related

The symptoms are categorised as either "Fever", "Flu", "Breathing Difficulty", "Stomach Related", "Nausea", "Enlarged Lymph Nodes", "Coughing", "Fatigue", "Pains" or "General Sickness".

Map

Data Preparation

The map shows where the 13 city zones are located and through it, we can find out where each of the microblog messages were created. All messages were plotted on the map and segmented into the 13 city zones. A new variable of "Location" is subsequently formed.

@@ Line 87: / Line 87: @@
 <br/>
 * '''Create new variable, "Sick Related", where 1 indicates that the text is related to sickness and "0" if it is not'''
-:From the frequently occurring words and phrases analysed in step 1, words related to symptoms of the disease were picked out. These words form the dictionary of words to extract potentially relevant microblog messages. After the first round of extraction, the text was again analysed and a dictionary was established to weed out the false positives.
+:From the frequently occurring words and phrases analysed in step 1, words related to symptoms of the disease were picked out. These words form the dictionary of words to extract potentially relevant microblog messages. After the first round of extraction, the text was again analysed and a dictionary was established to weed out the following:
+# False positives of words that are related to sickness but used in the text as a slang. For example, "this movie is sick!"
+# Microblog messages that refer to sickness experienced by someone else as these observations cannot be accurately matched to the message creation time and location.
 <br/>
 <br/>

Difference between revisions of "ISSS608 2017-18 T1 Assign NURUL ASYIKEEN BINTE AZHAR Data Preparation"

Latest revision as of 21:50, 15 October 2017

Contents

The Dataset

Microblog Messages

Data Cleansing

Data Preparation

Location

Text

Map

Data Preparation

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools