Difference between revisions of "ISSS608 2017-18 T1 Assign FOO CELONG RAYMOND/MakingSenseOfTheChatter"

From Visual Analytics and Applications
Jump to navigation Jump to search
Line 15: Line 15:
  
 
<div style="color:#0F1940; padding-left: 10px; font-size: 24px; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif;">Exploring and Organising the Data</div>
 
<div style="color:#0F1940; padding-left: 10px; font-size: 24px; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif;">Exploring and Organising the Data</div>
 +
<hr style="margin-left:10px;margin-right:10px">
 
<div style="padding-left: 10px; color:#0F1940; font-size: 16px; font-weight: bold; font-family: 'Courier New', Courier, monospace">
 
<div style="padding-left: 10px; color:#0F1940; font-size: 16px; font-weight: bold; font-family: 'Courier New', Courier, monospace">
  

Revision as of 20:03, 14 October 2017

RaymHeader.png



Exploring and Organising the Data

First thing first, I checked if there is any dirty data. True enough, there are 21 microblogs with problems with the time values in the date. I cleaned up the time and set them to midnight (00:00).


RaymDirtyDates.png


The number of microblog is massive. I looked through to see the distribution of posts over time.


RaymMicroblogPerDay.png


RaymMicroblogPerHour.png


The data will need to be organised in some manner so that they can be easily analysed later. The obvious choice is by city zones. I carefully started to group the microblog by the zones by creating a column to store the zone in which the microblog was transmitted.


RaymMicroblogByZone.png


Next, I will also create an indicator column for microblogs that were transmitted area near the various points of interest.


RaymMicroblogByPlaceOfInterest.png


That should organised the geolocation information. Now that that is done, the next thing is to organise the texts by their content. I classify the microblogs by their similarities to each other, using the latent class analysis that is parameterised to generate 10 clusters.

  •  Clusters 1 to 5 are common topics.
  •  Class 6 & 7 contains messages about people suffering from the illness but there are many post where it is from a third person point of view, or it is not clear if the author is talking about himself. They might be useful to track the spread but less so to trace the source by tracking the whereabout of the author.
  •  Class 8 are microblogs in one or more foreign languages.
  •  Class 9 are messages about conventions and election debates. These gatherings may be hotspots for the transmission of the disease.
  •  Class 10 is interesting. It contains conversations that uses the phrase "lose my mind" in addition of their symptoms. Most of the authors describes the symptoms from first person which makes this group suitable for tracing of the source.


RaymFormulaKeyword.png


Indicators if symptoms keywords are used in the microblogs' text are added. But because these are ordinary citizens and not doctors they might use non medical terms to describe their condition. A survey of the microblogs yields the following keywords and their classification.


Classified as Searched Terms
 flu  flu, runny nose
 fever  fever, temp, temperature
 chills  chill
 sweats  sweat
 aches  ache, aching, sore, cramp
 pains  pain, hurt
 fatigue  fatigue
 coughing  cough, pnenomia
 breathing difficulty  breath
 nausea  naus
 vomiting  vomit, throwing up, puk
 diarrhea  diarrhea
 enlarged lymph node  lymph


After exploring the visualisations, the following keywords are tracks as well.


Classified as Searched Terms
 Accident  acci
 Convention  conven
 Explosion  explo
 Truck  truck