ISSS608 2017-18 T1 Assign FOO CELONG RAYMOND/MakingSenseOfTheChatter

From Visual Analytics and Applications
Revision as of 00:24, 15 October 2017 by Raymondfoo.2016 (talk | contribs)
Jump to navigation Jump to search
RaymHeader.png



Exploring and Organising the Data

First thing first, I checked if there is any dirty data. True enough, there are 21 microblogs with problems with the time values in the date. I cleaned up the time and set them to midnight (00:00).


RaymDirtyDates.png


The number of microblog is massive. I looked through to see the distribution of posts over time.


RaymMicroblogPerDay.png


RaymMicroblogPerHour.png


The data will need to be organised in some manner so that they can be easily analysed later. The obvious choice is by city zones. I carefully started to group the microblog by the zones by creating a column to store the zone in which the microblog was transmitted.


RaymMicroblogByZone.png


Next, I will also create an indicator column for microblogs that were transmitted area near the various points of interest.


RaymMicroblogByPlaceOfInterest.png


That should organised the geolocation information. Now that that is done, the next thing is to organise the texts by their content. I classify the microblogs by their similarities to each other, using the latent class analysis that is parameterised to generate 10 clusters.

  •  Clusters 1 to 5 are common topics.
  •  Class 6 & 7 contains messages about people suffering from the illness but there are many post where it is from a third person point of view, or it is not clear if the author is talking about himself. They might be useful to track the spread but less so to trace the source by tracking the whereabout of the author.
  •  Class 8 are microblogs in one or more foreign languages.
  •  Class 9 are messages about conventions and election debates. These gatherings may be hotspots for the transmission of the disease.
  •  Class 10 is interesting. It contains conversations that uses the phrase "lose my mind" in addition of their symptoms. Most of the authors describes the symptoms from first person which makes this group suitable for tracing of the source.


RaymFormulaKeyword.png


Indicators if symptoms keywords are used in the microblogs' text are added. But because these are ordinary citizens and not doctors they might use non medical terms to describe their condition. A survey of the microblogs yields the following keywords and their classification.


Classified as Searched Terms
 flu  flu, runny nose
 fever  fever, temp, temperature
 chills  chill
 sweats  sweat
 aches  ache, aching, sore, cramp
 pains  pain, hurt
 fatigue  fatigue
 coughing  cough, pnenomia
 breathing difficulty  breath
 nausea  naus
 vomiting  vomit, throwing up, puk
 diarrhea  diarrhea
 enlarged lymph node  lymph


After exploring the visualisations, the following keywords are tracks as well.


Classified as Searched Terms
 Accident  acci
 Convention  conven
 Explosion  explo
 Truck  truck


Converting the wind data

The data wind data provided is in the polar coordinate form where the average speed is the magnitude of the wind direction vector and the direction the wind is blow from is the angle of the vector. I convert the polar coordinate to cartesian coordinates and corrected the angle to reverse the direction so as to indicate where the wind is blow to instead of where the wind originate from. This information will be used to generate a "wind compass" for visualisation.