ISSS608 2017-18 T1 Assign FOO CELONG RAYMOND/MakingSenseOfTheChatter
First thing first, I checked if there is any dirty data. True enough, there are 21 microblogs with problems with the time values in the date. I cleaned up the time and set them to midnight (00:00).
The number of microblog is massive. I looked through to see the distribution of posts over time.
The data will need to be organised in some manner so that they can be easily analysed later. The obvious choice is by city zones. I carefully started to group the microblog by the zones by creating a column to store the zone in which the microblog was transmitted.
Next, I will also create an indicator column for microblogs that were transmitted area near the various points of interest.
That should organised the geolocation information. Now that that is done, the next thing is to organise the texts by their content. I classify the microblogs by their similarities to each other, using the latent class analysis that is parameterised to generate 10 clusters.
- Clusters 1 to 5 are common topics.
- Class 6 & 7 contains messages about people suffering from the illness but there are many post where it is from a third person point of view, or it is not clear if the author is talking about himself. They might be useful to track the spread but less so to trace the source by tracking the whereabout of the author.
- Class 8 are microblogs in one or more foreign languages.
- Class 9 are messages about conventions and election debates. These gatherings may be hotspots for the transmission of the disease.
- Class 10 is interesting. It contains conversations that uses the phrase "lose my mind" in addition of their symptoms. Most of the authors describes the symptoms from first person which makes this group suitable for tracing of the source.
Indicators if symptoms keywords are used in the microblogs' text are added. But because these are ordinary citizens and not doctors they might use non medical terms to describe their condition. A survey of the microblogs yields the following keywords and their classification.
Classified as | Searched Terms |
---|---|
flu | flu, runny nose |
fever | fever, temp, temperature |
chills | chill |
sweats | sweat |
aches | ache, aching, sore, cramp |
pains | pain, hurt |
fatigue | fatigue |
coughing | cough, pnenomia |
breathing difficulty | breath |
nausea | naus |
vomiting | vomit, throwing up, puk |
diarrhea | diarrhea |
enlarged lymph node | lymph |
After exploring the visualisations, the following keywords are tracks as well.
Classified as | Searched Terms |
---|---|
Accident | acci |
Convention | conven |
Explosion | explo |
Truck | truck |
The data wind data provided is in the polar coordinate form where the average speed is the magnitude of the wind direction vector and the direction the wind is blowing from is the angle of the vector. I convert the polar coordinate to cartesian coordinates and corrected the angle to reverse the direction so as to indicate where the wind is blow to instead of where the wind originate from. This information will be used to generate a "wind compass" for visualisation in tableau.