Difference between revisions of "ISSS608 2017-18 T1 Assign FOO CELONG RAYMOND/MakingSenseOfTheChatter"

From Visual Analytics and Applications
Jump to navigation Jump to search
Line 36: Line 36:
 
   <li>&nbsp;Class 10 is interesting. It contains conversations that uses the phrase "lose my mind" in addition of their symptoms. Most of the authors describes the symptoms from first person which makes this group suitable for tracing of the source.</li>
 
   <li>&nbsp;Class 10 is interesting. It contains conversations that uses the phrase "lose my mind" in addition of their symptoms. Most of the authors describes the symptoms from first person which makes this group suitable for tracing of the source.</li>
 
</ul>
 
</ul>
 
+
<br \>
[[File:RaymFormulaKeyword.png|800px|center]]
+
[[File:RaymFormulaKeyword.png|400px|center]]
 +
<br \>
 
<p>Indicators if symptoms keywords are used in the microblogs' text are added. But because these are ordinary citizens and not doctors they might use non medical terms to describe their condition. A survey of the microblogs yields the following keywords and their classification.</p>
 
<p>Indicators if symptoms keywords are used in the microblogs' text are added. But because these are ordinary citizens and not doctors they might use non medical terms to describe their condition. A survey of the microblogs yields the following keywords and their classification.</p>
  
 
</div>
 
</div>
 
</div>
 
</div>

Revision as of 21:54, 13 October 2017

RaymHeader.png



Exploring and Organising the Data

First thing first, I checked if there is any dirty data. True enough, there are 21 microblogs with problems with the time values in the date. I cleaned up the time and set them to midnight (00:00).


RaymDirtyDates.png


The number of microblog is massive. I looked through to see the distribution of posts over time.


RaymMicroblogPerDay.png


RaymMicroblogPerHour.png


The data will need to be organised in some manner so that they can be easily analysed later. The obvious choice is by city zones. I carefully started to group the microblog by the zones by creating a column to store the zone in which the microblog was transmitted.


RaymMicroblogByZone.png


Next, I will also create an indicator column for microblogs that were transmitted area near the various points of interest.


RaymMicroblogByPlaceOfInterest.png


That should organised the geolocation information. Now that that is done, the next thing is to organise the texts by their content. I classify the microblogs by their similarities to each other, using the latent class analysis that is parameterised to generate 10 clusters.

  •  Clusters 1 to 5 are common topics.
  •  Class 6 & 7 contains messages about people suffering from the illness but there are many post where it is from a third person point of view, or it is not clear if the author is talking about himself. They might be useful to track the spread but less so to trace the source by tracking the whereabout of the author.
  •  Class 8 are microblogs in one or more foreign languages.
  •  Class 9 are messages about conventions and election debates. These gatherings may be hotspots for the transmission of the disease.
  •  Class 10 is interesting. It contains conversations that uses the phrase "lose my mind" in addition of their symptoms. Most of the authors describes the symptoms from first person which makes this group suitable for tracing of the source.


RaymFormulaKeyword.png


Indicators if symptoms keywords are used in the microblogs' text are added. But because these are ordinary citizens and not doctors they might use non medical terms to describe their condition. A survey of the microblogs yields the following keywords and their classification.