Difference between revisions of "Group11 proposal"

From ISSS608-Visual Analytics and Applications
Jump to navigation Jump to search
m
 
(147 intermediate revisions by 3 users not shown)
Line 1: Line 1:
<center>
+
[[File:G11 TitleBanner2.png|1700px|frameless|SGSAS]]
[[File:Group11banner.PNG.png|800px|frameless|center|Group 11: Google Analytics - Power Up!]]
 
</center>
 
 
<div>
 
<div>
{|style="background-color:#667181;" width="100%" cellspacing="0" cellpadding="0" valign="top" border="0"  |
+
{|style="background-color:#607080;" width="100%" cellspacing="0" cellpadding="0" valign="top" border="0"  |
| style="font-family:Century Gothic; font-size:100%; solid #1B338F; background:#26a6d1; text-align:center;" width="25%" |  
+
| style="font-family:Century Gothic; font-size:100%; solid #103080; background:#20a0d0; text-align:center;" width="20%" |  
 
;
 
;
 
[[Group11_proposal| <font color="#FFFFFF">Proposal</font>]]
 
[[Group11_proposal| <font color="#FFFFFF">Proposal</font>]]
  
| style="font-family:Century Gothic; font-size:100%; solid #1B338F; background:#667181; text-align:center;" width="25%" |  
+
| style="font-family:Century Gothic; font-size:100%; solid #103080; background:#607080; text-align:center;" width="20%" |  
 
;
 
;
[[Group15_Poster| <font color="#FFFFFF">Poster</font>]]
+
[[Group11_poster| <font color="#FFFFFF">Poster</font>]]
  
| style="font-family:Century Gothic; font-size:100%; solid #1B338F; background:#667181; text-align:center;" width="25%" |  
+
| style="font-family:Century Gothic; font-size:100%; solid #103080; background:#607080; text-align:center;" width="20%" |  
 
;
 
;
[[Group15_Application| <font color="#FFFFFF">Application</font>]]
+
[[Group11_application| <font color="#FFFFFF">Application</font>]]
| style="font-family:Century Gothic; font-size:100%; solid #1B338F; background:#667181; text-align:center;" width="25%" |  
+
 
 +
| style="font-family:Century Gothic; font-size:100%; solid #103080; background:#607080; text-align:center;" width="20%" |  
 
;
 
;
[[Group15_Research Paper| <font color="#FFFFFF">Research Paper</font>]]
+
[[Group11_user_guide| <font color="#FFFFFF">Application User Guide</font>]]
  
| &nbsp;
+
| style="font-family:Century Gothic; font-size:100%; solid #103080; background:#607080; text-align:center;" width="20%" |
 +
;
 +
[[Group11_research_paper| <font color="#FFFFFF">Research Paper</font>]]
 
|}
 
|}
<br/>
 
 
 
</div>
 
</div>
 
<br>
 
<br>
== Background & Motivation ==
 
Twitter, an online social media platform with over 300 million monthly active users, serves as an important customer support platform for businesses. Over 27% of customers already take to Twitter to share product reviews, air their frustrations and connect with brands (ConverSocial, 2016). 
 
Research shows that when a customer Tweets at a business and receives a response, they’re willing to spend 3–20% more on an average-priced item from that business in the future (Alton, 2017). Thus, analysing these tweets help to identify best practices and common issues for businesses which are valuable in improving customer experience.
 
However, it is difficult to analyse these unstructured data. Organisations often turn to commercial off the shelf tools to analyse text data and extract important information. Nevertheless, commercial off the shelf tools are costly and unable to customise to the needs of the organisations.
 
  
== Project Objective ==
+
== Background ==
Using airline tweets as a case study, this project aims to use the available text mining packages in R to build an interface for users to perform text analytics(i.e namely topic modeling and sentiment analysis) without the need to code. Users can also visualise the results in an interactive manner to uncover insights in the airline tweets.
+
[[File:G11 MapUK.png|1000px|frameless]] <br />
 +
Grocery data from in-store purchases of 411 Tesco shops in the Greater London area are used in this R Shiny application. In this project, we will focus on using the nutrients information from this dataset at 4 different spatial granularities, Lower Super Output Areas (LSOA), Middle Layer Super Output Areas (MSOA), ward and Local Authority Districts (LAD).  
 +
 
 +
The analysis is performed, notably through four sections:
 +
# Exploratory Data Analysis (EDA)
 +
# Exploratory Spatial Data Analysis (ESDA)
 +
# Clustering (Hierarchical, GeoSpatial, Skater Clustering)
 +
# Geographically weighted regression (GWR)
 +
 
 +
 
 +
== Motivation ==
 +
The recent availability of this dataset provides us with an opportunity to work on information that is current. This dataset also combines geospatial data with aspatial information that allows us to apply geospatial regression techniques and geospatial clustering to understand nutrition and obesity at different geographic granularity.
 +
 
 +
Despite the importance of studying food consumption at scale, there is little data about what people actually eat over long periods of time. Our analysis will link these food consumption data of an area in Greater London through both aspatial and geospatial methods. We will attempt to analyze the eating habits of Londoners based on this dataset through a non-biased, non-personalized lens that is prevalent in current web data from social media and geo-referenced media.
 +
 
 +
== Project Objectives ==
 +
The project aims to deliver an R-Shiny app that provides:
 +
# Interactive user interface design
 +
# Nutritional information interfaced with a visual map representation
 +
# Clustering techniques through both aspatial and geospatial methods
 +
# Geographically weighted Regression (GWR) of nutritional data and obesity
 +
 
 +
 
 +
== Proposed Scope and Methodology ==
 +
# Analysis of Tesco Grocery dataset with background research
 +
# Exploratory Data Analysis (EDA) methods in R
 +
# Exploratory Spatial Data Analysis (ESDA) methods in R
 +
# Clustering methods for aspatial and geospatial information in R
 +
# Analysis of geographically weighted regression (GWR) in R
 +
# R Markdown development for functionality checks
 +
# R-Shiny app development for user interactivity
  
==Data Source==
 
We have chosen the Customer Support on Twitter dataset from Kaggle. (https://www.kaggle.com/thoughtvector/customer-support-on-twitter)
 
This dataset comprises of over 2.8 million tweets from 108 companies and over 700,000 unique users from May 2008 to December 2017. As we are focusing on the airline industry as a case study, we will filter for tweets that are related to airlines. <br>
 
[[File:Data pre-processing.png|600px]]<br>
 
The chart above shows the distribution of outbound tweets by author_id. Some of the airlines we will be looking are circled in red.
 
  
The dataset contains both in bound and outbound tweets. However, we will only being using inbound tweets for our analysis as companies would be more interested to find out about customers' feedback and their overall sentiment.
+
A generalized development timeframe for this project is shown below. <br />
  
==Methodology==
+
[[File:Gen Gantt2.png|1000px|frameless]]
For text analysis on the airline tweets, we would first perform text pre-processing namely tokenisation, stop words removal and stemming. The processed tweets will then be analysed via two key techniques: Topic Modelling and Sentiment Analysis.  
 
  
Topic modelling is about finding a topic in a set of words that are frequently co-occurring together. In this project, we would be looking at Latent Dirichlet allocation (LDA) to derive the topics for the tweets.  
+
== Storyboard & Visualization Features ==
 +
There will be five sections in the final App. Data exploration will be done in the first two sections using scatterplots, correlation plots, and Local Indicator of Spatial Autocorrelation (LISA).
 +
The next two sections will be the clustering methods and geographically weighted regression.
 +
The last section will show the 4 transformed final data tables used in the application. <br />
 +
<p>Exploratory Data Analysis <br />[[File:G11 Stb A.jpg|800px|frameless|EDA]]
 +
<p>Exploratory Spatial Data Analysis <br />[[File:G11 Stb B.jpg|800px|frameless|ESDA]]
 +
<p>Clustering <br />[[File:G11 Stb C.jpg|800px|frameless|Clustering]]
 +
<p>Geographically weighted regression <br />[[File:G11 Stb D.jpg|800px|frameless|GWR]]
 +
<p>Data Table <br />[[File:Stb E.jpg|800px|frameless|Data Table]]
  
For sentiment analysis, we would be focusing more on sentiment polarity classification where airlines would be able to find out if the customers’ have a positive or negative sentiment about the company in general or about their services. Sentiment polarity of the customers’ tweets would be determined via the lexicon approach where we would use any existing pre-compiled lexicons in R or via the classification approach where we will train a binary classifier to predict the sentiment polarity of a new tweet. Available sentiment analysis packages in R such as “sentiment r” would be explored.
 
  
==Visualisation Features==
+
== Software Tools ==
{| class="wikitable"
+
* RStudio: https://rstudio.com/
|- style="vertical-align: top;"
+
 
! Sketch !! Description
+
== R Packages ==
|-style="vertical-align: top;"
+
* shiny: https://shiny.rstudio.com
| <p>[[File:First Page.jpg|400px|center]]</p> || This is the first tab (i.e intro tab) of our R Shiny App that will be shown to our users. In this tab, we will provide a short description on the App to allow users to have a brief idea on what kind of analysis can be done on this App.
+
* shinythemes: https://cran.r-project.org/web/packages/shinythemes
|-style="vertical-align: top;"
+
* shinyWidgets: https://cran.r-project.org/web/packages/shinyWidgets
| <p>[[File:Second Page.png|200px|center]]</p> || This where the user will upload the data to our App per the data requirement. Data requirement will be specified on this tab. Having an upload data feature allows users to analyse different kind of tweets as long as the data structure adhere to our App's data requirement.
+
* RColorBrewer: https://cran.r-project.org/web/packages/RColorBrewer
|-style="vertical-align: top;"
+
* tidyverse: https://www.tidyverse.org
| <p>[[File:Topic Modeling.jpg |400px|center]]</p> || The third tab will be about topic modelling results. In this tab, there will be a total of two sub-tabs. The first sub-tab will reflect the model comparison results between LDA and LSA based on the airline and time period of the user’s choice. Thereafter, user will choose between using the LDA or LSA models to derive the topics. Users would also be shown the optimal number of topics so as to give them an idea on the number of topics they should put as input for the next sub-tab which will then generate a list of key words for each topic.  
+
* leaflet: https://cran.r-project.org/web/packages/leaflet
|-style="vertical-align: top;"
+
* tmap: https://cran.r-project.org/web/packages/tmap
| <p>[[File:Sentiment Analysis.jpg|400px|center]]</p> || We will use R libraries such as “sentiment r” to analyze the general sentiment of each airline. Before deploying which model to use, we will use a confusion matrix and classification table to evaluate the different models. User are then able compare the general sentiments of different airlines for the chosen time period.
+
* spdep: https://cran.r-project.org/web/packages/spdep
|}
+
* rgeos: https://cran.r-project.org/web/packages/rgeos
== References==
+
* sf: https://cran.r-project.org/web/packages/sf
Alton, L. (2017, December 5). 4 tips for providing effective customer support on Twitter. Retrieved from https://business.twitter.com/en/blog/4-tips-for-providing-effective-customer-support-on-Twitter.html <br>
+
* sp: https://cran.r-project.org/web/packages/sp
ConverSocial. (2016). The State of Social Customer Service. Retrieved from http://www.conversocial.com/hubfs/Conversocial-Report-The-State-of-Social-Customer-Service-16.pdf
+
* rgdal: https://cran.r-project.org/web/packages/rgdal
 +
* GWmodel: https://cran.r-project.org/web/packages/GWmodel
 +
* plotly: https://cran.r-project.org/web/packages/plotly
 +
* ClustGeo: https://cran.r-project.org/web/packages/ClustGeo
 +
* dendextend https://cran.r-project.org/web/packages/dendextend
 +
* GGally: https://cran.r-project.org/web/packages/GGally
 +
* ggdendro: https://cran.r-project.org/web/packages/ggdendro
 +
* corrplot: https://cran.r-project.org/web/packages/corrplot
 +
* DT: https://cran.r-project.org/web/packages/DT
 +
 
 +
== Team Members ==
 +
* LI Junyi Darren
 +
* Muhammad Jufri Bin RAMLI
 +
* TEO Lip Peng Raymond
 +
 
 +
== References ==
 +
*[https://www.nature.com/articles/s41597-020-0397-7 Tesco Grocery 1.0, a large-scale dataset of grocery purchases in London]
 +
*[https://figshare.com/collections/Tesco_Grocery_1_0/4769354/2 Tesco Grocery 1.0 dataset]
 +
*[https://springernature.figshare.com/articles/Metadata_record_for_Tesco_Grocery_1_0_a_large-scale_dataset_of_grocery_purchases_in_London/11799765 Metadata record for: Tesco Grocery 1.0]
 +
*[https://epjdatascience.springeropen.com/articles/10.1140/epjds/s13688-019-0191-y Large-scale and high-resolution analysis of food purchases and health outcomes]
 +
*[https://geoportal.statistics.gov.uk/datasets/guide-to-presenting-statistics-for-super-output-areas-june-2018 Guide to presenting statistics for Super Output Areas (June 2018)]
 +
*[https://webarchive.nationalarchives.gov.uk/20170110165409/https://www.noo.org.uk/visualisation Data on child obesity and excess weight at small area level]
 +
*[https://en.wikipedia.org/wiki/Greater_London Wikipedia: Greater London]
 +
*[https://geoportal.statistics.gov.uk/datasets/regions-december-2019-boundaries-en-bgc Regions (December 2019) Boundaries EN BGC]
 +
*[https://geoportal.statistics.gov.uk/datasets/local-authority-districts-december-2019-boundaries-uk-bgc Local Authority Districts (December 2019) Boundaries UK BGC]
 +
*[https://geoportal.statistics.gov.uk/datasets/wards-december-2019-boundaries-ew-bgc Wards (December 2019) Boundaries EW BGC]
 +
*[https://geoportal.statistics.gov.uk/datasets/middle-layer-super-output-areas-december-2011-boundaries-ew-bgc Middle Layer Super Output Areas (December 2011) Boundaries EW BGC]
 +
*[https://geoportal.statistics.gov.uk/datasets/lower-layer-super-output-areas-december-2011-boundaries-ew-bgc Lower Layer Super Output Areas (December 2011) Boundaries EW BGC]
 +
*[https://sk.sagepub.com/reference/geography/n406.xml Exploratory Spatial Data Analysis - Jin Chen]
 +
*[https://sk.sagepub.com/reference/geoinfoscience/n64.xml Exploratory Spatial Data Analysis (ESDA) - Chris Brunsdon]
 +
*[https://pro.arcgis.com/en/pro-app/tool-reference/spatial-statistics/how-geographicallyweightedregression-works.htm How Geographically Weighted Regression (GWR) works]
 +
*[https://arxiv.org/abs/1306.0413 GWmodel: an R Package for Exploring Spatial Heterogeneity using Geographically Weighted Models]
 +
*[https://arxiv.org/abs/1312.2753 The GWmodel R package: Further Topics for Exploring Spatial Heterogeneity using Geographically Weighted Models]
 +
*[https://arxiv.org/abs/1905.00266 Scalable GWR: A linear-time algorithm for large-scale geographically weighted regression with polynomial kernels]
 +
*[http://mural.maynoothuniversity.ie/7850/1/MC_Minkowski.pdf The Minkowski approach for choosing the distance metric in geographically weighted regression ]
 +
*[https://wiki.smu.edu.sg/1819t3isss608/Group09_Methodology UK's access to health assets and hazards]
 +
*[https://wiki.smu.edu.sg/18191isss608g1/ISSS608_Group07_Proposal Corn: The A-maize-ing Crop]

Latest revision as of 09:06, 3 May 2020

SGSAS

Proposal

Poster

Application

Application User Guide

Research Paper


Background

G11 MapUK.png
Grocery data from in-store purchases of 411 Tesco shops in the Greater London area are used in this R Shiny application. In this project, we will focus on using the nutrients information from this dataset at 4 different spatial granularities, Lower Super Output Areas (LSOA), Middle Layer Super Output Areas (MSOA), ward and Local Authority Districts (LAD).

The analysis is performed, notably through four sections:

  1. Exploratory Data Analysis (EDA)
  2. Exploratory Spatial Data Analysis (ESDA)
  3. Clustering (Hierarchical, GeoSpatial, Skater Clustering)
  4. Geographically weighted regression (GWR)


Motivation

The recent availability of this dataset provides us with an opportunity to work on information that is current. This dataset also combines geospatial data with aspatial information that allows us to apply geospatial regression techniques and geospatial clustering to understand nutrition and obesity at different geographic granularity.

Despite the importance of studying food consumption at scale, there is little data about what people actually eat over long periods of time. Our analysis will link these food consumption data of an area in Greater London through both aspatial and geospatial methods. We will attempt to analyze the eating habits of Londoners based on this dataset through a non-biased, non-personalized lens that is prevalent in current web data from social media and geo-referenced media.

Project Objectives

The project aims to deliver an R-Shiny app that provides:

  1. Interactive user interface design
  2. Nutritional information interfaced with a visual map representation
  3. Clustering techniques through both aspatial and geospatial methods
  4. Geographically weighted Regression (GWR) of nutritional data and obesity


Proposed Scope and Methodology

  1. Analysis of Tesco Grocery dataset with background research
  2. Exploratory Data Analysis (EDA) methods in R
  3. Exploratory Spatial Data Analysis (ESDA) methods in R
  4. Clustering methods for aspatial and geospatial information in R
  5. Analysis of geographically weighted regression (GWR) in R
  6. R Markdown development for functionality checks
  7. R-Shiny app development for user interactivity


A generalized development timeframe for this project is shown below.

Gen Gantt2.png

Storyboard & Visualization Features

There will be five sections in the final App. Data exploration will be done in the first two sections using scatterplots, correlation plots, and Local Indicator of Spatial Autocorrelation (LISA). The next two sections will be the clustering methods and geographically weighted regression. The last section will show the 4 transformed final data tables used in the application.

Exploratory Data Analysis
EDA

Exploratory Spatial Data Analysis
ESDA

Clustering
Clustering

Geographically weighted regression
GWR

Data Table
Data Table

Software Tools

R Packages

Team Members

  • LI Junyi Darren
  • Muhammad Jufri Bin RAMLI
  • TEO Lip Peng Raymond

References