Difference between revisions of "Group11 proposal"

From ISSS608-Visual Analytics and Applications
Jump to navigation Jump to search
 
(75 intermediate revisions by 3 users not shown)
Line 1: Line 1:
[[File:Group11banner.PNG.png|frameless|1100px|left|Group 11: Google Analytics - Power Up!]]
+
[[File:G11 TitleBanner2.png|1700px|frameless|SGSAS]]
 
<div>
 
<div>
 
{|style="background-color:#607080;" width="100%" cellspacing="0" cellpadding="0" valign="top" border="0"  |
 
{|style="background-color:#607080;" width="100%" cellspacing="0" cellpadding="0" valign="top" border="0"  |
| style="font-family:Stencil; font-size:100%; solid #103080; background:#20a0d0; text-align:center;" width="25%" |  
+
| style="font-family:Century Gothic; font-size:100%; solid #103080; background:#20a0d0; text-align:center;" width="20%" |  
 
;
 
;
 
[[Group11_proposal| <font color="#FFFFFF">Proposal</font>]]
 
[[Group11_proposal| <font color="#FFFFFF">Proposal</font>]]
  
| style="font-family:Stencil; font-size:100%; solid #103080; background:#607080; text-align:center;" width="25%" |  
+
| style="font-family:Century Gothic; font-size:100%; solid #103080; background:#607080; text-align:center;" width="20%" |  
 
;
 
;
 
[[Group11_poster| <font color="#FFFFFF">Poster</font>]]
 
[[Group11_poster| <font color="#FFFFFF">Poster</font>]]
  
| style="font-family:Stencil; font-size:100%; solid #103080; background:#607080; text-align:center;" width="25%" |  
+
| style="font-family:Century Gothic; font-size:100%; solid #103080; background:#607080; text-align:center;" width="20%" |  
 
;
 
;
 
[[Group11_application| <font color="#FFFFFF">Application</font>]]
 
[[Group11_application| <font color="#FFFFFF">Application</font>]]
  
| style="font-family:Stencil; font-size:100%; solid #103080; background:#607080; text-align:center;" width="25%" |  
+
| style="font-family:Century Gothic; font-size:100%; solid #103080; background:#607080; text-align:center;" width="20%" |
 +
;
 +
[[Group11_user_guide| <font color="#FFFFFF">Application User Guide</font>]]
 +
 
 +
| style="font-family:Century Gothic; font-size:100%; solid #103080; background:#607080; text-align:center;" width="20%" |  
 
;
 
;
 
[[Group11_research_paper| <font color="#FFFFFF">Research Paper</font>]]
 
[[Group11_research_paper| <font color="#FFFFFF">Research Paper</font>]]
 
 
|}
 
|}
 
</div>
 
</div>
Line 23: Line 26:
  
 
== Background ==
 
== Background ==
Google Analytics is a suite of analytical tools to provide insights on website access to aid businesses decisions. It allows businesses to profile their website visitors and how they interact with the content. It provides Analytics Intelligence for quick answers to common metrics, numerous online reports on audience, advertising, acquisition, behaviour, conversion and user flow, and data analysis with data filtering, manipulation, segmentation and visualization features. A paid version "Google Analytics 360" provides more advanced eCommerce features on which users are likely to convert to customers and how best to use the marketing dollars.
+
[[File:G11 MapUK.png|1000px|frameless]] <br />
 +
Grocery data from in-store purchases of 411 Tesco shops in the Greater London area are used in this R Shiny application. In this project, we will focus on using the nutrients information from this dataset at 4 different spatial granularities, Lower Super Output Areas (LSOA), Middle Layer Super Output Areas (MSOA), ward and Local Authority Districts (LAD).
 +
 
 +
The analysis is performed, notably through four sections:
 +
# Exploratory Data Analysis (EDA)
 +
# Exploratory Spatial Data Analysis (ESDA)
 +
# Clustering (Hierarchical, GeoSpatial, Skater Clustering)
 +
# Geographically weighted regression (GWR)
 +
 
  
 
== Motivation ==
 
== Motivation ==
Google Analytics delivers a ton of insights into the users visiting the website. However, the visualizations are limited to line charts, bar charts, pie charts, highlight tables and geo maps. Also, besides the use of totals, averages and proportions, there is no available statistical analysis and inference of the data. One possible reason for such approach could be due to the mass target audience nature of the tool, as it could be difficult to make advanced visualizations and statistical analysis easily understood by the common users.
+
The recent availability of this dataset provides us with an opportunity to work on information that is current. This dataset also combines geospatial data with aspatial information that allows us to apply geospatial regression techniques and geospatial clustering to understand nutrition and obesity at different geographic granularity.
  
The third motivation stamps from the fact that Google Analytics is a hugely popular tool with good data management capabilities. This allows further analysis and visualization of the data outside the platform and in a repeatable manner that may have practical benefits.
+
Despite the importance of studying food consumption at scale, there is little data about what people actually eat over long periods of time. Our analysis will link these food consumption data of an area in Greater London through both aspatial and geospatial methods. We will attempt to analyze the eating habits of Londoners based on this dataset through a non-biased, non-personalized lens that is prevalent in current web data from social media and geo-referenced media.
  
[[File:GaAudienceOverview.png|500px|thumb|none|Fig1: GA Audience Overview with metrics and basic pie chart]]
+
== Project Objectives ==
[[File:GaUserInsights.png|500px|thumb|none|Fig2: GA Audience insights]]
+
The project aims to deliver an R-Shiny app that provides:
 +
# Interactive user interface design
 +
# Nutritional information interfaced with a visual map representation
 +
# Clustering techniques through both aspatial and geospatial methods
 +
# Geographically weighted Regression (GWR) of nutritional data and obesity
  
== Project Objectives ==
 
The project aims to deliver a R-Shiny app that provides:
 
# better interactivity in user interface design;
 
# visualization of key audience, behaviour and performance insights;
 
# statistical analysis and inferences on key audience, behaviour and performance data; and
 
# workflow for the export and import of Google Analytics data.
 
  
 
== Proposed Scope and Methodology ==
 
== Proposed Scope and Methodology ==
# Analysis of Google Analytics schema to understand the data structure, metadata and table relationships
+
# Analysis of Tesco Grocery dataset with background research
# Analysis of Google Analytics data management features to support export of data
+
# Exploratory Data Analysis (EDA) methods in R
# Analysis of R data management features and packages to support import of data
+
# Exploratory Spatial Data Analysis (ESDA) methods in R
# Sourcing of sample data for analysis and testing
+
# Clustering methods for aspatial and geospatial information in R
# Analysis of existing Google Analytics features and shortfalls for enhancements
+
# Analysis of geographically weighted regression (GWR) in R
# Design of enhanced UI, visualizations, statistical analysis and workflow
+
# R Markdown development for functionality checks
# R-Shiny app development and testing
+
# R-Shiny app development for user interactivity
# Demonstration of R-Shiny app
 
# Pilot run with live data
 
 
 
== Project Timeline ==
 
Challenging...
 
 
 
* limited statistics knowledge
 
* limited R knowledge
 
* difficult dataset
 
 
 
== Visualisation Features ==
 
* calendar view - overview to detect any yearly, monthly, weekly, daily, hourly patterns
 
* cycle plot - analysis of trend and cyclical patterns
 
* visitor/customer segmentation - k-means and/or LCA clustering
 
  
== Data Source & Preparation ==
 
In September 2019, Google BigQuery published a Google Analytics sample with twelve months (Aug 2016 to Aug 2017) of obfuscated Google Analytics 360 data on the Google Merchandise Store, a real ecommerce store that sells Google-branded merchandise. The data is typical of what an ecommerce website would see and includes the following information:
 
  
* Traffic source data: information about where website visitors originate, including data about organic traffic, paid search traffic, and display traffic
+
A generalized development timeframe for this project is shown below. <br />
* Content data: information about the behaviour of users on the site, such as URLs of pages that visitors look at, how they interact with content, etc.
 
* Transactional data: information about the transactions on the Google Merchandise Store website.
 
  
However, data for some fields is obfuscated, such as fullVisitorId, or removed such as clientId, adWordsClickInfo and geoNetwork. “Not available in demo dataset” will be returned for STRING values and “null” will be returned for INTEGER values when querying the fields containing no data.
+
[[File:Gen Gantt2.png|1000px|frameless]]
  
The dataset is huge with a daily rate of approximately 25MB for 1,500 sessions and 40,000 detailed records. It can be exported to AVRO or JSON formats.
+
== Storyboard & Visualization Features ==
 +
There will be five sections in the final App. Data exploration will be done in the first two sections using scatterplots, correlation plots, and Local Indicator of Spatial Autocorrelation (LISA).  
 +
The next two sections will be the clustering methods and geographically weighted regression.
 +
The last section will show the 4 transformed final data tables used in the application. <br />
 +
<p>Exploratory Data Analysis <br />[[File:G11 Stb A.jpg|800px|frameless|EDA]]
 +
<p>Exploratory Spatial Data Analysis <br />[[File:G11 Stb B.jpg|800px|frameless|ESDA]]
 +
<p>Clustering <br />[[File:G11 Stb C.jpg|800px|frameless|Clustering]]
 +
<p>Geographically weighted regression <br />[[File:G11 Stb D.jpg|800px|frameless|GWR]]
 +
<p>Data Table <br />[[File:Stb E.jpg|800px|frameless|Data Table]]
  
* [https://console.cloud.google.com/marketplace/details/obfuscated-ga360-data/obfuscated-ga360-data?filter=solution-type:dataset&q=analytics&id=45f150ac-81d3-4796-9abf-d7a4f98eb4c6&pli=1 Google Analytics Sample]
 
* [https://support.google.com/analytics/answer/3437719?hl=en BigQuery Export schema]
 
* [https://your.googlemerchandisestore.com/Index Google Official Merchandise Store]
 
  
 
== Software Tools ==
 
== Software Tools ==
Line 82: Line 78:
  
 
== R Packages ==
 
== R Packages ==
* rjson: https://cran.r-project.org/web/packages/rjson
 
 
* shiny: https://shiny.rstudio.com
 
* shiny: https://shiny.rstudio.com
* shinydashboard: https://cran.r-project.org/web/packages/shinydashboard
+
* shinythemes: https://cran.r-project.org/web/packages/shinythemes
* ggplot2: https://cran.r-project.org/web/packages/ggplot2
+
* shinyWidgets: https://cran.r-project.org/web/packages/shinyWidgets
* plotly: https://plot.ly/r
+
* RColorBrewer: https://cran.r-project.org/web/packages/RColorBrewer
* poLCA: https://cran.r-project.org/web/packages/poLCA
 
 
* tidyverse: https://www.tidyverse.org
 
* tidyverse: https://www.tidyverse.org
* trelliscope: https://www.rdocumentation.org/packages/trelliscope/versions/0.9.7
+
* leaflet: https://cran.r-project.org/web/packages/leaflet
 +
* tmap: https://cran.r-project.org/web/packages/tmap
 +
* spdep: https://cran.r-project.org/web/packages/spdep
 +
* rgeos: https://cran.r-project.org/web/packages/rgeos
 +
* sf: https://cran.r-project.org/web/packages/sf
 +
* sp: https://cran.r-project.org/web/packages/sp
 +
* rgdal: https://cran.r-project.org/web/packages/rgdal
 +
* GWmodel: https://cran.r-project.org/web/packages/GWmodel
 +
* plotly: https://cran.r-project.org/web/packages/plotly
 +
* ClustGeo: https://cran.r-project.org/web/packages/ClustGeo
 +
* dendextend https://cran.r-project.org/web/packages/dendextend
 +
* GGally: https://cran.r-project.org/web/packages/GGally
 +
* ggdendro: https://cran.r-project.org/web/packages/ggdendro
 +
* corrplot: https://cran.r-project.org/web/packages/corrplot
 +
* DT: https://cran.r-project.org/web/packages/DT
  
 
== Team Members ==
 
== Team Members ==
Line 97: Line 105:
  
 
== References ==
 
== References ==
 +
*[https://www.nature.com/articles/s41597-020-0397-7 Tesco Grocery 1.0, a large-scale dataset of grocery purchases in London]
 +
*[https://figshare.com/collections/Tesco_Grocery_1_0/4769354/2 Tesco Grocery 1.0 dataset]
 +
*[https://springernature.figshare.com/articles/Metadata_record_for_Tesco_Grocery_1_0_a_large-scale_dataset_of_grocery_purchases_in_London/11799765 Metadata record for: Tesco Grocery 1.0]
 +
*[https://epjdatascience.springeropen.com/articles/10.1140/epjds/s13688-019-0191-y Large-scale and high-resolution analysis of food purchases and health outcomes]
 +
*[https://geoportal.statistics.gov.uk/datasets/guide-to-presenting-statistics-for-super-output-areas-june-2018 Guide to presenting statistics for Super Output Areas (June 2018)]
 +
*[https://webarchive.nationalarchives.gov.uk/20170110165409/https://www.noo.org.uk/visualisation Data on child obesity and excess weight at small area level]
 +
*[https://en.wikipedia.org/wiki/Greater_London Wikipedia: Greater London]
 +
*[https://geoportal.statistics.gov.uk/datasets/regions-december-2019-boundaries-en-bgc Regions (December 2019) Boundaries EN BGC]
 +
*[https://geoportal.statistics.gov.uk/datasets/local-authority-districts-december-2019-boundaries-uk-bgc Local Authority Districts (December 2019) Boundaries UK BGC]
 +
*[https://geoportal.statistics.gov.uk/datasets/wards-december-2019-boundaries-ew-bgc Wards (December 2019) Boundaries EW BGC]
 +
*[https://geoportal.statistics.gov.uk/datasets/middle-layer-super-output-areas-december-2011-boundaries-ew-bgc Middle Layer Super Output Areas (December 2011) Boundaries EW BGC]
 +
*[https://geoportal.statistics.gov.uk/datasets/lower-layer-super-output-areas-december-2011-boundaries-ew-bgc Lower Layer Super Output Areas (December 2011) Boundaries EW BGC]
 +
*[https://sk.sagepub.com/reference/geography/n406.xml Exploratory Spatial Data Analysis - Jin Chen]
 +
*[https://sk.sagepub.com/reference/geoinfoscience/n64.xml Exploratory Spatial Data Analysis (ESDA) - Chris Brunsdon]
 +
*[https://pro.arcgis.com/en/pro-app/tool-reference/spatial-statistics/how-geographicallyweightedregression-works.htm How Geographically Weighted Regression (GWR) works]
 +
*[https://arxiv.org/abs/1306.0413 GWmodel: an R Package for Exploring Spatial Heterogeneity using Geographically Weighted Models]
 +
*[https://arxiv.org/abs/1312.2753 The GWmodel R package: Further Topics for Exploring Spatial Heterogeneity using Geographically Weighted Models]
 +
*[https://arxiv.org/abs/1905.00266 Scalable GWR: A linear-time algorithm for large-scale geographically weighted regression with polynomial kernels]
 +
*[http://mural.maynoothuniversity.ie/7850/1/MC_Minkowski.pdf The Minkowski approach for choosing the distance metric in geographically weighted regression ]
 +
*[https://wiki.smu.edu.sg/1819t3isss608/Group09_Methodology UK's access to health assets and hazards]
 +
*[https://wiki.smu.edu.sg/18191isss608g1/ISSS608_Group07_Proposal Corn: The A-maize-ing Crop]

Latest revision as of 09:06, 3 May 2020

SGSAS

Proposal

Poster

Application

Application User Guide

Research Paper


Background

G11 MapUK.png
Grocery data from in-store purchases of 411 Tesco shops in the Greater London area are used in this R Shiny application. In this project, we will focus on using the nutrients information from this dataset at 4 different spatial granularities, Lower Super Output Areas (LSOA), Middle Layer Super Output Areas (MSOA), ward and Local Authority Districts (LAD).

The analysis is performed, notably through four sections:

  1. Exploratory Data Analysis (EDA)
  2. Exploratory Spatial Data Analysis (ESDA)
  3. Clustering (Hierarchical, GeoSpatial, Skater Clustering)
  4. Geographically weighted regression (GWR)


Motivation

The recent availability of this dataset provides us with an opportunity to work on information that is current. This dataset also combines geospatial data with aspatial information that allows us to apply geospatial regression techniques and geospatial clustering to understand nutrition and obesity at different geographic granularity.

Despite the importance of studying food consumption at scale, there is little data about what people actually eat over long periods of time. Our analysis will link these food consumption data of an area in Greater London through both aspatial and geospatial methods. We will attempt to analyze the eating habits of Londoners based on this dataset through a non-biased, non-personalized lens that is prevalent in current web data from social media and geo-referenced media.

Project Objectives

The project aims to deliver an R-Shiny app that provides:

  1. Interactive user interface design
  2. Nutritional information interfaced with a visual map representation
  3. Clustering techniques through both aspatial and geospatial methods
  4. Geographically weighted Regression (GWR) of nutritional data and obesity


Proposed Scope and Methodology

  1. Analysis of Tesco Grocery dataset with background research
  2. Exploratory Data Analysis (EDA) methods in R
  3. Exploratory Spatial Data Analysis (ESDA) methods in R
  4. Clustering methods for aspatial and geospatial information in R
  5. Analysis of geographically weighted regression (GWR) in R
  6. R Markdown development for functionality checks
  7. R-Shiny app development for user interactivity


A generalized development timeframe for this project is shown below.

Gen Gantt2.png

Storyboard & Visualization Features

There will be five sections in the final App. Data exploration will be done in the first two sections using scatterplots, correlation plots, and Local Indicator of Spatial Autocorrelation (LISA). The next two sections will be the clustering methods and geographically weighted regression. The last section will show the 4 transformed final data tables used in the application.

Exploratory Data Analysis
EDA

Exploratory Spatial Data Analysis
ESDA

Clustering
Clustering

Geographically weighted regression
GWR

Data Table
Data Table

Software Tools

R Packages

Team Members

  • LI Junyi Darren
  • Muhammad Jufri Bin RAMLI
  • TEO Lip Peng Raymond

References