Difference between revisions of "Group11 proposal"

From ISSS608-Visual Analytics and Applications
Jump to navigation Jump to search
Line 110: Line 110:
 
*[https://springernature.figshare.com/articles/Metadata_record_for_Tesco_Grocery_1_0_a_large-scale_dataset_of_grocery_purchases_in_London/11799765 Metadata record for: Tesco Grocery 1.0]
 
*[https://springernature.figshare.com/articles/Metadata_record_for_Tesco_Grocery_1_0_a_large-scale_dataset_of_grocery_purchases_in_London/11799765 Metadata record for: Tesco Grocery 1.0]
 
*[https://epjdatascience.springeropen.com/articles/10.1140/epjds/s13688-019-0191-y Large-scale and high-resolution analysis of food purchases and health outcomes]
 
*[https://epjdatascience.springeropen.com/articles/10.1140/epjds/s13688-019-0191-y Large-scale and high-resolution analysis of food purchases and health outcomes]
*[https://en.wikipedia.org/wiki/Greater_London Wikipedia: Greater London]
 
*[https://geoportal.statistics.gov.uk/datasets/local-authority-districts-december-2019-boundaries-uk-bfc LAD boundaries 2019]
 
*[https://geoportal.statistics.gov.uk/datasets/wards-december-2019-boundaries-ew-bfc Ward boundaries 2019]
 
*[http://geoportal.statistics.gov.uk/datasets/826dc85fb600440889480f4d9dbb1a24_0 MSOA Boundaries 2011]
 
*[https://geoportal.statistics.gov.uk/datasets/lower-layer-super-output-areas-december-2011-boundaries-ew-bfc LSOA boundaries 2011]
 
 
*[https://geoportal.statistics.gov.uk/datasets/guide-to-presenting-statistics-for-super-output-areas-june-2018 Guide to presenting statistics for Super Output Areas (June 2018)]
 
*[https://geoportal.statistics.gov.uk/datasets/guide-to-presenting-statistics-for-super-output-areas-june-2018 Guide to presenting statistics for Super Output Areas (June 2018)]
 
*[https://webarchive.nationalarchives.gov.uk/20170110165409/https://www.noo.org.uk/visualisation Data on child obesity and excess weight at small area level]
 
*[https://webarchive.nationalarchives.gov.uk/20170110165409/https://www.noo.org.uk/visualisation Data on child obesity and excess weight at small area level]
 
*[https://pro.arcgis.com/en/pro-app/tool-reference/spatial-statistics/how-geographicallyweightedregression-works.htm How Geographically Weighted Regression (GWR) works]
 
*[https://pro.arcgis.com/en/pro-app/tool-reference/spatial-statistics/how-geographicallyweightedregression-works.htm How Geographically Weighted Regression (GWR) works]
 +
*[https://en.wikipedia.org/wiki/Greater_London Wikipedia: Greater London]
 
*[https://geoportal.statistics.gov.uk/datasets/regions-december-2019-boundaries-en-bgc RGN boundaries 2019 BGC]
 
*[https://geoportal.statistics.gov.uk/datasets/regions-december-2019-boundaries-en-bgc RGN boundaries 2019 BGC]
*[https://geoportal.statistics.gov.uk/datasets/regions-december-2019-boundaries-en-buc RGN boundaries 2019 BUC]
 
 
*[https://geoportal.statistics.gov.uk/datasets/local-authority-districts-december-2019-boundaries-uk-bgc LAD boundaries 2019 BGC]
 
*[https://geoportal.statistics.gov.uk/datasets/local-authority-districts-december-2019-boundaries-uk-bgc LAD boundaries 2019 BGC]
*[https://geoportal.statistics.gov.uk/datasets/local-authority-districts-december-2019-boundaries-uk-buc LAD boundaries 2019 BUC]
 
 
*[https://geoportal.statistics.gov.uk/datasets/wards-december-2019-boundaries-ew-bgc Wards boundaries 2019 BGC]
 
*[https://geoportal.statistics.gov.uk/datasets/wards-december-2019-boundaries-ew-bgc Wards boundaries 2019 BGC]
 
*[http://geoportal.statistics.gov.uk/datasets/middle-layer-super-output-areas-december-2011-boundaries-ew-bgc MSOA boundaries 2011 BGC]
 
*[http://geoportal.statistics.gov.uk/datasets/middle-layer-super-output-areas-december-2011-boundaries-ew-bgc MSOA boundaries 2011 BGC]
 
*[http://geoportal.statistics.gov.uk/datasets/lower-layer-super-output-areas-december-2011-boundaries-ew-bgc LSOA boundaries 2011 BGC]
 
*[http://geoportal.statistics.gov.uk/datasets/lower-layer-super-output-areas-december-2011-boundaries-ew-bgc LSOA boundaries 2011 BGC]

Revision as of 11:07, 20 April 2020

Group 11: Google Analytics - Power Up!

Proposal

Poster

Application

Research Paper


Background

Google Analytics is a suite of analytical tools to provide insights on website access to aid businesses decisions. It allows businesses to profile their website visitors and how they interact with the content. It provides Analytics Intelligence for quick answers to common metrics, numerous online reports on audience, advertising, acquisition, behaviour, conversion and user flow, and data analysis with data filtering, manipulation, segmentation and visualization features. A paid version "Google Analytics 360" provides more advanced eCommerce features on which users are likely to convert to customers and how best to use the marketing dollars.

Motivation

Google Analytics delivers a ton of insights into the users visiting the website. However, the visualizations are limited to line charts, bar charts, pie charts, highlight tables and geo maps. Also, besides the use of totals, averages and proportions, there is no available statistical analysis and inference of the data. One possible reason for such approach could be due to the mass target audience nature of the tool, as it could be difficult to make advanced visualizations and statistical analysis easily understood by the common users.

The third motivation stamps from the fact that Google Analytics is a hugely popular tool with good data management capabilities. This allows further analysis and visualization of the data outside the platform and in a repeatable manner that may have practical benefits.

Fig1: GA Audience Overview with metrics and basic pie chart
Fig2: GA Audience insights

Project Objectives

The project aims to deliver a R-Shiny app that provides:

  1. better interactivity in user interface design;
  2. visualization of key audience, behaviour and performance insights;
  3. statistical analysis and inferences on key audience, behaviour and performance data; and
  4. reproducible workflow for the export and import of Google Analytics data.

Proposed Scope and Methodology

  1. Analysis of Google Analytics schema to understand the data structure, metadata and table relationships
  2. Analysis of Google Analytics data management features to support export of data
  3. Analysis of R data management features and packages to support import of data
  4. Sourcing of sample data for analysis and testing
  5. Analysis of existing Google Analytics features and shortfalls for enhancements
  6. Design of enhanced UI, visualizations, statistical analysis and workflow
  7. R-Shiny app development and testing
  8. Demonstration of R-Shiny app
  9. Pilot run with live data

Storyboard & Visualization Features

  • Data Import and Manipulation
Data Import and Manipulation


  • EDA – Distribution, Heatmap, Choropleth
EDA – Distribution, Heatmap, Choropleth


  • Analytical – k-means, LCA, hierarchical clustering
Analytical – k-means, LCA, hierarchical clustering


Analytical – k-means, LCA, hierarchical clustering


Analytical – k-means, LCA, hierarchical clustering


Analytical – k-means, LCA, hierarchical clustering

Data Source & Preparation

In January 2018, Google BigQuery published a Google Analytics sample with twelve months (Aug 2016 to Aug 2017) of obfuscated Google Analytics 360 data on the Google Merchandise Store, a real ecommerce store that sells Google-branded merchandise. The data is typical of what an ecommerce website would see and includes the following information:

  • Traffic source data: information about where website visitors originate, including data about organic traffic, paid search traffic, and display traffic
  • Content data: information about the behaviour of users on the site, such as URLs of pages that visitors look at, how they interact with content, etc.
  • Transactional data: information about the transactions on the Google Merchandise Store website.

However, data for some fields is obfuscated, such as fullVisitorId, or removed such as clientId, adWordsClickInfo and geoNetwork. “Not available in demo dataset” will be returned for STRING values and “null” will be returned for INTEGER values when querying the fields containing no data.

The is a huge dataset of 400+ variables with a daily data incremental rate of approximately 25MB for 1,500 sessions and 40,000 detailed records. It can be exported to AVRO, JSON or CSV formats.

Software Tools

R Packages

Team Members

  • LI Junyi Darren
  • Muhammad Jufri Bin RAMLI
  • TEO Lip Peng Raymond

References