ANLY482 AY2016-17 T2 Group10 Project Overview: Methodology
Data Collection
The data given by GSK are mainly in the form of flat files (Excel). Each contains 1 or more sheets with multiple columns. Hence the data is very high in dimensionality. Metadata is not yet available, but from column headers and the conversation with the sponsor, we have an idea on which ones will be more relevant to us. Such data include sales information, competency and results of sale staff, and data on the methods of the salespeople. These data have been promised to us. To discover potential insights through spatial clustering analysis of sale territories, we also intend to collect spatial data from its vertical industries: hospitals, clinics and retail pharmacies. This can be easily collected from Singapore’s public data website, Data.gov.sg, in SHP or KML formats.
Data Preparation
The stage of data preparation (or data wrangling, newly termed as data preparation taken to the next level ) would involve employing techniques of ETL (Extract, Transform, Load) to form an Analytics Sandbox used for further exploratory analysis purposes. To better facilitate future analysis, we will be conducting ETL process and exploratory data analysis cyclically such that if the latter is not satisfactory, we will go back to revise the former. The entire process of data preparation will be done using JMP Pro 13, which supersedes its predecessor SAS Enterprise Guide and Miner and has capabilities in the fields of descriptive and predictive modelling required by our team.