ANLY482 AY2017-18 T1 Group03/Final Progress

From Analytics Practicum
Jump to navigation Jump to search

Logo3.PNG

Home About Us Project Overview Midterm Progress Final Progress Project Management Documentation


1.0 Methodology


1.1 Data Collection
There were 6 main files sent to us by client out of which 5 of them have been extensively used. For our research, we obtained a full year data of 2016 and 2017 data till August.

File Name (Edited for confidentiality) Description Number of Rows
Customer Interaction Information on interaction details ~213,000(FTF) & ~421,000(PW)
Employee File Information on employee teams ~2000
Customer (Organisations) Information on healthcare organisations ~117,000
Customer (non-Organisation) Information on healthcare professionals ~205,000
Invoice Information on transactions ~9,000,000



1.2 Tools Used
Our research has extensively used SAS JMP Pro 13 for data cleaning and transformation and for confirmatory analysis because of its intuitive system and simplistic approach. During the initial period, there has been considerable usage of Tableau Software for exploratory data analysis because of its easy visualisations for understanding the data.




1.3 Data Cleaning & Transformation

1.3.1 Files Required for Cleaning & Transformation

File Name (Edited for confidentiality) Description
Customer Interaction (Type A) This file contains information about face-to-face interactions that the client has done with the customers. We have included information such as customer code and allocated duration (i.e to remove double counting of duration issue in the original file) in the file.
Customer Interaction (Type B) This file contains information about phone web interactions that the client has done with the customers. We have included customer code in the file.
Clean Invoice (2015-2017) This file contains information about sales invoices that occurred between 2015 to the latest month in 2017. We have previously cleaned and mapped in Therapy Area to the file.


1.3.2 Customer Interaction

We have performed a summary function for both Customer 360 (F2F) and Customer 360 (Phone Web) based on each product code sold and total interaction duration to each customer on a monthly basis.

Essentially, the summarized files will show us that per month, how many interactions were made and how much time were spent to sell a product to each customer. The N counts that appear in each file represents interaction counts for face-to-face (F2F Interaction Count) and phone web (Call Count) interactions. We have saved both files as Summarized Customer 360 (F2F) and Summarized Customer 360 (PW).


1.3.3 Invoice File

We performed a summary function for the invoice file to extract out relevant information for us to analyse sales and to eventually link the summarized file to interaction files. Essentially, we summarized the invoice based on products sold to individual customer and how much revenue was earned per month. We also classified the products sold by their therapy areas, which distribution channel was used to sell the products, and the sales representative who was involved in making the sales. We have saved the summarized file as Summarized Invoice.


1.3.4 Linking both files

We used the update function to pull in interaction details from both summarized Customer 360 files into the summarized invoice file, using date of invoice and interaction as a linking point.

The purpose of doing this was that we could not find a direct linkage between customer interactions and sales invoice. To close a sales invoice, there could be multiple interactions done before the customer decides to buy a product. Therefore, we have chosen this method to account for the number of interactions done per month and tie that to sales invoices per month based on customer code and product code.

We will then have information about sales and interaction to each customer by a particular product the company made sales to. The file was saved as temporarily as Invoice Interaction.


1.3.5 Cleaning up Invoice Interaction

With the new file on hand, we have renamed both N rows for face-to-face interactions and phone web interactions to F2F Interaction Count and Call Count respectively. We then recoded missing value in F2F Interaction Count and Call Count columns with “0”. There were sales transactions done that did not contain both face-to-face and phone web interactions. As our objective was to analyze sales based on interactions, we have decided to remove those transactions that do not contain any interactions.


1.3.6 Standardizing Invoice Interaction file data

Before we came up with statistical model to analyze sales and interaction, we first checked if the data in the total sales amount (i.e Sum(Amount$)) contained a normal distribution. Initial analysis by running a distribution check indicated that the total sales amount was right-skewed as shown below.

Summary Distribution of original transformed sales revenue amounts
Mean 2172.2607 Std Dev 8785.3315 Std Err Mean 78.534427 Upper 95% Mean 2326.2002
N 12514 Median 515 Mode 173.55 Lower 95% Mean 2018.3211


We had transform such data into normal distribution so that robust parametric tests can be applied for our analysis. By performing data transformation, it will be easier for us to do comparisons and interpretations of results. With the help of JMP Pro software, we were able to identify the best transformation test for the sales amount by using the Johnson Su test as shown below:

Summary Distribution of Johnson Su transformed sales revenue amounts
Mean -6.53e-17 Std Dev 1.00004 Std Err Mean 0.0089396 Upper 95% Mean 0.017523
N 12514 Median -0.029004 Mode -0.99218 Lower 95% Mean -0.017523

The results above showed us that the transformed sales amount is now more normally distributed. Therefore, we created a new column for the transformed sales amount as Johnson Su Transform Sum(Amount$).


1.3.7 Binning of interaction counts

According to client’s requirement, the binning of interaction counts for both face-to-face and phone web interactions should be as follows:

1) Low: 0 – 1 interactions
2) Medium: 2 – 3 interactions
3) High: 4 interactions and above

Thus, we used the interactive binning function to bin up the interaction counts into their respective bins and created new columns for them as Call Count Groups and F2F Interaction Groups. With the new updated Invoice Interaction file, we have extracted out 2016 and 2017 data using the “Filter” function and saved them separately as Invoice Interaction (2016) and Invoice Interaction (2017).




1.4 Data Analysis

1.4.1 Exploratory Data Analysis

As we had substantial amount of dataset, our initial period was mainly based on data cleaning and analysing each file. Tableau was mainly used for visualisation purposes where we simulated different graphs by choosing different parameters to better understand the data. Tree-maps, Bar charts, Histograms were mainly used for visualisation purposes as we had a mixture of categorical and continuous variables in our dataset.


1.4.2 Confirmatory Data Analysis

The 2 main independent variables for our analysis are the two-different interaction counts: Call Count and F2F Interaction Count. Both variables are of ordinal type and after binning our interaction counts, the variable type changed to nominal. Our dependent variable is the sales revenue: Johnson Su Transform Sum(Amount$) which is of continuous nature.

Given the nature of the variables, we decided to run One-way Analysis or ANOVA (Analysis of Variance) test to check for sensitivity in each stage of our analysis. The null hypothesis for ANOVA is that the difference in group means equals to 0 while rejecting the null hypothesis means that the difference in group means is not equal to 0. As ANOVA test only tells us if the results are significant or not, we later run a post-hoc test called Tukey HSD (Tukey’s Honest Significant Difference) test to find out which specific group’s means (compared to each other) is different or also a between group comparison for the different levels of interactions. The null hypothesis of Tukey HSD states that the two or more group means should be equal while rejecting the null hypothesis means that the two or more group means are not equal. Tukey HSD test should only be used after rejecting the null hypothesis of ANOVA.

This step is further verified by a non-parametric test called Wilcoxon’s test which statistically compares the means of the different levels of interactions and assesses for significant differences. As it doesn’t require a normal distribution of the dependent variable, it helps us to cross verify our results against the Tukey HSD’s test. We also use another non-parametric test called Moods Median test when there are significant outliers in our sample. The Median test compares the medians of the different levels of interactions and this is essentially helpful when the sample size as heavy outliers which might alter our results from our previous tests.

After running each of the tests, we further streamline our sample data through channels to get even more accurate analysis on the sensitivity of each product. We follow similar procedures mentioned above to get the results.


2.0 Exploratory Data Analysis


2.1 Distribution of Revenue


Since the objective of our analysis is to determine the drivers behind GSK’s sales revenue, it is essential to understand how sales revenue is currently distributed within the entity. Thus, before conducting statistical analysis on the data, we performed exploratory data analysis to derive insights about the distribution of GSK’s sales revenue. We used Tableau, an intelligent software that allows data to be transformed into actionable insights via data visualization. In addition, Tableau will allow us to create dashboards and enable us to share our findings with the client.

We used “Invoice Interaction” as our Data Source in Tableau and started analysing the data by channels and therapy areas for a general overview of the distribution. For this, we used a tree-map, which is ideal for displaying large amounts hierarchically structured data.

EDA Tree Map of Sales

The size of the rectangles in the tree-map represents the amount of sales revenue each channel generates GSK. We have also used colour to represent the sales quantity of each rectangle. The darker the shade, the higher the quantity of sales. Each channel is broken down to the therapy areas contributing to the channel’s revenue.

We observed that Restructured Hospital, General Practitioner, and Specialist are the top 3 channels that generate the most sales amount for the company. Further analysis also shows that most of Restructured Hospital’s purchases are CNS and HIV products, most of the General Practitioner’s purchases are Anti Infective products and most of the Specialist’s purchases are Paediatric Vaccine products.


2.2 Pareto Analysis of Revenue

Next, to decide which products to focus on for further analysis, we studied the contribution of each product total sales revenue. For this visualisation, we used a Pareto chart, a chart that contains both bars and a line graph.

Pareto Chart of Sales Revenue

The left vertical axis represents the sales revenue generated by each product, while the right vertical axis is the cumulative percentage of the total sales revenue. The individual sales revenue of each product is represented in descending order by bars with the cumulative total represented by the line. The purpose of the Pareto chart is to highlight products that are the highest contributors to GSK’s sales revenue. We filtered out the top products that make up 80% of total sales revenue. This is marked out by the dotted reference line on the graph shown above:

1. Seretide/ Advair 2. Infanrix 3. Zyrtec 4. Keppra 5. Veramyst/ Avamys 6. Rotarix
7. Augmentin Tab 8. Avodart/ Avolve Others 9. Dudart/ Jalyn 10. Ventolin 11. Twinrix/ Ambirix 12. Augmentin Syr
13. Varilrix 14. Priorix-Tetra (MMRV) 15. Engerix/ Fendrix 16. Epzicom/ Kivexa 17. Epivir 18. Flixotide/ Flovent
19. Zeffix 20. Boostrix 21. Cervarix 22. Hepsera- Adefovir 23. Lacimictal Others 24. Fluarix

We compared the list with the power brands that the company provided us with and suggested to include the products listed above to conduct a more comprehensive analysis. After consultation with the sales manager, we were instructed to include 4 other products in our analysis: Relva, Anoro, Incruse, Havrix.


3.0 Confirmatory Analysis


3.1 Overview


After cleaning and preparing the data, we decided to work on our confirmatory analysis. The main objective of our client was to assess the sensitivity of the sales of the top 24 products to different levels of interactions (face to face or phone web).

To achieve said results, our first step was to analyse the therapy areas and measure the sensitivity of each therapy area’s sales to the different levels of interactions. For this step, all the products in each therapy area was used.

After analysing each therapy area, our second step was to focus on the top 24 products which fell into either of the 9 different therapy areas. If the sales of a product, was sensitive to different levels of interactions, we then proceeded to dig deeper by focusing on the different channels offered by each product. This ultimately gave us a coherent understanding of the sensitivity of each of the products and provided clearer business solutions for our client.

For each interaction type, we went from annual > therapy area > Product > Channel.


3.2 Case Study (Seretide | 335)


3.2.1 Analysis of Therapy Areas

For this paper, we will be focusing on one therapy area and assess the sensitivity of sales of a product to get a finer understanding of the steps involved in our analysis. As our client has mentioned before, Respiratory is one of their key therapy areas in generating sales. Also, we will be focusing on the year of 2016 for our analysis, as we have the full year data compared to 2017 where we only have the latest data till August 2017. As such, we use the 2017 results as a validation for our 2016 results.

We first run our sample containing Respiratory as the therapy area through a box plot analysis to have a general outline of the data we are dealing with. As we can see, our exploratory data analysis for face to face interactions shows that the low interaction bin (0-1 interaction counts) has the highest sales compared to the other two bins. Also, there is a lot of variation among the three different interaction bins.

EDA for F2F interaction, Respiratory

We run a similar box plot for phone web interactions for Respiratory for the year of 2016. We see similar results with low interaction bin having highest sales but with a lot of variation among the three different interaction bins. We also see that low interaction bin has the lowest sales as well.

EDA for Phone Web interaction, Respiratory

After the exploratory analysis, we move to the confirmatory analysis via the ANOVA test. We use Tukey HSD test to compare the means of the three different interaction levels for each type of interactions.

For Face to Face interactions, from the ordered differences report we see that there is a significant difference between high and low bin and high and medium bin (with p value less than 0.05). This is also confirmed from the circle comparison graph below which shows that the high circle is significantly different from the low and medium bin.

One-way Analysis for Face-to-Face Interaction

This is further verified by the non-parametric Wilcoxon’s test which shows a significant value(with ChiSq. Value less than 0.05) which asserts our evaluation that face to face interactions does have a significant impact on Respiratory Therapy Area.

For Phone Web interactions, from the ordered difference report we see that there is no significant difference between any of the different levels of interactions with p value for each result being higher than 0.05. This is also confirmed from the comparison circles shown below in the graph.

One-way Analysis for Phone Web Interaction

This is further verified by the non-parametric Wilcoxon’s test which doesn’t show significant result for the comparison of the means.


3.2.1.1 Interpretation

The above results can be interpreted as such:

For Face to Face interactions, for the year of 2016, for the therapy area “Respiratory”, there is a significant difference between high interaction level and the other two interaction levels (medium and low) which states that sales is sensitive to higher level of interactions on a monthly basis. On the other hand, there is not enough statistical evidence to show that sales is sensitive for any level of phone web interactions.

From a business perspective, this means that on a general overview, our client should focus on shifting their low face to face interactions to higher face to face interactions for Respiratory therapy area, to generate more sales. This should only be done if the profits earned from such an action is higher than the costs needed to redirect all their sales representatives to higher face to face interactions.


3.2.2 Analysis by Products

We consider Seretide (335) which is the top product of our client and falls under Respiratory Therapy Area. Seretide also has one of the bigger sample sizes due to its popularity in the market and has generated the maximum revenue as gathered from our initial exploratory data analysis. Like before, we look only into the 2016 data and later verify through the 2017 data.

We first run a box plot analysis for both face to face and phone web interactions and we see that medium level of interaction has the highest sales for face to face while the high level of interaction for phone web has an insignificant amount of data. Also the low level of interaction for phone web has the highest sales.

We proceed with the ANOVA and Tukey HSD test for Seretide for 2016. We see that the circle comparison graph shows us that high level of interaction is significantly different from the other two levels of interactions for face to face which is further reconfirmed by the ordered difference report from Tukey HSD test. The ordered difference report also tells us that high interaction is significantly different from low and medium bins (with a p value less than 0.05).

We do similar analysis for phone web interactions for Seretide and we see that there is no significant difference among any of the interaction levels according to the ordered difference report and the circle comparison graph.

Following which, we translate the significant differences back to absolute sales revenue.


3.2.3 Analysis by Channels

We dig deeper by considering all the channels for the product, Seretide. Only “Specialist” had significant results with similar results to our product analysis. The ordered report tells us that high interaction is significantly different from the other two levels of interactions with a p-value less than 0.05. Our analysis indicates that a shift from low to high interaction counts would result in an increase in sales.


3.2.4 Confirmatory Analysis for Seretide

To validate our 2016 results, we run the similar analysis for 2017 for the same product.


3.2.4.1 Analysis by Channels


We see findings which are consistent with our previous results. High interaction level is still significantly different than the other two for Seretide. Consistent with 2016, Specialist as a channel is the only one with significant differences between High interaction bin mean and the other two interaction bin means.




3.3 Summarised Table for 2016


Names have been ommitted for confidentiality
Product Code Face-to-Face Phone Web Product Code Face-to-Face Phone Web Product Code Face-to-Face Phone Web |
335 Sensitive - AEL Sensitive - AGK Sensitive -
AAH Sensitive Sensitive 287 Sensitive - ABT Sensitive -
AAP Sensitive - AAT - - AAZ Sensitive -
AAO Sensitive - 362 - - 474 - -
304 Sensitive - ABA Sensitive Sensitive ABW Sensitive -
312 - -
4.0 Limitations


4.1 Analytical Findings

There was a lack of information to tie interaction entries with the customers to the actual sales invoices issued due to (i) the rotation of Sales Representatives, (ii) the usage of HCP’s account code vs HCO’s parent account code (differing fields). The different names given to this customer identification in each file, though unique in the values, rendered us unable to efficiently combine the interactions and the invoices without having to manually look through every unique customer identification in invoice and match to either the list from HCO or HCP. For the sake of this project, we settled for analysing based on monthly basis (aggregate and group by months).

Interaction duration was not accurately logged in the Customer 360 SG file. As such, we could not carry out any analysis regarding the interaction duration. Even for interaction counts, there were many cases of exceeding the maximum intensity stipulated by Client (2 products in one interaction sitting), where the rationale for limiting Sales Representatives to such a rule was not explained clearly either, from a business point of view.

Data for 2017 did not consist of a whole year’s data. Generally lacking in data, our project team could only analyse 2016 and utilise 2017 as validation and confirmatory. We could not look at trends in sensitivities across years.




4.2 Managerial Recommendations


We recommend for the Client to track not just instances of unique customer to Sales Representative, but also instances of each unique customer-sales rep-product or customer-sales rep-interaction sitting format. This would allow for easier aggregation of data without overlap. Also, the Client could consider having a field specifying if a certain entry from Sales Invoice and Customer 360 SG were taken from HCO or HCP list. At present there were cases where “Primary Parent” was input, but not for all belonging to HCO, while also the customer field in Customer 360 SG (interaction) was linked to the customer account name in HCP, but for HCO it was the representative of the HCO. This meant that a similar customer code was conflicted in two different files.

We recommend for the Client to approach the project Supervisor for a team to look at backend software interfaces, especially the Sales Representatives’ logging of data, to see if they can change the way the data is entered into the system. Also recommend to look into Data Management.

Lastly, we recommend for the Client to continually follow up with the analysis to validate the findings for each year. The Client may also wish to look at different time frames, other than just a year, since some of the sales take place on an annual tender renewal basis (Client should also consider removing these from the analysis since the tender is an extraneous factor affecting the sales, and not interaction sensitive).


5.0 Conclusion


Going forward, future research could look at the most highly sensitive product out of the list of those that undergo phone web interaction. In this project we could only manage to analyse sensitivity based on interaction counts. With proper linkage of sales invoices and customer interactions, deeper analysis could take place to provide more meaningful recommendations.


6.0 References

Aluregowda, A. (2013). Organized Retail Strategy – A Study at Loyal World Supermarket. IOSR Journal of Business and Management, 7(1), 75-80. doi:10.9790/487x-0717580


Anic, I., & Radas, S. (2006). The Relationships Between Shopping Trip Type, Purchases Made on Promotion, and Unplanned Purchases for a High/Low Hypermarket Retailer. Privredna kretanja i ekonomska politika , 107 , 27-45. Retrieved December 1, 2017.


Breierova L.,& Choudhari M. (2001). An Introduction to Sensitivity Analysis.


Tan K., Yee M., & Xiang S. (2016) Using Analysis of Variance(ANOVA) method in the Identification of Unequal Means Resulting from Sales Interactions.