AY1516 T2 Team13 Natasha Studio Findings LogReg

From Analytics Practicum
Revision as of 14:42, 17 April 2016 by Amy.tan.2012 (talk | contribs)
Jump to navigation Jump to search

HOME

TEAM

PROJECT OVERVIEW

FINDINGS & ANALYSIS

PROJECT MANAGEMENT

DOCUMENTATION

EXPLORATORY DATA ANALYSIS OTHER ANALYSIS DATABASE CREATION ASSOCIATION RULE MINING LOGISTIC REGRESSION

Data Transformation

Process Flow of Logistic Regression Analysis

Before conducting logistic regression, data transformation is required to obtain the members’ package utilization rate. Figure 33 shows the process of data transformation before logistic regression can be performed. Microsoft Excel is the primary tool for such data transformation and utilization analysis.

Firstly, the Attendance data and Purchase data are aligned in terms of time period. The alignment is necessary as the variable “Total Classes Attended” is obtained from the Attendance dataset, while the “Total Classes Purchased” is obtained from the Purchases dataset. As the Attendance data provided by Natasha Studio runs from March 2014 to March 2015, the Purchases data involved in this analysis is thus scoped accordingly.

Furthermore, in the Purchase data, the “DateStart” and “Date End” variables are missing for some of the records. “DateStart” indicate the dates which the members activate their purchased packages, while “Date End” indicates the dates where the package validity would end. These two dates are different from the “DatePurchased” variable which indicates the date when the member purchase the packages. This means that the package purchasing date is independent of the package utilization dates.

During the data preparation stage earlier, it was found that there are significant amount of missing data for the “DateStart” and “Date End” variables. A clear package start date and end date is important so as to calculate the total number of lessons purchased by members accurately. As such, members whose “DateStart” and “Date End” variables are missing are excluded from the model. Next, the Purchase data is filtered into “Course Purchase”, “Open Class Purchase” and “Unlimited 1 Month Purchase”. For instance, “Open Class Purchase” contains all open class course purchases within the period from March 2014 to March 2015. We focused the analysis for unlimited package on the unlimited 1 month package as it makes up 96% of all unlimited packages bought within this time period Next, members’ lists are created for each of the package type based on the “COUNTIF” function to identify members who purchased each of the three packages.

Open Class

Utilization Rate

OpenClassStat

After obtaining the utilization rate, an analysis for utilization rate is shown in the table above. The table shows that the average open class utilization rate is 65%. Meanwhile, it can also be observed that the range of open class utilization rate is high, as the maximum and minimum utilization rate is 100% and 0% respectively. The open class utilization rate is also skewed towards the higher end, where the 80th percentile is also 100%.

Logistic Regression

In order to ascertain if open class package utilization rate can explain the open class package repurchase, a logistic regression is performed. The table below lists the variables involved in the logistics regression model and its hypotheses. The setting of SAS EG is also adjusted where the probability output is modelled for “Repurchase = 1”.

LogReg for Open Class

Results

LogRegResults for Open Class

The logistic regression result obtained from SAS EG is shown the table above. Firstly, looking at the test for global null hypothesis, the highest p-value is 0.0075. Using a significance level of 5%, the p-value is lower than the significance level and hence null hypothesis is rejected. We conclude that open class utilization rate does explain the probability of open class repurchase and beta is not equal to 0.

LogRegEqn for Open Class

This gives us the equation above, which shows that for a 1% increase in utilization rate, we see a 3% increase in odds of repurchasing the open class package.

Evaluating fit of model

Next, the fit of the logistic regression model is evaluated based on the following measures:

  • Area under ROC curve
  • Odds Ratio Estimate
  • Pair concordant and pair disconcordant
Area under ROC Curve

ROC Curve for Open Class

A model that has area under ROC curve higher than 0.7 are considered reasonable and considered strong if the value exceeds 0.8. The area under ROC curve is 0.6987 as shown the figure above, which is very close to the criteria of 0.7. Hence, this indicate that the fit of the model is reasonable.

Odds Ratio Estimate

Odds Ratio for Open Class

A model that has a 95% confidence interval of odds ratio which contains 1 is not statistically significant that there is an association between the dependent variable and the independent variable. This is because the odds ratio of 1 means that there is the independent variable has no effect on the dependent variable. As we see in the table above, the odds ratio 95% confidence interval does not contain 1 and hence can conclude that the open class utilization rate has an effect on the open class repurchase variable.

Pair Concordant and Pair Disconcordant

Pairs Concordant for Open Class

Percent concordant is the percentage of pairs in which the model estimates a higher probability for the member with the higher response (i.e. member who have higher utilization rate has a higher probability of repurchasing the package). Percent discordant is the contrary of percent concordant. This means that a high percentage of concordant pairs indicates a good-fit for the model. As our model has a 64.9% of concordant pairs as seen the table above, which is a moderately high percentage, this once again indicates that the fit of model is reasonable.