AY1516 T2 Team13 Natasha Studio Findings LogReg

From Analytics Practicum
Revision as of 14:34, 17 April 2016 by Amy.tan.2012 (talk | contribs)
Jump to navigation Jump to search

HOME

TEAM

PROJECT OVERVIEW

FINDINGS & ANALYSIS

PROJECT MANAGEMENT

DOCUMENTATION

EXPLORATORY DATA ANALYSIS OTHER ANALYSIS DATABASE CREATION ASSOCIATION RULE MINING LOGISTIC REGRESSION

Data Transformation

Process Flow of Logistic Regression Analysis

Before conducting logistic regression, data transformation is required to obtain the members’ package utilization rate. Figure 33 shows the process of data transformation before logistic regression can be performed. Microsoft Excel is the primary tool for such data transformation and utilization analysis.

Firstly, the Attendance data and Purchase data are aligned in terms of time period. The alignment is necessary as the variable “Total Classes Attended” is obtained from the Attendance dataset, while the “Total Classes Purchased” is obtained from the Purchases dataset. As the Attendance data provided by Natasha Studio runs from March 2014 to March 2015, the Purchases data involved in this analysis is thus scoped accordingly.

Furthermore, in the Purchase data, the “DateStart” and “Date End” variables are missing for some of the records. “DateStart” indicate the dates which the members activate their purchased packages, while “Date End” indicates the dates where the package validity would end. These two dates are different from the “DatePurchased” variable which indicates the date when the member purchase the packages. This means that the package purchasing date is independent of the package utilization dates.

During the data preparation stage earlier, it was found that there are significant amount of missing data for the “DateStart” and “Date End” variables. A clear package start date and end date is important so as to calculate the total number of lessons purchased by members accurately. As such, members whose “DateStart” and “Date End” variables are missing are excluded from the model. Next, the Purchase data is filtered into “Course Purchase”, “Open Class Purchase” and “Unlimited 1 Month Purchase”. For instance, “Open Class Purchase” contains all open class course purchases within the period from March 2014 to March 2015. We focused the analysis for unlimited package on the unlimited 1 month package as it makes up 96% of all unlimited packages bought within this time period Next, members’ lists are created for each of the package type based on the “COUNTIF” function to identify members who purchased each of the three packages.

Open Class

Utilization Rate

OpenClassStat

After obtaining the utilization rate, an analysis for utilization rate is shown in the table above. The table shows that the average open class utilization rate is 65%. Meanwhile, it can also be observed that the range of open class utilization rate is high, as the maximum and minimum utilization rate is 100% and 0% respectively. The open class utilization rate is also skewed towards the higher end, where the 80th percentile is also 100%.

Logistic Regression

In order to ascertain if open class package utilization rate can explain the open class package repurchase, a logistic regression is performed. The table below lists the variables involved in the logistics regression model and its hypotheses. The setting of SAS EG is also adjusted where the probability output is modelled for “Repurchase = 1”.

LogReg for Open Class

Results

LogRegResults for Open Class

The logistic regression result obtained from SAS EG is shown the table above. Firstly, looking at the test for global null hypothesis, the highest p-value is 0.0075. Using a significance level of 5%, the p-value is lower than the significance level and hence null hypothesis is rejected. We conclude that open class utilization rate does explain the probability of open class repurchase and beta is not equal to 0.

LogRegEqn for Open Class

This gives us the equation above, which shows that for a 1% increase in utilization rate, we see a 3% increase in odds of repurchasing the open class package.

Evaluating fit of model

Next, the fit of the logistic regression model is evaluated based on the following measures:

  • Area under ROC curve
  • Odds Ratio Estimate
  • Pair concordant and pair disconcordant