Come back after 30 days!/Findings

From Analytics Practicum
Jump to navigation Jump to search


AP CBA30D Home.png[|Dashboard]   AP CBA30D Project Overview.png [|Project Overview]     AP CBA30D Findings.png [|Findings]   AP CBA30D Doc.png [|Documentation]   AP CBA30D AboutUs.png [|About Us]


Summary of findings by week

This is the summary of pointers and major decisions the team have taken in this data mining project. Visit our documentation page for slides for more information.

Week 2

  • Literature review: Apprised of the commonly used methodologies in attempting to achieve a binary outcome and such methods include decision trees and logistic regressions.
  • Literature review: Informed of the commonly used predictors (e.g. age, proxies of socioeconomic status such as ward admitted) and the contexts of which researches have arrived at their findings.

Week 3

  • Consultation with Prof. Kam: Arrived at a consensus to employ 2 models (i.e. decision trees and logistic regressions) for evaluating the predictive power in this particular project. Prompted to think about project storyboarding for eventual deliverables and about data implications should we decide to do drastic changes to dataset (e.g. omissions of variables when we simply think it has no correlation to the response variable as predictors e.g. drugs taken).
  • Exploratory data analysis: A possible predictor variable has missing values for 40% of the records and another 2.3%. Thought of imputation via highest occurring data value for the missing values.
  • Literature review: Was apprised of the data transformation methods (e.g. Box Cox transformation of continuous variables to form a normal distribution) through a SAS conference paper.

Week 4

  • Consultation with Prof. Kam: Understood the need to conduct univariate for understanding the variables' distribution (skewed, normalized for continuous etc), bivariate analysis of continuous variables to determine the extent of multicollearinity.
  • Exploratory data analysis:
    • To find out about percentage of unique values (i.e. finding out the number of readmitted patients in total because the rows of the dataset are specific to discrete hospital encounters with a patient, not the record of a patient itself). 70.28% of the entire dataset are unique patients encountered by the hospital.
    • As the dataset shows a trinary response variable (i.e. not admitted, readmitted within 30 days, readmitted after 30 days), the team has decided to focus on a binary outcome by grouping the value of "readmitted after 30 days" as not readmitted for the simplicity of our data analysis. This is because our project only focuses on the possibility of readmission within 30 days. Besides, readmitted after a year could be deciphered as "readmitted after 30 days". Having said this, we were informed that patients could be admitted on the 31 or 32th day which is similar to readmission within 30 days in reality. However, we decide to stick to the rigidity of the dataset and defined as such for analysis simplicity.

Week 5

  • Exploratory data analysis: Thought of the ways to handle 2 variables that have missing values: proposed regression imputation for the variable that has 40% values missing but was cautioned that it may introduce errors. Literature was obtained mostly from David Howell's page
  • Consultation with Prof. Kam:
    • For the variable that has 40% missing values, we were advised to conduct a two-pronged approach (i.e. a model without 40% of the data and a model without the variable entirely) in which the eventual models can be used to compare predictive power and therefore, able to make a judgment as to whether the variable was considered a predictor. Were advised that this sort of judgment can be considered as an eventual recommendation.
    • For the variable that has 2.23% missing values, we decide to remove these 2.23% of records which was agreed on the basis of minimal impact
    • Apprised of the following steps in data mining process: dummy variables creation, selection of variables for model construction, data sampling & 2 pronged approaches which the team will embark on next.

Week 6

  • Exploratory data analysis: Informed of the need to test for statistical independence. There are 2 areas of concern:
    • Patient who have more than 2 encounters at the hospital: Patients in this category skew the number of readmissions within 30 days higher as their earlier admissions have indicated "Readmitted". Using the Pearson's chi square test, the p-value arrived at is 0 (X^2 = 184035, degrees of freedom = 1), indicating that the null hypothesis of patients who are admitted more than once is statistically independent should be rejected. Therefore, we are only using 1 encounter per patient which the corresponding Readmitted variable indicates their successive readmission.
    • 14 variables that are deemed insignificant are removed from the model construction because their distributions are one-class only (e.g. acetohexamide has 99.9% of values of "No")