Come back after 30 days!/Findings

From Analytics Practicum
Revision as of 13:49, 7 February 2015 by Nicholaslee.2011 (talk | contribs) (Created page with "<div style="background: #fdf5e6; padding: 13px; font-weight: bold; text-align: left; line-height: wrap_content; text-indent: 20px;font-size:20px; font-family:helvetica"><font...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search
Summary of findings by week

Week 2

  • Literature review: Apprised of the commonly used methodologies in attempting to achieve a binary outcome and such methods include decision trees and logistic regressions.
  • Literature review: Informed of the commonly used predictors (e.g. age, proxies of socioeconomic status such as ward admitted) and the contexts of which researches have arrived at their findings.

Week 3

  • Consultation with Prof. Kam: Arrived at a consensus to employ 2 models (i.e. decision trees and logistic regressions) for evaluating the predictive power in this particular project. Prompted to think about project storyboarding for eventual deliverables and about data implications should we decide to do drastic changes to dataset (e.g. omissions of variables when we simply think it has no correlation to the response variable as predictors e.g. drugs taken).
  • Exploratory data analysis: A possible predictor variable has missing values for 40% of the records and another 2.3%. Thought of imputation via highest occurring data value for the missing values.
  • Literature review: Was apprised of the data transformation methods (e.g. Box Cox transformation of continuous variables to form a normal distribution) through a SAS conference paper.

Week 4

  • Consultation with Prof. Kam: Understood the need to conduct univariate for understanding the variables' distribution (skewed, normalized for continuous etc), bivariate analysis of continuous variables to determine the extent of multicollearinity.
  • Exploratory data analysis:
    • To find out about percentage of unique values (i.e. finding out the number of readmitted patients in total because the rows of the dataset are specific to discrete hospital encounters with a patient, not the record of a patient itself). 70.28% of the entire dataset are unique patients encountered by the hospital.
    • As the dataset shows a trinary response variable (i.e. not admitted, readmitted within 30 days, readmitted after 30 days), the team has decided to focus on a binary outcome by grouping the value of "readmitted after 30 days" as not readmitted for the simplicity of our data analysis. This is because our project only focuses on the possibility of readmission within 30 days. Besides, readmitted after a year could be deciphered as "readmitted after 30 days". Having said this, we were informed that patients could be admitted on the 31 or 32th day which is similar to readmission within 30 days in reality. However, we decide to stick to the rigidity of the dataset and defined as such for analysis simplicity.

Week 5

  • Exploratory data analysis: Thought of the ways to handle 2 variables that have missing values: proposed regression imputation for the variable that has 40% values missing but was cautioned that it may introduce errors.
  • Consultation with Prof. Kam:
    • We were advised to conduct a two-pronged approach (i.e. a model without 40% of the data and a model without the variable entirely) in which the eventual models can be used to compare predictive power and therefore, able to make a judgment as to whether the variable was considered a predictor. Were advised that this sort of judgment can be considered as an eventual recommendation.
    • Apprised of the following steps in data mining process: dummy variables creation, selection of variables for model construction, data sampling & 2 pronged approaches which the team will embark on next.