R2Z - Methodology

From Analytics Practicum
Jump to navigation Jump to search

HOME

 

PROJECT OVERVIEW

 

PROJECT FINDINGS

 

PROJECT DOCUMENTATION

 

PROJECT MANAGEMENT

 

ANLY482 HOMEPAGE

Background Data Source Methodology

Tools Used

The cleaned data was loaded into JMP Pro to conduct data preparation and exploration. This analytical software can efficiently handle large volumes of data, which is essential given our sponsor's abundant data (approximately 2.2 million).

Model Building

Line of Best Fit - The line of best fit goes through the middle of all data points when plotted. The goal of the best-fit line is to find the parametric values that closely match the experimental data. In order to perform fitting, we define a function, depending on the closeness between the data and the model. This function is then minimized to the smallest possible value with respect to the parameters. The parametric values that minimize the function are the best fitting.

Funnel Plot - Funnel plots are a form of scatter plot where observed area rates are plotted against area populations. Control limits, which represent the expected variation in rates assuming that source of variation stochastic, are then overlaid on it. The control limits are computed similar to confidence limits, and they exhibit the distinctive funnel shape as a result of smaller expected variability in larger populations. In a funnel plot, if rate variation is only random and stochastic, then an appropriate proportion of the points representing area rates will tend to fall within the funnel. Such process is considered to be "under control” and that the model fit is adequate. However, when many rates fall outside the funnel, the plot can be described as “overdispersed,” and it can be said that the process is not in control or the model does not fit the data well.

Data Binning - Although scatter plots are widely used, intuitive and easy to understand, they often have a high number of overlapping data points and become less useful in deriving meaningful insights. Moreover, when analyzing a very dense dataset, scatter plots are too coarse and recursion in certain areas is needed. Binning is an efficient approach to reduce the complexity of large volumes of multi-dimensional data by dividing the plot area into groups of value ranges. Variable binned scatter plots allow visualization of large amounts of data without overlap. They group data in a two-dimensional space based on the densities of pairs of variables.

Logistic Regression - Logistic regression is a statistical method for analyzing a dataset in which there are one or more independent variables that determine an outcome, and only two possible outcomes exist. The goal of logistic regression is to find the best fitting model to describe the relationship between the dichotomous dependent variable and independent variables .

Decision Tree - A decision tree is a graph that uses a branching method to illustrate possible outcomes of a decision. It works for both categorical and continuous variables, and is mostly used in classification problems. Decision trees are used in data mining to simplify complex strategic challenges and evaluate cost-effectiveness of business decisions.