|
|
Line 122: |
Line 122: |
| [[File:outlier.png|center|frame|Transforming the Explanatory Variables and removing the outliers]] | | [[File:outlier.png|center|frame|Transforming the Explanatory Variables and removing the outliers]] |
| <br> | | <br> |
− |
| |
− |
| |
| | | |
| {| style="width:100%; vertical-align:top; margin-top:5px;" | | {| style="width:100%; vertical-align:top; margin-top:5px;" |
| |- | | |- |
| | style="vertical-align:top;width:20%;" | <div style="none: solid; border-width:2px; background: #FFFFFF; padding: 10px; font-weight:bold; text-align:center; line-height: wrap_content; text-indent: 20px; font-size:18px"><font color="#b1260e" size=5 face="Century Gothic">Bivariate Fit</font></div><br/> | | | style="vertical-align:top;width:20%;" | <div style="none: solid; border-width:2px; background: #FFFFFF; padding: 10px; font-weight:bold; text-align:center; line-height: wrap_content; text-indent: 20px; font-size:18px"><font color="#b1260e" size=5 face="Century Gothic">Bivariate Fit</font></div><br/> |
| + | |
| + | <p>We also conduct bivariate analysis on the response variable against each transformed explanatory variable to review the linearity of fit. This step helps us to decide if the transformation of the variable is necessary, and we pick the transformation that provides the highest R<sup>2</sup> value. |
| + | </p> |
| + | |
| + | |
| + | [[File:Bivfit.png|center|frame|Bivariate fit of difficult words count. we select the SQRT transformation instead of the Ln transformation]] |
| + | |
| + | <p> |
| + | This is repeated across all the explanatory variables, and we realise that all the readability indices have very poor R<sup>2</sup> values (close to zero). We then examine further if the stepwise model will pick these measures even though individually the variables do not have strong explanatory power. |
| + | |
| + | </p> |
| | | |
| {| style="width:100%; vertical-align:top; margin-top:5px;" | | {| style="width:100%; vertical-align:top; margin-top:5px;" |
Revision as of 17:03, 23 April 2017
Click here to return to AY16/17 T2 Group List
Multiple Linear Regression Model
What makes a good Facebook post? This section outlines the explanatory model on the article dataset from Facebook Insights supplemented with our crawled variables to form a holistic complete article dataset.
Response / Dependent Variables
We choose to make use of “Total Engagement” as the response/ dependent variable. “Total Engagement” for each post is the sum of the total number of reactions (like, love, wow, haha, angry, sad), comments and shares of that post as of the data retrieval date. Reactions are similar to the ‘likes’ on Facebook, but provides the additional option of reacting with five animated emoji rather than a simple ‘like’ reaction.
Other possible response variables include the comment sentiment score measures, and individual engagement metrics but they are ruled out due to reasons such as their non-normal distribution and utility for our sponsor.
Explanatory / Independent Variables
Article Dataset Metadata for Analysis
Header
|
Description
|
Post Message Sentiment |
Crawled Variable: Sentiment Score calculated using PyCharm python script, AFINN Sentiment words and emoji package
|
Article Text Sentiment |
Derived Variable: Sentiment Score calculated using PyCharm python script, AFINN Sentiment words and emoji package
|
Number of Images |
Crawled Variable: Number of Images in the article
|
Number of Videos |
Crawled Variable: Number of Videos in the article
|
Number of Links |
Crawled Variable: The number of embedded links in the article
|
Number of syllables |
Crawled Variable: number of syllables within text
|
Word count |
Crawled Variable: Total word count
|
Sentence count |
Crawled Variable: Total sentence count
|
Words per Sentence |
Crawled Variable: Number of words/sentence in the body of text
|
Flesch reading ease |
Crawled Variable: Readability Index value of Flesch Reading Ease
|
Flesch kincaid grade |
Crawled Variable: Readability Index value of Flesch kincaid grade
|
Gunning fog |
Crawled Variable: Readability Index value of Gunning fog
|
Smog index |
Crawled Variable: Readability Index value of Smog index
|
Automated readability index |
Crawled Variable: Readability Index value of Automated readability index
|
Coleman liau index |
Crawled Variable: Readability Index value of Coleman liau index
|
Linsear write formula |
Crawled Variable: Readability Index value of Linsear write formula
|
Dale chall readability score |
Crawled Variable: Readability Index value of Dale chall readability score
|
Difficult words count |
Crawled Variable: Total count of difficult words
|
Article Category |
Crawled Variable: The categories of the article, 9 levels
|
Day of Week |
Derived Variable: The time of the day from the (adjusted) posted column of the article categorical 7 levels
|
Time Interval (Hour) |
Derived Variable: The time intervals of the articles derived from recursive splitting of the hour from the time of day column, to coincide with morning, afternoon, evening and night, categorical 4 levels
|
Article Authors |
Crawled Variable: The author of the article. Authors who wrote fewer than 9 articles are collectively grouped into others. Categorical 20 levels
|
Data Transformation / Excluding Outliers
We perform the transformation on the variables to make them more suitable for regression analysis. We perform a square root transformation as well as a natural logarithm transformation on all response and explanatory variables whose distributions are not normal to reduce skewness and yield a more normal distribution.
Transforming the Response Variables and removing the outliers
The outliers for the explanatory variables are judged by the independent variable distributions as well as the scatterplots of the response variable against the explanatory variables. We remove the following data points (as circled in the figure) as outliers.
Transforming the Explanatory Variables and removing the outliers
Bivariate Fit
We also conduct bivariate analysis on the response variable against each transformed explanatory variable to review the linearity of fit. This step helps us to decide if the transformation of the variable is necessary, and we pick the transformation that provides the highest R2 value.
Bivariate fit of difficult words count. we select the SQRT transformation instead of the Ln transformation
This is repeated across all the explanatory variables, and we realise that all the readability indices have very poor R2 values (close to zero). We then examine further if the stepwise model will pick these measures even though individually the variables do not have strong explanatory power.
multi-collinearity
Stepwise Regression
Evaluation of Model Fit
Model Assumptions
Interpretation and Managerial insights
|
|
|
|
|
|
|
|
|
|