Difference between revisions of "Jarvis Video"

Latest revision as of 22:39, 23 April 2017

Articles		Videos		R

Multiple Linear Regression Model

What makes a good Facebook post? This section outlines the explanatory model on the video dataset from Facebook Insights supplemented with our crawled variables to form a holistic complete video dataset.

Response / Dependent Variables

We choose to make use of “Total Engagement” as the response/ dependent variable. “Total Engagement” for each post is the sum of the total number of reactions (like, love, wow, haha, angry, sad), comments and shares of that post as of the data retrieval date. Reactions are similar to the ‘likes’ on Facebook, but provides the additional option of reacting with five animated emoji rather than a simple ‘like’ reaction.

Other possible response variables include the comment sentiment score measures, and individual engagement metrics but they are ruled out due to reasons such as their non-normal distribution and utility for our sponsor.

Explanatory / Independent Variables

**Video Dataset Metadata for Analysis**
Header	Description
Post Message Sentiment	Crawled Variable: Sentiment Score calculated using PyCharm python script, AFINN Sentiment words and emoji package
Video Duration in Seconds	Derived Variable: The duration in seconds of the video
Number of actors	Derived Variable: Total Number of actors inside a video
Video Category	Crawled Variable: The categories of the video, 7 levels
Day of Week	Derived Variable: The time of the day from the adjusted posted column of the video, categorical 7 levels
Time Interval (Hour)	Derived Variable: The time intervals of the articles derived from recursive splitting of the hour from the time of day column, to coincide with morning, afternoon, evening and night, categorical 4 levels
Video thumbnail includes Words	Derived Categorical Dummy Variable: 1 being video thumbnail includes words and 0 being video thumbnail does not include words
Video thumbnail includes Actor faces	Derived Categorical Dummy Variable: 1 being video thumbnail includes actor faces and 0 being video thumbnail does not include actor faces
Video thumbnail includes subject matter	Derived Categorical Dummy Variable: 1 being video thumbnail includes subject matter and 0 being video thumbnail does not include subject matter
Actors' names	Derived Variable From JMP Make Indicator Columns from Actor multiresponse column. Actors that are from sponsored companies are marked as external. Actors that act in fewer than 4 videos are marked as others. There are 101 different actors

Data Transformation / Excluding Outliers

We have performed transformation and exclusion of outliers for the video dataset in a similar fashion as the article dataset for both the response variables and the explanatory variables.

Bivariate Fit

We also conduct bivariate analysis on the response variable against each transformed explanatory variable to review the linearity of fit. This step helps us to decide if the transformation of the variable is necessary, and we pick the transformation that provides the highest R² value. The video dataset only has the three numerical variables below

Bivariate correlation scatterplot and matrix the video model

Checking for Multi-collinearity

Parameter Estimates with VIF statistics

As a result we have the list of numerical continuous explanatory variables to explain the variation of our response variables for the video regression model in preparation for the next step which is the stepwise regression.

Stepwise Regression

We proceed with the creation of our explanatory model by running stepwise regression within the Fit Model platform on JMP Pro 13 on the variables from the steps above with the inclusion of categorical variables (that will be dummy coded by JMP). We conduct a p-value threshold regression at 5% which gives the best R² and adjusted R² values, indicating the best model fit given the available data. We ran the regression for the forward, backward and mixed directions and realised that the R² values for the mixed direction is the highest, and we will be using it to run our model with. AICC and BICC measures are not used since we are looking at an explanatory model instead of a predictive model.

The regression equation and parameter estimates are shown below:

Video Regression equation for Ln(Total engagement)

Videp Regression Parameter Estimates for Ln(Total engagement)

Model Fit and Model Assumptions

Video Regression model fit results

The goodness of fit is represented by the R² value. R² is a statistical measure known as the coefficient of determination which measures how close data points are to the line generated by the model. The R² value here for the articles model is 0.20 and represents that the variation in Ln Total Engagement for articles is 20% explained by the model.

To gauge the explanatory power of each additional explanatory variable added, we also consider the adjusted R² value, which adjusts for the number of explanatory variables in the model – that is, it would only increase if each explanatory variable added improves the model more than what is expected by chance. The adjusted R² value here for the articles model is 0.17 and represents that the variation in Ln Total Engagement for articles is 17% explained by those explanatory variables that affect the response variable.

We then move on to the model assumptions to validate our regression model findings. There are several assumptions of linear regression models which need to be met, as seen below:

Relationship between the dependent variable and independent variables is linear
Expected mean error of the regression model is zero
Errors/Residuals have constant variance (Homoscedastic)
Errors/Residuals are independent of each other
Errors/Residuals are normally distributed and have a population mean of zero

Assumption 1: Linearity

Residual by predicted plot

The points are quite symmetrically distributed around the line, and this indicates that the points are random and hence fulfills the linearity assumption.

Assumption 2: Zero expected mean error

Distribution of residuals

The residuals largely follow a normal distribution with a mean close to zero and a standard deviation close to one.

Assumption 3: Homoscedasticity

Residual by predicted plot

The distribution of the points in the plot is rather symmetrical, with no signs of increasing residuals with the increase of the predicted values (it is not funnel shaped). This indicates that the residuals have constant variance and are hence homoscedastic

Assumption 4: Independent Residuals

Residual by row plot

The scatter plot shows that the residuals are randomly distributed around the line and hence shows that they are time independent. This also suggests that residuals are not autocorrelated.

Durbin-Watson test of no autocorrelation

The Durbin-Watson d = 1.6, which is between the two critical values of 1.5 < d < 2.5. Therefore, we can assume that there is no first order linear auto-correlation in our multiple linear regression data

Assumption 5: Residuals are normally distributed

Studentized Residual distribution

The residuals largely follow a normal distribution with a mean close to zero and a standard deviation close to one, hence fulfilling this assumption.

Interpretation and Managerial insights

A multiple stepwise linear regression was run to explain Ln(Total Engagement) for video performance from post message sentiment score, Ln(duration of video in seconds), video category, hourly time interval to post and video actors. These variables statistically significantly explained Ln(Total Engagement), F(12.63, 1.47) = 8.57, p < 0.0001***, adjusted R² = 0.17. All selected variables added statistically significantly to the explanation, p < .05. The video regression model has met all 5 assumptions highlighted above, and we believe that our sponsor can benefit from the knowledge of the determinants of their different social media engagement based on the regression equation on their video performance.

While our video explanatory regression models can explain up to 17-18% of the variation in the post’s engagement performance, insights can still be gleaned from it. Below are the points that can be drawn for the video regression model:

Positive sounding post messages when added to the description of the video can help increase engagement.
Video duration matters and longer videos tend to perform based on our results. However, we believe that there is an ideal video length as overtly lengthy videos could deter engagement.
A, B, C, D, and E videos since they are significantly more popular and should place more emphasis in its content creation.
Best time to post is in the late afternoons, evenings, and nights between 4pm to 11pm
Actors A, B and C are performing well and can be suited for such videos whereas actors D, E, F, G, H and I do not perform that well, suggesting the need for either improvement or adjustment of assignments.

@@ Line 92: / Line 92: @@
 |-
 | style="vertical-align:top;width:20%;" | <div style="none: solid; border-width:2px; background: #FFFFFF; padding: 10px; font-weight:bold; text-align:center; line-height: wrap_content; text-indent: 20px; font-size:18px"><font color="#b1260e" size=5 face="Century Gothic">Data Transformation / Excluding Outliers</font></div><br/>
-<p>We perform the transformation on the variables to make them more suitable for regression analysis. We perform a square root transformation as well as a natural logarithm transformation on all response and explanatory variables whose distributions are not normal to reduce skewness and yield a more normal distribution.</p><br>
+<p>We have performed transformation and exclusion of outliers for the video dataset in a similar fashion as the article dataset for both the response variables and the explanatory variables.</p><br>
-[[File:Article_transformation.png|700px|center]]
-{|style="width:100%;vertical-align:top;margin-top:20px;"
-|-
-|style="vertical-align:top;width:30%;" | <div style="background: #ffffff; text-align:center; line-height: wrap_content; text-align: center;font-size:12px">Transforming the Response Variables and removing the outliers</div>
-<br>
-<p>The outliers for the explanatory variables are judged by the independent variable distributions as well as the scatterplots of the response variable against the explanatory variables. We remove the following data points (as circled in the figure) as outliers. </p><br>
-[[File:outlier.png|700px|center]]
-{|style="width:100%;vertical-align:top;margin-top:20px;"
-|-
-|style="vertical-align:top;width:30%;" | <div style="background: #ffffff; text-align:center; line-height: wrap_content; text-align: center;font-size:12px">Transforming the Explanatory Variables and removing the outliers</div>
-<br>
 {| style="width:100%; vertical-align:top; margin-top:5px;"
@@ Line 113: / Line 99: @@
 | style="vertical-align:top;width:20%;" | <div style="none: solid; border-width:2px; background: #FFFFFF; padding: 10px; font-weight:bold; text-align:center; line-height: wrap_content; text-indent: 20px; font-size:18px"><font color="#b1260e" size=5 face="Century Gothic">Bivariate Fit</font></div><br/>
-<p>We also conduct bivariate analysis on the response variable against each transformed explanatory variable to review the linearity of fit. This step helps us to decide if the transformation of the variable is necessary, and we pick the transformation that provides the highest R<sup>2</sup> value.
+<p>We also conduct bivariate analysis on the response variable against each transformed explanatory variable to review the linearity of fit. This step helps us to decide if the transformation of the variable is necessary, and we pick the transformation that provides the highest R<sup>2</sup> value. The video dataset only has the three numerical variables below
 </p>
-[[File:Bivfit.png|700px|center]]
+[[File:vidbivfit.png|700px|center]]
 {|style="width:100%;vertical-align:top;margin-top:20px;"
 |-
-|style="vertical-align:top;width:30%;" | <div style="background: #ffffff; text-align:center; line-height: wrap_content; text-align: center;font-size:12px">Bivariate fit of difficult words count. we select the SQRT transformation instead of the Ln transformation</div>
+|style="vertical-align:top;width:30%;" | <div style="background: #ffffff; text-align:center; line-height: wrap_content; text-align: center;font-size:12px">Bivariate correlation scatterplot and matrix the video model</div>
-<p>
-This is repeated across all the explanatory variables, and we realise that all the readability indices have very poor R<sup>2</sup> values (close to zero). We then examine further if the stepwise model will pick these measures even though individually the variables do not have strong explanatory power.
-</p>
 {| style="width:100%; vertical-align:top; margin-top:5px;"
@@ Line 130: / Line 111: @@
 | style="vertical-align:top;width:20%;" | <div style="none: solid; border-width:2px; background: #FFFFFF; padding: 10px; font-weight:bold; text-align:center; line-height: wrap_content; text-indent: 20px; font-size:18px"><font color="#b1260e" size=5 face="Century Gothic"> Checking for Multi-collinearity</font></div><br/>
-<p>We also ran bivariate fit against all the 18 numerical explanatory variables to test for multicollinearity. The figure below shows the bivariate correlation scatterplot.</p>
+[[File:vidparamest.png|700px|center]]
-[[File:Bivfitscattermatrix.png|700px|center]]
-{|style="width:100%;vertical-align:top;margin-top:20px;"
-|-
-|style="vertical-align:top;width:30%;" | <div style="background: #ffffff; text-align:center; line-height: wrap_content; text-align: center;font-size:12px">Bivariate correlation scatterplot matrix for all 18 numerical variables for the article model</div>
-<p>Using this scatterplot together with the bivariate correlation matrix, we eliminated 8 variables that are highly correlated. We ran Standard Least Squares regression on continuous numerical variables to verify the absence of multicollinearity in our remaining variables.</p>
-[[File:vifparamest.png|700px|center]]
 {|style="width:100%;vertical-align:top;margin-top:20px;"
 |-
@@ Line 146: / Line 117: @@
-<p>As a result, we have the narrowed down version of our final list of numerical continuous explanatory variables to explain the variation of our response variables for the article regression model in preparation for the next step which is the stepwise regression.</p>
+<p>As a result we have the list of numerical continuous explanatory variables to explain the variation of our response variables for the video regression model in preparation for the next step which is the stepwise regression.</p>
 {| style="width:100%; vertical-align:top; margin-top:5px;"
@@ Line 152: / Line 123: @@
 | style="vertical-align:top;width:20%;" | <div style="none: solid; border-width:2px; background: #FFFFFF; padding: 10px; font-weight:bold; text-align:center; line-height: wrap_content; text-indent: 20px; font-size:18px"><font color="#b1260e" size=5 face="Century Gothic">Stepwise Regression</font></div><br/>
-<p>We proceed with the creation of our explanatory model by running stepwise regression within the Fit Model platform on JMP Pro 13 on the variables filtered from the steps above with the inclusion of categorical variables (that will be dummy coded by JMP). We conduct a p-value threshold regression at 5% which gives the best R<sup>2</sup> and adjusted R<sup>2</sup> values, indicating the best model fit given the available data. We ran the regression for the forward, backward and mixed directions and realised that the R<sup>2</sup> values for the three different directions are the same. We then select the mixed direction to run our model with. AICC and BICC measures are not used since we are looking at an explanatory model instead of a predictive model.</p>
+<p>We proceed with the creation of our explanatory model by running stepwise regression within the Fit Model platform on JMP Pro 13 on the variables from the steps above with the inclusion of categorical variables (that will be dummy coded by JMP). We conduct a p-value threshold regression at 5% which gives the best R<sup>2</sup> and adjusted R<sup>2</sup> values, indicating the best model fit given the available data. We ran the regression for the forward, backward and mixed directions and realised that the R<sup>2</sup> values for the mixed direction is the highest, and we will be using it to run our model with. AICC and BICC measures are not used since we are looking at an explanatory model instead of a predictive model.</p>
 <br>
 <p>The regression equation and parameter estimates are shown below:</p>
-[[File:artregeqn.png|700px|center]]
+[[File:videqn.png|700px|center]]
 {|style="width:100%;vertical-align:top;margin-top:20px;"
 |-
-|style="vertical-align:top;width:30%;" | <div style="background: #ffffff; text-align:center; line-height: wrap_content; text-align: center;font-size:12px">Article Regression equation for Ln(Total engagement)</div>
+|style="vertical-align:top;width:30%;" | <div style="background: #ffffff; text-align:center; line-height: wrap_content; text-align: center;font-size:12px">Video Regression equation for Ln(Total engagement)</div>
-[[File:artparam.png|700px|center]]
+[[File:vidparam.png|700px|center]]
 {|style="width:100%;vertical-align:top;margin-top:20px;"
 |-
-|style="vertical-align:top;width:30%;" | <div style="background: #ffffff; text-align:center; line-height: wrap_content; text-align: center;font-size:12px">Article Regression Parameter Estimates for Ln(Total engagement)</div>
+|style="vertical-align:top;width:30%;" | <div style="background: #ffffff; text-align:center; line-height: wrap_content; text-align: center;font-size:12px">Videp Regression Parameter Estimates for Ln(Total engagement)</div>
 {| style="width:100%; vertical-align:top; margin-top:5px;"
@@ Line 170: / Line 141: @@
 | style="vertical-align:top;width:20%;" | <div style="none: solid; border-width:2px; background: #FFFFFF; padding: 10px; font-weight:bold; text-align:center; line-height: wrap_content; text-indent: 20px; font-size:18px"><font color="#b1260e" size=5 face="Century Gothic">Model Fit and Model Assumptions</font></div><br/>
-[[File:artmodfit.png|700px|center]]
+[[File:vidparam.png|700px|center]]
 {|style="width:100%;vertical-align:top;margin-top:20px;"
 |-
-|style="vertical-align:top;width:30%;" | <div style="background: #ffffff; text-align:center; line-height: wrap_content; text-align: center;font-size:12px">Article Regression Model Fit</div>
+|style="vertical-align:top;width:30%;" | <div style="background: #ffffff; text-align:center; line-height: wrap_content; text-align: center;font-size:12px">Video Regression model fit results</div>
 <p>
 The goodness of fit is represented by the R<sup>2</sup> value. R<sup>2</sup> is a statistical measure known as the coefficient of determination which measures how close data points are to the line generated by the model.
-The R<sup>2</sup> value here for the articles model is 0.18 and represents that the variation in Ln Total Engagement for articles is 18% explained by the model.
+The R<sup>2</sup>  value here for the articles model is 0.20 and represents that the variation in Ln Total Engagement for articles is 20% explained by the model.
 <br><br>
 To gauge the explanatory power of each additional explanatory variable added, we also consider the adjusted R<sup>2</sup> value, which adjusts for the number of explanatory variables in the model – that is, it would only increase if each explanatory variable added improves the model more than what is expected by chance.
 The adjusted R<sup>2</sup> value here for the articles model is 0.17 and represents that the variation in Ln Total Engagement for articles is 17% explained by those explanatory variables that affect the response variable.
@@ Line 197: / Line 167: @@
 | style="vertical-align:top;width:20%;" | <div style="none: solid; border-width:2px; background: #FFFFFF; padding: 10px; font-weight:bold; text-align:center; line-height: wrap_content; text-indent: 20px; font-size:18px"><font color="#b1260e" size=3 face="Century Gothic">Assumption 1: Linearity</font></div><br/>
-[[File:Assumption_1n3.png|700px|center]]
+[[File:Assumption_1n3v.png|700px|center]]
 {|style="width:100%;vertical-align:top;margin-top:20px;"
 |-
@@ Line 208: / Line 178: @@
 | style="vertical-align:top;width:20%;" | <div style="none: solid; border-width:2px; background: #FFFFFF; padding: 10px; font-weight:bold; text-align:center; line-height: wrap_content; text-indent: 20px; font-size:18px"><font color="#b1260e" size=3 face="Century Gothic">Assumption 2: Zero expected mean error</font></div><br/>
-[[File:Assumption_2.png|700px|center]]
+[[File:Assumption_2v.png|700px|center]]
 {|style="width:100%;vertical-align:top;margin-top:20px;"
 |-
@@ Line 218: / Line 188: @@
 |-
 | style="vertical-align:top;width:20%;" | <div style="none: solid; border-width:2px; background: #FFFFFF; padding: 10px; font-weight:bold; text-align:center; line-height: wrap_content; text-indent: 20px; font-size:18px"><font color="#b1260e" size=3 face="Century Gothic">Assumption 3: Homoscedasticity</font></div><br/>
-[[File:Assumption_1n3.png|700px|center]]
+[[File:Assumption_1n3v.png|700px|center]]
 {|style="width:100%;vertical-align:top;margin-top:20px;"
 |-
@@ Line 229: / Line 199: @@
 | style="vertical-align:top;width:20%;" | <div style="none: solid; border-width:2px; background: #FFFFFF; padding: 10px; font-weight:bold; text-align:center; line-height: wrap_content; text-indent: 20px; font-size:18px"><font color="#b1260e" size=3 face="Century Gothic">Assumption 4: Independent Residuals</font></div><br/>
-[[File:Assumption_4a.png|700px|center]]
+[[File:Assumption_4v.png|700px|center]]
 {|style="width:100%;vertical-align:top;margin-top:20px;"
 |-
@@ Line 237: / Line 207: @@
 </p>
-[[File:Assumption_4b.png|400px|center]]
+[[File:Assumption_4bv.png|400px|center]]
 {|style="width:100%;vertical-align:top;margin-top:20px;"
 |-
 |style="vertical-align:top;width:30%;" | <div style="background: #ffffff; text-align:center; line-height: wrap_content; text-align: center;font-size:12px">Durbin-Watson test of no autocorrelation</div>
-<p>The Durbin-Watson d = 2.15, which is between the two critical values of 1.5 < d < 2.5. Therefore, we can assume that there is no first order linear auto-correlation in our multiple linear regression data</p>
+<p>The Durbin-Watson d = 1.6, which is between the two critical values of 1.5 < d < 2.5. Therefore, we can assume that there is no first order linear auto-correlation in our multiple linear regression data</p>
 {| style="width:100%; vertical-align:top; margin-top:5px;"
@@ Line 248: / Line 218: @@
 | style="vertical-align:top;width:20%;" | <div style="none: solid; border-width:2px; background: #FFFFFF; padding: 10px; font-weight:bold; text-align:center; line-height: wrap_content; text-indent: 20px; font-size:18px"><font color="#b1260e" size=3 face="Century Gothic">Assumption 5: Residuals are normally distributed</font></div><br/>
-[[File:Assumption_5.png|700px|center]]
+[[File:Assumption_5v.png|700px|center]]
 {|style="width:100%;vertical-align:top;margin-top:20px;"
 |-
@@ Line 261: / Line 231: @@
-<p>A multiple stepwise linear regression was run to explain Ln(Total Engagement) for article performance from post message sentiment score, number of links, SQRT(Number of images) and article authors. These variables statistically significantly explained Ln(Total Engagement), F(33.79, 1.06) = 31.96, p < 0.0001***, adjusted R2 = 0.17. All selected variables provided statistically significantly to the explanation, p < .05. The article regression model has met all 5 assumptions highlighted above, and we believe that our sponsor can benefit from the knowledge of the different determinants of their different social media engagement performance based on the regression equation on their article performance.</p>
+<p>A multiple stepwise linear regression was run to explain Ln(Total Engagement) for video performance from post message sentiment score, Ln(duration of video in seconds), video category, hourly time interval to post and video actors. These variables statistically significantly explained Ln(Total Engagement), F(12.63, 1.47) = 8.57, p < 0.0001***, adjusted R<sup>2</sup> = 0.17. All selected variables added statistically significantly to the explanation, p < .05. The video regression model has met all 5 assumptions highlighted above, and we believe that our sponsor can benefit from the knowledge of the determinants of their different social media engagement based on the regression equation on their video performance.</p>
 <br><br>
 <p>
-While our article explanatory regression models can explain up to 17-18% of the variation in the post’s engagement performance, insights can still be gleaned from it. Below are the points that can be drawn for the article regression model:
+While our video explanatory regression models can explain up to 17-18% of the variation in the post’s engagement performance, insights can still be gleaned from it. Below are the points that can be drawn for the video regression model:
 <br><br>
-*	A positive sounding post message to accompany the article can help increase engagement.
+*	Positive sounding post messages when added to the description of the video can help increase engagement.
-*	Articles that contain too many embedded links may not perform well in terms of engagement. This could suggest possibly that viewer tend not to read the article or are referred elsewhere as a result.
+*	Video duration matters and longer videos tend to perform based on our results. However, we believe that there is an ideal video length as overtly lengthy videos could deter engagement.
-*	The number of images used in an article matters and more images can help improve the engagement level of the article. This is applicable for categories that require visually appealing information
+*	A, B, C, D, and E videos since they are significantly more popular and should place more emphasis in its content creation.
-*	Authors A, B, C, D, E, F, G, H, I, and J are performing well and can be considered suited for writing their relevant categories whereas authors K, L, M, N, O, P, Q, R, S, and T are performing poorly, suggesting the need for either improvement or adjustment of assignments.
+*	Best time to post is in the late afternoons, evenings, and nights between 4pm to 11pm
+*	Actors A, B and C are performing well and can be suited for such videos whereas actors D, E, F, G, H and I do not perform that well, suggesting the need for either improvement or adjustment of assignments.
 </p>

Difference between revisions of "Jarvis Video"

Latest revision as of 22:39, 23 April 2017

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools