Regression Analysis

What is Regression?

Regression Analysis

The statistical technique for finding the best-fitting straight line for a set of data is called regression, and the resulting straight line is called the regression line.

  • The goal of regression is to find the best-fitting straight line for a set of data.
  • Y = bX + a, the best fit is defined precisely to achieve the above goal. b and a are constants that determine the slope and Y-intercept of the line

Prerequisite

  • The sum of squares (SS)
  • Computational formula
  • Definitional formula
  • z-scores
  • Analysis of variance
  • MS values and F-ratios
  • Pearson correlation
  • The sum of products (SP)

Least-Squares concept

Y = bX + a,

  • For every X value in the data, the linear equation determines a Y value on the line.
  • Predicted Y and is called (Y hat).
  • The distance between this predicted value and the actual Y value in the data is determined by

Regression Analysis

  • Some of the distances will be positive and some will be negative
  • Square each distance to obtain a positive measure of error.
  • To get the total error between the line and the data, add the squared errors for all of the data points.

Regression Analysis

  • The best-fitting line is the one that has the smallest total squared error.

Regression Analysis

or

Regression Analysis

And

a  =  – b

Regression Analysis


Standardization with z score

  • Transform the X and Y values into z-scores before finding the regression equation.
  • The resulting equation is often called the standardized form of the regression equation
  • z-scores have zero mean and the standard deviation is always. The standardized form of the regression equation becomes:
    • z-score for each X value (zX)
    • z-score for the corresponding Y value (zY )

Regression Analysis

  • Slope constant b is now identified as beta.
  • Because both sets of z-scores have a mean of zero, the constant a disappears from the regression equation.
  • When one variable, X, is used to predict a second variable, Y, the value of beta is equal to the Pearson correlation for X and Y.
  • A standardized form of the regression equation becomes :

Regression Analysis

Regression Analysis


Standard Error of Estimate for Regression

  • The standard error of estimate gives a measure of the standard distance between the predicted Y values on the regression line and the actual Y values in the data.
  • The standard error of the estimate is the sum of squared deviations (SS).
  • Each deviation measures the distance between the actual Y value (from the data) and the predicted Y value (from the regression line).
  • This sum of squares is commonly called SSresidual

Regression Analysis

  • The degrees of freedom for the standard error of estimate are df = n – 2.
  • standard error of estimate: SS value is divided by degrees of freedom =

Regression Analysis

Regression Analysis


Standard Error and the Correlation

  • The regression equation simply describes the best-fitting line and is used for making predictions.
  • Squaring the correlation provides a measure of the accuracy of prediction.
  • The squared correlation, r2, is called the coefficient of determination because it determines what proportion of the variability in Y is predicted by the relationship with X.
  • Because r2 measures the predicted portion of the variability in the Y scores, expression (1 – r2) is used to measure the unpredicted portion.
  • r2 and the standard error of estimate indicate the accuracy of these predictions.

Predicted variability = SSregression = r2 SSY

Unpredicted variability = SSresidual = (1 – r2 )SSY

  • if r = 0.70, then r2 = 0.49 (or 49%) of the variability for the Y is predicted by the relationship with X and the remaining 51% (1 – r2 ) is the unpredicted portion.
  • r = 1.00, the prediction is perfect and there are no residuals.
  • As r approaches zero, the data points move away from the predicted line and the residuals increase to a higher level.
  • standard error of estimate: SS value is divided by degrees of freedom =

Regression Analysis


Analysis of Regression

  • For the non-zero r-value, there will be numerical values for the regression equation (a, b).
  • If there is no real relationship in the population, r and the regression equation will be of no use.
  • A significance test is conducted for the regression equation to assess whether a real relationship exists or it is because of sampling error.
  • The purpose of the test is to determine whether the sample correlation represents a real relationship

Null hypothesis :

States that there is no relationship between the two variables in the population.

  • For a correlation H0 : the population correlation is ρ = 0
  • For the regression equation, H0 : the slope of the regression equation (b or beta) is zero

F-Ratio

  • The numerator of the F-ratio is MSregression, which is the variance in the Y scores that is predicted by the regression equation.
  • This variance measures the systematic changes in Y that occur when the value of X increases or decreases.
  • The denominator is MSresidual, which is the unpredicted variance in the Y scores. This variance measures the changes in Y that are independent of changes in X

MSregression  df =1

Regression Analysis

MSresidual df = n – 2

Regression Analysis

Regression Analysis


Example:

The data consist of n = 10 pairs of scores with a correlation of r = 0.812 and SSY = 112. .Determine whether the sample correlation represents a real relationship.

F-ratio from table

For regression df = 1 , n-2= 1 , 8

At α = .05, F (1, 8) = 5.32

At α = .01, F (1, 8) = 11.26


F-ratio from data analysis

Y = bX + a

Null hypothesis H0 :

There is no relationship between X and Y, the regression equation has b = 0

r = 0.812 and SSY = 112

Predicted variability = SSregression = r2 SSY = (0.812)2 x 112 = 73.85

Predicted variability = SSresidual = (1 – r2) SSY = (1 – 0.8122) x 112 = 38.15

MSregression = 73.85

Regression Analysis

MSresidual = 4.76

Regression Analysis

Regression Analysis

F-Ratio calculated  =  15.49


Conclusion:

  • F-Ratio calculated  15.49 > F(1,8)= 5.32 & 11.26 { α = .05 & α = .01}
  • Therefore H0 is rejected
  • The regression equation accounts for a significant portion of the variance for the Y scores

Regression Analysis

 


Significance: Correlation vs. Regression( Regression Analysis)

  • Testing the significance of the regression equation is equivalent to testing the significance of the Pearson correlation.
  • If the correlation between two variables is significant, then the regression equation is also significant.
  • If a correlation is not significant, the regression equation is also not significant.
  • t statistic The t statistic for a correlation

Regression Analysis

Regression Analysis


Null hypothesis H0 :

There is no relationship between X and Y or for population  ρ = 0

Regression Analysis

Multiply the numerator and the denominator by SSY

Regression Analysis

Regression Analysis

Regression Analysis

Regression Analysis

Regression Analysis


Multiple Regression with Two Predictor Variables

  • The process of using several predictor variables to help obtain more accurate predictions is called multiple regression
  • it is possible to combine a large number of predictor variables in a single multiple-regression equation
  • Two predictor variables as X1 and X2 predict the value of Y.

The regression equation with two predictors is: 

Regression Analysis


If zY , zX1 , zX2 are z-scores transformation of of Y , X1 & X2 then standardized form is:

z Y = (beta1 )zX1 + (beta2 )zX2


  • SSX1 is the sum of squared deviations for X1
  • SSX2 is the sum of squared deviations for X2
  • SPX1Y is the sum of products of deviations for X1 and Y
  • SPX2Y is the sum of products of deviations for X2 and Y
  • SPX1X2 is the sum of products of deviations for X1 and X2

Regression Analysis

Regression Analysis

  a = – b2

Regression Analysis


Example: Compute coefficients, b1 and b2, and the constant, a, and regression equation for the below table.

Solution :

  • Calculate mean MY, MX1 and MX2 of Y, X1 and X2

Regression Analysis

  • Calculate SSY , SSX1 , SSX2

Regression Analysis

  • Calculate SPX1Y , SPX2Y and SPX1X2

Regression Analysis


Calculate b1

Regression Analysis

Regression Analysis

b1 = = 0.779


Calculate b2

Regression Analysis

Regression Analysis

b2 = = 0.280


Calculate a

a = – b2 

Regression Analysis

a = 7.8 – 0.280 * 6.9

a = 1.74


Regression equation:  = 0.779 X1 + 0.28 X2 + 1.74

Regression Analysis


Standard Error of Estimate- Multiple Regression

  • The standard error of estimate for a linear regression equation is the standard distance between the regression line and the actual data points.
  • The standard error of the estimate can be defined as the standard distance between the predicted Y values (from the regression equation) and the actual Y values (in the data).

Standard error of estimate for linear regression SSresidual = (1 – r2 )SSY   

and df = n-2

Standard error of estimate for Multiple Regression :

SSresidual = (1 – r2 ) SSY ,

Regression Analysis

df = n – 3 ( with two predictors X1 & X2)

Regression Analysis

Regression Analysis


Significance of the Multiple Regression Equation: Analysis of Regression

  • The significance of a multiple-regression equation is computed by an F-ratio
  • F-ratio determines whether the equation predicts a significant portion of the variance for the Y scores. The total variability of the Y scores is partitioned into two components,

Regression Analysis

Regression Analysis

Regression Analysis

dfregression = 1,

dfresidual = n-2  for One predictor (μ1)

dfresidual = n – 3  for two predictor (μ1, μ2)


Contribution of Individual Predictor Variable

  • In the Standardized form of the regression equation, the relative size of the beta values is an indication of the relative contribution of the two variables.

The standardized regression equation is

z Y = (beta1 )zX1 + (beta2 )zX2

z Y = 0.558zX1 + 0.247zX2

  • Both betas are positive indicating that both X1 and X2 are positively related to Y.
  • Multiple regression equation with both X1 and X2 predicted R2 = 55.62% of the variance for the Y scores.
  • To determine how much is predicted by X1 alone, we begin with the correlation between X1 and Y,

Regression Analysis

Regression Analysis

r = =0.7229

Regression Analysis

r2 = 52.26%

  • The additional contribution made by adding X2 to the regression equation can be computed as:

= (% with both X1 and X2 ) − (% with X1 alone)

= 55.62% − 52.26%

= 3.36%


  • SSY = 90, the additional variability from adding X2 as a predictor amounts to

SSadditional = 3.36% of 90

= 0.0336(90)

= 3.02


  • This SS value has df = 1,

Regression Analysis

= 3.02/ 1

= 3.02


  • F-ratio to evaluate the significance of the multiple-regression equation

Regression Analysis

= 3.024 /5.71

= 0.529


  • df = 1, 7

  • At α = 0.05 F(1,7) = 5.59
  • At α = 0.01 F(1,7) = 12.2
  • >>F-ratio is not significant.

It can be concluded that adding X2 to the regression equation does not significantly improve the prediction compared to using X1 as a single predictor.


Scroll to Top
logo
Privacy Overview

This website uses cookies so that we can provide you with the best user experience possible. Cookie information is stored in your browser and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.