Regression Analysis

What is Regression?

Table of Contents.

The statistical technique for finding the best-fitting straight line for a set of data is called regression, and the resulting straight line is called the regression line.

The goal of regression is to find the best-fitting straight line for a set of data.
Y = bX + a, the best fit is defined precisely to achieve the above goal. b and a are constants that determine the slope and Y-intercept of the line

Prerequisite

The sum of squares (SS)
Computational formula
Definitional formula
z-scores
Analysis of variance
MS values and F-ratios
Pearson correlation
The sum of products (SP)

Least-Squares concept

Y = bX + a,

For every X value in the data, the linear equation determines a Y value on the line.
Predicted Y and is called (Y hat).
The distance between this predicted value and the actual Y value in the data is determined by

Some of the distances will be positive and some will be negative
Square each distance to obtain a positive measure of error.
To get the total error between the line and the data, add the squared errors for all of the data points.

The best-fitting line is the one that has the smallest total squared error.

And

a = – b

Standardization with z score

Transform the X and Y values into z-scores before finding the regression equation.
The resulting equation is often called the standardized form of the regression equation
z-scores have zero mean and the standard deviation is always. The standardized form of the regression equation becomes:
- z-score for each X value (z_X)
- z-score for the corresponding Y value (z_Y )

Slope constant b is now identified as beta.
Because both sets of z-scores have a mean of zero, the constant a disappears from the regression equation.
When one variable, X, is used to predict a second variable, Y, the value of beta is equal to the Pearson correlation for X and Y.
A standardized form of the regression equation becomes :

Standard Error of Estimate for Regression

The standard error of estimate gives a measure of the standard distance between the predicted Y values on the regression line and the actual Y values in the data.
The standard error of the estimate is the sum of squared deviations (SS).
Each deviation measures the distance between the actual Y value (from the data) and the predicted Y value (from the regression line).
This sum of squares is commonly called SSresidual

The degrees of freedom for the standard error of estimate are df = n – 2.
standard error of estimate: SS value is divided by degrees of freedom =

Standard Error and the Correlation

The regression equation simply describes the best-fitting line and is used for making predictions.

Squaring the correlation provides a measure of the accuracy of prediction.

The squared correlation, r², is called the coefficient of determination because it determines what proportion of the variability in Y is predicted by the relationship with X.

Because r² measures the predicted portion of the variability in the Y scores, expression (1 – r²) is used to measure the unpredicted portion.

r² and the standard error of estimate indicate the accuracy of these predictions.

Predicted variability = SS_regression = r² SS_Y

Unpredicted variability = SS_residual = (1 – r² )SS_Y

if r = 0.70, then r² = 0.49 (or 49%) of the variability for the Y is predicted by the relationship with X and the remaining 51% (1 – r2 ) is the unpredicted portion.
r = 1.00, the prediction is perfect and there are no residuals.

As r approaches zero, the data points move away from the predicted line and the residuals increase to a higher level.

standard error of estimate: SS value is divided by degrees of freedom =

Analysis of Regression

For the non-zero r-value, there will be numerical values for the regression equation (a, b).
If there is no real relationship in the population, r and the regression equation will be of no use.
A significance test is conducted for the regression equation to assess whether a real relationship exists or it is because of sampling error.
The purpose of the test is to determine whether the sample correlation represents a real relationship

Null hypothesis :

States that there is no relationship between the two variables in the population.

For a correlation H₀ : the population correlation is ρ = 0
For the regression equation, H₀ : the slope of the regression equation (b or beta) is zero

F-Ratio

The numerator of the F-ratio is MS_regression, which is the variance in the Y scores that is predicted by the regression equation.
This variance measures the systematic changes in Y that occur when the value of X increases or decreases.
The denominator is MS_residual, which is the unpredicted variance in the Y scores. This variance measures the changes in Y that are independent of changes in X

MS_regression df =1

MS_residualdf = n – 2

Example:

The data consist of n = 10 pairs of scores with a correlation of r = 0.812 and SS_Y = 112. .Determine whether the sample correlation represents a real relationship.

F-ratio from table

For regression df = 1 , n-2= 1 , 8

At α = .05, F (1, 8) = 5.32

At α = .01, F (1, 8) = 11.26

F-ratio from data analysis

Y = bX + a

Null hypothesis H₀ :

There is no relationship between X and Y, the regression equation has b = 0

r = 0.812 and SS_Y = 112

Predicted variability = SS_regression= r² SS_Y = (0.812)²x 112 = 73.85

Predicted variability = SS_residual= (1 – r²⁾ SS_Y= (1 – 0.812²) x 112 = 38.15

MS_regression = 73.85

MS_residual = 4.76

F-Ratio calculated = 15.49

Conclusion:

F-Ratio calculated 15.49 > F(1,8)= 5.32 & 11.26 { α = .05 & α = .01}
Therefore H₀ is rejected
The regression equation accounts for a significant portion of the variance for the Y scores

Significance: Correlation vs. Regression( )

Testing the significance of the regression equation is equivalent to testing the significance of the Pearson correlation.
If the correlation between two variables is significant, then the regression equation is also significant.
If a correlation is not significant, the regression equation is also not significant.
t statistic The t statistic for a correlation

Null hypothesis H₀ :

There is no relationship between X and Y or for population ρ = 0

Multiply the numerator and the denominator by SS_Y

Multiple Regression with Two Predictor Variables

The process of using several predictor variables to help obtain more accurate predictions is called multiple regression
it is possible to combine a large number of predictor variables in a single multiple-regression equation
Two predictor variables as X1 and X2 predict the value of Y.

The regression equation with two predictors is:

If z_Y , z_X1 , z_X2 are z-scores transformation of of Y , X1 & X2 then standardized form is:

z_Y = (beta1 )z_X1 + (beta2 )z_X2

SS_X1 is the sum of squared deviations for X1
SS_X2 is the sum of squared deviations for X2
SP_X1Y is the sum of products of deviations for X1 and Y
SP_X2Y is the sum of products of deviations for X2 and Y
SP_X1X2is the sum of products of deviations for X1 and X2

a = – b2

Example: Compute coefficients, b1 and b2, and the constant, a, and regression equation for the below table.

Solution :

Calculate mean M_Y, M_X1 and M_X2 of Y, X1 and X2

Calculate SS_Y , SS_X1 , SS_X2

Calculate SPX_1Y , SP_X2Y and SP_X1X2

Calculate b1

b1 = = 0.779

Calculate b2

b2 = = 0.280

Calculate a

a = – b2

a = 7.8 – 0.280 * 6.9

a = 1.74

Regression equation: = 0.779 X1 + 0.28 X2 + 1.74

Standard Error of Estimate- Multiple Regression

The standard error of estimate for a linear regression equation is the standard distance between the regression line and the actual data points.
The standard error of the estimate can be defined as the standard distance between the predicted Y values (from the regression equation) and the actual Y values (in the data).

Standard error of estimate for linear regression SS_residual = (1 – r² )SS_Y

and df = n-2

Standard error of estimate for Multiple Regression :

SS_residual = (1 – r² ) SS_{Y ,}

df = n – 3 ( with two predictors X1 & X2)

Significance of the Multiple Regression Equation: Analysis of Regression

The significance of a multiple-regression equation is computed by an F-ratio
F-ratio determines whether the equation predicts a significant portion of the variance for the Y scores. The total variability of the Y scores is partitioned into two components,

df_regression = 1,

df_residual = n-2 for One predictor (μ1)

df_residual = n – 3 for two predictor (μ1, μ2)

Contribution of Individual Predictor Variable

In the Standardized form of the regression equation, the relative size of the beta values is an indication of the relative contribution of the two variables.

The standardized regression equation is

z_Y = (beta1 )z_X1 + (beta2 )z_X2

z_Y = 0.558z_X1 + 0.247z_X2

Both betas are positive indicating that both X1 and X2 are positively related to Y.
Multiple regression equation with both X1 and X2 predicted R² = 55.62% of the variance for the Y scores.
To determine how much is predicted by X1 alone, we begin with the correlation between X1 and Y,

r = =0.7229

r²= 52.26%

The additional contribution made by adding X2 to the regression equation can be computed as:

= (% with both X1 and X2 ) − (% with X1 alone)

= 55.62% − 52.26%

= 3.36%

SSY = 90, the additional variability from adding X2 as a predictor amounts to

SSadditional = 3.36% of 90

= 0.0336(90)

= 3.02

This SS value has df = 1,

= 3.02/ 1

= 3.02

F-ratio to evaluate the significance of the multiple-regression equation

= 3.024 /5.71

= 0.529

df = 1, 7
At α = 0.05 F(1,7) = 5.59
At α = 0.01 F(1,7) = 12.2
>>F-ratio is not significant.

It can be concluded that adding X2 to the regression equation does not significantly improve the prediction compared to using X1 as a single predictor.

What is Regression?

Least-Squares concept

Standardization with z score

Standard Error and the Correlation

Null hypothesis :

Example:

Significance: Correlation vs. Regression( )

Null hypothesis H0 :

Multiple Regression with Two Predictor Variables

Example: Compute coefficients, b1 and b2, and the constant, a, and regression equation for the below table.

Standard Error of Estimate- Multiple Regression

Significance of the Multiple Regression Equation: Analysis of Regression

Contribution of Individual Predictor Variable

Also Read

Related Posts

Null hypothesis H₀ :