What is Regression?
The statistical technique for finding the best-fitting straight line for a set of data is called regression, and the resulting straight line is called the regression line.
- The goal of regression is to find the best-fitting straight line for a set of data.
- Y = bX + a, the best fit is defined precisely to achieve the above goal. b and a are constants that determine the slope and Y-intercept of the line
Prerequisite
- The sum of squares (SS)
- Computational formula
- Definitional formula
- z-scores
- Analysis of variance
- MS values and F-ratios
- Pearson correlation
- The sum of products (SP)
Least-Squares concept
Y = bX + a,
- For every X value in the data, the linear equation determines a Y value on the line.
- Predicted Y and is called (Y hat).
- The distance between this predicted value and the actual Y value in the data is determined by
- Some of the distances will be positive and some will be negative
- Square each distance to obtain a positive measure of error.
- To get the total error between the line and the data, add the squared errors for all of the data points.
- The best-fitting line is the one that has the smallest total squared error.
or
And
a = – b
Standardization with z score
- Transform the X and Y values into z-scores before finding the regression equation.
- The resulting equation is often called the standardized form of the regression equation
- z-scores have zero mean and the standard deviation is always. The standardized form of the regression equation becomes:
- z-score for each X value (zX)
- z-score for the corresponding Y value (zY )
- Slope constant b is now identified as beta.
- Because both sets of z-scores have a mean of zero, the constant a disappears from the regression equation.
- When one variable, X, is used to predict a second variable, Y, the value of beta is equal to the Pearson correlation for X and Y.
- A standardized form of the regression equation becomes :
Standard Error of Estimate for Regression
- The standard error of estimate gives a measure of the standard distance between the predicted Y values on the regression line and the actual Y values in the data.
- The standard error of the estimate is the sum of squared deviations (SS).
- Each deviation measures the distance between the actual Y value (from the data) and the predicted Y value (from the regression line).
- This sum of squares is commonly called SSresidual
- The degrees of freedom for the standard error of estimate are df = n – 2.
- standard error of estimate: SS value is divided by degrees of freedom =
Standard Error and the Correlation
- The regression equation simply describes the best-fitting line and is used for making predictions.
- Squaring the correlation provides a measure of the accuracy of prediction.
- The squared correlation, r2, is called the coefficient of determination because it determines what proportion of the variability in Y is predicted by the relationship with X.
- Because r2 measures the predicted portion of the variability in the Y scores, expression (1 – r2) is used to measure the unpredicted portion.
- r2 and the standard error of estimate indicate the accuracy of these predictions.
Predicted variability = SSregression = r2 SSY
Unpredicted variability = SSresidual = (1 – r2 )SSY
- if r = 0.70, then r2 = 0.49 (or 49%) of the variability for the Y is predicted by the relationship with X and the remaining 51% (1 – r2 ) is the unpredicted portion.
- r = 1.00, the prediction is perfect and there are no residuals.
- As r approaches zero, the data points move away from the predicted line and the residuals increase to a higher level.
- standard error of estimate: SS value is divided by degrees of freedom =
Analysis of Regression
- For the non-zero r-value, there will be numerical values for the regression equation (a, b).
- If there is no real relationship in the population, r and the regression equation will be of no use.
- A significance test is conducted for the regression equation to assess whether a real relationship exists or it is because of sampling error.
- The purpose of the test is to determine whether the sample correlation represents a real relationship
Null hypothesis :
States that there is no relationship between the two variables in the population.
- For a correlation H0 : the population correlation is ρ = 0
- For the regression equation, H0 : the slope of the regression equation (b or beta) is zero
F-Ratio
- The numerator of the F-ratio is MSregression, which is the variance in the Y scores that is predicted by the regression equation.
- This variance measures the systematic changes in Y that occur when the value of X increases or decreases.
- The denominator is MSresidual, which is the unpredicted variance in the Y scores. This variance measures the changes in Y that are independent of changes in X
MSregression df =1
MSresidual df = n – 2
Example:
The data consist of n = 10 pairs of scores with a correlation of r = 0.812 and SSY = 112. .Determine whether the sample correlation represents a real relationship.
F-ratio from table
For regression df = 1 , n-2= 1 , 8
At α = .05, F (1, 8) = 5.32
At α = .01, F (1, 8) = 11.26
F-ratio from data analysis
Y = bX + a
Null hypothesis H0 :
There is no relationship between X and Y, the regression equation has b = 0
r = 0.812 and SSY = 112
Predicted variability = SSregression = r2 SSY = (0.812)2 x 112 = 73.85
Predicted variability = SSresidual = (1 – r2) SSY = (1 – 0.8122) x 112 = 38.15
MSregression = 73.85
MSresidual = 4.76
F-Ratio calculated = 15.49
Conclusion:
- F-Ratio calculated 15.49 > F(1,8)= 5.32 & 11.26 { α = .05 & α = .01}
- Therefore H0 is rejected
- The regression equation accounts for a significant portion of the variance for the Y scores
Significance: Correlation vs. Regression(
)
- Testing the significance of the regression equation is equivalent to testing the significance of the Pearson correlation.
- If the correlation between two variables is significant, then the regression equation is also significant.
- If a correlation is not significant, the regression equation is also not significant.
- t statistic The t statistic for a correlation
Null hypothesis H0 :
There is no relationship between X and Y or for population ρ = 0
Multiply the numerator and the denominator by SSY
Multiple Regression with Two Predictor Variables
- The process of using several predictor variables to help obtain more accurate predictions is called multiple regression
- it is possible to combine a large number of predictor variables in a single multiple-regression equation
- Two predictor variables as X1 and X2 predict the value of Y.
The regression equation with two predictors is:
If zY , zX1 , zX2 are z-scores transformation of of Y , X1 & X2 then standardized form is:
z Y = (beta1 )zX1 + (beta2 )zX2
- SSX1 is the sum of squared deviations for X1
- SSX2 is the sum of squared deviations for X2
- SPX1Y is the sum of products of deviations for X1 and Y
- SPX2Y is the sum of products of deviations for X2 and Y
- SPX1X2 is the sum of products of deviations for X1 and X2
a = – b2
Example: Compute coefficients, b1 and b2, and the constant, a, and regression equation for the below table.
Solution :
- Calculate mean MY, MX1 and MX2 of Y, X1 and X2
- Calculate SSY , SSX1 , SSX2
- Calculate SPX1Y , SPX2Y and SPX1X2
Calculate b1
b1 = = 0.779
Calculate b2
b2 = = 0.280
Calculate a
a = – b2
a = 7.8 – 0.280 * 6.9
a = 1.74
Regression equation: = 0.779 X1 + 0.28 X2 + 1.74
Standard Error of Estimate- Multiple Regression
- The standard error of estimate for a linear regression equation is the standard distance between the regression line and the actual data points.
- The standard error of the estimate can be defined as the standard distance between the predicted Y values (from the regression equation) and the actual Y values (in the data).
Standard error of estimate for linear regression SSresidual = (1 – r2 )SSY
and df = n-2
Standard error of estimate for Multiple Regression :
SSresidual = (1 – r2 ) SSY ,
df = n – 3 ( with two predictors X1 & X2)
Significance of the Multiple Regression Equation: Analysis of Regression
- The significance of a multiple-regression equation is computed by an F-ratio
- F-ratio determines whether the equation predicts a significant portion of the variance for the Y scores. The total variability of the Y scores is partitioned into two components,
dfregression = 1,
dfresidual = n-2 for One predictor (μ1)
dfresidual = n – 3 for two predictor (μ1, μ2)
Contribution of Individual Predictor Variable
- In the Standardized form of the regression equation, the relative size of the beta values is an indication of the relative contribution of the two variables.
The standardized regression equation is
z Y = (beta1 )zX1 + (beta2 )zX2
z Y = 0.558zX1 + 0.247zX2
- Both betas are positive indicating that both X1 and X2 are positively related to Y.
- Multiple regression equation with both X1 and X2 predicted R2 = 55.62% of the variance for the Y scores.
- To determine how much is predicted by X1 alone, we begin with the correlation between X1 and Y,
r = =0.7229
r2 = 52.26%
- The additional contribution made by adding X2 to the regression equation can be computed as:
= (% with both X1 and X2 ) − (% with X1 alone)
= 55.62% − 52.26%
= 3.36%
SSY = 90, the additional variability from adding X2 as a predictor amounts to
SSadditional = 3.36% of 90
= 0.0336(90)
= 3.02
This SS value has df = 1,
= 3.02/ 1
= 3.02
F-ratio to evaluate the significance of the multiple-regression equation
= 3.024 /5.71
= 0.529
df = 1, 7
- At α = 0.05 F(1,7) = 5.59
- At α = 0.01 F(1,7) = 12.2
- >>F-ratio is not significant.
It can be concluded that adding X2 to the regression equation does not significantly improve the prediction compared to using X1 as a single predictor.
Also Read
- https://matistics.com/statistics-data-variables/
- https://matistics.com/descriptive-statistics/
- https://matistics.com/1-1-measurement-scale/
- https://matistics.com/point-biserial-correlation-and-biserial-correlation/
- https://matistics.com/2-0-statistics-distributions/
- https://matistics.com/1-2-statistics-population-and-sample/
- https://matistics.com/7-hypothesis-testing/
- https://matistics.com/8-errors-in-hypothesis-testing/
- https://matistics.com/9-one-tailed-hypothesis-test/
- https://matistics.com/10-statistical-power/
- https://matistics.com/11-t-statistics/
- https://matistics.com/12-hypothesis-t-test-one-sample/
- https://matistics.com/13-hypothesis-t-test-2-sample/
- https://matistics.com/14-t-test-for-two-related-samples/
- https://matistics.com/15-analysis-of-variance-anova-independent-measures/
- https://matistics.com/16-anova-repeated-measures/
- https://matistics.com/17-two-factor-anova-independent-measures/
- https://matistics.com/18-correlation/
- https://matistics.com/19-regression/
- https://matistics.com/20-chi-square-statistic/
- https://matistics.com/21-binomial-test/