Pearson Correlation Analysis

Asit

3 years ago

What is Correlation?

The Pearson correlation measures the degree and the direction of the linear relationship between two variables.

Prerequisite

- The sum of squares (SS)
- Computational formula
- Definitional formula
- z-scores
- Hypothesis testing
- Correlation is a statistical technique it is used to measure and describe the relationship between two variables.

Characteristics of a relationship

The sign of the correlation, positive or negative, describes the direction of the relationship.
In a positive correlation, the two variables tend to change in the same direction: as the value of the X variable increases from one individual to another, the Y variable also tends to increase; when the X variable decreases, the Y variable also decreases.
In a negative correlation, the two variables tend to go in opposite directions. As the X variable increases, the Y variable decreases. That is, it is an inverse relationship

Form of the Relationship

The Strength or Consistency of the Relationship

A perfect correlation is identified by a correlation of 1.00 and indicates a perfectly consistent relationship.
For a correlation of 1.00 (or –1.00), each change in X is accompanied by a perfectly predictable change in Y.
At the other extreme, a correlation of 0 indicates no consistency at all.
For a correlation of 0, the data points are scattered randomly with no clear trend. Intermediate values between 0 and 1 indicate the degree of consistency.

Pearson correlation

The Pearson correlation for a sample is identified by the letter r. The corresponding correlation for the entire population is identified by the Greek letter rho (ρ), which is the Greek equivalent of the letter r. Conceptually, this correlation is computed by

Sum of squares

The sum of Products of Deviation

The definitional formula for the sum of products is

M_X is the mean for the X scores
M_Y is the mean for the Y

Formula Pearson Correlation

The Pearson Correlation and z-Scores

z-scores identify the location of each individual score within a distribution.
Each X value can be transformed into a z-score, z_X, using the mean and standard deviation for the set of Xs
Each Y value can be transformed into z_Y
After the transformation, the formula for the Pearson correlation can be expressed in terms of z-score.

For a sample,

For a population,

Correlations Interpretation

Correlation simply describes a relationship between two variables. It does not explain why the two variables are related.

One of the most common errors in interpreting correlations is to assume that a correlation necessarily implies a cause-and-effect relationship between the two variables
To establish a cause-and-effect relationship, it is necessary to conduct an experiment in which one variable is manipulated and other variables is purposely controlled.
It cannot be interpreted as proof of a cause-and-effect relationship between the two variables

The value of a correlation can be affected greatly by the range of scores represented in the data.

Correlation is computed from scores that represent the full range of possible values
The correlation within a restricted range could be completely different from the correlation that would be obtained from a full range.
Correlation should not be generalized beyond the range of data represented in the sample.
For a correlation to provide an accurate description of the general population, there should be a wide range of X and Y values in the data.

One or two extreme data points, often called outliers, can have a dramatic effect on the value of a correlation

An outlier is an individual with X and/or Y values that are substantially different (larger or smaller) from the values obtained for the other individuals in the data set. The data point of a single outlier can have a dramatic influence on the value obtained for the correlation.
If you only “go by the numbers,” you might overlook the fact that one extreme data point inflated the size of the correlation.

Strength of the relationship

A correlation measures the degree of relationship between two variables on a scale from 0–1. Square the correlation is used to measure the strength of the relationship.

A correlation of 1.00 does mean that there is a 100% perfectly predictable relationship between X and Y,
A correlation of .5 does not mean that you can make predictions with 50% accuracy.
The value r² is called the coefficient of determination because it measures the proportion of variability in one variable that can be determined from the relationship with the other variable.
A correlation of r = 0.90, r² = 0.81 (or 81%) of the variability in the Y scores can be predicted from the relationship with X
r² measures how much of the variance in the dependent variable is accounted for by the independent variable.

Partial correlation

A partial correlation measures the relationship between two variables while controlling the influence of a third variable by holding it constant.

Three variables, X, Y, and Z, it is possible to compute three individual Pearson correlations:

r_XY measuring the correlation between X and Y
r_XZ measuring the correlation between X and Z
r_YZ measuring the correlation between Y and Z
r_XY-Z =

Null Hypotheses

The null hypothesis is No.
There is no correlation in the population.
The population correlation is zero.

H₀ : ρ = 0 (There is no population correlation.)

If there is a specific prediction about the direction of the correlation. For a positive relationship,

H0 : ρ ≤ 0 (The population correlation is not positive.)

Alternative hypothesis

The alternative hypothesis is Yes.
There is a real, nonzero correlation in the population.

H1 : ρ ≠ 0 (There is a real correlation.)

If there is a specific prediction about the direction of the correlation. For a positive relationship,

H1 : ρ > 0 (The population correlation is positive.)

t statistic

The t statistic for a correlation

Degrees of Freedom for the t Statistic

df = n – 2
Only n = 2 data points have no degrees of freedom. Specifically, if there are only two points, they will fit perfectly on a straight line, and the sample produces a perfect correlation of 1

Spearman Correlation

When the Pearson correlation formula is used with data from an ordinal scale (ranks), the result is called the Spearman correlation.
The Spearman correlation is used in two situations. First, the Spearman correlation is used to measure the relationship between X and Y when both variables are measured on ordinal scales.
Spearman is used when the original data are ordinal; that is when the X and Y values are ranks.
Spearman correlation measures consistency, rather than form, and comes from a simple observation
If there is a consistently one-directional relationship between two variables, the relationship is said to be monotonic.
Spearman correlation measures the degree of a monotonic relationship between two variables
Spearman correlation is identified by the symbol r_S to differentiate it from the Pearson correlation

Formula

After the original X values and Y values have been ranked,
Calculate: SS and S_P

M = (n + 1)/2

SS for this series of integers =

D is the difference between the X rank and the Y rank for each individual

Point-Biserial Correlation

The point-biserial correlation is used to measure the relationship between two variables in situations in which one variable consists of regular, numerical scores, but the second variable has only two values.
A variable with only two values is called a dichotomous variable or a binomial variable.
Examples of dichotomous variables :
- Pass / Fail
- Good /Bad
- Success/failure
Dichotomous variables are first converted to numerical values by assigning a value of 0 to one category and a value of 1 to the other category. Then the regular Pearson correlation formula is used with the converted data.

t value measurement for statistical significance:-

Phi-Coefficient

When both variables (X and Y) measured for each individual are dichotomous, the correlation between the two variables is called the phi-coefficient.
To compute phi (Φ) :

Convert each of the dichotomous variables to numerical values by assigning a 0 to one category and a 1 to the other category for each of the variables.
Use the Pearson formula with the converted score