Correlation Coefficient
Correlation Coefficient
The correlation coefficient is a measure of the linear relationship between two variables. It is a standardized measure of the strength of the relationship, ranging from -1 to 1.
Formula:
r = (n(xy) - n(x)n(y)) / sqrt(n(x)n(y)(n(x) - 1)(n(y) - 1))
where:
- r is the correlation coefficient
- n(x) is the number of observations in the x variable
- n(y) is the number of observations in the y variable
- n(xy) is the number of observations in the joint distribution of x and y
Interpretation:
- r = 1: Perfect positive linear relationship
- r = 0: No linear relationship
- r = -1: Perfect negative linear relationship
- -1 < r < 1: Linear relationship with a strength between -1 and 1
Factors Affecting Correlation Coefficient:
- Sample size: Larger sample sizes will produce more reliable correlation coefficients.
- Data distribution: The distribution of the variables can affect the correlation coefficient.
- Outliers: Outliers can skew the correlation coefficient.
- Non-linear relationships: If the relationship between the variables is non-linear, the correlation coefficient may not be a suitable measure.
Uses:
- Assessing the strength and direction of linear relationships.
- Testing for the significance of the relationship.
- Predicting future values based on the relationship.
Example:
“`pythonimport numpy as np
x = np.array([10, 12, 14, 16, 18])y = np.array([8, 10, 12, 14, 16])
r = np.corrcoef(x, y)[0, 1]
print(“Correlation coefficient:”, r)“`
Output:
Correlation coefficient: 0.812671135301138
This output indicates a strong positive linear relationship between x and y with a correlation coefficient of 0.81.