Module IX·Article III·~6 min read

Correlation and Regression Analysis

Quantitative Data Analysis

Turn this article into a podcast

Pick voices, format, length — AI generates the audio

Concept of Correlation

Correlation is a statistical measure describing the degree and direction of the linear relationship between two variables. Correlation analysis allows us to answer the question: do two variables change in agreement?

Types of Correlational Relationships

  • Positive correlation — when one variable increases, the other also increases (for example, number of hours of preparation and exam results)
  • Negative correlation — when one variable increases, the other decreases (for example, stress level and sleep quality)
  • No correlation — changes in one variable are not systematically related to changes in the other (the correlation coefficient is close to zero)

Pearson's Correlation Coefficient (r)

Pearson's coefficient (Pearson's r) measures the strength and direction of the linear relationship between two interval or ratio variables. The values of r range from −1 to +1.

When to use: both variables are measured on an interval/ratio scale; the relationship between the variables is linear; the data are approximately normally distributed; there are no significant outliers.

Steps in SPSS: Analyze → Correlate → Bivariate → transfer the variables into the Variables list → make sure Pearson is checked → choose the test type (Two-tailed or One-tailed) → OK.

Interpretation of output: SPSS displays a correlation matrix. For each pair of variables, the following are shown: the correlation coefficient (r), the Sig. (2-tailed) — p-value, and the number of observations (N). If p < 0.05, the correlation is statistically significant.

Spearman's Rank Correlation (ρ)

The Spearman coefficient (Spearman's rho, ρ) is a nonparametric measure of correlation based on the ranks of observations. It assesses a monotonic (not necessarily linear) relationship between variables.

When to use: one or both variables are measured on an ordinal scale; the data distribution deviates significantly from normal; the relationship is monotonic but not linear; there are significant outliers.

In SPSS: in the Bivariate Correlations window, check Spearman instead of (or in addition to) Pearson. Interpretation is similar: value of ρ, p-value, and N.

Interpreting the Strength of Correlation

Cohen's recommendations (Cohen, 1988) for interpreting the absolute value of the correlation coefficient:

| |r| or |ρ| | Strength of relationship | |---|---| | 0.10 – 0.29 | Weak correlation | | 0.30 – 0.49 | Moderate correlation | | 0.50 – 1.00 | Strong correlation |

Important: these boundaries are approximate. In some research fields (for example, psychology) a correlation r = 0.30 may be considered practically significant, while in physics such a value would be insignificant.

The coefficient of determination shows the proportion of variance in one variable explained by the other. For example, r = 0.50 means r² = 0.25, that is, 25% of the variation in one variable is explained by the relationship with the other.

Correlation Does Not Imply Causation

Finding a correlation between two variables does not prove that one is the cause of the other. Possible explanations for correlation:

  1. Variable A influences B (direct causality)
  2. Variable B influences A (reverse causality)
  3. A third variable C influences both (spurious correlation, confounding variable)
  4. Random coincidence (especially with a large number of tested relationships)

Example: there is a correlation between ice cream sales and the number of drownings. This does not mean that ice cream causes drownings — both variables are associated with a third: hot weather.

Chi-Square Test (χ²) for Categorical Data

When both variables are categorical (nominal), the chi-square test is used to check the independence of the variables.

Hypotheses: H₀: variables are independent (no relationship); H₁: variables are associated.

Steps in SPSS: Analyze → Descriptive Statistics → Crosstabs → transfer variables into Row(s) and Column(s) → click Statistics → check Chi-square → OK.

Interpretation: in the Chi-Square Tests table look for the row “Pearson Chi-Square.” If Asymp. Sig. (2-sided) < 0.05, the variables are statistically significantly associated. Condition for applicability: expected frequencies in each cell must be ≥ 5 (at least in 80% of cells).

Simple Linear Regression

Regression analysis not only establishes the presence of a relationship, but also allows us to predict the value of one variable based on another. Simple linear regression models the relationship between one independent (predictor, X) and one dependent (Y) variable.

Regression equation: Y = a + bX, where a is the constant (intercept, the value of Y when X = 0), b is the regression coefficient (slope, the change in Y when X increases by one unit).

Coefficient of determination R² shows the proportion of variance in the dependent variable explained by the model. R² = 0.45 means that 45% of the variation in Y is explained by the predictor X.

Conducting Regression in SPSS

Steps: Analyze → Regression → Linear → transfer the dependent variable into Dependent → transfer the predictor into Independent(s) → OK.

Interpretation of SPSS output:

  • Model Summary: R, R², Adjusted R² — show the quality of the model
  • ANOVA table: F-statistic and Sig. — tests the significance of the model as a whole. If Sig. < 0.05, the model is statistically significant
  • Coefficients table: values of a (Constant) and b (predictor coefficient), standard errors, t-statistic, and p-value for each coefficient. The B column contains unstandardized coefficients, Beta — standardized,

Example interpretation: if b = 2.3 for the variable "hours of preparation," this means that each additional hour of preparation is associated with an increase in exam results by an average of 2.3 points (all else being equal).

Multiple Regression: Introduction

Multiple linear regression expands simple regression by including several predictors: Y = a + b₁X₁ + b₂X₂ + ... + bₖXₖ.

Advantages: allows you to control for the influence of other variables; evaluates the unique contribution of each predictor; increases the accuracy of prediction.

Standardized coefficients (Beta) allow comparison of the relative importance of predictors measured in different units. The larger the absolute value of Beta, the stronger the influence of the predictor.

Multicollinearity — a problem arising when there is a high correlation between predictors. Checked in SPSS via Statistics → Collinearity diagnostics. A value of VIF (Variance Inflation Factor) > 10 indicates multicollinearity.

Practical Tasks

Task 1. The researcher obtained r = 0.42 (p = 0.003, n = 48) between the number of hours of sleep and academic performance. Interpret the result: determine the strength of the correlation according to Cohen, calculate r² and explain its meaning. Solution: moderate positive correlation; r² = 0.176, that is, about 17.6% of the variation in performance is explained by the number of hours of sleep. The relationship is statistically significant (p < 0.05).

Task 2. In the study, a strong positive correlation (r = 0.78) was found between the number of fire engines arriving at the site and the amount of fire damage. Can we conclude that fire engines increase the damage? Solution: no, this is an example of spurious correlation. A third variable — the scale of the fire — influences both: large fires require more engines and cause more damage.

Task 3. According to regression analysis, the equation obtained is: Score = 45.2 + 3.1 × Preparation_Hours (R² = 0.38, p < 0.001). Interpret: (a) the value of the constant, (b) the regression coefficient, (c) R². Solution: (a) with zero preparation, the expected score is 45.2 points; (b) each additional hour of preparation is associated with an increase of 3.1 points in score; (c) the model explains 38% of the variation in scores.

Task 4. The researcher wants to study the relationship between gender (male/female) and preference for learning format (online/offline). Which statistical test should be used and why? Solution: chi-square test, since both variables are categorical (nominal). Pearson or Spearman correlation are not suitable for nominal data.

§ Act · what next