Module 5

Regression & Correlation Analysis

Explore relationships between variables using correlation, simple and multiple regression, and predictive modeling for banking operations.

Module 5 — 90-second video overview

Understanding Relationships Between Variables

In the previous module, you learned to test whether differences between groups are statistically significant. Regression and correlation analysis take a different approach — they explore the relationships between continuous variables. Instead of asking "is group A different from group B?", you ask "as variable X changes, how does variable Y change?"

In banking operations, this capability is powerful:

  • Does trade volume predict the number of reconciliation breaks?
  • Is there a relationship between analyst experience (months on the job) and case processing time?
  • Can you predict next week's exception volume based on expected market activity?
  • How much does each additional system change deployment contribute to operational errors?

Understanding these relationships helps Green Belts identify root causes (what drives defects), predict future performance (capacity planning), and prioritize improvements (focus on the factors with the largest impact).

Correlation: Measuring Linear Relationships

Correlation measures the strength and direction of the linear relationship between two continuous variables. The Pearson correlation coefficient (r) ranges from -1 to +1:

  • r = +1 — Perfect positive linear relationship (as X increases, Y increases proportionally)
  • r = 0 — No linear relationship
  • r = -1 — Perfect negative linear relationship (as X increases, Y decreases proportionally)

Interpreting Correlation Strength

r
0.00 - 0.19Very weak or no correlation
0.20 - 0.39Weak correlation
0.40 - 0.59Moderate correlation
0.60 - 0.79Strong correlation
0.80 - 1.00Very strong correlation

Scatter Plots: Visualizing Relationships

Always create a scatter plot before calculating correlation. The scatter plot reveals:

  • Direction: Positive slope (upward) or negative slope (downward)
  • Strength: How tightly the points cluster around a line
  • Linearity: Whether the relationship is linear or curved
  • Outliers: Individual points that deviate significantly from the pattern

In banking, you might plot:

  • Trade volume (X) vs. settlement failures (Y) — expecting a positive correlation
  • Analyst tenure in months (X) vs. error rate (Y) — expecting a negative correlation
  • Number of system changes in a week (X) vs. exception count (Y) — expecting a positive correlation

Correlation vs. Causation

This distinction is critical and frequently misunderstood. Correlation does not prove causation. Two variables may be correlated because:

  1. X causes Y — Higher trade volume causes more settlement failures (plausible)
  2. Y causes X — More settlement failures cause higher trade volume (unlikely in this case)
  3. A third variable (Z) causes both — Market volatility drives both higher trade volumes and more settlement failures. The relationship between volume and failures may partially (or entirely) disappear when you control for volatility.
  4. Coincidence — Two variables may be correlated by chance, especially with small samples or when fishing through many variable pairs

In banking operations, be especially cautious about:

  • Time-based confounding: Many banking metrics trend upward over time (volume, complexity, regulatory requirements). Two metrics may be correlated simply because both are increasing over time, not because one drives the other.
  • Volume effects: Almost every defect metric correlates positively with volume. More trades mean more breaks; more payments mean more exceptions. This does not mean volume "causes" defects — it simply means there are more opportunities for defects. Normalize by volume (defect rate per 1,000 transactions) before drawing conclusions.
  • Seasonality: End-of-month, end-of-quarter, and year-end patterns create correlations that may be misleading if not recognized.

Simple Linear Regression

While correlation tells you that a relationship exists, regression models the relationship mathematically, allowing you to:

  • Quantify how much Y changes for each unit change in X
  • Predict Y values for given X values
  • Assess how well X explains the variation in Y

The Regression Equation

Simple linear regression fits the equation:

Y = β₀ + β₁X + ε

Where:

  • Y = dependent variable (what you are trying to predict or explain)
  • X = independent variable (the predictor)
  • β₀ = intercept (the value of Y when X = 0)
  • β₁ = slope (the change in Y for each one-unit increase in X)
  • ε = error term (the variation in Y that X does not explain)

Banking Example: Trade Volume and Reconciliation Breaks

A Green Belt in the securities operations team wants to understand the relationship between daily trade volume and the number of reconciliation breaks. They collect 60 days of data:

After fitting the regression model:

Breaks = 12.4 + 0.023 × Trade Volume

Interpreting the coefficients:

  • Intercept (12.4): Even with zero trades, the model predicts 12.4 breaks. This represents a baseline level of breaks from non-trade sources (corporate actions, standing instruction changes, system issues). Whether this intercept is meaningful depends on the context — if trade volumes never approach zero, the intercept is an extrapolation.
  • Slope (0.023): For every additional 1,000 trades, the model predicts approximately 23 additional breaks. This is the marginal break rate.

Example prediction: On a day when the bank processes 15,000 trades:

Predicted breaks = 12.4 + 0.023 × 15,000 = 12.4 + 345 = 357.4 breaks

This prediction allows the operations team to staff accordingly. If tomorrow's expected trade volume is 20,000 (perhaps due to a major index rebalancing), the model predicts:

Predicted breaks = 12.4 + 0.023 × 20,000 = 12.4 + 460 = 472.4 breaks

The team can proactively allocate additional analysts for break resolution.

Interpreting R-Squared (R²)

R² (coefficient of determination) tells you what proportion of the variation in Y is explained by the regression model.

R² = 1 - (Sum of Squared Residuals / Total Sum of Squares)

  • R² = 0.85 means 85% of the variation in break volume is explained by trade volume. The remaining 15% is due to other factors not in the model.
  • R² = 0.35 means trade volume explains only 35% of the variation in breaks. Other factors (counterparty mix, system stability, day of week) likely play significant roles.

What Makes a "Good" R²?

There is no universal threshold. Context matters:

  • In highly controlled processes (automated payments matching), R² > 0.80 may be achievable
  • In complex processes influenced by many factors (KYC case cycle time), R² of 0.40-0.60 from a single predictor may be excellent — it means you have found a meaningful driver
  • An R² of 0.95+ is rare and should be verified — it may indicate overfitting or a definitional relationship (e.g., total breaks = equity breaks + FI breaks + derivatives breaks is a perfect relationship, not a predictive model)

Regression Model Evaluation

R² (Simple)
0.85
Explains 85% of variation in breaks
Adjusted R²
0.82
Accounts for multiple predictors
P-Value
<0.001
Model is statistically significant
Key Predictor
Trade Volume
Coefficient: 0.023 breaks per trade

Statistical Significance of the Model

R² alone does not tell you whether the relationship is statistically significant. Check:

  • F-test for overall model significance (p-value for the entire regression)
  • t-test for each coefficient (is β₁ significantly different from zero?)
  • Confidence interval for the slope (does the interval include zero? If so, the relationship may not be significant)

Residual Analysis: Checking Model Validity

The residuals (ε = actual Y - predicted Y) should be examined to verify that the regression model is appropriate. Plot residuals against predicted values and look for:

Patterns That Indicate Problems

Random scatter (good): Residuals are randomly distributed around zero with constant spread. This confirms the model assumptions are met.

Funnel shape (heteroscedasticity): Residuals spread wider as predicted values increase. This means the model is less accurate for larger predictions. Common in banking when low-volume days have tight break counts but high-volume days have highly variable break counts. Solution: Consider log-transforming the dependent variable or using weighted regression.

Curved pattern (non-linearity): Residuals show a systematic curve, suggesting the true relationship is not linear. Solution: Consider adding a quadratic term (X²) or transforming variables.

Clusters: Distinct groups of residuals suggest a missing categorical variable. For example, if breaks cluster differently on system-change days vs. non-system-change days, adding a system-change indicator variable would improve the model.

Time patterns: If residuals show a trend or cyclical pattern over time, there may be autocorrelation — observations are not independent. This is common in daily banking data and may require time-series techniques.

Additional Diagnostic Checks

  • Normality of residuals: Create a histogram or normal probability plot of residuals. They should be approximately normally distributed for inference (confidence intervals, p-values) to be valid.
  • Influential observations: Check for individual data points that disproportionately influence the regression line. Use Cook's distance or leverage values. In banking, an influential observation might be a day with an unusual event (system outage, massive market correction) that should be investigated and potentially excluded or modeled separately.

Multiple Regression: Adding More Predictors

Simple regression uses one predictor. Multiple regression uses two or more predictors, which is almost always more realistic in banking operations where outcomes are influenced by many factors simultaneously.

Y = β₀ + β₁X₁ + β₂X₂ + β₃X₃ + ... + ε

Banking Example: Predicting Daily Reconciliation Breaks

The Green Belt extends the model to include three predictors:

  • X₁: Daily trade volume (continuous)
  • X₂: Number of new counterparties active that day (continuous)
  • X₃: Number of system change deployments in the past 48 hours (continuous, often 0 or 1)

Results:

VariableCoefficientStd Errort-statisticp-value
Intercept8.73.22.720.009
Trade volume0.0190.0036.33<0.001
New counterparties3.81.13.450.001
System deployments45.313.73.310.002
Model StatisticValue
0.82
Adjusted R²0.81
F-statistic84.6
p-value (model)<0.001

Interpretation:

  • Trade volume: Each additional 1,000 trades is associated with 19 more breaks (p < 0.001). Still the primary driver.
  • New counterparties: Each new counterparty active on a given day adds approximately 3.8 breaks (p = 0.001). New counterparties bring unfamiliar SSIs, untested connectivity, and higher exception rates.
  • System deployments: Each system change deployment in the past 48 hours is associated with 45.3 additional breaks (p = 0.002). This is a high-impact finding — system changes are a major but intermittent source of operational disruption.
  • Adjusted R² = 0.81: The three-variable model explains 81% of the variation in daily break volume, up from approximately 72% with trade volume alone. The additional predictors provide meaningful explanatory power.

Adjusted R²

Adjusted R² is the correct metric for multiple regression. Unlike R², which always increases when you add variables (even random noise), Adjusted R² penalizes for additional variables and only increases if the new variable genuinely improves the model. If Adjusted R² decreases when you add a variable, that variable is not contributing useful explanatory power and should be removed.

Multicollinearity

When two or more independent variables are highly correlated with each other, the model suffers from multicollinearity. Symptoms include:

  • Coefficients change dramatically when variables are added or removed
  • Variables that should be significant have high p-values
  • Coefficient signs are opposite to what you expect

In banking, multicollinearity is common. For example, trade volume and settlement value are highly correlated — including both in a model creates instability. Check the Variance Inflation Factor (VIF) for each variable:

  • VIF < 5 — Acceptable
  • VIF 5-10 — Concerning, investigate
  • VIF > 10 — Severe multicollinearity, remove one of the correlated variables

Using Regression for Prediction and Decision-Making

Regression models in banking operations serve two primary purposes:

1. Root Cause Validation

The regression results above provide statistical evidence that trade volume, new counterparty activity, and system changes are all significant drivers of reconciliation breaks. This validates (or challenges) the hypotheses generated during the Analyze phase.

The system change finding is particularly actionable. Each deployment is associated with 45 additional breaks. If the bank deploys system changes twice per week on average, that is 90 extra breaks per week — approximately 4,680 per year. If each break costs an average of $15 to investigate and resolve, system change-related breaks cost roughly $70,200 per year. This quantification helps build the business case for better change management, pre-deployment testing, or deployment timing optimization.

2. Capacity Planning and Prediction

The model can predict tomorrow's break volume based on known or expected inputs:

Scenario: Next Friday, the bank expects 18,000 trades, 5 new counterparties will go live, and a system deployment is planned.

Predicted breaks = 8.7 + (0.019 × 18,000) + (3.8 × 5) + (45.3 × 1) = 8.7 + 342 + 19 + 45.3 = 415 breaks

Compared to a normal day with 12,000 trades, no new counterparties, and no system changes:

Predicted breaks = 8.7 + (0.019 × 12,000) + (3.8 × 0) + (45.3 × 0) = 8.7 + 228 = 236.7 breaks

The difference (415 vs. 237) tells the operations team they need approximately 75% more break resolution capacity that day. They can plan accordingly by rescheduling non-urgent work, arranging overtime, or requesting the system deployment be moved to a lower-volume day.

Prediction Intervals

Point predictions are useful, but they should include uncertainty estimates. A prediction interval provides a range within which the actual value is expected to fall. For example:

Predicted breaks = 415, 95% prediction interval: [340, 490]

This tells the operations manager: "Expect about 415 breaks, but plan for anywhere between 340 and 490." The width of the prediction interval reflects the model's uncertainty and should be factored into staffing decisions.

Limitations and Cautions

  • Extrapolation: Do not use the model to predict outside the range of observed data. If trade volumes in your data ranged from 8,000 to 22,000, do not use the model to predict breaks for a 50,000-trade day — the relationship may not be linear at that scale.
  • Causation: Regression identifies associations, not causes. The system deployment coefficient does not prove that system changes cause breaks — there may be confounding factors. However, combined with process knowledge and temporal analysis, regression provides strong evidence for investigation.
  • Stationarity: The model assumes relationships are stable over time. If the bank implements a new reconciliation system, the relationship between trade volume and breaks may change. Periodically revalidate your model.
  • Outlier influence: A single unusual day (e.g., a market crash that generated 5x normal volume) can distort the entire model. Investigate outliers, understand their cause, and consider whether they should be included or modeled separately.

Key Takeaways

  • Correlation measures the strength and direction of linear relationships; regression models the relationship mathematically
  • Correlation does not prove causation — always consider confounding variables, especially time trends and volume effects
  • R² tells you what proportion of variation is explained; Adjusted R² is the correct measure for multiple regression
  • Residual analysis validates model assumptions: check for patterns, heteroscedasticity, and non-linearity
  • Multiple regression allows you to quantify the individual contribution of each factor while controlling for others
  • In banking, regression is valuable for both root cause validation and operational capacity planning
  • Always report prediction intervals alongside point predictions to communicate uncertainty
  • Beware of multicollinearity when using correlated predictors in the same model

In the next module, we will explore Design of Experiments (DOE) — a structured approach to testing multiple factors simultaneously when you need to go beyond observational analysis and actively experiment with process changes.

Module Quiz

5 questions — Pass mark: 60%

Q1.What is the fundamental difference between correlation and causation?

Q2.A Green Belt finds that R² = 0.72 in a regression model predicting reconciliation break volume from trade volume. What does this mean?

Q3.When examining residual plots from a regression model, what pattern would indicate a problem?

Q4.In a multiple regression model, adding a new independent variable always increases R². Why can this be misleading?

Q5.A regression model predicts daily reconciliation breaks using trade volume, new counterparty count, and system change deployments. The coefficient for system change deployments is 45.3 with a p-value of 0.002. What does this mean?