Six Sigma Green Belt for Banking OperationsModule 5 of 7

Module 5 of 771% through

Module 5

Regression & Correlation Analysis

Explore relationships between variables using correlation, simple and multiple regression, and predictive modeling for banking operations.

Module 5 — 90-second video overview

Understanding Relationships Between Variables

In the previous module, you learned to test whether differences between groups are statistically significant. Regression and correlation analysis take a different approach — they explore the relationships between continuous variables. Instead of asking "is group A different from group B?", you ask "as variable X changes, how does variable Y change?"

In banking operations, this capability is powerful:

Does trade volume predict the number of reconciliation breaks?
Is there a relationship between analyst experience (months on the job) and case processing time?
Can you predict next week's exception volume based on expected market activity?
How much does each additional system change deployment contribute to operational errors?

Understanding these relationships helps Green Belts identify root causes (what drives defects), predict future performance (capacity planning), and prioritize improvements (focus on the factors with the largest impact).

Correlation: Measuring Linear Relationships

Correlation measures the strength and direction of the linear relationship between two continuous variables. The Pearson correlation coefficient (r) ranges from -1 to +1:

r = +1 — Perfect positive linear relationship (as X increases, Y increases proportionally)
r = 0 — No linear relationship
r = -1 — Perfect negative linear relationship (as X increases, Y decreases proportionally)

Interpreting Correlation Strength

	r
0.00 - 0.19	Very weak or no correlation
0.20 - 0.39	Weak correlation
0.40 - 0.59	Moderate correlation
0.60 - 0.79	Strong correlation
0.80 - 1.00	Very strong correlation

Scatter Plots: Visualizing Relationships

Always create a scatter plot before calculating correlation. The scatter plot reveals:

Direction: Positive slope (upward) or negative slope (downward)
Strength: How tightly the points cluster around a line
Linearity: Whether the relationship is linear or curved
Outliers: Individual points that deviate significantly from the pattern

In banking, you might plot:

Trade volume (X) vs. settlement failures (Y) — expecting a positive correlation
Analyst tenure in months (X) vs. error rate (Y) — expecting a negative correlation
Number of system changes in a week (X) vs. exception count (Y) — expecting a positive correlation

Correlation vs. Causation

This distinction is critical and frequently misunderstood. Correlation does not prove causation. Two variables may be correlated because:

X causes Y — Higher trade volume causes more settlement failures (plausible)
Y causes X — More settlement failures cause higher trade volume (unlikely in this case)
A third variable (Z) causes both — Market volatility drives both higher trade volumes and more settlement failures. The relationship between volume and failures may partially (or entirely) disappear when you control for volatility.
Coincidence — Two variables may be correlated by chance, especially with small samples or when fishing through many variable pairs

In banking operations, be especially cautious about:

Time-based confounding: Many banking metrics trend upward over time (volume, complexity, regulatory requirements). Two metrics may be correlated simply because both are increasing over time, not because one drives the other.
Volume effects: Almost every defect metric correlates positively with volume. More trades mean more breaks; more payments mean more exceptions. This does not mean volume "causes" defects — it simply means there are more opportunities for defects. Normalize by volume (defect rate per 1,000 transactions) before drawing conclusions.
Seasonality: End-of-month, end-of-quarter, and year-end patterns create correlations that may be misleading if not recognized.

Simple Linear Regression

While correlation tells you that a relationship exists, regression models the relationship mathematically, allowing you to:

Quantify how much Y changes for each unit change in X
Predict Y values for given X values
Assess how well X explains the variation in Y

The Regression Equation

Simple linear regression fits the equation:

Y = β₀ + β₁X + ε

Where:

Y = dependent variable (what you are trying to predict or explain)
X = independent variable (the predictor)
β₀ = intercept (the value of Y when X = 0)
β₁ = slope (the change in Y for each one-unit increase in X)
ε = error term (the variation in Y that X does not explain)

Banking Example: Trade Volume and Reconciliation Breaks

A Green Belt in the securities operations team wants to understand the relationship between daily trade volume and the number of reconciliation breaks. They collect 60 days of data:

After fitting the regression model:

Breaks = 12.4 + 0.023 × Trade Volume

Interpreting the coefficients:

Intercept (12.4): Even with zero trades, the model predicts 12.4 breaks. This represents a baseline level of breaks from non-trade sources (corporate actions, standing instruction changes, system issues). Whether this intercept is meaningful depends on the context — if trade volumes never approach zero, the intercept is an extrapolation.
Slope (0.023): For every additional 1,000 trades, the model predicts approximately 23 additional breaks. This is the marginal break rate.

Example prediction: On a day when the bank processes 15,000 trades:

Predicted breaks = 12.4 + 0.023 × 15,000 = 12.4 + 345 = 357.4 breaks

This prediction allows the operations team to staff accordingly. If tomorrow's expected trade volume is 20,000 (perhaps due to a major index rebalancing), the model predicts:

Predicted breaks = 12.4 + 0.023 × 20,000 = 12.4 + 460 = 472.4 breaks

The team can proactively allocate additional analysts for break resolution.

Interpreting R-Squared (R²)

R² (coefficient of determination) tells you what proportion of the variation in Y is explained by the regression model.

R² = 1 - (Sum of Squared Residuals / Total Sum of Squares)

R² = 0.85 means 85% of the variation in break volume is explained by trade volume. The remaining 15% is due to other factors not in the model.
R² = 0.35 means trade volume explains only 35% of the variation in breaks. Other factors (counterparty mix, system stability, day of week) likely play significant roles.

What Makes a "Good" R²?

There is no universal threshold. Context matters:

In highly controlled processes (automated payments matching), R² > 0.80 may be achievable
In complex processes influenced by many factors (KYC case cycle time), R² of 0.40-0.60 from a single predictor may be excellent — it means you have found a meaningful driver
An R² of 0.95+ is rare and should be verified — it may indicate overfitting or a definitional relationship (e.g., total breaks = equity breaks + FI breaks + derivatives breaks is a perfect relationship, not a predictive model)

Regression Model Evaluation

R² (Simple)

0.85

Explains 85% of variation in breaks

Adjusted R²

0.82

Accounts for multiple predictors

P-Value

<0.001

Model is statistically significant

Key Predictor

Trade Volume

Coefficient: 0.023 breaks per trade

Statistical Significance of the Model

R² alone does not tell you whether the relationship is statistically significant. Check:

F-test for overall model significance (p-value for the entire regression)
t-test for each coefficient (is β₁ significantly different from zero?)
Confidence interval for the slope (does the interval include zero? If so, the relationship may not be significant)

Residual Analysis: Checking Model Validity

The residuals (ε = actual Y - predicted Y) should be examined to verify that the regression model is appropriate. Plot residuals against predicted values and look for:

Patterns That Indicate Problems

Random scatter (good): Residuals are randomly distributed around zero with constant spread. This confirms the model assumptions are met.

Funnel shape (heteroscedasticity): Residuals spread wider as predicted values increase. This means the model is less accurate for larger predictions. Common in banking when low-volume days have tight break counts but high-volume days have highly variable break counts. Solution: Consider log-transforming the dependent variable or using weighted regression.

Curved pattern (non-linearity): Residuals show a systematic curve, suggesting the true relationship is not linear. Solution: Consider adding a quadratic term (X²) or transforming variables.

Clusters: Distinct groups of residuals suggest a missing categorical variable. For example, if breaks cluster differently on system-change days vs. non-system-change days, adding a system-change indicator variable would improve the model.

Time patterns: If residuals show a trend or cyclical pattern over time, there may be autocorrelation — observations are not independent. This is common in daily banking data and may require time-series techniques.

Additional Diagnostic Checks

Normality of residuals: Create a histogram or normal probability plot of residuals. They should be approximately normally distributed for inference (confidence intervals, p-values) to be valid.
Influential observations: Check for individual data points that disproportionately influence the regression line. Use Cook's distance or leverage values. In banking, an influential observation might be a day with an unusual event (system outage, massive market correction) that should be investigated and potentially excluded or modeled separately.

Multiple Regression: Adding More Predictors

Simple regression uses one predictor. Multiple regression uses two or more predictors, which is almost always more realistic in banking operations where outcomes are influenced by many factors simultaneously.

Y = β₀ + β₁X₁ + β₂X₂ + β₃X₃ + ... + ε

Banking Example: Predicting Daily Reconciliation Breaks

The Green Belt extends the model to include three predictors:

X₁: Daily trade volume (continuous)
X₂: Number of new counterparties active that day (continuous)
X₃: Number of system change deployments in the past 48 hours (continuous, often 0 or 1)

Results:

Variable	Coefficient	Std Error	t-statistic	p-value
Intercept	8.7	3.2	2.72	0.009
Trade volume	0.019	0.003	6.33	<0.001
New counterparties	3.8	1.1	3.45	0.001
System deployments	45.3	13.7	3.31	0.002

Model Statistic	Value
R²	0.82
Adjusted R²	0.81
F-statistic	84.6
p-value (model)	<0.001

Interpretation:

Trade volume: Each additional 1,000 trades is associated with 19 more breaks (p < 0.001). Still the primary driver.
New counterparties: Each new counterparty active on a given day adds approximately 3.8 breaks (p = 0.001). New counterparties bring unfamiliar SSIs, untested connectivity, and higher exception rates.
System deployments: Each system change deployment in the past 48 hours is associated with 45.3 additional breaks (p = 0.002). This is a high-impact finding — system changes are a major but intermittent source of operational disruption.
Adjusted R² = 0.81: The three-variable model explains 81% of the variation in daily break volume, up from approximately 72% with trade volume alone. The additional predictors provide meaningful explanatory power.

Adjusted R²

Adjusted R² is the correct metric for multiple regression. Unlike R², which always increases when you add variables (even random noise), Adjusted R² penalizes for additional variables and only increases if the new variable genuinely improves the model. If Adjusted R² decreases when you add a variable, that variable is not contributing useful explanatory power and should be removed.

Multicollinearity

When two or more independent variables are highly correlated with each other, the model suffers from multicollinearity. Symptoms include:

Coefficients change dramatically when variables are added or removed
Variables that should be significant have high p-values
Coefficient signs are opposite to what you expect

In banking, multicollinearity is common. For example, trade volume and settlement value are highly correlated — including both in a model creates instability. Check the Variance Inflation Factor (VIF) for each variable:

VIF < 5 — Acceptable
VIF 5-10 — Concerning, investigate
VIF > 10 — Severe multicollinearity, remove one of the correlated variables

Using Regression for Prediction and Decision-Making

Regression models in banking operations serve two primary purposes:

1. Root Cause Validation

The regression results above provide statistical evidence that trade volume, new counterparty activity, and system changes are all significant drivers of reconciliation breaks. This validates (or challenges) the hypotheses generated during the Analyze phase.

The system change finding is particularly actionable. Each deployment is associated with 45 additional breaks. If the bank deploys system changes twice per week on average, that is 90 extra breaks per week — approximately 4,680 per year. If each break costs an average of $15 to investigate and resolve, system change-related breaks cost roughly $70,200 per year. This quantification helps build the business case for better change management, pre-deployment testing, or deployment timing optimization.

2. Capacity Planning and Prediction

The model can predict tomorrow's break volume based on known or expected inputs:

Scenario: Next Friday, the bank expects 18,000 trades, 5 new counterparties will go live, and a system deployment is planned.

Predicted breaks = 8.7 + (0.019 × 18,000) + (3.8 × 5) + (45.3 × 1) = 8.7 + 342 + 19 + 45.3 = 415 breaks

Compared to a normal day with 12,000 trades, no new counterparties, and no system changes:

Predicted breaks = 8.7 + (0.019 × 12,000) + (3.8 × 0) + (45.3 × 0) = 8.7 + 228 = 236.7 breaks

The difference (415 vs. 237) tells the operations team they need approximately 75% more break resolution capacity that day. They can plan accordingly by rescheduling non-urgent work, arranging overtime, or requesting the system deployment be moved to a lower-volume day.

Prediction Intervals

Point predictions are useful, but they should include uncertainty estimates. A prediction interval provides a range within which the actual value is expected to fall. For example:

Predicted breaks = 415, 95% prediction interval: [340, 490]

This tells the operations manager: "Expect about 415 breaks, but plan for anywhere between 340 and 490." The width of the prediction interval reflects the model's uncertainty and should be factored into staffing decisions.

Limitations and Cautions

Extrapolation: Do not use the model to predict outside the range of observed data. If trade volumes in your data ranged from 8,000 to 22,000, do not use the model to predict breaks for a 50,000-trade day — the relationship may not be linear at that scale.
Causation: Regression identifies associations, not causes. The system deployment coefficient does not prove that system changes cause breaks — there may be confounding factors. However, combined with process knowledge and temporal analysis, regression provides strong evidence for investigation.
Stationarity: The model assumes relationships are stable over time. If the bank implements a new reconciliation system, the relationship between trade volume and breaks may change. Periodically revalidate your model.
Outlier influence: A single unusual day (e.g., a market crash that generated 5x normal volume) can distort the entire model. Investigate outliers, understand their cause, and consider whether they should be included or modeled separately.

Key Takeaways

Correlation measures the strength and direction of linear relationships; regression models the relationship mathematically
Correlation does not prove causation — always consider confounding variables, especially time trends and volume effects
R² tells you what proportion of variation is explained; Adjusted R² is the correct measure for multiple regression
Residual analysis validates model assumptions: check for patterns, heteroscedasticity, and non-linearity
Multiple regression allows you to quantify the individual contribution of each factor while controlling for others
In banking, regression is valuable for both root cause validation and operational capacity planning
Always report prediction intervals alongside point predictions to communicate uncertainty
Beware of multicollinearity when using correlated predictors in the same model

In the next module, we will explore Design of Experiments (DOE) — a structured approach to testing multiple factors simultaneously when you need to go beyond observational analysis and actively experiment with process changes.

Module 4: Hypothesis Testing

Module 6: Design of Experiments (DOE)

Monthly newsletter

Stay current between modules

Subscribe to the monthly essay for long-form analysis on AI enablement, embedded governance, and operating-model design — written for the same audience this course serves.

No spam. Unsubscribe anytime. Read by senior practitioners across FS, healthcare, energy, and the public sector.

Regression & Correlation Analysis

Understanding Relationships Between Variables

Correlation: Measuring Linear Relationships

Interpreting Correlation Strength

Scatter Plots: Visualizing Relationships

Correlation vs. Causation

Simple Linear Regression

The Regression Equation

Banking Example: Trade Volume and Reconciliation Breaks

Interpreting R-Squared (R²)

What Makes a "Good" R²?

Regression Model Evaluation

Statistical Significance of the Model

Residual Analysis: Checking Model Validity

Patterns That Indicate Problems

Additional Diagnostic Checks

Multiple Regression: Adding More Predictors

Banking Example: Predicting Daily Reconciliation Breaks

Adjusted R²

Multicollinearity

Using Regression for Prediction and Decision-Making

1. Root Cause Validation

2. Capacity Planning and Prediction

Prediction Intervals

Limitations and Cautions

Key Takeaways

Module Quiz

Stay current between modules