Understanding Relationships Between Variables
In the previous module, you learned to test whether differences between groups are statistically significant. Regression and correlation analysis take a different approach — they explore the relationships between continuous variables. Instead of asking "is group A different from group B?", you ask "as variable X changes, how does variable Y change?"
In banking operations, this capability is powerful:
- Does trade volume predict the number of reconciliation breaks?
- Is there a relationship between analyst experience (months on the job) and case processing time?
- Can you predict next week's exception volume based on expected market activity?
- How much does each additional system change deployment contribute to operational errors?
Understanding these relationships helps Green Belts identify root causes (what drives defects), predict future performance (capacity planning), and prioritize improvements (focus on the factors with the largest impact).
Correlation: Measuring Linear Relationships
Correlation measures the strength and direction of the linear relationship between two continuous variables. The Pearson correlation coefficient (r) ranges from -1 to +1:
- r = +1 — Perfect positive linear relationship (as X increases, Y increases proportionally)
- r = 0 — No linear relationship
- r = -1 — Perfect negative linear relationship (as X increases, Y decreases proportionally)
Interpreting Correlation Strength
| r | |
|---|---|
| 0.00 - 0.19 | Very weak or no correlation |
| 0.20 - 0.39 | Weak correlation |
| 0.40 - 0.59 | Moderate correlation |
| 0.60 - 0.79 | Strong correlation |
| 0.80 - 1.00 | Very strong correlation |
Scatter Plots: Visualizing Relationships
Always create a scatter plot before calculating correlation. The scatter plot reveals:
- Direction: Positive slope (upward) or negative slope (downward)
- Strength: How tightly the points cluster around a line
- Linearity: Whether the relationship is linear or curved
- Outliers: Individual points that deviate significantly from the pattern
In banking, you might plot:
- Trade volume (X) vs. settlement failures (Y) — expecting a positive correlation
- Analyst tenure in months (X) vs. error rate (Y) — expecting a negative correlation
- Number of system changes in a week (X) vs. exception count (Y) — expecting a positive correlation
Correlation vs. Causation
This distinction is critical and frequently misunderstood. Correlation does not prove causation. Two variables may be correlated because:
- X causes Y — Higher trade volume causes more settlement failures (plausible)
- Y causes X — More settlement failures cause higher trade volume (unlikely in this case)
- A third variable (Z) causes both — Market volatility drives both higher trade volumes and more settlement failures. The relationship between volume and failures may partially (or entirely) disappear when you control for volatility.
- Coincidence — Two variables may be correlated by chance, especially with small samples or when fishing through many variable pairs
In banking operations, be especially cautious about:
- Time-based confounding: Many banking metrics trend upward over time (volume, complexity, regulatory requirements). Two metrics may be correlated simply because both are increasing over time, not because one drives the other.
- Volume effects: Almost every defect metric correlates positively with volume. More trades mean more breaks; more payments mean more exceptions. This does not mean volume "causes" defects — it simply means there are more opportunities for defects. Normalize by volume (defect rate per 1,000 transactions) before drawing conclusions.
- Seasonality: End-of-month, end-of-quarter, and year-end patterns create correlations that may be misleading if not recognized.
Simple Linear Regression
While correlation tells you that a relationship exists, regression models the relationship mathematically, allowing you to:
- Quantify how much Y changes for each unit change in X
- Predict Y values for given X values
- Assess how well X explains the variation in Y
The Regression Equation
Simple linear regression fits the equation:
Y = β₀ + β₁X + ε
Where:
- Y = dependent variable (what you are trying to predict or explain)
- X = independent variable (the predictor)
- β₀ = intercept (the value of Y when X = 0)
- β₁ = slope (the change in Y for each one-unit increase in X)
- ε = error term (the variation in Y that X does not explain)
Banking Example: Trade Volume and Reconciliation Breaks
A Green Belt in the securities operations team wants to understand the relationship between daily trade volume and the number of reconciliation breaks. They collect 60 days of data:
After fitting the regression model:
Breaks = 12.4 + 0.023 × Trade Volume
Interpreting the coefficients:
- Intercept (12.4): Even with zero trades, the model predicts 12.4 breaks. This represents a baseline level of breaks from non-trade sources (corporate actions, standing instruction changes, system issues). Whether this intercept is meaningful depends on the context — if trade volumes never approach zero, the intercept is an extrapolation.
- Slope (0.023): For every additional 1,000 trades, the model predicts approximately 23 additional breaks. This is the marginal break rate.
Example prediction: On a day when the bank processes 15,000 trades:
Predicted breaks = 12.4 + 0.023 × 15,000 = 12.4 + 345 = 357.4 breaks
This prediction allows the operations team to staff accordingly. If tomorrow's expected trade volume is 20,000 (perhaps due to a major index rebalancing), the model predicts:
Predicted breaks = 12.4 + 0.023 × 20,000 = 12.4 + 460 = 472.4 breaks
The team can proactively allocate additional analysts for break resolution.
Interpreting R-Squared (R²)
R² (coefficient of determination) tells you what proportion of the variation in Y is explained by the regression model.
R² = 1 - (Sum of Squared Residuals / Total Sum of Squares)
- R² = 0.85 means 85% of the variation in break volume is explained by trade volume. The remaining 15% is due to other factors not in the model.
- R² = 0.35 means trade volume explains only 35% of the variation in breaks. Other factors (counterparty mix, system stability, day of week) likely play significant roles.
What Makes a "Good" R²?
There is no universal threshold. Context matters:
- In highly controlled processes (automated payments matching), R² > 0.80 may be achievable
- In complex processes influenced by many factors (KYC case cycle time), R² of 0.40-0.60 from a single predictor may be excellent — it means you have found a meaningful driver
- An R² of 0.95+ is rare and should be verified — it may indicate overfitting or a definitional relationship (e.g., total breaks = equity breaks + FI breaks + derivatives breaks is a perfect relationship, not a predictive model)
Regression Model Evaluation
Statistical Significance of the Model
R² alone does not tell you whether the relationship is statistically significant. Check:
- F-test for overall model significance (p-value for the entire regression)
- t-test for each coefficient (is β₁ significantly different from zero?)
- Confidence interval for the slope (does the interval include zero? If so, the relationship may not be significant)
Residual Analysis: Checking Model Validity
The residuals (ε = actual Y - predicted Y) should be examined to verify that the regression model is appropriate. Plot residuals against predicted values and look for:
Patterns That Indicate Problems
Random scatter (good): Residuals are randomly distributed around zero with constant spread. This confirms the model assumptions are met.
Funnel shape (heteroscedasticity): Residuals spread wider as predicted values increase. This means the model is less accurate for larger predictions. Common in banking when low-volume days have tight break counts but high-volume days have highly variable break counts. Solution: Consider log-transforming the dependent variable or using weighted regression.
Curved pattern (non-linearity): Residuals show a systematic curve, suggesting the true relationship is not linear. Solution: Consider adding a quadratic term (X²) or transforming variables.
Clusters: Distinct groups of residuals suggest a missing categorical variable. For example, if breaks cluster differently on system-change days vs. non-system-change days, adding a system-change indicator variable would improve the model.
Time patterns: If residuals show a trend or cyclical pattern over time, there may be autocorrelation — observations are not independent. This is common in daily banking data and may require time-series techniques.
Additional Diagnostic Checks
- Normality of residuals: Create a histogram or normal probability plot of residuals. They should be approximately normally distributed for inference (confidence intervals, p-values) to be valid.
- Influential observations: Check for individual data points that disproportionately influence the regression line. Use Cook's distance or leverage values. In banking, an influential observation might be a day with an unusual event (system outage, massive market correction) that should be investigated and potentially excluded or modeled separately.
Multiple Regression: Adding More Predictors
Simple regression uses one predictor. Multiple regression uses two or more predictors, which is almost always more realistic in banking operations where outcomes are influenced by many factors simultaneously.
Y = β₀ + β₁X₁ + β₂X₂ + β₃X₃ + ... + ε
Banking Example: Predicting Daily Reconciliation Breaks
The Green Belt extends the model to include three predictors:
- X₁: Daily trade volume (continuous)
- X₂: Number of new counterparties active that day (continuous)
- X₃: Number of system change deployments in the past 48 hours (continuous, often 0 or 1)
Results:
| Variable | Coefficient | Std Error | t-statistic | p-value |
|---|---|---|---|---|
| Intercept | 8.7 | 3.2 | 2.72 | 0.009 |
| Trade volume | 0.019 | 0.003 | 6.33 | <0.001 |
| New counterparties | 3.8 | 1.1 | 3.45 | 0.001 |
| System deployments | 45.3 | 13.7 | 3.31 | 0.002 |
| Model Statistic | Value |
|---|---|
| R² | 0.82 |
| Adjusted R² | 0.81 |
| F-statistic | 84.6 |
| p-value (model) | <0.001 |
Interpretation:
- Trade volume: Each additional 1,000 trades is associated with 19 more breaks (p < 0.001). Still the primary driver.
- New counterparties: Each new counterparty active on a given day adds approximately 3.8 breaks (p = 0.001). New counterparties bring unfamiliar SSIs, untested connectivity, and higher exception rates.
- System deployments: Each system change deployment in the past 48 hours is associated with 45.3 additional breaks (p = 0.002). This is a high-impact finding — system changes are a major but intermittent source of operational disruption.
- Adjusted R² = 0.81: The three-variable model explains 81% of the variation in daily break volume, up from approximately 72% with trade volume alone. The additional predictors provide meaningful explanatory power.
Adjusted R²
Adjusted R² is the correct metric for multiple regression. Unlike R², which always increases when you add variables (even random noise), Adjusted R² penalizes for additional variables and only increases if the new variable genuinely improves the model. If Adjusted R² decreases when you add a variable, that variable is not contributing useful explanatory power and should be removed.
Multicollinearity
When two or more independent variables are highly correlated with each other, the model suffers from multicollinearity. Symptoms include:
- Coefficients change dramatically when variables are added or removed
- Variables that should be significant have high p-values
- Coefficient signs are opposite to what you expect
In banking, multicollinearity is common. For example, trade volume and settlement value are highly correlated — including both in a model creates instability. Check the Variance Inflation Factor (VIF) for each variable:
- VIF < 5 — Acceptable
- VIF 5-10 — Concerning, investigate
- VIF > 10 — Severe multicollinearity, remove one of the correlated variables
Using Regression for Prediction and Decision-Making
Regression models in banking operations serve two primary purposes:
1. Root Cause Validation
The regression results above provide statistical evidence that trade volume, new counterparty activity, and system changes are all significant drivers of reconciliation breaks. This validates (or challenges) the hypotheses generated during the Analyze phase.
The system change finding is particularly actionable. Each deployment is associated with 45 additional breaks. If the bank deploys system changes twice per week on average, that is 90 extra breaks per week — approximately 4,680 per year. If each break costs an average of $15 to investigate and resolve, system change-related breaks cost roughly $70,200 per year. This quantification helps build the business case for better change management, pre-deployment testing, or deployment timing optimization.
2. Capacity Planning and Prediction
The model can predict tomorrow's break volume based on known or expected inputs:
Scenario: Next Friday, the bank expects 18,000 trades, 5 new counterparties will go live, and a system deployment is planned.
Predicted breaks = 8.7 + (0.019 × 18,000) + (3.8 × 5) + (45.3 × 1) = 8.7 + 342 + 19 + 45.3 = 415 breaks
Compared to a normal day with 12,000 trades, no new counterparties, and no system changes:
Predicted breaks = 8.7 + (0.019 × 12,000) + (3.8 × 0) + (45.3 × 0) = 8.7 + 228 = 236.7 breaks
The difference (415 vs. 237) tells the operations team they need approximately 75% more break resolution capacity that day. They can plan accordingly by rescheduling non-urgent work, arranging overtime, or requesting the system deployment be moved to a lower-volume day.
Prediction Intervals
Point predictions are useful, but they should include uncertainty estimates. A prediction interval provides a range within which the actual value is expected to fall. For example:
Predicted breaks = 415, 95% prediction interval: [340, 490]
This tells the operations manager: "Expect about 415 breaks, but plan for anywhere between 340 and 490." The width of the prediction interval reflects the model's uncertainty and should be factored into staffing decisions.
Limitations and Cautions
- Extrapolation: Do not use the model to predict outside the range of observed data. If trade volumes in your data ranged from 8,000 to 22,000, do not use the model to predict breaks for a 50,000-trade day — the relationship may not be linear at that scale.
- Causation: Regression identifies associations, not causes. The system deployment coefficient does not prove that system changes cause breaks — there may be confounding factors. However, combined with process knowledge and temporal analysis, regression provides strong evidence for investigation.
- Stationarity: The model assumes relationships are stable over time. If the bank implements a new reconciliation system, the relationship between trade volume and breaks may change. Periodically revalidate your model.
- Outlier influence: A single unusual day (e.g., a market crash that generated 5x normal volume) can distort the entire model. Investigate outliers, understand their cause, and consider whether they should be included or modeled separately.
Key Takeaways
- Correlation measures the strength and direction of linear relationships; regression models the relationship mathematically
- Correlation does not prove causation — always consider confounding variables, especially time trends and volume effects
- R² tells you what proportion of variation is explained; Adjusted R² is the correct measure for multiple regression
- Residual analysis validates model assumptions: check for patterns, heteroscedasticity, and non-linearity
- Multiple regression allows you to quantify the individual contribution of each factor while controlling for others
- In banking, regression is valuable for both root cause validation and operational capacity planning
- Always report prediction intervals alongside point predictions to communicate uncertainty
- Beware of multicollinearity when using correlated predictors in the same model
In the next module, we will explore Design of Experiments (DOE) — a structured approach to testing multiple factors simultaneously when you need to go beyond observational analysis and actively experiment with process changes.