Module 4

Hypothesis Testing

Apply t-tests, chi-square, and ANOVA to validate root causes and measure improvement in banking operations with statistical rigor.

Module 4 — 90-second video overview

From Observation to Evidence

In the Analyze phase of DMAIC, you have identified potential root causes using tools like Fishbone diagrams, Pareto charts, and process analysis. But how do you know if a suspected root cause is real? How do you determine whether an observed difference is a genuine pattern or just random noise?

Hypothesis testing provides the statistical framework to answer these questions with quantified confidence. Instead of relying on experience, intuition, or the loudest voice in the room, you use data and probability to make decisions. This is the core of what makes a Green Belt analysis more rigorous than a Yellow Belt analysis.

In banking operations, hypothesis testing answers questions like:

  • Did the process change actually reduce settlement cycle times, or was the improvement just natural variation?
  • Is the error rate really different between the London and New York operations centers, or does the difference disappear when you account for volume differences?
  • Is there a relationship between analyst experience level and case processing quality?
  • After implementing a new routing algorithm, did the straight-through processing rate genuinely increase?

The Logic of Hypothesis Testing

Every hypothesis test follows the same logical structure:

Step 1: State the Hypotheses

  • Null Hypothesis (H₀): The default position — there is no difference, no effect, no relationship. The status quo holds.
  • Alternative Hypothesis (H₁ or Hₐ): What you are trying to demonstrate — there is a difference, an effect, or a relationship.

Example: A bank implemented a new exception routing process to reduce settlement cycle times.

  • H₀: The mean settlement cycle time after the change equals the mean cycle time before the change (μ_after = μ_before)
  • H₁: The mean settlement cycle time after the change is different from the mean cycle time before the change (μ_after ≠ μ_before)

Note: This is a two-tailed test because we are testing for any difference (increase or decrease). If we specifically predicted the direction ("the new process reduces cycle time"), we would use a one-tailed test: H₁: μ_after < μ_before.

Step 2: Choose the Significance Level (α)

The significance level is the probability of rejecting H₀ when it is actually true — a false positive or Type I error. By convention, α = 0.05 (5%) is standard in most business applications, including banking.

This means: if there truly is no difference, you accept a 5% chance of incorrectly concluding that there is one. For high-stakes decisions (e.g., changes to regulatory reporting processes), you might use α = 0.01 for greater stringency.

Step 3: Collect Data and Calculate the Test Statistic

Based on the type of data and the question being asked, calculate the appropriate test statistic (t-statistic, chi-square statistic, F-statistic, etc.).

Step 4: Determine the P-Value

The p-value is the probability of observing your data (or something more extreme) if the null hypothesis is true. In plain English: how likely is it that you would see this result just by chance?

  • Small p-value (e.g., p = 0.02) → Your data is unlikely under H₀ → Evidence against H₀ → Consider rejecting H₀
  • Large p-value (e.g., p = 0.45) → Your data is quite plausible under H₀ → No evidence against H₀ → Fail to reject H₀

Step 5: Make a Decision

  • If p ≤ α → Reject H₀. There is statistically significant evidence of a difference/effect.
  • If p > α → Fail to reject H₀. There is insufficient evidence to conclude a difference/effect exists.

Critical note: "Fail to reject H₀" is not the same as "H₀ is true." It means you did not find enough evidence to disprove it. The difference might exist but be too small for your sample to detect (a power issue).

Hypothesis Testing: Five-Step Process

1
State HypothesesDefine H₀ and H₁
2
Choose αSet significance level (0.05)
3
Collect & CalculateGather data, compute test statistic
4
Determine P-ValueProbability under H₀
5
Make DecisionIf p ≤ α, reject H₀

Type I and Type II Errors

Understanding error types is essential for Green Belts making decisions that affect banking operations:

H₀ is actually TRUEH₀ is actually FALSE
Reject H₀Type I Error (False Positive) — αCorrect Decision — Power (1-β)
Fail to reject H₀Correct DecisionType II Error (False Negative) — β

Type I Error (False Positive): You conclude there is a difference when there is not. In banking: you declare that a process change improved settlement times, implement it permanently, and then discover the "improvement" was just normal variation. The cost is wasted implementation effort and potentially disrupted processes.

Type II Error (False Negative): You conclude there is no difference when there actually is one. In banking: you test whether a new routing algorithm improves STP rates, conclude it does not, and abandon a change that would actually have saved millions. The cost is a missed improvement opportunity.

Statistical Power (1 - β) is the probability of correctly detecting a real difference. Power depends on:

  • Sample size — larger samples detect smaller differences
  • Effect size — larger differences are easier to detect
  • Significance level — more lenient α (e.g., 0.10) increases power but also increases false positive risk
  • Variability — less variation in the data makes differences easier to detect

In banking projects, aim for power of at least 0.80 (80% probability of detecting a real difference).

T-Tests: Comparing Means

T-tests are the workhorse of Six Sigma hypothesis testing. They compare means to determine whether observed differences are statistically significant.

One-Sample T-Test

Question: Is the population mean different from a specified value?

Banking application: "Is our average payment processing time different from the SLA target of 4 hours?"

  • H₀: μ = 4 hours
  • H₁: μ ≠ 4 hours

You collect a random sample of 50 payments and find a sample mean of 4.3 hours with a standard deviation of 1.2 hours. The t-test calculates whether this difference (0.3 hours) is statistically significant given the sample size and variability.

Assumptions: Data is approximately normally distributed (or sample size is large enough — n > 30 — for the Central Limit Theorem to apply), observations are independent, and data is continuous.

Two-Sample T-Test

Question: Are the means of two independent groups different?

Banking application: "Is the average settlement cycle time different between the period before the process change and the period after?"

  • H₀: μ_before = μ_after
  • H₁: μ_before ≠ μ_after

Example walkthrough: A bank implemented a new exception routing process for trade settlement. The Green Belt collects data:

MetricBefore (n=120)After (n=95)
Mean cycle time6.8 hours5.2 hours
Standard deviation2.4 hours2.1 hours

The observed difference is 1.6 hours. But is this statistically significant, or could it be due to random variation?

Running a two-sample t-test:

  • t-statistic = 5.17
  • Degrees of freedom = 207
  • p-value = 0.000001

Since p < 0.05, we reject H₀. There is statistically significant evidence that settlement cycle times differ between the two periods. The 1.6-hour reduction is real, not just random fluctuation.

Assumptions: Both groups are approximately normally distributed (or large enough samples), the two groups are independent (different time periods, different transactions), and variances are reasonably similar (if not, use Welch's t-test, which does not assume equal variances).

Paired T-Test

Question: Is there a difference in means for paired or matched observations?

Banking application: "Did the training program improve analyst accuracy? Compare each analyst's error rate before and after training."

The key difference from a two-sample t-test is that paired data involves the same subjects measured twice (before and after), not two independent groups. The test analyzes the differences within each pair.

  • H₀: The mean difference (d̄) = 0
  • H₁: The mean difference (d̄) ≠ 0

Example: 15 KYC analysts complete a training program. Their first-time-right rates are measured before and after:

AnalystBeforeAfterDifference
171%88%+17%
268%82%+14%
375%91%+16%
............
1570%85%+15%
Mean71.3%86.7%+15.4%

The paired t-test analyzes the column of differences. If the p-value is less than 0.05, you can conclude the training had a statistically significant effect on accuracy.

When to use paired vs. two-sample: Use the paired test when each observation in one group has a natural match in the other group (same person measured twice, same account processed by two methods, same day's volume measured by two systems). The paired test is more powerful because it controls for individual variation.

Chi-Square Test for Categorical Data

When both your variables are categorical (not continuous), t-tests do not apply. The chi-square test of independence determines whether two categorical variables are associated.

Banking application: "Is there a relationship between the type of payment exception (missing reference, incorrect amount, wrong beneficiary, duplicate) and the payment channel (SWIFT, domestic ACH, real-time gross settlement)?"

Setting Up the Chi-Square Test

Create a contingency table showing observed frequencies:

Exception TypeSWIFTDomestic ACHRTGSTotal
Missing reference4512012177
Incorrect amount30352893
Wrong beneficiary55408103
Duplicate10855100
Total14028053473
  • H₀: Exception type and payment channel are independent (no relationship)
  • H₁: Exception type and payment channel are associated

The chi-square test compares the observed frequencies to the frequencies you would expect if the two variables were truly independent. If the observed frequencies deviate substantially from the expected frequencies, the test produces a large chi-square statistic and a small p-value.

Result: χ² = 89.4, degrees of freedom = 6, p-value < 0.001

Conclusion: There is a highly significant association between exception type and payment channel. Looking at the data, domestic ACH has a disproportionate number of "missing reference" and "duplicate" exceptions, while SWIFT has more "wrong beneficiary" exceptions. This insight directs root cause investigation to channel-specific issues.

Assumptions: Expected cell frequencies should be at least 5 (if not, consider combining categories or using Fisher's exact test). Observations are independent.

ANOVA: Comparing Multiple Groups

When you need to compare means across three or more groups, Analysis of Variance (ANOVA) is the appropriate test. Running multiple t-tests (group 1 vs. 2, 1 vs. 3, 2 vs. 3, etc.) inflates the false positive rate — this is called the multiple comparisons problem.

One-Way ANOVA

Question: Are the means of three or more groups all equal, or is at least one different?

Banking application: "Do the four regional operations centers (London, New York, Singapore, Mumbai) have different average error rates?"

  • H₀: μ_London = μ_New York = μ_Singapore = μ_Mumbai
  • H₁: At least one group mean is different from the others

Banking Example: Comparing Error Rates Across Regional Centers

A bank's operations division has four regional centers processing trade confirmations. The Green Belt suspects that error rates vary by region but needs statistical evidence.

Data collected over 3 months (monthly error rate per analyst, 15+ analysts per center):

CenternMean Error RateStd Dev
London182.8%0.9%
New York223.1%1.1%
Singapore164.5%1.3%
Mumbai254.2%1.0%

ANOVA results:

  • F-statistic = 11.34
  • p-value = 0.0001

Conclusion: p < 0.05, so we reject H₀. There is statistically significant evidence that at least one center has a different mean error rate. But ANOVA does not tell you which centers differ. For that, you need post-hoc tests.

ANOVA: Error Rates by Regional Processing Centre
Mean Error Rate (%)012345 %4.5 %Singapore4.2 %Mumbai3.1 %New York2.8 %London

Post-Hoc Analysis

After a significant ANOVA result, use a post-hoc test (such as Tukey's HSD — Honestly Significant Difference) to determine which specific groups differ:

ComparisonDifferencep-valueSignificant?
London vs. New York0.3%0.72No
London vs. Singapore1.7%0.001Yes
London vs. Mumbai1.4%0.004Yes
New York vs. Singapore1.4%0.003Yes
New York vs. Mumbai1.1%0.02Yes
Singapore vs. Mumbai0.3%0.68No

Interpretation: London and New York perform similarly (no significant difference). Singapore and Mumbai perform similarly. But there is a significant gap between the London/New York group and the Singapore/Mumbai group. The Green Belt should investigate what London and New York do differently — different training, different systems, different process design — and consider transferring best practices.

ANOVA Assumptions

  • Normality: Data within each group is approximately normally distributed (check with histograms or normality tests)
  • Homogeneity of variance: Variances across groups are roughly equal (check with Levene's test; if violated, use Welch's ANOVA)
  • Independence: Observations are independent (one analyst's error rate does not influence another's)

Practical vs. Statistical Significance

This is one of the most important concepts for Green Belts to internalize. A result can be statistically significant but practically meaningless, and conversely, a result can be practically important but fail to reach statistical significance.

Statistically Significant, Practically Meaningless

With a large enough sample, tiny differences become statistically significant. If you have data on 500,000 payments, a difference of 0.1% in STP rates between two routing methods will likely be statistically significant. But does a 0.1% improvement justify the cost and risk of changing the routing method? Probably not.

Always report effect size alongside p-values:

  • For t-tests: report the mean difference and confidence interval
  • For ANOVA: report eta-squared (η²), which shows the proportion of variation explained by the factor
  • For chi-square: report Cramér's V for the strength of association

Practically Important, Not Statistically Significant

If your sample is small, you may fail to detect a real and meaningful difference. This is a power issue. If you observe a 15% improvement in cycle time but the p-value is 0.08, the improvement may be real — you just do not have enough data to prove it at the 95% confidence level. Options:

  • Collect more data (increase power)
  • Accept α = 0.10 for a less stringent test (with appropriate justification)
  • Report the confidence interval — it may show that the true improvement is likely positive even if the test is not formally significant

The Green Belt's Responsibility

Present both statistical results and business context to stakeholders. A typical Green Belt conclusion might read:

"The two-sample t-test shows a statistically significant reduction in settlement cycle time from 6.8 hours to 5.2 hours (p < 0.001). This 1.6-hour reduction (24% improvement) translates to an estimated capacity release of 2.5 FTEs, saving approximately $225,000 annually. The 95% confidence interval for the improvement is 1.1 to 2.1 hours."

This gives stakeholders everything they need: the statistical evidence, the practical magnitude, and the uncertainty range.

Choosing the Right Test

ScenarioData TypesAppropriate Test
Is the mean different from a target?1 continuous variableOne-sample t-test
Are two independent group means different?1 continuous, 1 categorical (2 groups)Two-sample t-test
Did the same subjects change?1 continuous, 2 time pointsPaired t-test
Are 3+ group means different?1 continuous, 1 categorical (3+ groups)One-way ANOVA
Are two categorical variables related?2 categorical variablesChi-square test
Is there a linear relationship?2 continuous variablesCorrelation / Regression (next module)

Key Takeaways

  • Hypothesis testing provides the statistical rigor to validate root causes and confirm improvements
  • The null hypothesis (H₀) represents "no difference" — you need evidence (low p-value) to reject it
  • p-value < α → reject H₀ (evidence of a difference); p-value > α → fail to reject H₀ (insufficient evidence)
  • Use the right test for your data: t-tests for continuous means, chi-square for categorical relationships, ANOVA for 3+ group comparisons
  • Statistical significance (small p-value) does not equal practical significance (meaningful business impact) — always report both
  • Type I errors (false positives) and Type II errors (false negatives) both have real costs in banking
  • After a significant ANOVA result, use post-hoc tests (Tukey's HSD) to identify which specific groups differ
  • Always check test assumptions: normality, independence, and equal variances

In the next module, we will extend our statistical toolkit to Regression and Correlation Analysis — exploring relationships between variables and building predictive models for banking operations.

Module Quiz

5 questions — Pass mark: 60%

Q1.The null hypothesis (H₀) in a Six Sigma project typically states:

Q2.A Green Belt runs a two-sample t-test comparing settlement cycle times before and after a process change. The p-value is 0.03. At a significance level of 0.05, what is the correct conclusion?

Q3.When should a chi-square test be used instead of a t-test?

Q4.A bank compares error rates across 4 regional operations centers. Which test is most appropriate?

Q5.A process change reduces average trade settlement time from 4.2 hours to 4.0 hours, and the difference is statistically significant (p = 0.01). However, the SLA is 24 hours. What should the Green Belt conclude?