Why Your Data Might Be Wrong
Before you invest weeks in root cause analysis, hypothesis testing, and solution design, you need to answer a fundamental question: is your data any good?
In manufacturing, measurement systems are physical instruments — calipers, scales, pressure gauges — and their accuracy can be tested and calibrated. In banking operations, the "measurement system" is far more complex and far more prone to error. Your data is generated by:
- People classifying, categorizing, and recording information
- Systems capturing timestamps, matching records, and applying business rules
- Processes that determine what gets measured, when, and how
Each of these introduces potential error. Consider the following scenarios:
An operations team tracks "trade break causes" by having analysts select a reason code from a dropdown menu. But the reason codes are ambiguous, analysts interpret them differently, and there is no "unknown" option — so analysts pick whatever seems closest. The data shows that "counterparty SSI mismatch" is the #1 cause of breaks, but is that really true, or is it just the easiest category to select?
A KYC team measures "case cycle time" from the date a case is assigned to the date it is marked complete. But some analysts mark cases complete before all documentation is filed, then go back to finish the work. The cycle time data shows a median of 3 days, but the actual time to complete the work is 5-6 days.
A payments team measures STP rates using a system report that counts how many payments pass straight through without manual intervention. But the system counts a payment as "STP" if it passes the initial validation, even if it is subsequently caught and repaired by a downstream exception handler. The reported STP rate of 85% is actually closer to 72% when end-to-end processing is considered.
These are not hypothetical problems. They are common realities in banking operations. Measurement System Analysis (MSA) is the discipline that detects and quantifies these issues before they corrupt your analysis.
Measurement System Analysis Fundamentals
MSA evaluates the quality of your measurement system by decomposing the total observed variation into its components:
Total Observed Variation = Real Process Variation + Measurement System Variation
If measurement system variation is large relative to process variation, your data is unreliable. You may observe differences that are not real (false signals) or fail to detect differences that are real (missed signals).
Sources of Measurement Error
Accuracy (Bias): The measurement consistently reads too high or too low. Example: a system timestamps trade settlements at the time the back-office system processes the settlement, not the actual settlement time in the market infrastructure. This introduces a consistent 45-minute delay bias.
Precision: The measurement varies excessively around the true value. Precision has two components:
- Repeatability: Variation when the same person measures the same item multiple times. Can an analyst consistently categorize the same trade break the same way if shown it on different days?
- Reproducibility: Variation when different people measure the same item. Do different analysts categorize the same trade break the same way?
Stability: The measurement system's accuracy and precision remain constant over time. Do analysts apply classification criteria the same way in January as they do in June? Do system calculations remain consistent after software updates?
Linearity: The measurement system's accuracy is consistent across the range of measured values. Does the system accurately capture cycle times for both 2-minute processes and 2-hour processes?
Resolution: The measurement system can detect meaningful differences. If you measure cycle time in whole days, you cannot detect the difference between a 4-hour process and a 7-hour process — both read as "1 day."
Gage R&R for Continuous Data
Gage R&R (Gauge Repeatability and Reproducibility) is the standard method for assessing measurement system capability for continuous data. In banking, continuous measurements include cycle times, processing durations, dollar amounts, error counts, and volumes.
Setting Up a Gage R&R Study
The standard Gage R&R study design involves:
- Appraisers: Typically 3 people who perform the measurement (e.g., 3 analysts who record processing times or classify break values)
- Parts: Typically 10 items to be measured (e.g., 10 trade breaks, 10 KYC cases, 10 payment exceptions)
- Trials: Typically 2-3 repeated measurements per appraiser per part (each appraiser measures each item 2-3 times, without knowing their previous answers)
This gives you 60-90 data points (3 appraisers × 10 parts × 2-3 trials), which is sufficient to decompose the variation.
Interpreting Gage R&R Results
The key metric is %Study Variation (also called %R&R), which tells you what percentage of the total observed variation is attributable to the measurement system:
- %R&R < 10% — Excellent measurement system. Proceed with confidence.
- %R&R between 10% and 30% — Acceptable, depending on the application. May be adequate for process characterization but insufficient for tight process control.
- %R&R > 30% — Unacceptable. The measurement system is contributing too much noise. You must improve the measurement system before collecting data for analysis.
Additionally, examine the number of distinct categories (ndc). This tells you how many distinct groups within the process variation the measurement system can distinguish:
- ndc ≥ 5 — The measurement system can discriminate adequately
- ndc < 5 — The measurement system lacks resolution; it cannot reliably detect process changes
Banking Example: Gage R&R on Reconciliation Break Valuation
Consider a bank's securities reconciliation team. When a break is identified between the bank's records and the counterparty's records, an analyst must determine the break value — the financial exposure at risk. This value drives the urgency of resolution and the escalation path.
Three senior analysts are asked to independently value the same 10 breaks (selected to represent the range from minor timing differences to material mismatches). Each analyst values each break twice, a week apart, without access to their previous answers.
Results:
| Source | %Study Variation |
|---|---|
| Repeatability | 8.2% |
| Reproducibility | 22.1% |
| Total R&R | 30.3% |
| Part-to-Part | 95.3% |
| ndc | 4 |
Interpretation: The total R&R of 30.3% is at the boundary of acceptability. The problem is primarily reproducibility (22.1%) — meaning the three analysts are valuing the same breaks differently. The repeatability is reasonable (8.2%) — each analyst is at least consistent with themselves.
Action: The Green Belt investigates and discovers that the analysts use different approaches to value breaks involving accrued interest and corporate actions. Two analysts include accrued interest in the break value; one does not. The Green Belt works with the team to create a standardized break valuation procedure, trains all analysts, and reruns the study. The second study shows total R&R of 14.7% — within acceptable limits.
This example illustrates a critical Green Belt insight: improving the measurement system is often a project improvement in itself. The standardized valuation procedure reduces disputes, speeds up resolution, and improves reporting accuracy — even before the formal Analyze and Improve phases begin.
Attribute Agreement Analysis
Many banking measurements are attribute data — categorical classifications rather than continuous values. Examples include:
- AML alert dispositions (escalate, close, request more information)
- Trade break cause codes (SSI mismatch, trade date error, quantity discrepancy, etc.)
- KYC risk ratings (high, medium, low)
- Exception types (system error, data entry error, late receipt, etc.)
For attribute data, you cannot use Gage R&R. Instead, you use Attribute Agreement Analysis, which measures how consistently people apply categorical classifications.
Study Design
- Select 50 items (cases, alerts, breaks) that represent the full range of categories
- Have 3+ appraisers independently classify each item (blind, without discussion)
- Have each appraiser classify each item at least twice (to assess repeatability)
- Establish a "standard" or "expert" classification for each item (the "correct" answer)
Key Metrics
Within-appraiser agreement (repeatability): What percentage of the time does each appraiser classify the same item the same way across trials? Target: >90%.
Between-appraiser agreement (reproducibility): What percentage of the time do all appraisers agree with each other? Target: >80%.
Appraiser vs. standard agreement: What percentage of the time does each appraiser agree with the expert standard? Target: >90%.
Kappa statistic: Measures agreement beyond what would be expected by chance. Values range from -1 to +1:
- κ > 0.80 — Excellent agreement
- 0.60 < κ ≤ 0.80 — Good agreement
- 0.40 < κ ≤ 0.60 — Moderate agreement
- κ ≤ 0.40 — Poor agreement
Banking Example: AML Alert Disposition Consistency
A bank's AML team has 25 analysts who triage transaction monitoring alerts. Each alert must be classified as:
- SAR-eligible — escalate for Suspicious Activity Report filing
- Close — no issue — false positive, no further action
- Close — documented — unusual but explainable activity, document rationale
- Request more information — insufficient data to make a determination
The Green Belt selects 50 alerts (representing the full spectrum from obvious false positives to clear SAR cases, with many in the gray area). Five experienced analysts independently classify each alert twice.
Results:
| Metric | Result |
|---|---|
| Within-appraiser agreement | 82% |
| Between-appraiser agreement | 58% |
| Appraiser vs. standard | 71% |
| Fleiss' Kappa | 0.49 |
Interpretation: Individual analysts are reasonably consistent (82% repeatability), but they disagree with each other significantly (only 58% agreement, Kappa of 0.49). The biggest disagreement area is between "Close — documented" and "Request more information" — analysts have different thresholds for when they feel they have enough information to make a judgment.
Action: This finding has major implications. If analysts cannot consistently classify alerts, then any analysis of alert disposition patterns, workload distribution, or closure rates is built on unreliable data. The Green Belt recommends:
- Clarify the definitions and decision criteria for each category, with specific examples
- Create a decision tree that guides analysts through the classification
- Conduct calibration sessions where analysts discuss borderline cases and align on standards
- Retrain all analysts on the revised criteria
- Rerun the attribute agreement analysis to confirm improvement
Sampling Strategies
Unless you can analyze the entire population (which is sometimes possible with system data in banking), you need a sampling strategy that produces a representative sample. The wrong sampling approach leads to biased conclusions.
Simple Random Sampling
Every item in the population has an equal chance of being selected. This is the default approach and works well when the population is homogeneous.
Banking application: Sampling completed KYC cases to assess quality when all cases follow the same process and serve similar customers.
How to implement: Use a random number generator to select case IDs from the full population. Avoid "convenience sampling" (grabbing the most recent cases or the ones on your desk) — this introduces bias.
Stratified Sampling
The population is divided into subgroups (strata), and random samples are drawn from each stratum. Use stratified sampling when you expect the process to behave differently across subgroups.
Banking application: Sampling trade breaks for root cause analysis when you know that equity trades, fixed income trades, and derivatives each have different break profiles. If you sample randomly, you might over-represent equities (high volume) and miss patterns in derivatives (low volume, high complexity).
How to implement: Define your strata (product type, customer segment, region, time period), determine what proportion of the population each stratum represents, and draw samples proportionally (or equally, if you want to ensure minimum representation of small strata).
Example:
| Stratum | Population % | Proportional Sample (n=200) | Equal Sample (n=200) |
|---|---|---|---|
| Equity trades | 60% | 120 | 50 |
| Fixed income | 25% | 50 | 50 |
| Derivatives | 10% | 20 | 50 |
| FX | 5% | 10 | 50 |
The proportional approach gives you results representative of the overall population. The equal approach gives you enough data to analyze each product type individually, even the smaller ones. Choose based on your analysis objectives.
Systematic Sampling
Select every k-th item from the population. For example, if you have 10,000 payments and want a sample of 500, you would select every 20th payment (k = 10,000/500 = 20), starting from a randomly chosen point.
Banking application: Sampling transactions from a daily processing queue. Systematic sampling is faster than random sampling when working with large, ordered datasets.
Caution: If the data has a periodic pattern that aligns with your sampling interval, systematic sampling can produce biased results. For example, if payments are processed in batches of 20 (batch 1 = domestic, batch 2 = international, repeat), sampling every 20th payment would capture only one type.
Time-Based Sampling
Collect all data during specific time windows, selected to represent different conditions.
Banking application: To understand call center performance, sample all calls during one morning shift, one afternoon shift, and one evening shift per week for four weeks. This captures variation across shifts and days.
Judgmental (Expert) Sampling
The sampler deliberately selects specific items based on expertise. This is not statistically valid for making population-level inferences, but it is useful for exploring specific conditions or verifying hypotheses.
Banking application: A Green Belt suspects that trade breaks caused by corporate actions are particularly time-consuming to resolve. They deliberately pull 30 corporate-action-related breaks for detailed timing analysis. This is valid for investigating a specific hypothesis but cannot be used to estimate overall break resolution times.
Sample Size Determination
"How much data do I need?" is one of the most common questions a Green Belt faces. The answer depends on what you are trying to detect and how confident you need to be.
For Continuous Data (Means)
The sample size formula for estimating a population mean with a specified precision is:
n = (Z × σ / E)²
Where:
- n = required sample size
- Z = Z-score for desired confidence level (1.96 for 95%, 2.58 for 99%)
- σ = estimated standard deviation of the population
- E = desired margin of error (how close to the true mean you need to be)
Example: You want to estimate the average cycle time for KYC case remediation within ±0.5 days, with 95% confidence. Preliminary data suggests a standard deviation of 2.3 days.
n = (1.96 × 2.3 / 0.5)² = (9.016 / 0.5)² = (4.508)² ≈ 81 cases
For Attribute Data (Proportions)
The sample size formula for estimating a population proportion is:
n = Z² × p × (1-p) / E²
Where:
- p = estimated proportion (defect rate)
- E = desired margin of error
Example: You want to estimate the STP rate of a payments process within ±3 percentage points, with 95% confidence. You estimate the current STP rate is about 80% (p = 0.80).
n = (1.96)² × 0.80 × 0.20 / (0.03)² = 3.8416 × 0.16 / 0.0009 = 0.6147 / 0.0009 ≈ 683 payments
For Hypothesis Testing (Comparing Groups)
When you plan to compare two groups (e.g., before vs. after, or team A vs. team B), sample size depends on:
- Effect size — how large a difference do you want to detect?
- Power — the probability of detecting a real difference (typically 80%)
- Significance level — the probability of a false positive (typically 5%)
- Variability — the standard deviation within groups
We will cover this in detail in the Hypothesis Testing module. For now, know that detecting small differences requires larger samples than detecting large differences.
Practical Considerations in Banking
- Finite population correction: If your sample is more than 5-10% of the total population, you can use a smaller sample than the formula suggests. Apply the correction: n_adjusted = n / (1 + (n-1)/N), where N is the population size.
- Data availability: In many banking processes, you have access to the full population through system data. In those cases, use all available data — there is no statistical reason to sample when you have everything.
- Cost of sampling: If data collection requires manual effort (e.g., timing observations, case reviews), balance statistical precision against practical cost. An estimate within ±5% is usually sufficient for process improvement; you do not need ±0.1% precision.
- Subgroup analysis: If you plan to analyze subgroups (by product type, region, analyst), you need sufficient data within each subgroup. A total sample of 200 that breaks into 5 subgroups gives you only 40 per subgroup — which may not be enough.
Data Integrity in Banking Systems
Beyond measurement system variation and sampling, Green Belts in banking must contend with data integrity issues that are unique to the financial services environment:
System-of-Record Conflicts
Banks often have multiple systems that should contain the same data but do not. The trade booking system, the settlement system, and the accounting system may each show a different view of the same transaction. Before analyzing data, confirm which system is the authoritative source for each data element.
Data Latency
Real-time data is rare in banking operations. Most reports are generated from batch processes that run overnight. This means "today's data" is actually yesterday's data, and any analysis must account for this lag. For time-sensitive measurements (like intraday settlement timing), you may need to extract data directly from operational systems rather than from reporting databases.
Data Definition Inconsistencies
The same term can mean different things across departments. "Cycle time" in one team might mean elapsed calendar days; in another, it means business days; in a third, it means actual processing time excluding queue time. Before combining data from multiple sources, validate that definitions are consistent.
Manual Data Entry
Any field that requires manual entry is susceptible to error. Common issues include:
- Free-text fields with inconsistent formatting
- Dropdown selections that do not match the actual situation (analysts choose the closest option)
- Timestamps that record when data was entered, not when the event occurred
- Missing values where fields are not mandatory
System Migration Artifacts
Banks frequently undergo system migrations. Historical data extracted from a migrated system may contain conversion artifacts — changed field formats, truncated values, or misaligned date ranges. Always check whether the data spans a system migration and assess the impact.
Key Takeaways
- Measurement System Analysis must be conducted before root cause analysis to ensure data trustworthiness
- Gage R&R decomposes variation into repeatability (within-appraiser) and reproducibility (between-appraiser)
- For attribute data, Attribute Agreement Analysis measures classification consistency using percent agreement and Kappa statistics
- Choose your sampling strategy based on population structure: random for homogeneous populations, stratified for populations with distinct subgroups
- Sample size depends on desired precision, confidence level, and expected variability — not on population size alone
- In banking, data integrity issues (system conflicts, latency, definition inconsistencies, manual entry errors) add a layer of complexity that does not exist in manufacturing MSA
- Improving the measurement system often delivers process improvements even before formal root cause analysis begins
In the next module, we will apply the statistical foundation to Hypothesis Testing — using data to determine whether observed differences are real or just random variation.