Why Test Multiple Factors Simultaneously?
In the previous modules, you learned to analyze observational data using hypothesis testing and regression. These tools are powerful for identifying relationships in existing data, but they have a fundamental limitation: observational data is subject to confounding. You can observe that trade volume and break rates are correlated, but you cannot be certain that reducing volume would reduce breaks — other factors that co-vary with volume might be the real drivers.
Design of Experiments (DOE) takes a fundamentally different approach. Instead of passively observing data, you actively change process factors in a structured, controlled manner and measure the impact on outcomes. This allows you to:
- Establish causation, not just correlation
- Test multiple factors simultaneously, rather than one at a time
- Detect interaction effects — situations where the impact of one factor depends on the level of another
- Optimize processes by finding the best combination of factor settings
In manufacturing, DOE is straightforward: change the temperature, pressure, and speed of a machine and measure the output quality. In banking operations, DOE requires more creativity and care, but it is absolutely applicable and highly valuable.
The Problem with One-Factor-at-a-Time (OFAT)
The intuitive approach to testing improvements is to change one thing at a time: test a new threshold level, see if it works, then test a different routing approach, see if it works. This is called OFAT (One-Factor-at-a-Time) experimentation.
OFAT has three critical weaknesses:
1. It Cannot Detect Interactions
Suppose you are optimizing AML alert triage by testing two factors: alert threshold level (low vs. high) and analyst experience tier (junior vs. senior). With OFAT:
Test 1: Low threshold + junior analyst → 45 minutes average
Test 2: High threshold + junior analyst → 38 minutes average
Conclusion: High threshold is better (saves 7 minutes)
Test 3: High threshold + junior analyst → 38 minutes average (baseline)
Test 4: High threshold + senior analyst → 25 minutes average
Conclusion: Senior analysts are better (saves 13 minutes)
OFAT recommendation: Use high threshold + senior analysts → expect ~25 minutes
But what if you also tested low threshold + senior analyst? You might find:
- Test 5: Low threshold + senior analyst → 22 minutes average
This is an interaction effect: senior analysts are particularly effective with low thresholds because they can quickly dismiss the additional alerts that low thresholds generate, while junior analysts are overwhelmed by them. The OFAT approach missed the optimal combination.
2. It Requires More Runs
To test 3 factors at 2 levels each, OFAT requires at least 4 runs (baseline + 3 changes). A full factorial DOE requires 8 runs but gives you far more information: all main effects, all two-way interactions, and the three-way interaction. The information per run is dramatically better with DOE.
3. It Assumes Factors Are Independent
OFAT implicitly assumes that the effect of each factor is the same regardless of the settings of other factors. This assumption is often wrong, especially in complex banking processes where people, systems, and procedures interact in non-linear ways.
Full Factorial Designs
A full factorial design tests every possible combination of factor levels. For factors with 2 levels each (coded as "low" and "high," or -1 and +1), the notation is 2^k, where k is the number of factors.
2² Design (2 Factors, 2 Levels)
4 experimental runs covering all combinations:
| Run | Factor A | Factor B |
|---|---|---|
| 1 | Low (-1) | Low (-1) |
| 2 | High (+1) | Low (-1) |
| 3 | Low (-1) | High (+1) |
| 4 | High (+1) | High (+1) |
2³ Design (3 Factors, 2 Levels)
8 experimental runs:
| Run | Factor A | Factor B | Factor C |
|---|---|---|---|
| 1 | -1 | -1 | -1 |
| 2 | +1 | -1 | -1 |
| 3 | -1 | +1 | -1 |
| 4 | +1 | +1 | -1 |
| 5 | -1 | -1 | +1 |
| 6 | +1 | -1 | +1 |
| 7 | -1 | +1 | +1 |
| 8 | +1 | +1 | +1 |
2⁴ Design (4 Factors, 2 Levels)
16 experimental runs. As k increases, the number of runs grows exponentially. For k > 4, consider fractional factorial designs (running a carefully selected fraction of all combinations) to reduce the number of runs while still estimating main effects and key interactions.
Main Effects and Interactions
Main Effects
A main effect is the average change in the response (outcome) when a factor moves from its low level to its high level, averaged across all levels of the other factors.
Example: If switching from a low threshold to a high threshold reduces average processing time from 40 minutes to 32 minutes (averaged across all combinations of other factors), the main effect of threshold is -8 minutes.
Interaction Effects
An interaction effect occurs when the impact of one factor depends on the level of another factor. If the threshold effect is -12 minutes when using senior analysts but only -4 minutes when using junior analysts, there is a threshold × experience interaction.
Interactions are often the most valuable finding in a DOE because they reveal:
- Which combinations of factors produce the best results
- Why previous one-at-a-time improvements did not deliver expected benefits
- Process dynamics that are invisible in observational data
Visualizing Effects
Main effect plots show the average response at each level of a factor. A steep line indicates a large main effect; a flat line indicates no effect.
Interaction plots show how the effect of one factor changes across levels of another factor. Parallel lines indicate no interaction; non-parallel (especially crossing) lines indicate an interaction.
Practical DOE in Service Environments
DOE in banking is more challenging than in manufacturing. You cannot simply "turn dials" on a banking process the way you can adjust machine settings. Here are the key adaptations:
Constraints Unique to Banking
Ethical constraints: You cannot deliberately give some customers worse service to test the effect. If testing a new KYC workflow, you cannot intentionally slow down processing for a control group.
Regulatory constraints: Processes subject to regulatory requirements cannot be modified below the regulatory standard. You can test whether exceeding the standard in different ways produces better outcomes, but you cannot test "what happens if we skip the compliance check?"
Volume constraints: You cannot always control the number of transactions that flow through a process. Unlike manufacturing where you can set the production rate, banking volumes are driven by market activity and customer behavior.
Blinding limitations: In manufacturing, operators may not know which experimental condition they are testing. In banking, analysts almost always know they are being observed or tested, which can influence behavior (Hawthorne effect).
Strategies for Banking DOE
Natural experiments: Take advantage of naturally occurring variation. Different shifts, different days, different offices, or different teams may already operate under different conditions. You can analyze these "natural experiments" using DOE analysis techniques, even though you did not control the assignments.
Pilot-based experiments: Run the experiment in a pilot environment before full-scale deployment. Select a subset of cases, a specific team, or a single office for the experiment. Ensure the pilot is large enough for statistical significance.
Sequential experimentation: If you cannot run all combinations simultaneously, run them sequentially — one combination per week, for example. This introduces time as a potential confound, so randomize the order of experimental runs.
Simulation: For processes where live experimentation is too risky, build a simulation model and test factor combinations in the simulation before selecting the best combination for a live pilot.
A/B testing: For high-volume, low-risk processes (like alert routing or queue assignment), A/B testing is a form of DOE where transactions are randomly assigned to different process configurations. Ensure sample sizes are sufficient and monitor for adverse effects.
Selecting Factor Levels
Choose factor levels that are:
- Practically meaningful — the difference between low and high should represent a real operational change, not a trivial variation
- Achievable — both levels must be operationally feasible and compliant
- Safe — neither level should create unacceptable risk
For example, when testing AML alert thresholds:
- Low = 70 (current threshold — captures more alerts, including more false positives)
- High = 85 (proposed threshold — captures fewer alerts, but may miss some genuine suspicious activity)
The "high" level must be validated as safe — you cannot test a threshold of 99 that would miss virtually all genuine alerts.
Banking Example: Optimizing AML Alert Triage
A bank's AML operations team processes approximately 2,500 alerts per day. The average triage time is 35 minutes per alert, and the team is struggling to meet the 48-hour disposition deadline for regulatory purposes. The Green Belt wants to find the combination of process settings that minimizes triage time without compromising quality.
Factor Selection
After brainstorming with the team and reviewing the Analyze phase findings, the Green Belt identifies three factors to test:
Factor A — Threshold Level
- Low (-1): Current threshold of 70 (generates more alerts, many false positives)
- High (+1): Proposed threshold of 85 (fewer alerts, higher proportion of genuine cases)
Factor B — Analyst Experience Tier
- Low (-1): Tier 1 analysts (0-12 months experience)
- High (+1): Tier 2 analysts (12+ months experience)
Factor C — Case Complexity Routing
- Low (-1): Random assignment (current state — any case goes to any available analyst)
- High (+1): Complexity-based routing (simple cases to Tier 1, complex cases to Tier 2)
Experimental Design
The Green Belt designs a 2³ full factorial experiment with 3 replicates per run (to estimate experimental error), for a total of 24 experimental units. Each "unit" is a batch of 50 alerts processed under the specified conditions over one shift.
| Run | Threshold (A) | Experience (B) | Routing (C) | Rep 1 (min) | Rep 2 (min) | Rep 3 (min) | Avg (min) |
|---|---|---|---|---|---|---|---|
| 1 | Low | Tier 1 | Random | 42 | 45 | 43 | 43.3 |
| 2 | High | Tier 1 | Random | 36 | 34 | 38 | 36.0 |
| 3 | Low | Tier 2 | Random | 31 | 28 | 30 | 29.7 |
| 4 | High | Tier 2 | Random | 27 | 26 | 28 | 27.0 |
| 5 | Low | Tier 1 | Complexity | 38 | 40 | 37 | 38.3 |
| 6 | High | Tier 1 | Complexity | 33 | 35 | 32 | 33.3 |
| 7 | Low | Tier 2 | Complexity | 24 | 22 | 25 | 23.7 |
| 8 | High | Tier 2 | Complexity | 20 | 21 | 19 | 20.0 |
Analysis Results
Main Effects:
| Factor | Effect (minutes) | p-value | Significant? |
|---|---|---|---|
| A (Threshold) | -4.8 | 0.001 | Yes |
| B (Experience) | -10.3 | <0.001 | Yes |
| C (Routing) | -2.8 | 0.01 | Yes |
Interpretation of main effects:
- Raising the threshold from 70 to 85 reduces average triage time by 4.8 minutes (fewer false positives to investigate)
- Using Tier 2 analysts reduces average triage time by 10.3 minutes (experience is the dominant factor)
- Complexity-based routing reduces average triage time by 2.8 minutes (better matching of case difficulty to analyst capability)
Interaction Effects:
| Interaction | Effect (minutes) | p-value | Significant? |
|---|---|---|---|
| A × B (Threshold × Experience) | -1.2 | 0.14 | No |
| A × C (Threshold × Routing) | -0.5 | 0.52 | No |
| B × C (Experience × Routing) | -3.1 | 0.008 | Yes |
| A × B × C | -0.3 | 0.68 | No |
Key finding — B × C interaction: The interaction between experience and routing is significant. Let us examine it:
- With random routing: Tier 2 is 9.1 minutes faster than Tier 1 (average of runs 3,4 vs 1,2)
- With complexity routing: Tier 2 is 14.3 minutes faster than Tier 1 (average of runs 7,8 vs 5,6)
Interpretation: Complexity-based routing amplifies the advantage of experienced analysts. When experienced analysts receive complex cases (which they handle efficiently) and junior analysts receive simple cases (which they can manage), the overall system performs much better than when cases are randomly assigned. The routing strategy is particularly beneficial for the experienced analyst tier.
Optimal Configuration
Based on the DOE results, the optimal configuration is:
- Threshold: High (85) — reduces false positive workload
- Analyst tier: This is a staffing decision, not a setting to optimize. The finding confirms the value of investing in analyst development and retention.
- Routing: Complexity-based — the interaction with experience makes this particularly impactful
Predicted performance at optimal settings (Run 8): 20.0 minutes average triage time, down from the current 35 minutes (a 43% improvement).
Quality Validation
Before implementing, the Green Belt must verify that the higher threshold (85) does not miss genuine suspicious activity. Review a sample of alerts that would have been generated at threshold 70 but not at threshold 85. If any represent genuine SAR-eligible activity, the threshold may need to be adjusted or a secondary screening process added.
This is critical in banking DOE: you cannot optimize one metric at the expense of a regulatory or risk requirement. The Green Belt presents both the efficiency improvement and the quality validation to the sponsor and compliance stakeholder for a joint decision.
Fractional Factorial Designs
When the number of factors exceeds 4, a full factorial design requires too many experimental runs. A 2^(k-p) fractional factorial design runs a strategically selected fraction of the full factorial:
- 2^(4-1) = 8 runs instead of 16 (half-fraction)
- 2^(5-1) = 16 runs instead of 32 (half-fraction)
- 2^(5-2) = 8 runs instead of 32 (quarter-fraction)
The trade-off is that some effects become confounded (aliased) — you cannot distinguish them from each other. Careful design ensures that main effects are not confounded with each other, only with higher-order interactions that are usually negligible.
For Green Belt projects in banking, the most common designs are:
- 2² or 2³ full factorials — for 2-3 factors, run the full design
- 2^(4-1) fractional factorial — for 4 factors, run 8 runs instead of 16
- Screening designs (Plackett-Burman) — for 5+ factors, identify which factors matter before designing a detailed experiment on the significant ones
Replication and Randomization
Replication
Running each experimental condition multiple times (replicates) allows you to estimate the experimental error — the natural variation that exists even when all factors are held constant. Without replicates, you cannot determine whether observed differences are real or just noise.
In banking, each replicate should be a genuinely independent observation — for example, a different shift, a different day, or a different batch of work. Processing 50 alerts under the same conditions on three different days gives 3 independent replicates. Processing 150 alerts in a single batch gives only 1 replicate (with 50 items each, but not independent replicates).
Randomization
Randomize the order in which experimental runs are conducted to prevent time-related confounding. If you run all "low threshold" conditions in week 1 and all "high threshold" conditions in week 2, any difference might be due to the threshold change or to something that changed between week 1 and week 2 (different volume, different staff availability, system update).
In banking, full randomization is sometimes impractical (you cannot switch analyst assignments every 30 minutes). In such cases, use blocking — group runs by shift or day and randomize within blocks. This controls for the block effect (shift differences) while allowing valid estimation of factor effects.
Key Takeaways
- DOE tests multiple factors simultaneously, revealing interaction effects that OFAT experimentation misses
- Full factorial designs test every combination of factor levels — 2^k runs for k factors at 2 levels
- Main effects show the average impact of each factor; interactions show how factors influence each other
- In banking, DOE requires adaptation: ethical constraints, regulatory limits, and the inability to fully control experimental conditions
- Use pilot-based experiments, natural experiments, or A/B testing as DOE strategies in banking environments
- The B × C interaction in the AML example demonstrates why DOE is superior to OFAT — routing strategy matters much more for experienced analysts than for juniors
- Always validate that the optimal configuration meets quality and compliance requirements before implementing
- Replication estimates experimental error; randomization prevents time-based confounding
In the next and final module, we will cover Advanced Control and Project Closure — sustaining improvements using advanced SPC rules, process capability studies, and rigorous project closeout documentation.