Building Data Foundations for AIModule 3 of 7

Module 3 of 743% through

Module 3

Schema Design for Workflows, Not Reports

How to design data models around the next decision rather than the next dashboard, with worked examples in KYC, fraud, and regulatory reporting.

Module 3 — 90-second video overview

Designing back from the decision

In Module 2 we covered the five characteristics of an AI-ready data layer. This module is about how to actually design schemas that meet them. The biggest single shift you have to make is in where you start.

In a reporting world, you start from the data you have and ask "what reports can we produce from this?" In an action world, you start from the decision the workflow needs to make and ask "what data shape would make this decision cheap and reliable?" Then you design the schema backwards from there.

This is a different design discipline. It produces different schemas — usually denormalised, often duplicated, structured around the unit of decision rather than the unit of analysis. It is not the discipline most data teams are trained in. This module is about teaching it.

Start with the decision, not the data

Pick the workflow you're going to support. List the decisions it makes — not the steps, not the people involved, the decisions. For each decision, ask:

What inputs does this decision depend on?
How fresh do those inputs need to be?
What identifiers tie them together?
What confidence signal does the decision produce?
What feedback does the decision generate that should be captured for future training?

You now have a specification for the data the workflow needs. The schema is whatever shape lets the workflow read that data with one cheap operation per decision.

Notice what you didn't do: you didn't ask "what data do we have today?" That comes later. The point of starting from the decision is that it forces you to design for what the workflow actually needs, rather than reverse-engineering a rationalisation around what the warehouse already contains.

The wide-row pattern

The single most useful pattern for action-data schemas is the wide-row (sometimes called denormalised event row or decision context row). The idea is that each row contains all the precomputed context a downstream decision will need, so the workflow can read it in one operation without joins.

A typical reporting schema would have:

A customers table
A transactions table
A kyc_status table
A risk_scores table
A complaints table

…and the AI workflow would have to join all five tables to get the context for a single customer-action decision. Each join introduces latency, fragility, and the risk of stale joins across tables that update at different rates.

A wide-row action schema for the same workflow would have:

A customer_decision_context table where each row is a customer at a point in time, with the relevant fields from all five source systems pre-joined and pre-computed: KYC tier, current risk score, recent transaction count, complaint flag, sanction status, etc.

The workflow reads one row to make its decision. The wide row is updated whenever any of its source values change (via change-data-capture), so it's always current. Storage is cheaper than developer time, and read performance is dramatically better.

Precompute at write time

The companion pattern to the wide row is write-time precomputation: when data lands, compute all the derived values the downstream workflow will need, and store them alongside the raw data. This is the opposite of the reporting pattern, which computes derived values at query time on demand.

Examples:

When a transaction lands, immediately compute the rolling 24-hour transaction count and 24-hour transaction value for that customer, and write them into the customer's wide row.
When a KYC review completes, immediately compute the new KYC tier and the updated risk score, and write them into the wide row.
When a complaint is filed, immediately update the customer's complaint flag and recompute any complaint-driven risk adjustments.

The cost of all this computation lands on writes, not reads. That is the right trade-off when the read pattern is "high frequency, low latency, predictable." Reads happen many times more often than writes, and reads are the ones that block the workflow.

Identifier discipline

In a reporting world, you can survive identifier inconsistency. If "customer_id" is an integer in one system and a UUID in another, the warehouse team writes a mapping table and life goes on. The dashboards still work.

In an action world, identifier inconsistency kills you. The workflow has to join data across systems in real time, with no human in the loop to sort it out. If two systems disagree on the customer's ID, the workflow either fails or makes the wrong decision.

This means action-data schemas have to take identifiers much more seriously. Three rules:

Stable. A customer's identifier must not change over time, even when the customer's other attributes change. If it has to change (rare), there must be a documented chain of responsibility for the rename.
Resolvable. Every system that touches the customer must use the same identifier, or there must be a deterministic resolver that maps between them with no ambiguity.
Available. The identifier must be present in every record that needs to be joined. No nulls, no defaults, no "we'll fill it in later."

This is unglamorous work. It is also the difference between an action data layer that compounds and one that produces a steady trickle of data quality incidents.

Worked example: KYC

Imagine you are building a KYC triage system that decides whether to approve a customer instantly, request additional documents, or escalate to a human reviewer.

Decisions the workflow makes:

Approve, request docs, or escalate
If escalated, route to which queue
If approved, what monitoring tier to apply

Data each decision needs:

Customer identity (verified)
Document evidence (presence and quality)
Risk signals (sanctions, PEP, adverse media, geographic risk)
Internal history (existing relationships, prior KYC outcomes)
Confidence on each input

Wide-row schema:

A kyc_decision_context row per applicant per attempt, containing:

Stable applicant ID
Identity verification results (with timestamps and source)
Document quality scores (precomputed, per document type)
Sanction/PEP/adverse-media flags (with last-checked timestamps)
Geographic risk score
Internal relationship indicator
Triage decision and confidence
Escalation reason if applicable
Reviewer override (populated later, by feedback loop)

The triage workflow reads one row, makes one decision, writes one row. The model has all the context it needs without joins. The reviewer's override (if any) lands in the same row, completing the feedback loop.

This is what a workflow-shaped schema looks like. Same data as your reporting world, organised completely differently.

What's next

In Module 4 we'll cover the data quality programme — what it actually takes to keep an action-data layer trustworthy under regulatory scrutiny, including SLOs, monitoring, and incident response.

Module 2: The Five Characteristics of an AI-Ready Data Layer

Module 4: Running a Data Quality Programme That Holds Up

Monthly newsletter

Stay current between modules

Subscribe to the monthly essay for long-form analysis on AI enablement, embedded governance, and operating-model design — written for the same audience this course serves.

No spam. Unsubscribe anytime. Read by senior practitioners across FS, healthcare, energy, and the public sector.