Lineage as a production capability, not a wiki page
Most enterprises have lineage. They have it in PowerPoint slides from a 2019 architecture review, or in a Confluence wiki maintained by an analyst who left two years ago, or in a vendor tool that nobody opens. None of this counts. Lineage that exists as documentation is not lineage in any operationally meaningful sense — it is folklore.
Lineage as a production capability is something different. It is generated automatically by the pipelines themselves. It stays current as pipelines change. It is queryable in real time. It is used during incident response. It is wired into the data quality monitoring stack. And critically, it is granular enough — usually column-level — to answer the questions an AI engineer or a regulator is going to ask.
This module is about how to build that.
Why lineage matters more for AI than for reporting
In a reporting world, lineage is mostly an audit concern. Once a year someone asks "where does this number come from?" and the data team produces a diagram. The report itself is read by humans who can sanity-check obvious problems.
In an AI world, lineage is a load-bearing operational capability. Three reasons:
Debugging. When a model produces a strange output, the first question is "what data did it see?" If you can't answer that question quickly and concretely — which fields, from which sources, with what freshness — you cannot debug. You are reduced to guessing.
Drift detection. Models degrade because their input data changes shape over time. Detecting that requires being able to compare the data the model sees today against the data it was trained on. That comparison is impossible without lineage.
Regulatory defensibility. In financial services, the regulator increasingly expects you to be able to reconstruct any individual automated decision on demand. This means showing exactly which data inputs the model used at the time of the decision, including the lineage of those inputs back to source. PRA SS1/23 and the EU AI Act both push hard in this direction.
A documented lineage diagram cannot do any of these jobs. You need lineage as data — emitted by pipelines, stored in a queryable form, integrated with your other observability.
OpenLineage and the lineage ecosystem
The good news is that lineage tooling has matured rapidly in the last few years. The de-facto open standard is OpenLineage, an open specification for emitting lineage events from data pipelines. Most major orchestration tools (Airflow, dbt, Spark, Flink, Dagster) now have OpenLineage integrations that emit lineage automatically when jobs run.
The standard is consumed by tools like:
- Marquez — open-source reference implementation of an OpenLineage backend
- DataHub — broad metadata platform with strong lineage support
- OpenMetadata — open-source metadata platform with column-level lineage
- Atlan, Collibra, Alation — commercial metadata platforms
The point is not which specific tool you use. The point is that you adopt a model where:
- Pipelines emit lineage automatically as they run
- The lineage goes into a queryable backend
- The backend is wired into your monitoring and incident-response tooling
- People actually use it during incidents
If those four conditions are met, you have lineage as a production capability. If any of them is missing, you have documentation.
Column-level lineage
Most early lineage tools tracked table-level dependencies: "table A is built from tables B and C." That is useful but insufficient. For AI, you usually need column-level lineage: "the risk_score column in table A is derived from the transaction_count_24h column in table B and the kyc_tier column in table C, via the following SQL expression."
Why does this matter? Because:
- Models depend on specific columns, not whole tables
- Drift happens at the column level, not the table level
- Bug fixes need to be scoped to specific columns
- Regulators ask about specific decision inputs, which are columns
Column-level lineage is harder to extract — it requires SQL parsing or instrumentation of the transformation tools — but it is the level that matters operationally. Most modern lineage tools support it.
Observability beyond lineage
Lineage is one part of a broader data observability practice. The full picture covers:
- Freshness — is the data current?
- Volume — are the row counts in expected ranges?
- Schema — has the structure changed unexpectedly?
- Distribution — have the value distributions drifted?
- Lineage — what is upstream and downstream of any given dataset, at the column level?
These are the five pillars of data observability as Monte Carlo, Bigeye, and others have framed them. They map directly to the data SLOs we covered in Module 4: SLOs are the commitments; observability is how you measure them.
A working observability stack lets you:
- Set SLOs per dataset
- Automatically detect breaches
- Page the on-call when breaches happen
- Show the on-call what changed (schema, volume, distribution)
- Walk the lineage to find root cause and downstream impact
- Capture the incident as a learning for future prevention
This is the operational fabric that turns "we have data" into "we have trustworthy data we can reason about under pressure."
Lineage in incident response
The single most useful test of whether your lineage is real: when a data incident happens, do people actually open the lineage tool?
If yes — if the on-call's first move is to walk the lineage from the breached dataset upstream to find what changed, and downstream to find what's affected — then lineage is part of how the team operates. If not — if the incident response runs through tribal knowledge and Slack messages and "ask Sarah, she knows" — then your lineage is documentation, regardless of how nice the tool looks.
The on-call runbook should list lineage walks as standard incident-response steps. New on-calls should be trained to use it. Post-mortems should reference what the lineage showed. The tool should be open during shadow shifts. This is how a capability becomes operational rather than ceremonial.
What regulators want to see
From a regulated-industry perspective, the questions you should be able to answer about any AI-driven decision are:
- What data inputs did the model see at decision time?
- Where did each of those inputs come from?
- What was the freshness of each input at decision time?
- What were the values of the model's monitoring signals (confidence, drift, etc.) at decision time?
- Has any of the upstream data changed since the model was trained, and if so, how?
- Who was accountable for the decision, and what was their override interface?
If your lineage and observability stack can answer all of these for any specific decision, you have built a defensible AI deployment. If it can't, you have governance debt. The work to close that gap is exactly the work this module covers.
What's next
In Module 6 we'll cover the data flywheel — how the right architecture turns operational data into a moat that compounds over time and becomes very hard for competitors to close.