playbook/antigravity-awesome-skills/skills/monte-carlo-push-ingestion/references/push-lineage.md

161 lines
5.0 KiB
Markdown

# Pushing Table and Column Lineage
## Overview
Both table-level and column-level lineage use the same endpoint: `POST /ingest/v1/lineage`.
The `event_type` field distinguishes them:
- `LINEAGE` — table-level: source table → destination table
- `COLUMN_LINEAGE` — column-level: source table.column → destination table.column
(also automatically creates the parent table-level edge)
Push lineage is **typically visible in the MC lineage graph within seconds to a few minutes**
via the fast direct path (PushLineageProcessor → S3 CSVs → neo4jLineageLoaderPrivate → Neo4j).
**Expiration**:
- Pushed **table lineage does not expire** (`expire_at = 9999-12-31`).
- Pushed **column lineage expires after 10 days** (same as pulled column lineage).
**Batching**: For large numbers of lineage events, split into batches. The compressed request
body must not exceed **1MB** (Kinesis limit).
## pycarlo models
```python
from pycarlo.features.ingestion import (
IngestionService,
LineageEvent,
LineageAssetRef,
ColumnLineageField,
ColumnLineageSourceField,
)
```
## Table lineage example
```python
event = LineageEvent(
destination=LineageAssetRef(
database="analytics",
schema="public",
table="customer_orders",
),
sources=[
LineageAssetRef(database="analytics", schema="public", table="customers"),
LineageAssetRef(database="analytics", schema="public", table="orders"),
],
)
result = service.send_lineage(
resource_uuid="<your-resource-uuid>",
resource_type="data-lake",
events=[event],
)
invocation_id = service.extract_invocation_id(result)
print("invocation_id:", invocation_id)
```
## Column lineage example
```python
event = LineageEvent(
destination=LineageAssetRef(
database="analytics",
schema="public",
table="customer_orders",
),
sources=[
LineageAssetRef(database="analytics", schema="public", table="customers"),
LineageAssetRef(database="analytics", schema="public", table="orders"),
],
# column mappings: dest_col ← src_table.src_col
fields=[
ColumnLineageField(
destination_field="customer_id",
source_fields=[
ColumnLineageSourceField(
database="analytics", schema="public",
table="customers", field="customer_id",
)
],
),
ColumnLineageField(
destination_field="order_amount",
source_fields=[
ColumnLineageSourceField(
database="analytics", schema="public",
table="orders", field="amount",
)
],
),
],
)
result = service.send_lineage(
resource_uuid=resource_uuid,
resource_type="data-lake",
events=[event],
)
```
Column lineage push automatically creates a table-level edge too, so you don't need to
send separate table and column lineage events for the same relationship.
## Extracting lineage from SQL logs
For warehouses that don't expose a native lineage table, extract lineage by parsing query
history SQL for `CREATE TABLE AS SELECT`, `INSERT INTO ... SELECT`, and `MERGE INTO` patterns.
Simplified example regex:
```python
import re
CTAS_PATTERN = re.compile(
r"CREATE\s+(?:OR\s+REPLACE\s+)?TABLE\s+(?:IF\s+NOT\s+EXISTS\s+)?(\S+)\s+AS\s+SELECT",
re.IGNORECASE,
)
INSERT_PATTERN = re.compile(
r"INSERT\s+(?:OVERWRITE\s+)?(?:INTO\s+)?(\S+).*?FROM\s+(\S+)",
re.IGNORECASE | re.DOTALL,
)
```
For Snowflake, BigQuery, and Redshift the query history tables provide this SQL.
For Databricks, use `system.access.table_lineage` directly (no parsing needed).
For Hive, parse the HiveServer2 log file.
## Output manifest (include invocation_id)
```python
manifest = {
"resource_uuid": resource_uuid,
"invocation_id": service.extract_invocation_id(result), # ← save this
"collected_at": datetime.now(tz=timezone.utc).isoformat(),
"edges": [
{
"destination": {"database": e.destination.database, "table": e.destination.table},
"sources": [{"database": s.database, "table": s.table} for s in e.sources],
}
for e in events
],
}
with open("lineage_output.json", "w") as f:
json.dump(manifest, f, indent=2)
```
## How push lineage is distinguished from query-derived lineage
Push-ingested lineage nodes and edges carry `origin = push_ingest` in Neo4j and
`origin_type = DIRECT_LINEAGE` in the normalized lineage model. This prevents the lineage
DAG from overwriting them with query-log-derived edges and gives MC a clear audit trail.
## Neo4j node expiry
Push-ingested **table lineage** nodes and edges are written with `expire_at = 9999-12-31`
(never expire). This is handled internally by PushLineageProcessor — you do not need to set
this manually when using `send_lineage()`.
Push-ingested **column lineage** expires after **10 days**, same as pulled column lineage.
For custom nodes created via GraphQL mutations, you **do** need to set
`expireAt: "9999-12-31"` explicitly — see `references/custom-lineage.md`.