playbook/antigravity-awesome-skills/skills/monte-carlo-push-ingestion/references/push-metadata.md

159 lines
5.4 KiB
Markdown

# Pushing Table Metadata
## Overview
Metadata push sends three types of signals per table:
- **Schema** — column names and types
- **Volume** — row count and byte count
- **Freshness** — last update timestamp
All three travel together in a single `RelationalAsset` object via `POST /ingest/v1/metadata`.
**Expiration**: Pushed table metadata **does not expire**. Once pushed, it remains in Monte
Carlo until explicitly deleted via `deletePushIngestedTables`.
**Batching**: For large numbers of tables, split assets into batches. The compressed request
body must not exceed **1MB** (Kinesis limit).
## pycarlo models
```python
from pycarlo.features.ingestion import (
IngestionService,
RelationalAsset,
AssetMetadata,
AssetField,
AssetVolume,
AssetFreshness,
)
```
## Minimal example
```python
asset = RelationalAsset(
type="TABLE", # ONLY "TABLE" or "VIEW" — normalize warehouse-native values
metadata=AssetMetadata(
name="orders",
database="analytics",
schema="public",
description="Order transactions",
),
fields=[
AssetField(name="order_id", type="INTEGER"),
AssetField(name="amount", type="DECIMAL"),
AssetField(name="created_at", type="TIMESTAMP"),
],
volume=AssetVolume(
row_count=1_500_000,
byte_count=250_000_000,
),
freshness=AssetFreshness(
last_update_time="2024-03-01T12:00:00Z", # ISO 8601 string, NOT a datetime object
),
)
result = service.send_metadata(
resource_uuid="<your-resource-uuid>",
resource_type="data-lake", # see note below on resource_type
events=[asset],
)
invocation_id = service.extract_invocation_id(result)
print("invocation_id:", invocation_id) # save this!
```
## resource_type
The `resource_type` value must match the type of the MC resource (warehouse connection) you
are pushing to. Use the same string that appears in the MC UI or the `connectionType` field
from `getUser { account { warehouses { connectionType } } }`.
Common values:
- `"data-lake"` — Hive, EMR, Glue, generic data lake connections
- `"snowflake"` — Snowflake
- `"bigquery"` — BigQuery
- `"databricks"` — Databricks Unity Catalog
- `"redshift"` — Redshift
## Asset type
The `type` parameter on `RelationalAsset` must be one of two values (uppercase):
- `"TABLE"` — tables, external tables, dynamic tables, materialized views, etc.
- `"VIEW"` — views, secure views
**Important**: Warehouse-native type values like `"BASE TABLE"` (Snowflake), `"MANAGED"` /
`"EXTERNAL"` (Databricks), or `"MATERIALIZED_VIEW"` (BigQuery) are **NOT accepted** by the
MC API and will cause a 400 error. Always normalize to `"TABLE"` or `"VIEW"` before pushing.
## Field types
Normalize to SQL-standard uppercase strings. Monte Carlo accepts any string but canonical
values like `INTEGER`, `BIGINT`, `VARCHAR`, `FLOAT`, `BOOLEAN`, `TIMESTAMP`, `DATE`,
`DECIMAL`, `ARRAY`, `STRUCT` work best with downstream features.
## Volume and freshness are optional
If your warehouse doesn't expose row counts or last-modified timestamps, omit `volume`
and/or `freshness` — schema-only metadata is valid.
If you send `freshness`, each push must carry a **changed** `last_update_time` to count as
a new data point for the anomaly detector (repeated identical timestamps don't advance the
training clock).
## Freshness + volume only mode (skip schema)
For periodic pushes (e.g. hourly cron), you often don't need to re-collect the full schema
on every run — field definitions rarely change. Collection scripts can support a
`--only-freshness-and-volume` flag that skips the `COLUMNS` / `INFORMATION_SCHEMA` query
and omits `fields` from the manifest. This is significantly faster on warehouses with many
tables. Use the full collection (with fields) on the first push and on a daily schedule,
and the freshness+volume only mode for hourly pushes in between. See the
[BigQuery Iceberg example](https://github.com/monte-carlo-data/mcd-public-resources/tree/main/examples/push-ingestion/bigquery/push-iceberg-tables)
for a working implementation of this pattern.
## Batch multiple tables
`events` accepts a list. Push all tables in a single call or in batches:
```python
result = service.send_metadata(
resource_uuid=resource_uuid,
resource_type="data-lake",
events=[asset1, asset2, asset3, ...],
)
```
## Output manifest (include invocation_id)
Always write a local manifest so you can trace issues later:
```python
import json
from datetime import datetime, timezone
manifest = {
"resource_uuid": resource_uuid,
"invocation_id": service.extract_invocation_id(result), # ← critical for debugging
"collected_at": datetime.now(tz=timezone.utc).isoformat(),
"assets": [
{
"database": a.metadata.database,
"schema": a.metadata.schema,
"table": a.metadata.name,
"row_count": a.volume.row_count if a.volume else None,
"fields": [{"name": f.name, "type": f.type} for f in a.fields],
}
for a in assets
],
}
with open("metadata_output.json", "w") as f:
json.dump(manifest, f, indent=2)
```
## Push frequency for anomaly detection
To keep volume and freshness anomaly detectors active:
- Push **at most once per hour** (pushing more frequently produces unpredictable behavior)
- Push **consistently** — gaps longer than a few days will deactivate detectors
- See `references/anomaly-detection.md` for minimum sample requirements