159 lines
5.4 KiB
Markdown
159 lines
5.4 KiB
Markdown
# Pushing Table Metadata
|
|
|
|
## Overview
|
|
|
|
Metadata push sends three types of signals per table:
|
|
- **Schema** — column names and types
|
|
- **Volume** — row count and byte count
|
|
- **Freshness** — last update timestamp
|
|
|
|
All three travel together in a single `RelationalAsset` object via `POST /ingest/v1/metadata`.
|
|
|
|
**Expiration**: Pushed table metadata **does not expire**. Once pushed, it remains in Monte
|
|
Carlo until explicitly deleted via `deletePushIngestedTables`.
|
|
|
|
**Batching**: For large numbers of tables, split assets into batches. The compressed request
|
|
body must not exceed **1MB** (Kinesis limit).
|
|
|
|
## pycarlo models
|
|
|
|
```python
|
|
from pycarlo.features.ingestion import (
|
|
IngestionService,
|
|
RelationalAsset,
|
|
AssetMetadata,
|
|
AssetField,
|
|
AssetVolume,
|
|
AssetFreshness,
|
|
)
|
|
```
|
|
|
|
## Minimal example
|
|
|
|
```python
|
|
asset = RelationalAsset(
|
|
type="TABLE", # ONLY "TABLE" or "VIEW" — normalize warehouse-native values
|
|
metadata=AssetMetadata(
|
|
name="orders",
|
|
database="analytics",
|
|
schema="public",
|
|
description="Order transactions",
|
|
),
|
|
fields=[
|
|
AssetField(name="order_id", type="INTEGER"),
|
|
AssetField(name="amount", type="DECIMAL"),
|
|
AssetField(name="created_at", type="TIMESTAMP"),
|
|
],
|
|
volume=AssetVolume(
|
|
row_count=1_500_000,
|
|
byte_count=250_000_000,
|
|
),
|
|
freshness=AssetFreshness(
|
|
last_update_time="2024-03-01T12:00:00Z", # ISO 8601 string, NOT a datetime object
|
|
),
|
|
)
|
|
|
|
result = service.send_metadata(
|
|
resource_uuid="<your-resource-uuid>",
|
|
resource_type="data-lake", # see note below on resource_type
|
|
events=[asset],
|
|
)
|
|
invocation_id = service.extract_invocation_id(result)
|
|
print("invocation_id:", invocation_id) # save this!
|
|
```
|
|
|
|
## resource_type
|
|
|
|
The `resource_type` value must match the type of the MC resource (warehouse connection) you
|
|
are pushing to. Use the same string that appears in the MC UI or the `connectionType` field
|
|
from `getUser { account { warehouses { connectionType } } }`.
|
|
|
|
Common values:
|
|
- `"data-lake"` — Hive, EMR, Glue, generic data lake connections
|
|
- `"snowflake"` — Snowflake
|
|
- `"bigquery"` — BigQuery
|
|
- `"databricks"` — Databricks Unity Catalog
|
|
- `"redshift"` — Redshift
|
|
|
|
## Asset type
|
|
|
|
The `type` parameter on `RelationalAsset` must be one of two values (uppercase):
|
|
- `"TABLE"` — tables, external tables, dynamic tables, materialized views, etc.
|
|
- `"VIEW"` — views, secure views
|
|
|
|
**Important**: Warehouse-native type values like `"BASE TABLE"` (Snowflake), `"MANAGED"` /
|
|
`"EXTERNAL"` (Databricks), or `"MATERIALIZED_VIEW"` (BigQuery) are **NOT accepted** by the
|
|
MC API and will cause a 400 error. Always normalize to `"TABLE"` or `"VIEW"` before pushing.
|
|
|
|
## Field types
|
|
|
|
Normalize to SQL-standard uppercase strings. Monte Carlo accepts any string but canonical
|
|
values like `INTEGER`, `BIGINT`, `VARCHAR`, `FLOAT`, `BOOLEAN`, `TIMESTAMP`, `DATE`,
|
|
`DECIMAL`, `ARRAY`, `STRUCT` work best with downstream features.
|
|
|
|
## Volume and freshness are optional
|
|
|
|
If your warehouse doesn't expose row counts or last-modified timestamps, omit `volume`
|
|
and/or `freshness` — schema-only metadata is valid.
|
|
|
|
If you send `freshness`, each push must carry a **changed** `last_update_time` to count as
|
|
a new data point for the anomaly detector (repeated identical timestamps don't advance the
|
|
training clock).
|
|
|
|
## Freshness + volume only mode (skip schema)
|
|
|
|
For periodic pushes (e.g. hourly cron), you often don't need to re-collect the full schema
|
|
on every run — field definitions rarely change. Collection scripts can support a
|
|
`--only-freshness-and-volume` flag that skips the `COLUMNS` / `INFORMATION_SCHEMA` query
|
|
and omits `fields` from the manifest. This is significantly faster on warehouses with many
|
|
tables. Use the full collection (with fields) on the first push and on a daily schedule,
|
|
and the freshness+volume only mode for hourly pushes in between. See the
|
|
[BigQuery Iceberg example](https://github.com/monte-carlo-data/mcd-public-resources/tree/main/examples/push-ingestion/bigquery/push-iceberg-tables)
|
|
for a working implementation of this pattern.
|
|
|
|
## Batch multiple tables
|
|
|
|
`events` accepts a list. Push all tables in a single call or in batches:
|
|
|
|
```python
|
|
result = service.send_metadata(
|
|
resource_uuid=resource_uuid,
|
|
resource_type="data-lake",
|
|
events=[asset1, asset2, asset3, ...],
|
|
)
|
|
```
|
|
|
|
## Output manifest (include invocation_id)
|
|
|
|
Always write a local manifest so you can trace issues later:
|
|
|
|
```python
|
|
import json
|
|
from datetime import datetime, timezone
|
|
|
|
manifest = {
|
|
"resource_uuid": resource_uuid,
|
|
"invocation_id": service.extract_invocation_id(result), # ← critical for debugging
|
|
"collected_at": datetime.now(tz=timezone.utc).isoformat(),
|
|
"assets": [
|
|
{
|
|
"database": a.metadata.database,
|
|
"schema": a.metadata.schema,
|
|
"table": a.metadata.name,
|
|
"row_count": a.volume.row_count if a.volume else None,
|
|
"fields": [{"name": f.name, "type": f.type} for f in a.fields],
|
|
}
|
|
for a in assets
|
|
],
|
|
}
|
|
with open("metadata_output.json", "w") as f:
|
|
json.dump(manifest, f, indent=2)
|
|
```
|
|
|
|
## Push frequency for anomaly detection
|
|
|
|
To keep volume and freshness anomaly detectors active:
|
|
- Push **at most once per hour** (pushing more frequently produces unpredictable behavior)
|
|
- Push **consistently** — gaps longer than a few days will deactivate detectors
|
|
- See `references/anomaly-detection.md` for minimum sample requirements
|