5.4 KiB
Pushing Table Metadata
Overview
Metadata push sends three types of signals per table:
- Schema — column names and types
- Volume — row count and byte count
- Freshness — last update timestamp
All three travel together in a single RelationalAsset object via POST /ingest/v1/metadata.
Expiration: Pushed table metadata does not expire. Once pushed, it remains in Monte
Carlo until explicitly deleted via deletePushIngestedTables.
Batching: For large numbers of tables, split assets into batches. The compressed request body must not exceed 1MB (Kinesis limit).
pycarlo models
from pycarlo.features.ingestion import (
IngestionService,
RelationalAsset,
AssetMetadata,
AssetField,
AssetVolume,
AssetFreshness,
)
Minimal example
asset = RelationalAsset(
type="TABLE", # ONLY "TABLE" or "VIEW" — normalize warehouse-native values
metadata=AssetMetadata(
name="orders",
database="analytics",
schema="public",
description="Order transactions",
),
fields=[
AssetField(name="order_id", type="INTEGER"),
AssetField(name="amount", type="DECIMAL"),
AssetField(name="created_at", type="TIMESTAMP"),
],
volume=AssetVolume(
row_count=1_500_000,
byte_count=250_000_000,
),
freshness=AssetFreshness(
last_update_time="2024-03-01T12:00:00Z", # ISO 8601 string, NOT a datetime object
),
)
result = service.send_metadata(
resource_uuid="<your-resource-uuid>",
resource_type="data-lake", # see note below on resource_type
events=[asset],
)
invocation_id = service.extract_invocation_id(result)
print("invocation_id:", invocation_id) # save this!
resource_type
The resource_type value must match the type of the MC resource (warehouse connection) you
are pushing to. Use the same string that appears in the MC UI or the connectionType field
from getUser { account { warehouses { connectionType } } }.
Common values:
"data-lake"— Hive, EMR, Glue, generic data lake connections"snowflake"— Snowflake"bigquery"— BigQuery"databricks"— Databricks Unity Catalog"redshift"— Redshift
Asset type
The type parameter on RelationalAsset must be one of two values (uppercase):
"TABLE"— tables, external tables, dynamic tables, materialized views, etc."VIEW"— views, secure views
Important: Warehouse-native type values like "BASE TABLE" (Snowflake), "MANAGED" /
"EXTERNAL" (Databricks), or "MATERIALIZED_VIEW" (BigQuery) are NOT accepted by the
MC API and will cause a 400 error. Always normalize to "TABLE" or "VIEW" before pushing.
Field types
Normalize to SQL-standard uppercase strings. Monte Carlo accepts any string but canonical
values like INTEGER, BIGINT, VARCHAR, FLOAT, BOOLEAN, TIMESTAMP, DATE,
DECIMAL, ARRAY, STRUCT work best with downstream features.
Volume and freshness are optional
If your warehouse doesn't expose row counts or last-modified timestamps, omit volume
and/or freshness — schema-only metadata is valid.
If you send freshness, each push must carry a changed last_update_time to count as
a new data point for the anomaly detector (repeated identical timestamps don't advance the
training clock).
Freshness + volume only mode (skip schema)
For periodic pushes (e.g. hourly cron), you often don't need to re-collect the full schema
on every run — field definitions rarely change. Collection scripts can support a
--only-freshness-and-volume flag that skips the COLUMNS / INFORMATION_SCHEMA query
and omits fields from the manifest. This is significantly faster on warehouses with many
tables. Use the full collection (with fields) on the first push and on a daily schedule,
and the freshness+volume only mode for hourly pushes in between. See the
BigQuery Iceberg example
for a working implementation of this pattern.
Batch multiple tables
events accepts a list. Push all tables in a single call or in batches:
result = service.send_metadata(
resource_uuid=resource_uuid,
resource_type="data-lake",
events=[asset1, asset2, asset3, ...],
)
Output manifest (include invocation_id)
Always write a local manifest so you can trace issues later:
import json
from datetime import datetime, timezone
manifest = {
"resource_uuid": resource_uuid,
"invocation_id": service.extract_invocation_id(result), # ← critical for debugging
"collected_at": datetime.now(tz=timezone.utc).isoformat(),
"assets": [
{
"database": a.metadata.database,
"schema": a.metadata.schema,
"table": a.metadata.name,
"row_count": a.volume.row_count if a.volume else None,
"fields": [{"name": f.name, "type": f.type} for f in a.fields],
}
for a in assets
],
}
with open("metadata_output.json", "w") as f:
json.dump(manifest, f, indent=2)
Push frequency for anomaly detection
To keep volume and freshness anomaly detectors active:
- Push at most once per hour (pushing more frequently produces unpredictable behavior)
- Push consistently — gaps longer than a few days will deactivate detectors
- See
references/anomaly-detection.mdfor minimum sample requirements