5.4 KiB

Raw Blame History

Pushing Table Metadata

Overview

Metadata push sends three types of signals per table:

Schema — column names and types
Volume — row count and byte count
Freshness — last update timestamp

All three travel together in a single RelationalAsset object via POST /ingest/v1/metadata.

Expiration: Pushed table metadata does not expire. Once pushed, it remains in Monte Carlo until explicitly deleted via deletePushIngestedTables.

Batching: For large numbers of tables, split assets into batches. The compressed request body must not exceed 1MB (Kinesis limit).

pycarlo models

from pycarlo.features.ingestion import (
    IngestionService,
    RelationalAsset,
    AssetMetadata,
    AssetField,
    AssetVolume,
    AssetFreshness,
)

Minimal example

asset = RelationalAsset(
    type="TABLE",  # ONLY "TABLE" or "VIEW" — normalize warehouse-native values
    metadata=AssetMetadata(
        name="orders",
        database="analytics",
        schema="public",
        description="Order transactions",
    ),
    fields=[
        AssetField(name="order_id", type="INTEGER"),
        AssetField(name="amount",   type="DECIMAL"),
        AssetField(name="created_at", type="TIMESTAMP"),
    ],
    volume=AssetVolume(
        row_count=1_500_000,
        byte_count=250_000_000,
    ),
    freshness=AssetFreshness(
        last_update_time="2024-03-01T12:00:00Z",  # ISO 8601 string, NOT a datetime object
    ),
)

result = service.send_metadata(
    resource_uuid="<your-resource-uuid>",
    resource_type="data-lake",   # see note below on resource_type
    events=[asset],
)
invocation_id = service.extract_invocation_id(result)
print("invocation_id:", invocation_id)   # save this!

resource_type

The resource_type value must match the type of the MC resource (warehouse connection) you are pushing to. Use the same string that appears in the MC UI or the connectionType field from getUser { account { warehouses { connectionType } } }.

Common values:

"data-lake" — Hive, EMR, Glue, generic data lake connections
"snowflake" — Snowflake
"bigquery" — BigQuery
"databricks" — Databricks Unity Catalog
"redshift" — Redshift

Asset type

The type parameter on RelationalAsset must be one of two values (uppercase):

"TABLE" — tables, external tables, dynamic tables, materialized views, etc.
"VIEW" — views, secure views

Important: Warehouse-native type values like "BASE TABLE" (Snowflake), "MANAGED" / "EXTERNAL" (Databricks), or "MATERIALIZED_VIEW" (BigQuery) are NOT accepted by the MC API and will cause a 400 error. Always normalize to "TABLE" or "VIEW" before pushing.

Field types

Normalize to SQL-standard uppercase strings. Monte Carlo accepts any string but canonical values like INTEGER, BIGINT, VARCHAR, FLOAT, BOOLEAN, TIMESTAMP, DATE, DECIMAL, ARRAY, STRUCT work best with downstream features.

Volume and freshness are optional

If your warehouse doesn't expose row counts or last-modified timestamps, omit volume and/or freshness — schema-only metadata is valid.

If you send freshness, each push must carry a changed last_update_time to count as a new data point for the anomaly detector (repeated identical timestamps don't advance the training clock).

Freshness + volume only mode (skip schema)

For periodic pushes (e.g. hourly cron), you often don't need to re-collect the full schema on every run — field definitions rarely change. Collection scripts can support a --only-freshness-and-volume flag that skips the COLUMNS / INFORMATION_SCHEMA query and omits fields from the manifest. This is significantly faster on warehouses with many tables. Use the full collection (with fields) on the first push and on a daily schedule, and the freshness+volume only mode for hourly pushes in between. See the BigQuery Iceberg example for a working implementation of this pattern.

Batch multiple tables

events accepts a list. Push all tables in a single call or in batches:

result = service.send_metadata(
    resource_uuid=resource_uuid,
    resource_type="data-lake",
    events=[asset1, asset2, asset3, ...],
)

Output manifest (include invocation_id)

Always write a local manifest so you can trace issues later:

import json
from datetime import datetime, timezone

manifest = {
    "resource_uuid": resource_uuid,
    "invocation_id": service.extract_invocation_id(result),   # ← critical for debugging
    "collected_at": datetime.now(tz=timezone.utc).isoformat(),
    "assets": [
        {
            "database": a.metadata.database,
            "schema": a.metadata.schema,
            "table": a.metadata.name,
            "row_count": a.volume.row_count if a.volume else None,
            "fields": [{"name": f.name, "type": f.type} for f in a.fields],
        }
        for a in assets
    ],
}
with open("metadata_output.json", "w") as f:
    json.dump(manifest, f, indent=2)

Push frequency for anomaly detection

To keep volume and freshness anomaly detectors active:

Push at most once per hour (pushing more frequently produces unpredictable behavior)
Push consistently — gaps longer than a few days will deactivate detectors
See references/anomaly-detection.md for minimum sample requirements

5.4 KiB Raw Blame History