# Pushing Table Metadata ## Overview Metadata push sends three types of signals per table: - **Schema** — column names and types - **Volume** — row count and byte count - **Freshness** — last update timestamp All three travel together in a single `RelationalAsset` object via `POST /ingest/v1/metadata`. **Expiration**: Pushed table metadata **does not expire**. Once pushed, it remains in Monte Carlo until explicitly deleted via `deletePushIngestedTables`. **Batching**: For large numbers of tables, split assets into batches. The compressed request body must not exceed **1MB** (Kinesis limit). ## pycarlo models ```python from pycarlo.features.ingestion import ( IngestionService, RelationalAsset, AssetMetadata, AssetField, AssetVolume, AssetFreshness, ) ``` ## Minimal example ```python asset = RelationalAsset( type="TABLE", # ONLY "TABLE" or "VIEW" — normalize warehouse-native values metadata=AssetMetadata( name="orders", database="analytics", schema="public", description="Order transactions", ), fields=[ AssetField(name="order_id", type="INTEGER"), AssetField(name="amount", type="DECIMAL"), AssetField(name="created_at", type="TIMESTAMP"), ], volume=AssetVolume( row_count=1_500_000, byte_count=250_000_000, ), freshness=AssetFreshness( last_update_time="2024-03-01T12:00:00Z", # ISO 8601 string, NOT a datetime object ), ) result = service.send_metadata( resource_uuid="", resource_type="data-lake", # see note below on resource_type events=[asset], ) invocation_id = service.extract_invocation_id(result) print("invocation_id:", invocation_id) # save this! ``` ## resource_type The `resource_type` value must match the type of the MC resource (warehouse connection) you are pushing to. Use the same string that appears in the MC UI or the `connectionType` field from `getUser { account { warehouses { connectionType } } }`. Common values: - `"data-lake"` — Hive, EMR, Glue, generic data lake connections - `"snowflake"` — Snowflake - `"bigquery"` — BigQuery - `"databricks"` — Databricks Unity Catalog - `"redshift"` — Redshift ## Asset type The `type` parameter on `RelationalAsset` must be one of two values (uppercase): - `"TABLE"` — tables, external tables, dynamic tables, materialized views, etc. - `"VIEW"` — views, secure views **Important**: Warehouse-native type values like `"BASE TABLE"` (Snowflake), `"MANAGED"` / `"EXTERNAL"` (Databricks), or `"MATERIALIZED_VIEW"` (BigQuery) are **NOT accepted** by the MC API and will cause a 400 error. Always normalize to `"TABLE"` or `"VIEW"` before pushing. ## Field types Normalize to SQL-standard uppercase strings. Monte Carlo accepts any string but canonical values like `INTEGER`, `BIGINT`, `VARCHAR`, `FLOAT`, `BOOLEAN`, `TIMESTAMP`, `DATE`, `DECIMAL`, `ARRAY`, `STRUCT` work best with downstream features. ## Volume and freshness are optional If your warehouse doesn't expose row counts or last-modified timestamps, omit `volume` and/or `freshness` — schema-only metadata is valid. If you send `freshness`, each push must carry a **changed** `last_update_time` to count as a new data point for the anomaly detector (repeated identical timestamps don't advance the training clock). ## Freshness + volume only mode (skip schema) For periodic pushes (e.g. hourly cron), you often don't need to re-collect the full schema on every run — field definitions rarely change. Collection scripts can support a `--only-freshness-and-volume` flag that skips the `COLUMNS` / `INFORMATION_SCHEMA` query and omits `fields` from the manifest. This is significantly faster on warehouses with many tables. Use the full collection (with fields) on the first push and on a daily schedule, and the freshness+volume only mode for hourly pushes in between. See the [BigQuery Iceberg example](https://github.com/monte-carlo-data/mcd-public-resources/tree/main/examples/push-ingestion/bigquery/push-iceberg-tables) for a working implementation of this pattern. ## Batch multiple tables `events` accepts a list. Push all tables in a single call or in batches: ```python result = service.send_metadata( resource_uuid=resource_uuid, resource_type="data-lake", events=[asset1, asset2, asset3, ...], ) ``` ## Output manifest (include invocation_id) Always write a local manifest so you can trace issues later: ```python import json from datetime import datetime, timezone manifest = { "resource_uuid": resource_uuid, "invocation_id": service.extract_invocation_id(result), # ← critical for debugging "collected_at": datetime.now(tz=timezone.utc).isoformat(), "assets": [ { "database": a.metadata.database, "schema": a.metadata.schema, "table": a.metadata.name, "row_count": a.volume.row_count if a.volume else None, "fields": [{"name": f.name, "type": f.type} for f in a.fields], } for a in assets ], } with open("metadata_output.json", "w") as f: json.dump(manifest, f, indent=2) ``` ## Push frequency for anomaly detection To keep volume and freshness anomaly detectors active: - Push **at most once per hour** (pushing more frequently produces unpredictable behavior) - Push **consistently** — gaps longer than a few days will deactivate detectors - See `references/anomaly-detection.md` for minimum sample requirements