6.0 KiB
Pushing Query Logs
Overview
Query logs let Monte Carlo build table usage history, populate query lineage, and surface
query-level insights in the catalog. Push them via POST /ingest/v1/querylogs.
Important timing note: MC processes pushed query logs asynchronously. Logs pushed now
may not be visible in getAggregatedQueries for at least 15-20 minutes. This is expected
behavior, not a bug.
Expiration: Pushed query logs expire on the same schedule as pulled query logs.
Batching: For large query log sets, split events into batches. The compressed request body must not exceed 1MB (Kinesis limit). A conservative default is 250 entries per batch.
pycarlo model
from pycarlo.features.ingestion import IngestionService, QueryLogEntry
QueryLogEntry required fields:
start_time(datetime) — when the query startedend_time(datetime) — when the query finished (required, easy to miss)query_text(str) — the SQL statement
Optional fields:
query_id(str) — warehouse-assigned query IDuser(str) — user/email who ran the queryreturned_rows(int) — rows returned to the clientdefault_database(str) — default database context
Basic example
from datetime import datetime, timezone
entries = [
QueryLogEntry(
start_time=datetime(2024, 3, 1, 10, 0, 0, tzinfo=timezone.utc),
end_time=datetime(2024, 3, 1, 10, 0, 5, tzinfo=timezone.utc),
query_text="SELECT * FROM analytics.public.orders WHERE status = 'pending'",
query_id="query-abc-123",
user="analyst@company.com",
returned_rows=847,
),
]
result = service.send_query_logs(
resource_uuid="<your-resource-uuid>",
log_type="snowflake", # ← warehouse-specific! see table below
entries=entries,
)
invocation_id = service.extract_invocation_id(result)
print("invocation_id:", invocation_id)
log_type per warehouse
Important: the query-log endpoint uses log_type, not resource_type. This is the only
push endpoint where the field name differs from metadata/lineage. The log_type value must
match what the MC normalizer expects for your warehouse. Using the wrong value causes:
ValueError: Unsupported ingest query-log log_type: <value>
| Warehouse | log_type |
|---|---|
| Snowflake | "snowflake" |
| BigQuery | "bigquery" |
| Databricks | "databricks" |
| Redshift | "redshift" |
| Hive (EMR/S3) | "hive-s3" |
| Athena | "athena" |
| Teradata | "teradata" |
| ClickHouse | "clickhouse" |
| Databricks (SQL Warehouse) | "databricks-metastore-sql-warehouse" |
| S3 | "s3" |
| Presto (S3) | "presto-s3" |
Warehouse-specific fields
Some warehouses support extra fields beyond the base QueryLogEntry. Pass them as keyword
arguments — the normalizer knows which fields are valid per warehouse.
Snowflake extras:
QueryLogEntry(
...
bytes_scanned=1024000,
warehouse_name="COMPUTE_WH",
warehouse_size="X-Small",
role_name="ANALYST",
query_tag="reporting",
execution_status="SUCCESS",
)
BigQuery extras:
QueryLogEntry(
...
total_bytes_billed=10485760,
statement_type="SELECT",
job_type="QUERY",
default_dataset="analytics.public",
)
Athena extras:
QueryLogEntry(
...
bytes_scanned=2048000,
catalog="AwsDataCatalog",
database="analytics",
output_location="s3://my-bucket/results/",
state="SUCCEEDED",
)
Collecting query logs per warehouse
Snowflake
SELECT
query_id,
query_text,
start_time,
end_time,
user_name,
database_name,
warehouse_name,
bytes_scanned,
rows_produced AS returned_rows,
execution_status
FROM snowflake.account_usage.query_history
WHERE start_time >= DATEADD(hour, -24, CURRENT_TIMESTAMP())
AND execution_status = 'SUCCESS'
ORDER BY start_time
Note: ACCOUNT_USAGE views have up to 45 minutes of latency. Don't collect the last hour.
BigQuery
from google.cloud import bigquery
client = bigquery.Client(project=project_id)
jobs = client.list_jobs(all_users=True, min_creation_time=start_dt, max_creation_time=end_dt)
for job in jobs:
if hasattr(job, 'query') and job.query:
# job.job_id, job.query, job.created, job.ended, job.user_email
Databricks
SELECT
statement_id AS query_id,
statement_text AS query_text,
start_time,
end_time,
executed_by AS user,
produced_rows AS returned_rows
FROM system.query.history
WHERE start_time >= DATEADD(HOUR, -24, NOW())
AND status = 'FINISHED'
Redshift (modern clusters)
SELECT
query_id,
query_text, -- may need text assembly from SYS_QUERYTEXT for long queries
start_time,
end_time,
user_id,
status
FROM sys_query_history
WHERE start_time >= DATEADD(hour, -24, GETDATE())
AND status = 'success'
For long queries (text > 4000 chars), assemble from SYS_QUERYTEXT:
SELECT query_id, LISTAGG(text, '') WITHIN GROUP (ORDER BY sequence) AS full_text
FROM sys_querytext
WHERE query_id = <id>
GROUP BY query_id
Hive
Parse the HiveServer2 log file (default: /tmp/root/hive.log) for lines matching:
(Executing|Starting) command\(queryId=(\S*)\): (?P<command>.*)
Output manifest (include invocation_id)
manifest = {
"resource_uuid": resource_uuid,
"invocation_id": service.extract_invocation_id(result), # ← save this
"collected_at": datetime.now(tz=timezone.utc).isoformat(),
"entry_count": len(entries),
"window_start": min(e.start_time for e in entries).isoformat(),
"window_end": max(e.end_time for e in entries).isoformat(),
"queries": [
{
"query_id": e.query_id,
"start_time": e.start_time.isoformat(),
"end_time": e.end_time.isoformat(),
"returned_rows": e.returned_rows,
"query": e.query_text[:200], # truncate for readability
}
for e in entries
],
}
with open("query_logs_output.json", "w") as f:
json.dump(manifest, f, indent=2)