playbook/antigravity-awesome-skills/skills/monte-carlo-monitor-creation/references/comparison-monitor.md

427 lines
16 KiB
Markdown

# Comparison Monitor Reference
Detailed reference for building `createComparisonMonitorMac` tool calls.
## When to Use
Use a comparison monitor when the user wants to:
- Compare data between two tables (e.g., source vs target, dev vs prod)
- Validate data consistency after migration or replication
- Check row count parity across environments
- Compare field-level metrics between tables (null counts, sums, distributions)
---
## Pre-Step: Verify Both Tables and Fields
Before constructing alert conditions, you MUST verify that both tables exist and that any referenced fields are real columns. This is the most common source of comparison monitor failures.
1. **Resolve both MCONs.** Use `search` to find the source and target tables. If the user provided `database:schema.table` format, search for each to get the MCON.
2. **Get full schemas.** Call `getTable` with `include_fields: true` on BOTH the source table and the target table. You need the column lists from both.
3. **For field-level metrics, verify fields exist on both sides.** Confirm that `sourceField` exists in the source table's column list AND `targetField` exists in the target table's column list. Field names are case-sensitive on most warehouses.
4. **Check field type compatibility.** The metric must be compatible with the column types on both sides. For example, `NUMERIC_MEAN` requires numeric columns in both the source and target tables. If the source column is numeric but the target is a string, the comparison will fail.
5. If any field does not exist or types are incompatible, stop and ask the user to clarify. Do not guess.
---
## Required Parameters
| Parameter | Type | Description |
|-----------|------|-------------|
| `name` | string | Unique identifier for the monitor. Use a descriptive slug (e.g., `orders_dev_prod_compare`). |
| `description` | string | Human-readable description of what the monitor checks. |
| `source_table` | string | Source table MCON (preferred) or `database:schema.table` format. If not MCON, also pass `source_warehouse`. |
| `target_table` | string | Target table MCON (preferred) or `database:schema.table` format. If not MCON, also pass `target_warehouse`. |
| `alert_conditions` | array | List of comparison conditions (see Alert Conditions below). |
## Optional Parameters
| Parameter | Type | Description |
|-----------|------|-------------|
| `source_warehouse` | string | Warehouse name or UUID for the source table. Required if `source_table` is not an MCON. |
| `target_warehouse` | string | Warehouse name or UUID for the target table. Required if `target_table` is not an MCON. |
| `segment_fields` | array of string | Fields to segment the comparison by. Must exist in BOTH tables with the same name. |
| `domain_id` | string (uuid) | Domain UUID (use `getDomains` to list). Only one domain can be assigned per monitor. |
---
## Cross-Warehouse Comparisons
When the source and target tables live in different warehouses (e.g., comparing a Snowflake staging table against a BigQuery production table), you MUST provide both `source_warehouse` and `target_warehouse` explicitly. The tool cannot auto-resolve warehouses when tables are in different environments.
Even when both tables are MCONs, if they belong to different warehouses, pass both warehouse parameters to be safe. Omitting them in cross-warehouse scenarios causes silent failures or incorrect results.
Common cross-warehouse patterns:
- **Dev vs prod:** same warehouse type, different databases or schemas
- **Migration validation:** source in old warehouse, target in new warehouse
- **Replication checks:** primary warehouse vs replica or downstream warehouse
---
## Alert Conditions
Each condition compares a metric between the source and target tables.
| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `metric` | string | Yes | The metric to compare (see Metrics Reference below). |
| `sourceField` | string | For field-level metrics | Column in the source table. Required for ALL metrics except `ROW_COUNT`. |
| `targetField` | string | For field-level metrics | Column in the target table. Required for ALL metrics except `ROW_COUNT`. |
| `thresholdValue` | number | No | Threshold for acceptable difference between source and target. |
| `isThresholdRelative` | boolean | No | `false` = absolute difference (default), `true` = percentage difference. |
| `customMetric` | object | No | Custom SQL expressions for source and target (see Custom Metrics below). |
---
## ROW_COUNT and Fields: A Critical Rule
> **NEVER pass `sourceField` or `targetField` when using the `ROW_COUNT` metric.**
`ROW_COUNT` is a table-level metric -- it counts all rows in the table, not values in a column. Passing field names with `ROW_COUNT` causes the API call to fail or produce unexpected behavior.
This is the single most common mistake with comparison monitors. Before submitting any alert condition with `ROW_COUNT`, verify that `sourceField` and `targetField` are both absent from the condition object.
| Metric | Fields needed? | What happens if you pass fields? |
|--------|---------------|----------------------------------|
| `ROW_COUNT` | **No -- NEVER pass fields** | API error or undefined behavior |
| All other metrics | **Yes -- always pass both fields** | Required for the comparison to work |
---
## Metrics Reference
### Table-level metric (no fields needed)
| Metric | Description |
|--------|-------------|
| `ROW_COUNT` | Compare total row counts between source and target. |
### Field-level metrics (require `sourceField` and `targetField`)
#### Uniqueness and duplicates
| Metric | Description |
|--------|-------------|
| `UNIQUE_COUNT` | Count of distinct values. |
| `DUPLICATE_COUNT` | Count of duplicate (non-unique) values. |
| `APPROX_DISTINCT_COUNT` | Approximate distinct count (faster on large tables). |
#### Null and empty checks
| Metric | Description |
|--------|-------------|
| `NULL_COUNT` | Count of null values. |
| `NON_NULL_COUNT` | Count of non-null values. |
| `EMPTY_STRING_COUNT` | Count of empty string values. |
| `TEXT_ALL_SPACES_COUNT` | Count of values that are all whitespace. |
| `NAN_COUNT` | Count of NaN values. |
| `TEXT_NULL_KEYWORD_COUNT` | Count of values containing null-like keywords (e.g., "NULL", "None"). |
#### Numeric statistics
| Metric | Description |
|--------|-------------|
| `NUMERIC_MEAN` | Mean of numeric field. |
| `NUMERIC_MEDIAN` | Median of numeric field. |
| `NUMERIC_MIN` | Minimum value. |
| `NUMERIC_MAX` | Maximum value. |
| `NUMERIC_STDDEV` | Standard deviation. |
| `SUM` | Sum of numeric field. |
| `ZERO_COUNT` | Count of zero values. |
| `NEGATIVE_COUNT` | Count of negative values. |
#### Percentiles
| Metric | Description |
|--------|-------------|
| `PERCENTILE_20` | 20th percentile value. |
| `PERCENTILE_40` | 40th percentile value. |
| `PERCENTILE_60` | 60th percentile value. |
| `PERCENTILE_80` | 80th percentile value. |
#### Text statistics
| Metric | Description |
|--------|-------------|
| `TEXT_MAX_LENGTH` | Maximum string length. |
| `TEXT_MIN_LENGTH` | Minimum string length. |
| `TEXT_MEAN_LENGTH` | Mean string length. |
| `TEXT_STD_LENGTH` | Standard deviation of string length. |
#### Text format checks
| Metric | Description |
|--------|-------------|
| `TEXT_NOT_INT_COUNT` | Count of values not parseable as integers. |
| `TEXT_NOT_NUMBER_COUNT` | Count of values not parseable as numbers. |
| `TEXT_NOT_UUID_COUNT` | Count of values not matching UUID format. |
| `TEXT_NOT_SSN_COUNT` | Count of values not matching SSN format. |
| `TEXT_NOT_US_PHONE_COUNT` | Count of values not matching US phone format. |
| `TEXT_NOT_US_STATE_CODE_COUNT` | Count of values not matching US state codes. |
| `TEXT_NOT_US_ZIP_CODE_COUNT` | Count of values not matching US zip codes. |
| `TEXT_NOT_EMAIL_ADDRESS_COUNT` | Count of values not matching email format. |
| `TEXT_NOT_TIMESTAMP_COUNT` | Count of values not parseable as timestamps. |
#### Boolean
| Metric | Description |
|--------|-------------|
| `TRUE_COUNT` | Count of true values. |
| `FALSE_COUNT` | Count of false values. |
#### Timestamp
| Metric | Description |
|--------|-------------|
| `FUTURE_TIMESTAMP_COUNT` | Count of timestamps in the future. |
| `PAST_TIMESTAMP_COUNT` | Count of timestamps unreasonably far in the past. |
| `UNIX_ZERO_COUNT` | Count of timestamps equal to Unix epoch zero (1970-01-01). |
---
## Choosing the Right Metric
| User intent | Correct metric | Fields needed? |
|-------------|---------------|----------------|
| Row count parity | `ROW_COUNT` | **No** -- never pass fields |
| Distinct values in a column | `UNIQUE_COUNT` | Yes |
| Null values in a column | `NULL_COUNT` | Yes |
| Sum, average, min, max | `SUM`, `NUMERIC_MEAN`, `NUMERIC_MIN`, `NUMERIC_MAX` | Yes |
| Data completeness | `NON_NULL_COUNT` | Yes |
| String format validation | `TEXT_NOT_EMAIL_ADDRESS_COUNT`, `TEXT_NOT_UUID_COUNT`, etc. | Yes |
| Custom computed expressions | Use `customMetric` instead of `metric` | No (SQL handles it) |
---
## Custom Metrics
Use custom metrics when:
- **Column names differ** between source and target and you need a computed expression (not just a direct field comparison).
- **You need a derived calculation** like `SUM(quantity * unit_price)` rather than a simple column metric.
- **Standard metrics do not cover the comparison** (e.g., comparing a ratio, a conditional aggregate, or a windowed calculation).
If the columns simply have different names but you want a standard metric (e.g., compare `SUM` of `revenue` in source vs `total_revenue` in target), you do NOT need a custom metric -- just use the standard metric with different `sourceField` and `targetField` values.
Custom metric structure:
```json
{
"customMetric": {
"displayName": "Revenue Sum",
"sourceSqlExpression": "SUM(revenue)",
"targetSqlExpression": "SUM(total_revenue)"
}
}
```
| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `displayName` | string | Yes | Human-readable name for the metric in alerts and dashboards. |
| `sourceSqlExpression` | string | Yes | SQL expression evaluated against the source table. |
| `targetSqlExpression` | string | Yes | SQL expression evaluated against the target table. |
When using `customMetric`, do NOT also pass `metric`, `sourceField`, or `targetField` in the same alert condition. The custom metric replaces all of those.
---
## Threshold Guidance
### Absolute thresholds (`isThresholdRelative: false` or omitted)
The `thresholdValue` is the maximum acceptable absolute difference between the source and target metric values.
- `thresholdValue: 0` -- source and target must match exactly.
- `thresholdValue: 100` -- up to 100 units of difference is acceptable.
### Relative (percentage) thresholds (`isThresholdRelative: true`)
The `thresholdValue` is the maximum acceptable percentage difference.
- `thresholdValue: 5` -- up to 5% difference is acceptable.
- `thresholdValue: 0.1` -- up to 0.1% difference is acceptable.
### When to use each
| Scenario | Recommended threshold type |
|----------|---------------------------|
| Exact replication (row counts must match) | Absolute, `thresholdValue: 0` |
| Near-real-time sync with small lag | Absolute, small value (e.g., 10-100) |
| Tables at different scales | Relative, percentage-based |
| Aggregated metrics (sums, means) | Relative, to handle floating-point differences |
---
## Examples
### Row count parity with absolute threshold
Compare row counts between dev and prod, alerting if they differ by more than 100 rows.
```json
{
"name": "orders_dev_prod_row_count",
"description": "Verify dev and prod orders tables have similar row counts",
"source_table": "MCON++a1b2c3d4-e5f6-7890-abcd-ef1234567890++1++1++dev_warehouse:core.orders",
"target_table": "MCON++b2c3d4e5-f6a7-8901-bcde-f12345678901++1++1++prod_warehouse:core.orders",
"alert_conditions": [
{
"metric": "ROW_COUNT",
"thresholdValue": 100,
"isThresholdRelative": false
}
]
}
```
Note: no `sourceField` or `targetField` -- `ROW_COUNT` is table-level.
### Row count parity with percentage threshold
Alert if row counts differ by more than 5%.
```json
{
"name": "orders_replication_check",
"description": "Verify replicated orders table is within 5% of source row count",
"source_table": "MCON++a1b2c3d4-e5f6-7890-abcd-ef1234567890++1++1++primary:sales.orders",
"target_table": "MCON++b2c3d4e5-f6a7-8901-bcde-f12345678901++1++1++replica:sales.orders",
"alert_conditions": [
{
"metric": "ROW_COUNT",
"thresholdValue": 5,
"isThresholdRelative": true
}
]
}
```
### Field-level comparison (different column names)
Compare the sum of `revenue` in the source table against `total_revenue` in the target table.
```json
{
"name": "revenue_source_target_sum",
"description": "Verify revenue sums match between staging and production",
"source_table": "MCON++a1b2c3d4-e5f6-7890-abcd-ef1234567890++1++1++staging:finance.transactions",
"target_table": "MCON++b2c3d4e5-f6a7-8901-bcde-f12345678901++1++1++production:finance.transactions",
"alert_conditions": [
{
"metric": "SUM",
"sourceField": "revenue",
"targetField": "total_revenue",
"thresholdValue": 1,
"isThresholdRelative": true
}
]
}
```
### Segmented comparison
Compare null counts on `email` between source and target, segmented by `country`. The `country` field must exist in both tables.
```json
{
"name": "email_nulls_by_country",
"description": "Compare email null counts by country between ETL source and target",
"source_table": "MCON++a1b2c3d4-e5f6-7890-abcd-ef1234567890++1++1++raw:crm.contacts",
"target_table": "MCON++b2c3d4e5-f6a7-8901-bcde-f12345678901++1++1++analytics:crm.contacts",
"segment_fields": ["country"],
"alert_conditions": [
{
"metric": "NULL_COUNT",
"sourceField": "email",
"targetField": "email",
"thresholdValue": 0,
"isThresholdRelative": false
}
]
}
```
### Cross-warehouse comparison with explicit warehouses
When source and target are in different warehouses, both warehouse parameters must be provided.
```json
{
"name": "migration_users_row_count",
"description": "Validate user row counts match after Snowflake to BigQuery migration",
"source_table": "snowflake_db:public.users",
"source_warehouse": "snowflake-prod",
"target_table": "bigquery_project:public.users",
"target_warehouse": "bigquery-prod",
"alert_conditions": [
{
"metric": "ROW_COUNT",
"thresholdValue": 0,
"isThresholdRelative": false
}
]
}
```
### Custom metric comparison
Compare a computed revenue expression when the SQL differs between source and target.
```json
{
"name": "computed_revenue_compare",
"description": "Compare total revenue computation between legacy and new schema",
"source_table": "MCON++a1b2c3d4-e5f6-7890-abcd-ef1234567890++1++1++warehouse:legacy.orders",
"target_table": "MCON++b2c3d4e5-f6a7-8901-bcde-f12345678901++1++1++warehouse:v2.orders",
"alert_conditions": [
{
"customMetric": {
"displayName": "Total Revenue",
"sourceSqlExpression": "SUM(quantity * unit_price)",
"targetSqlExpression": "SUM(total_amount)"
},
"thresholdValue": 0.01,
"isThresholdRelative": true
}
]
}
```
### Multiple alert conditions
Compare both row counts and field-level metrics in a single monitor.
```json
{
"name": "orders_full_comparison",
"description": "Full comparison of orders between staging and production",
"source_table": "MCON++a1b2c3d4-e5f6-7890-abcd-ef1234567890++1++1++staging:core.orders",
"target_table": "MCON++b2c3d4e5-f6a7-8901-bcde-f12345678901++1++1++production:core.orders",
"domain_id": "f47ac10b-58cc-4372-a567-0e02b2c3d479",
"alert_conditions": [
{
"metric": "ROW_COUNT",
"thresholdValue": 0,
"isThresholdRelative": false
},
{
"metric": "NULL_COUNT",
"sourceField": "customer_id",
"targetField": "customer_id",
"thresholdValue": 0,
"isThresholdRelative": false
},
{
"metric": "SUM",
"sourceField": "amount",
"targetField": "amount",
"thresholdValue": 0.1,
"isThresholdRelative": true
}
]
}
```
Note: the `ROW_COUNT` condition has no fields, while the field-level conditions each specify both `sourceField` and `targetField`.