Uniqueness Data Quality Rules

Uniqueness rules validate that data values are distinct and identify duplicate entries. These checks ensure data integrity by detecting unwanted duplicates and verifying that unique constraints are maintained.

POST /integration/v1/native_data_quality/monitor

Numeric Column Metrics

  • duplicate_count: Counts the number of rows containing duplicate values. Can check single columns or combinations of multiple columns.

Supported Operators: =, <, >, <=, >=, !=, <>, between, not between

Single Column Check:

{
       "metric_name": "duplicate_count",
       "operator": "=",
       "threshold": "0",
       "category": "uniqueness"
   }

Range Check:

 {
       "metric_name": "duplicate_count",
       "operator": "<=",
       "threshold": "5",
       "category": "uniqueness"
   }

  • duplicate_percent: Calculates the percentage of rows containing duplicate values relative to the total row count.

Supported Operators: =, <, >, <=, >=, !=, <>, between, not between

Basic Usage:

{
       "metric_name": "duplicate_percent",
       "operator": "<=",
       "threshold": "2.0",
       "category": "uniqueness"
   }

Strict Uniqueness:

 {
       "metric_name": "duplicate_percent",
       "operator": "=",
       "threshold": "0",
       "category": "uniqueness"
   }

Text Column Metrics

  • duplicate_count: Counts duplicate text values within a column or across multiple columns.

Supported Operators: =, <, >, <=, >=, !=, <>, between, not between

Basic Usage:

 {
       "metric_name": "duplicate_count",
       "operator": "=",
       "threshold": "0",
       "category": "uniqueness",
       "check_description": "Email addresses must be unique"
   }

Allow Limited Duplicates:

 {
       "metric_name": "duplicate_count",
       "operator": "<=",
       "threshold": "3",
       "category": "uniqueness",
       "check_description": "Allow up to 3 duplicate last names"
   }

  • duplicate_percent: Calculates the percentage of duplicate text entries.

Supported Operators: =, <, >, <=, >=, !=, <>, between, not between

Basic Usage:

 {
       "metric_name": "duplicate_percent",
       "operator": "<=",
       "threshold": "1.0",
       "category": "uniqueness"
   }

Time Column Metrics

  • duplicate_count: Counts duplicate timestamp or date values.

Supported Operators: =, <, >, <=, >=, !=, <>, between, not between

 {
       "metric_name": "duplicate_count",
       "operator": "<=",
       "threshold": "10",
       "category": "uniqueness",
       "check_description": "Allow some timestamp duplicates for batch processing"
   }

  • duplicate_percent: Calculates the percentage of duplicate time entries.

Supported Operators: =, <, >, <=, >=, !=, <>, between, not between

{
       "metric_name": "duplicate_percent",
       "operator": "<=",
       "threshold": "5.0",
       "category": "uniqueness"
   }

Custom Metrics (All Column Types)

  • sql (Custom SQL Query): Defines custom uniqueness metrics using SQL queries.

Supported Operators: =, <, >, <=, >=, !=, <>, between, not between

Business Rule Uniqueness:

{
       "metric_name": "sql",
       "operator": "=",
       "threshold": "0",
       "configuration_keys": {
           "custom_metric": "active_email_duplicates",
           "query value": "SELECT COUNT(*) FROM (SELECT email FROM schema.customers WHERE status = 'Active' GROUP BY email HAVING COUNT(*) > 1) dups"
       },
       "category": "uniqueness"
   }

Multi-Column Duplicate Checks

Note: For multi-column duplicate detection, specify column names in the check description or use custom SQL/CTE metrics.

Example Description Format:
"Include one column in the argument to compare values relative to that one column. Include more than one column in the argument to search for duplicate pairs in multiple columns. Be sure to add a space between the comma-separated values in the list of column names."

Multi-Column Example with Custom SQL:

{
       "metric_name": "sql",
       "operator": "=",
       "threshold": "0",
       "configuration_keys": {
           "custom_metric": "customer_order_uniqueness",
           "query value": "SELECT COUNT(*) FROM (SELECT customer_id, product_id, order_date FROM schema.orders GROUP BY customer_id, product_id, order_date HAVING COUNT(*) > 1) duplicates"
       },
       "category": "uniqueness",
       "check_description": "Each customer should only have one order per product per day"
   }