Alation Data Quality SDK
The Alation Data Quality SDK is a production-ready Python library that enables data engineers to execute Alation-governed data quality checks directly within external data pipelines, orchestration frameworks, and CI/CD workflows.
The workflow bridges the gap between governance and engineering: First, you create an SDK-Enabled Monitor within the Alation Data Quality application to define your checks. Then, you use this SDK to execute those checks programmatically in your external environment.
Key Capabilities
- Pushdown Execution: Queries run directly in your data warehouse (for example, Snowflake, Databricks, BigQuery) using a "pushdown" model. No data is extracted or stored by Alation.
- Centralized Governance: Checks are authored and managed in the Alation UI, but fetched dynamically by the SDK at runtime. This ensures pipeline logic always reflects the latest governance policies without code changes.
- Pipeline-Native: Designed for seamless integration into Airflow DAGs, Glue jobs, and GitHub Actions with deterministic exit codes for gating pipelines.
- Zero-Config Authentication: Automatically handles OAuth token exchange and securely fetches data source credentials from Alation, eliminating the need for hardcoded secrets in your scripts.
Common Use Cases
- Pipeline Gating (Airflow/Prefect): Stop a data pipeline immediately if critical quality checks fail, preventing bad data from polluting downstream dashboards.
- CI/CD Quality Gates: Run data quality tests as part of your pull request process to validate data transformations before merging code.
- Post-Load Validation: Automatically verify data freshness and validity immediately after an ETL load completes.
Detailed Use Case: Hard Enforcement in Data Pipeline
Scenario: A critical Gold layer table, gold.finance_monthly_close, is produced at the end of the Medallion pipeline. Because this table represents certified business metrics, strict contract enforcement is required before publishing to BI tools. This table feeds into critical downstream assets such as executive financial dashboards, board-level reporting, forecasting models, and external regulatory extracts.
Data Quality Monitor Configuration: The monitoring point is the final Gold output table, gold.finance_monthly_close. These rules are stored and owned by Finance Data Governance:
close_idmust beNOT NULLandUNIQUE.total_revenuemust be>= 0.net_profitmust be>= 0.- Freshness must be
< 24 hours. - No
NULLvalues inbusiness_unit. - Aggregated totals must reconcile with
silver.finance_transactions.
Pipeline Flow: Monitoring executes after the Gold table is built but before promotion or exposure.
- Bronze → Silver transformations complete.
- Silver → Gold transformation builds
gold.finance_monthly_close. - Alation Quality SDK executes contract validation on
gold.finance_monthly_close. - If validation passes → the table is certified and published.
- If validation fails → publication is blocked.
Execution Flow in Airflow:
The SDK seamlessly integrates into orchestration tools like Airflow to act as a quality gate:
- Airflow DAG runs a transformation task to
build gold.finance_monthly_close. - SDK fetches Monitor definition from Alation for dataset
gold.finance_monthly_close. - SDK runs validations directly against the Snowflake table.
- Results computed (pass/fail).
- If any Critical rule fails: - SDK publishes results to Alation.
- The Airflow task raises an exception.
- DAG fails.
- Downstream publish tasks do not run.
- If all Critical rules pass: - Results upload to Alation.
- Certification and publish step executes.
from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.operators.bash import BashOperator
from datetime import datetime
def run_gold_dq_checks(**context):
from data_quality_sdk import DataQualityRunner
runner = DataQualityRunner()
result = runner.run_checks()
# Hard Enforcement: Fail the Airflow task if any critical rule fails
if result['exit_code'] != 0:
raise Exception(f"Gold table certification failed! Data quality errors: {result['summary']}")
print("Gold table certified. All critical rules passed.")
return result
default_args = {
'owner': 'finance_data_eng',
'start_date': datetime(2024, 1, 1),
'retries': 0, # Do not retry on DQ failure to prevent bad data processing
}
with DAG('finance_monthly_close_pipeline', default_args=default_args, schedule_interval='@monthly') as dag:
# Transformations
bronze_to_silver = BashOperator(task_id='bronze_to_silver', bash_command='dbt run --select tag:silver_finance')
silver_to_gold = BashOperator(task_id='silver_to_gold', bash_command='dbt run --select tag:gold_finance_close')
# Alation Data Quality Gate
dq_check_gold = PythonOperator(
task_id='dq_check_gold_table',
python_callable=run_gold_dq_checks,
env_vars={
'ALATION_HOST': '{{ var.value.ALATION_HOST }}',
'MONITOR_ID': '{{ var.value.FINANCE_GOLD_MONITOR_ID }}',
'ALATION_CLIENT_ID': '{{ var.value.ALATION_CLIENT_ID }}',
'ALATION_CLIENT_SECRET': '{{ var.value.ALATION_CLIENT_SECRET }}',
'TENANT_ID': '{{ var.value.TENANT_ID }}',
}
)
# Downstream Publishing
publish_gold = BashOperator(task_id='publish_gold', bash_command='echo "Publishing certified data..."')
# Dependencies
bronze_to_silver >> silver_to_gold >> dq_check_gold >> publish_goldGetting Started
The SDK is available as a standard Python package. For complete installation instructions, API reference, and detailed code examples, please visit the official package documentation.
View Alation Data Quality SDK on PyPI.
For architectural details, see the Alation Data Quality Documentation.
Updated about 1 month ago