How to Scale Your Data Pipelines and Data Products with Contract Testing and Dbt

First, we need to add two new dbt packages, dbt-expectations and dbt-utils, that will allow us to make assertions on the schema of our sources and the accepted values.

# packages.ymlpackages:
- package: dbt-labs/dbt_utils
version: 1.1.1
- package: calogica/dbt_expectations
version: 0.8.5

Testing the data sources

Let’s start by defining a contract test for our first source. We pull data from raw_height, a table that contains height information from the users of the gym app.

We agree with our data producers that we will receive the height measurement, the units for the measurements, and the user ID. We agree on the data types and that only ‘cm’ and ‘inches’ are supported as units. With all this, we can define our first contract in the dbt source YAML file.

The building blocks

Looking at the previous test, we can see several of the dbt-unit-testing macros in use:

dbt_expectations.expect_column_values_to_be_of_type: This assertion allows us to define the expected column data type.
accepted_values: This assertion allows us to define a list of the accepeted values for a specific column.
dbt_utils.accepted_range: This assertion allows us to define a numerical range for a given column. In the example, we expected the column’s value not to be less than 0.
not null: Finally, built-in assertions like ‘not null’ allow us to define column constraints.

Using these building blocks, we added several tests to define the contract expectations described above. Notice also how we have tagged the tests as “contract-test-source”. This tag allows us to run all contract tests in isolation, both locally, and as we will see later, in the CI/CD pipeline:

dbt test --select tag:contract-test-source

We have seen how quickly we can create contract tests for the sources of our dbt app, but what about the public interfaces of our data pipeline or data product?

As data producers, we want to make sure we are producing data according to the expectations of our data consumers so we can satisfy the contract we have with them and make our data pipeline or data product trustworthy and reliable.

A simple way to ensure that we are meeting our obligations to our data consumers is to add contract testing for our public interfaces.

Dbt recently released a new feature for SQL models, model contracts, that allows to define the contract for a dbt model. While building your model, dbt will verify that your model’s transformation will produce a dataset matching up with its contract, or it will fail to build.

Let’s see it in action. Our mart, body_mass_indexes, produces a BMI metric from the weight and height measure data we get from our sources. The contract with our provider establishes the following:

Data types for each column.
User IDs cannot be null
User IDs are always greater than 0

Let’s define the contract of the body_mass_indexes model using dbt model contracts:

The building blocks

Looking at the previous model specification file, we can see several metadata that allow us to define the contract.

contract.enforced: This configuration tells dbt that we want to enforce the contract every time the model is run.
data_type: This assertion allows us to define the column type we are expecting to produce once the model runs.
constraints: Finally, the constraints block gives us the chance to define useful constraints like that a column cannot be null, set primary keys, and custom expressions. In the example above we defined a constraint to tell dbt that the user_id must always be greater than 0. You can see all the available constraints here.

A difference between the contract tests we defined for our sources and the ones defined for our marts or output ports is when the contracts are verified an enforced.

Model contracts are enforced when the model is being generated by dbt run, whereas contracts based on the dbt tests are enforced when the dbt tests run.

If one of the model contracts is not satisfied, you will see an error when you execute ‘dbt run’ with specific details on the failure. You can see an example in the following dbt run console output.

1 of 4 START sql table model dbt_testing_example.stg_gym_app__height ........... [RUN]
2 of 4 START sql table model dbt_testing_example.stg_gym_app__weight ........... [RUN]
2 of 4 OK created sql table model dbt_testing_example.stg_gym_app__weight ...... [SELECT 4 in 0.88s]
1 of 4 OK created sql table model dbt_testing_example.stg_gym_app__height ...... [SELECT 4 in 0.92s]
3 of 4 START sql table model dbt_testing_example.int_weight_measurements_with_latest_height  [RUN]
3 of 4 OK created sql table model dbt_testing_example.int_weight_measurements_with_latest_height  [SELECT 4 in 0.96s]
4 of 4 START sql table model dbt_testing_example.body_mass_indexes ............. [RUN]
4 of 4 ERROR creating sql table model dbt_testing_example.body_mass_indexes .... [ERROR in 0.77s]Finished running 4 table models in 0 hours 0 minutes and 6.28 seconds (6.28s).
Completed with 1 error and 0 warnings:
Database Error in model body_mass_indexes (models/marts/body_mass_indexes.sql)
new row for relation "body_mass_indexes__dbt_tmp" violates check constraint 
"body_mass_indexes__dbt_tmp_user_id_check1"
DETAIL:  Failing row contains (1, 2009-07-01, 82.5, null, null).
compiled Code at target/run/dbt_testing_example/models/marts/body_mass_indexes.sql

Until now we have a test suite of powerful contract tests, but how and when do we run them?

We can run contract tests in two types of pipelines.

CI/CD pipelines
Data pipelines

For example, you can execute the source contract tests on a schedule in a CI/CD pipeline targeting the data sources available in lower environments like test or staging. You can set the pipeline to fail every time the contract is not met.

These failures provides valuable information about contract-breaking changes introduced by other teams before these changes reach production.

Source link

What's Hot

No Train, All Gain: Enhancing Deep Frozen Representations with Self-Supervised Gradients

BLIP3-KALE: An Open-Source Dataset of 218 Million Image-Text Pairs Transforming Image Captioning with Knowledge-Augmented Dense Descriptions

Meta AI Researchers Introduce Mixture-of-Transformers (MoT): A Sparse Multi-Modal Transformer Architecture that Significantly Reduces Pretraining Computational Costs

How to Scale Your Data Pipelines and Data Products with Contract Testing and Dbt

A Practical Framework for Data Analysis: 6 Essential Principles | by Pararawendy Indarjo | Nov, 2024

How I Created a Data Science Project Following CRISP-DM Lifecycle | by Gustavo Santos | Nov, 2024

Increase Trust in Your Regression Model The Easy Way | by Jonte Dancker | Nov, 2024

Leave A Reply Cancel Reply

How ML AI Can Help Businesses Reduce Overhead Costs

How the AI Surge May Help Current WFH Employees

The ultimate contact center automation guide

Top 5AI Development Companies To Transform Your Business | by Amyra Sheldon

No Train, All Gain: Enhancing Deep Frozen Representations with Self-Supervised Gradients

BLIP3-KALE: An Open-Source Dataset of 218 Million Image-Text Pairs Transforming Image Captioning with Knowledge-Augmented Dense Descriptions

Meta AI Researchers Introduce Mixture-of-Transformers (MoT): A Sparse Multi-Modal Transformer Architecture that Significantly Reduces Pretraining Computational Costs

A Practical Framework for Data Analysis: 6 Essential Principles | by Pararawendy Indarjo | Nov, 2024

Our Picks

No Train, All Gain: Enhancing Deep Frozen Representations with Self-Supervised Gradients

BLIP3-KALE: An Open-Source Dataset of 218 Million Image-Text Pairs Transforming Image Captioning with Knowledge-Augmented Dense Descriptions

Meta AI Researchers Introduce Mixture-of-Transformers (MoT): A Sparse Multi-Modal Transformer Architecture that Significantly Reduces Pretraining Computational Costs

What's Hot

How to Scale Your Data Pipelines and Data Products with Contract Testing and Dbt

Testing the data sources

The building blocks

The building blocks

Related Posts

Leave A Reply Cancel Reply