First, we need to add two new dbt packages, dbt-expectations and dbt-utils, that will allow us to make assertions on the schema of our sources and the accepted values.
# packages.ymlpackages:
- package: dbt-labs/dbt_utils
version: 1.1.1
- package: calogica/dbt_expectations
version: 0.8.5
Testing the data sources
Let’s start by defining a contract test for our first source. We pull data from raw_height, a table that contains height information from the users of the gym app.
We agree with our data producers that we will receive the height measurement, the units for the measurements, and the user ID. We agree on the data types and that only ‘cm’ and ‘inches’ are supported as units. With all this, we can define our first contract in the dbt source YAML file.
The building blocks
Looking at the previous test, we can see several of the dbt-unit-testing macros in use:
- dbt_expectations.expect_column_values_to_be_of_type: This assertion allows us to define the expected column data type.
- accepted_values: This assertion allows us to define a list of the accepeted values for a specific column.
- dbt_utils.accepted_range: This assertion allows us to define a numerical range for a given column. In the example, we expected the column’s value not to be less than 0.
- not null: Finally, built-in assertions like ‘not null’ allow us to define column constraints.
Using these building blocks, we added several tests to define the contract expectations described above. Notice also how we have tagged the tests as “contract-test-source”. This tag allows us to run all contract tests in isolation, both locally, and as we will see later, in the CI/CD pipeline:
dbt test --select tag:contract-test-source
We have seen how quickly we can create contract tests for the sources of our dbt app, but what about the public interfaces of our data pipeline or data product?
As data producers, we want to make sure we are producing data according to the expectations of our data consumers so we can satisfy the contract we have with them and make our data pipeline or data product trustworthy and reliable.
A simple way to ensure that we are meeting our obligations to our data consumers is to add contract testing for our public interfaces.
Dbt recently released a new feature for SQL models, model contracts, that allows to define the contract for a dbt model. While building your model, dbt will verify that your model’s transformation will produce a dataset matching up with its contract, or it will fail to build.
Let’s see it in action. Our mart, body_mass_indexes, produces a BMI metric from the weight and height measure data we get from our sources. The contract with our provider establishes the following:
- Data types for each column.
- User IDs cannot be null
- User IDs are always greater than 0
Let’s define the contract of the body_mass_indexes model using dbt model contracts:
The building blocks
Looking at the previous model specification file, we can see several metadata that allow us to define the contract.
- contract.enforced: This configuration tells dbt that we want to enforce the contract every time the model is run.
- data_type: This assertion allows us to define the column type we are expecting to produce once the model runs.
- constraints: Finally, the constraints block gives us the chance to define useful constraints like that a column cannot be null, set primary keys, and custom expressions. In the example above we defined a constraint to tell dbt that the user_id must always be greater than 0. You can see all the available constraints here.
A difference between the contract tests we defined for our sources and the ones defined for our marts or output ports is when the contracts are verified an enforced.
Model contracts are enforced when the model is being generated by dbt run, whereas contracts based on the dbt tests are enforced when the dbt tests run.
If one of the model contracts is not satisfied, you will see an error when you execute ‘dbt run’ with specific details on the failure. You can see an example in the following dbt run console output.
1 of 4 START sql table model dbt_testing_example.stg_gym_app__height ........... [RUN]
2 of 4 START sql table model dbt_testing_example.stg_gym_app__weight ........... [RUN]
2 of 4 OK created sql table model dbt_testing_example.stg_gym_app__weight ...... [SELECT 4 in 0.88s]
1 of 4 OK created sql table model dbt_testing_example.stg_gym_app__height ...... [SELECT 4 in 0.92s]
3 of 4 START sql table model dbt_testing_example.int_weight_measurements_with_latest_height [RUN]
3 of 4 OK created sql table model dbt_testing_example.int_weight_measurements_with_latest_height [SELECT 4 in 0.96s]
4 of 4 START sql table model dbt_testing_example.body_mass_indexes ............. [RUN]
4 of 4 ERROR creating sql table model dbt_testing_example.body_mass_indexes .... [ERROR in 0.77s]Finished running 4 table models in 0 hours 0 minutes and 6.28 seconds (6.28s).
Completed with 1 error and 0 warnings:
Database Error in model body_mass_indexes (models/marts/body_mass_indexes.sql)
new row for relation "body_mass_indexes__dbt_tmp" violates check constraint
"body_mass_indexes__dbt_tmp_user_id_check1"
DETAIL: Failing row contains (1, 2009-07-01, 82.5, null, null).
compiled Code at target/run/dbt_testing_example/models/marts/body_mass_indexes.sql
Until now we have a test suite of powerful contract tests, but how and when do we run them?
We can run contract tests in two types of pipelines.
- CI/CD pipelines
- Data pipelines
For example, you can execute the source contract tests on a schedule in a CI/CD pipeline targeting the data sources available in lower environments like test or staging. You can set the pipeline to fail every time the contract is not met.
These failures provides valuable information about contract-breaking changes introduced by other teams before these changes reach production.