Enterprise Data Platform with Databricks Part 6

Introduction

Welcome back to our blog series on building modern enterprise data platforms!

In addition to meeting technical requirements, a key indicator of a data platform’s success is, above all, the confidence that business users have in the metrics it provides. In building our cloud data platform with Databricks and dbt , we rely on two key pillars to meet these requirements: automated tests for data quality (DQ) and a seamless end-to-end lineage.

Mit ENTUAL Software systematisch entwickeln.

Visualization created with the help of AI (Gemini)

Data Quality (DQ)

Data quality is not an optional, after-the-fact process, but rather an integral component of a data platform that must be deeply embedded in the development process. With dbt, we use a two-step process to ensure that only valid data reaches the higher layers of the Medaillon architecture.

Generic DQ Tests

Generic DQ tests are used to verify the data structure and basic integrity of a dataset. These are purely technical tests that do not check business logic. The major advantage of this type of test is that they can be defined declaratively in YAML files. In addition, they are highly reusable and can be executed directly during every build process.

The three standard tests that we define for nearly every model in the Bronze and Silver tiers are:

  • unique: Ensures that a primary key or a combination of fields does not contain duplicates.
  • not_null: Ensures that critical fields (such as a customer ID or a transaction timestamp) are never left blank.
  • accepted_values: Checks whether a field contains only values from a predefined list of allowed values (e.g., status codes such as "shipped," "completed," "returned").

An example of a YAML definition for such generic tests is shown below:

				
					models:
  - name: stg_orders
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      - name: status
        tests:
          - accepted_values:
              values: ['placed', 'shipped', 'completed', 'returned']

				
			

Technical DQ Tests

Not every error can be detected through simple structural checks using generic data quality checks. To verify the specific content accuracy of a data record, the implemented business logic must therefore be checked. These business-specific checks can be performed using custom SQL queries that specifically search for “logical inconsistencies.”

In dbt, such a business test is considered to have failed if the SQL query returns one or more records. Examples of such business tests might include:

  • Logic check: “Can the delivery date be earlier than the order date?”
  • Total check: “Does the total of the individual line items match the total amount on the invoice?”
  • Consistency check: "Do active contracts always have an associated payment method?"

These customized SQL tests allow us to catch business anomalies before they make their way into the reporting layer, where they could lead to incorrect decisions.

End-to-End Lineage

Transparency in data flow is key to error analysis. In traditional SQL environments, people often hard-code table names, such as:

				
					SELECT * FROM bronze_db.api_orders
				
			

he problem: If the table name or structure changes, the SQL queries must be updated manually. Furthermore, the system does not know the order in which the tables need to be populated, as there is no automated process to analyze dependencies between individual SQL models.

In dbt, we therefore use the ref() construct: Instead of specifying a table by name, we reference the name of the dbt model:

				
					SELECT 
order_id, 
UPPER(status) as status_code, 
ingested_at 
FROM {{ ref('stg_orders') }} 
WHERE status IS NOT NULL
				
			

In the background, several "things" are happening at once:

  • Creating a dependency graph: Every time the code is recompiled, dbt scans all SQL files for ref() statements. Based on the individual dependencies defined in the models, dbt constructs a Directed Acyclic Graph (DAG). This allows dbt to determine the order in which the SQL statements must be executed.
  • Environmental awareness: During compilation, dbt replaces the ref() statement with the actual path to the table, depending on whether you are currently working in the dev, test, or prod environment. This is controlled by variables in configuration files.
  • Visualization: The command dbt docs generate creates an interactive lineage webpage from these relationships. At a glance, you can view the entire lineage of the data platform at any level of detail, as well as information about individual tables.
  • Selective implementation: The dbt run command can now be controlled with great precision to execute individual models and their dependencies. Using dbt run –select my_model+ starts the model and all its dependent successor models.
    Using dbt run –select +my_model executes the model and all its necessary predecessor models.

Outlook

True trust in a data platform can only be built through radical transparency and verifiable quality. By using dbt in our cloud stack, we elevate testing and lineage from tedious manual tasks to integral, automated components of the pipeline.

Generic and domain-specific DQ tests act as a filter that catches errors before they reach the reporting stage. At the same time, dynamic lineage ensures that dependencies are always up to date, visually traceable, and managed with an awareness of the surrounding context.

Stay tuned!