How we think about Data Pipelines is changing | by Hugo Lu

The goal is to reliably and efficiently release data into production

Data Pipelines are series of tasks organised in a directed acyclic graph or “DAG”. Historically, these are run on open-source workflow orchestration packages like Airflow or Prefect, and require infrastructure managed by data engineers or platform teams. These data pipelines typically run on a schedule, and allow data engineers to update data in locations such as data warehouses or data lakes.

This is now changing. There is a great shift in mentality happening. As the data engineering industry matures, mindsets are shifting from a “move data to serve the business at all costs” mindset to “reliability and efficiency” / “software engineering” mindset.

Continuous Data Integration and Delivery

I’ve written before about how Data Teams ship data whereas software teams ship code.

This is a process called “Continuous Data Integration and Delivery”, and is the process of reliably and efficiently releasing data into production. There are subtle differences with the definition of “CI/CD” as used in Software Engineer, illustrated below.

In software engineering, Continuous Delivery is non-trivial because of the importance of having a near exact replica for code to operate in a staging environment.

Within Data Engineering, this is not necessary because the good we ship is data. If there is a table of data, and we know that as long as a few conditions are satisfied, the data is of a sufficient quality to be used, then that is sufficient for it to be “released” into production, so to speak.

The process of releasing data into production — the analog for Continuous Delivery — is very simple, as it simply relates to copying or cloning a dataset.

Furthermore, a key pillar of data engineering is reacting to new data as it arrives or checking to see if new data exists. There is no analog for this in software engineering — software applications do not need to…

Source link