Have you ever found yourself in a state where you are writing the same SQL repeatedly, processing all of the historic data daily/ on every run, you have no solid way of testing the results produced by it and eventually you start losing trust on the numbers, then this article might be worth reading for you.
I started working on Proof of Concept (PoC) Data Engineering projects in early 2019. Initially, the typical workflow included ingesting marketing & user behavior data into Big Query, which is an enterprise data warehouse, modeling it there and then building reporting on top of it. Below is the very abstract pipeline flow:
It may seem like an ETL data pipeline but it is not. It highlights an ELT flow (If you want to explore more on ETL vs ELT, this might be a good read for you). In short, E.L.T aims at extracting the data first, loading it into the warehouse and then doing transformations in the warehouse rather than doing it on the source end. For this article, I am specifically talking about SQL-Centric Data Engineering.
Analytics Teams Workflow Problem:
Analytics Teams have a few workflow problems that people don’t realize.
- Most of the time, analysts operate in isolation within an organization. Knowledge is not shared, which creates sub-optimal outcomes.
- The change in business logic over time is not documented.
- There are no data quality check-points before the data gets into the dashboard.
- Analysts typically write long giant SQL queries which seem like a black box, not modular at all and are extremely hard to debug.
- There’s no mechanism to get alerted on data loss if any of the data source breaks in the pipeline until and unless the client sees very unexpected figures and complains to you about it.
- If you are not materializing the results and using custom queries directly in the dashboard, then it is going to impact the processing cost
Is there a solution?
The approach we can take to solve the above-mentioned problems is typical Software Development LifeCycle Approach (SDLC). Analytics workflows should have the following features to overcome:
- Should be Collaborative & Modular
Analysts should work in a collaborative manner sharing knowledge with each other. If one has already written some SQL snippet for some use-case then other analysts should be able to re-use this for similar use-case at any point in time rather than re-writing it from scratch (DRY – Don’t Repeat Yourself).
- Should be Version Controlled
Analytics code is an asset and every change in the business logic should be completely documented. Typically, we change something in a long black-boxed query and a few days later we’re clueless about the same.
- Should contain Data Quality check-points for Schema as well as data
Every single SQL snippet should be testable and different data quality check-points should be created to ensure data accuracy before the data gets into the dashboard
- Should support multiple environments, Staging, and Production
Analytics require multiple environments. Analysts should have the freedom to experiment on the staging environments without affecting the live dashboard.
Data Build Tool aka DBT:
Data Build Tool (DBT) is an open-source command-line tool, developed by Fishtown Analytics, that speaks the preferred language of data analysts everywhere—SQL. It handles only the “T” in the ELT approach for building data pipelines and provides an abstraction layer on top of your warehouse to do transformations.
Essentially, what we do is, we build SQL models (which corresponds to one .sql file containing exactly one Select statement) on top of source tables coming from production databases, which are then referred to by further models, acting as building blocks towards a final model that is then materialized and used as a source in the BI tool. You can think of it as SQL on steroids.
Here is a cool lineage graph generated for one of our clients.
Clear value can be seen in the above diagram where each SQL model is easily testable as it moves along building a final model.
How DBT fit in Current Analytics Workflow?
DBT gets integrated with any of the warehouse it is currently supported with. It is a compiler and runner. Users write DBT code (SQL) and invoke DBT from the command line. It adds some boilerplate code (DDL), compiles it and runs that code against your configured warehouse.
After extraction and loading of data in the warehouse, it handles the transformed part of the pipeline in a well-managed way untill the data gets into the dashboard.
I am of the view that everything should have defined boundaries. DBT is a good tool for you if:
- Your analytics warehouse runs on separate hardware from the database used by your product.
- A copy of your source data is already loaded into your analytics warehouse.
- Your data transformation logic can be expressed in SQL.
P.S: The power of SQL should not be misjudged. Blog Post
- Your data can be transformed into a batch.
The only prerequisites I can think of before you use DBT are:
- Familiarity with writing SQL.
- Understanding how version control works with a few basic commands.
- Should have python 3+ and pip installed.
- We should have installed a warehouse development toolkit locally.
What is it gonna look like?
- Install DBT locally (assuming you have prerequisites fulfilled).
pip install dbt
- Break-down your transformations into modular components.
- Test & Run it locally.
- Push the analytics code to your configured repository.
- Schedule the pipeline on DBT Cloud.
DBT offers a lot and it is hard to complete everything here, so let’s break this down into different parts:
- This Part-I discussed the current problems we have in our analytics workflow and how DBT caters to all those problems bringing a change in the mindset.
- Part- II will include setting up DBT, its required configuration locally and building our first chain of SQL models
- Part – III will include setting up data quality check-points and some other advanced features of DBT
Stay tuned to Part – II of this blog series if you want to experiment with this change in mindset.