yellowsubmarine372

Learning dbt

Overview

dbt is a tool for the T (Transform) part of ELT (Extract Load Transform). It’s good for querying already stored data and modifying it. Can be used to centralize all your data transformations.

(1) dbt uses data models → enables code reuse
(2) automated quality tests → good for production data (always ensures data consistency)

background

data warehouse vs data lake vs data lakehouse

(1) data warehouse
aggregates data -> into single data repository (any structured data; csv crm ..) => ETL
-> load into data warehouse
-(transform & curate)-> analytics reporting purpose


(2) data lake
aggregate data => single repoistory (any structured unstructured semi-structured data)
-> DL -> ELT
-(transform & curate)-> AI ML usecases


(3) data lake house
combines flexibility and the scalabillity of data lake + data management capabillity of data warehouse.

(any structured unstructured semi-structured data)
-> DLH -> metadata layer (provides structured and schema data)
-> (ETL/ELT) -> Any prupose (analytics ML etc…)



ETL vs ELT

(1) ETL
extract -> transform -> load
traditional model

ex) source -> facility -> pipe the water to target 
means: extract raw data -> transform it (through joining cleaning) -> clean strucutred cloud data warehouse



(2) ELT
Extract -> Load -> and then Transform
flips traditional ETL

ex) you pipe the raw water directly into the reservoir -> treat it inside the reservoir using built-in filtration systems
means: raw data from the source -> load directly into warehouse -> transform using the compute power of that cloud system


=> so it isn’t just about architecture it’s also about performace cost and compliance


💡 modern data stack is just the productionization of the different layers of data integration flow.


My Project (dbtlearn)

What You Built                                                                             
                                                                                          
  This is a learning project using Airbnb data. You've built a complete data pipeline:                                                                                 

  Raw Data (Snowflake)                                                                       
      ↓                                                                                      
  src/ (staging - ephemeral)                                                                 
      ↓                                                                                      
  dim/ (dimensions - tables)                                                                 
      ↓                                                                                      
  fct/ (facts - incremental)                                                                 
      ↓                                                                                      
  mart/ (business layer)                                                                     
                                                                                             
  1. Layer: src/                                                                                
  Models: src_listings src_hosts src_reviews                                                
  
  2. Layer: dim/                                                                                
  Models: dim_listings_cleansed dim_hosts_cleansed dim_listings_w_hosts                    
  
  3. Layer: fct/                                                                                
  Models: fct_reviews                                                                        
  
  4. Layer: mart/                                                                               
  Models: mart_fullmoon_reviews                                                              
  
  
  What You've Implemented                                                                    
                                                                                             
  - Sources with freshness checks (warns after 1hr errors after 24hrs)                      
  - 18+ tests using dbt_expectations package (regex quantiles row counts)                  
  - Custom macros: positive_value no_nulls_in_columns logging examples                     
  - Snapshot: scd_raw_listings for tracking historical changes                               
  - Post-hooks: Auto-grants SELECT to REPORTER role                                          
  - Incremental loading with date range variables                                            
                                                                                              
  What dbt run Does                                                                          
                                                                                             
  Yes it makes changes in Snowflake.                                                        
                                                                                             
  It:                                                                                        
  1. Compiles your .sql model files into raw SQL                                             
  2. Executes that SQL against Snowflake to create/replace tables and views                  
  3. Creates objects in your AIRBNB.DEV schema (based on your profiles.yml)                  
                                                                                             
  For example:                                                                               
  - dim_listings_cleansed → Creates a TABLE in Snowflake                                     
  - dim_listings_w_hosts → Creates a VIEW in Snowflake                                       
  - fct_reviews → Creates/updates an INCREMENTAL TABLE                                       
  - src_* models → Nothing created (ephemeral = inline CTEs only)                                                   



importing aribnb data into snowflake

1. snowflake user creation

-- Use an admin role
USE ROLE ACCOUNTADMIN; #master role

-- Create the `transform` role
CREATE ROLE IF NOT EXISTS TRANSFORM;
GRANT ROLE TRANSFORM TO ROLE ACCOUNTADMIN;

-- Create the default warehouse if necessary
CREATE WAREHOUSE IF NOT EXISTS COMPUTE_WH;
GRANT OPERATE ON WAREHOUSE COMPUTE_WH TO ROLE TRANSFORM;

-- Create the `dbt` user and assign to role
CREATE USER IF NOT EXISTS dbt
  PASSWORD='dbtPassword123'
  LOGIN_NAME='dbt'
  MUST_CHANGE_PASSWORD=FALSE
  DEFAULT_WAREHOUSE='COMPUTE_WH'
  DEFAULT_ROLE=TRANSFORM
  DEFAULT_NAMESPACE='AIRBNB.RAW'
  COMMENT='DBT user used for data transformation';
GRANT ROLE TRANSFORM to USER dbt;

-- Create our database and schemas
CREATE DATABASE IF NOT EXISTS AIRBNB;
CREATE SCHEMA IF NOT EXISTS AIRBNB.RAW;

-- Set up permissions to role `transform`
GRANT ALL ON WAREHOUSE COMPUTE_WH TO ROLE TRANSFORM; 
GRANT ALL ON DATABASE AIRBNB to ROLE TRANSFORM;
GRANT ALL ON ALL SCHEMAS IN DATABASE AIRBNB to ROLE TRANSFORM;
GRANT ALL ON FUTURE SCHEMAS IN DATABASE AIRBNB to ROLE TRANSFORM;
GRANT ALL ON ALL TABLES IN SCHEMA AIRBNB.RAW to ROLE TRANSFORM;
GRANT ALL ON FUTURE TABLES IN SCHEMA AIRBNB.RAW to ROLE TRANSFORM;

2. table creation

-- Set up the defaults
USE WAREHOUSE COMPUTE_WH;
USE DATABASE airbnb;
USE SCHEMA RAW;

-- Create our three tables and import the data from S3
CREATE OR REPLACE TABLE raw_listings
                    (id integer
                     listing_url string
                     name string
                     room_type string
                     minimum_nights integer
                     host_id integer
                     price string
                     created_at datetime
                     updated_at datetime);
                    
COPY INTO raw_listings (id
                        listing_url
                        name
                        room_type
                        minimum_nights
                        host_id
                        price
                        created_at
                        updated_at)
                   from 's3://dbtlearn/listings.csv'
                    FILE_FORMAT = (type = 'CSV' skip_header = 1
                    FIELD_OPTIONALLY_ENCLOSED_BY = '"');
                    

CREATE OR REPLACE TABLE raw_reviews
                    (listing_id integer
                     date datetime
                     reviewer_name string
                     comments string
                     sentiment string);
                    
COPY INTO raw_reviews (listing_id date reviewer_name comments sentiment)
                   from 's3://dbtlearn/reviews.csv'
                    FILE_FORMAT = (type = 'CSV' skip_header = 1
                    FIELD_OPTIONALLY_ENCLOSED_BY = '"');
                    

CREATE OR REPLACE TABLE raw_hosts
                    (id integer
                     name string
                     is_superhost string
                     created_at datetime
                     updated_at datetime);
                    
COPY INTO raw_hosts (id name is_superhost created_at updated_at)
                   from 's3://dbtlearn/hosts.csv'
                    FILE_FORMAT = (type = 'CSV' skip_header = 1
                    FIELD_OPTIONALLY_ENCLOSED_BY = '"');



data flow

airbnb_data_flow

3 input tables: airbnb hosts airbnb listings airbnb reviews
-> make modifications in first layer = src layer
-> clean and make dim fact tables
-> + use external tables - and make mart table -(making test)-> make dashboard


git clone and running again


How the command works in real production

  1. local development
    dbt run --select my_model to iterate quickly on SQL logic without affecting prod data
  2. pull request & CI
    dbt buld --select state:modified+ when I open a pr automated CI job triggers -> It uses dbt build to execute only the models you changed and their downstream dependencies in a temporary “staging” shcema.
  3. prod deployment
    dbt build once code is merged into the main branch the company’s orchestrator (like Airflow dbt Cloud) excutesdbt build against the prod target
← all posts