Tame Database Complexity With SqlDBM and dbt

Serge Gershkovich
SqlDBM
Published in
13 min readJan 5, 2021

--

From complexity to control through CI/CD, automation, and best-practices.

By James LeeFormerIP on Flikr on Wikipedia

The tale you’ve been told about the boiling frog is a lie.

For an ectotherm, location change as a response to temperature variation is the primary means of thermoregulation. A frog would be long gone before the water even became uncomfortable.

However, another organism has evolved to be particularly vulnerable to the boiling pot scenario: the database developer.

Under the steadily rising heat of an enterprise system, a developer will begin to perspire, complain, and on occasion, even resort to foul language. All the while, staying complacent and continuing to monitor his environment.

Processes begin to run longer. Failures become more frequent. Documentation becomes outdated. Ad-hoc patches spring up in place of enterprise solutions.

Gradually, patches become crutches, crutches become handicaps, and the temperature keeps rising.

Get to Know CI/CD

Continuous integration and continuous deployment/delivery (CI/CD) is the coupling of shared (code) repositories and automated deployment tools.

The key idea of CI/CD is this: having code in one centralized and accessible repository and deploying it through automated means helps rein in the complexity of enterprise systems.

Centralized code and automated DB Ops provide standardization, stability, and visibility on how data is loaded and transformed by the system. CI/CD is not adding new functionality. It’s scaling existing functionality through automation. It’s keeping the heat of complexity under control.

In this article, we are going to explore two tools that, in conjunction, can help keep a system humming along at a comfortable temperature, proverbially speaking.

Get to Know the Tools

We can look at countless tools in this space, each with its own strengths and capabilities. For this article, I have chosen two that offer the greatest ROI and complement each other without much overlap.

SqlDBM

SqlDBM is a modeling tool that can diagram and maintain your entire database landscape without writing a single line of code. It can auto-generate a complete ER diagram by referencing your existing DDL, and it can write back changes made through its graphical interface as ALTER statements.

With built-in documentation and collaboration features, SqlDBM is more of an enterprise development tool masquerading as a diagramming tool.

dbt

To attempt to summarize dbt would be to misrepresent it. Its developers, Fishtown Analytics, describe it as the “T” in “ELT,” but that doesn’t even scratch the surface.

Let’s try a different approach: think of every hack, patch, best-practice, framework, and ad-hoc solution you’ve ever applied in your data warehouse. Now imagine that wrapped up in a single, elegant application.

From Complexity to Control

Perhaps the best way to explain these category-defying, feature-rich tools is to walk through the pain points of an evolving enterprise system and see how they can help restore order. Many of these may not seem obvious on day one, but if you ever want to reach day one hundred, you better plan accordingly.

Too busy to adopt best practices. Courtesy of karllhughes.com

The Map is the Territory

Before we even get to the “T” in “ELT,” we need to take stock of the vast number of tables and views in our existing database while considering a rate of future growth and new integrations. Without the right tools, keeping track of every entity will be impossible, and future developments will incur an unnecessary debt of analysis.

A modeling tool like SqlDBM can organize and search through your entire data landscape. Moreover, it will automatically identify related tables through primary and foreign key relationships, enabling you to visualize the layout instantly.

In the example below, we locate the employee dimension and automatically detect related master data and fact tables. What a productive fifteen seconds!

show related tables in SqlDBM

CI Before CD

Before exploring the neat, time-saving features of dbt, we need to highlight the fundamentals. For accessibility, version-control, and collaboration, code should be stored in a repository. The use of Git repositories for development teams has become a near-universal practice by this point, and dbt takes full advantage of it.

Every dbt project requires a repository connection that will store and version all of its artifacts. Great software doesn’t necessarily invent the best-practices, but it does nudge its users towards them.

Modularize Your Models

Now that we see the location and relationships of our data, we can begin our data transformation. Actually, no, we can’t. Rushing headlong into development is exactly what will (or already has) create the maintenance nightmare that we wish to avoid.

Here is where dbt’s focus on methodology and best-practices can help steer us in the right direction.

If you set out to build a car, you don’t dump every screw and component on the factory floor and begin assembling, LEGO-style. You break the assembly process into standalone components for ease of maintenance and reusability.

Every DTB project will come with a “models” folder by default. A model can be anything expressed through a SELECT statement. However, the beauty of models comes in their reusability. Once defined, dbt can instantiate any model by reference instead of by name.

{{ ref(‘stg_customer’) }}

instead of

“MYDB”.”MY_SCHEMA”.”MY_DIM_CUST”

With references, not only can we re-use the customer entity in other models, but we can also safely update it or change it entirely without breaking existing references.

Once modularized, model properties (e.g., materialized table or view) can now be treated as parameters instead of hard-coded designs. Dbt will update your architecture without having to rename views or tables because the reference remains constant.

Templatize Your Models

Reusing models offers tremendous flexibility, as we’ve just seen, but what about smaller components? Do you add the same couple of metadata fields to all your source tables? Perform the same join hundreds of times for currency conversion? Perform the same date conversion every time you want to go from natural year to fiscal year? dbt Macros are here to help.

Any frequently used SQL can be saved as a macro in dbt and reused anywhere. This gives us the same advantage as model references. If the underlying logic changes, we can update it in a single place instead of refactoring every reference across the entire database.

Define a Macro:

{% macro fiscal_year_from_natural_date(column_name) -%}{# converts a date to a fiscal year integer. Our fiscal year starts in October #}date_part(‘year’, dateadd(‘month’, 3, {{ column_name }})){%- endmacro %}

Re-use the macro:

{{ fiscal_year_from_natural_date(‘booking_date’) }} as booking_ficsal_year,{{ fiscal_year_from_natural_date(‘payment_date’) }} as payment_ficsal_year,

If the company fiscal year calculation were to change in the example above, it would only take one line of SQL to update the warehouse accordingly.

Become a Jinja Warrior

The conventions we’ve been demonstrating in the previous sections are part of a templating language called Jinja. We’ve seen how useful Jinja can be for encapsulating logic, but we can also use it to extend standard SQL capabilities through loops and conditional logic. Dbt even provides certain pre-built Jinja variables that allow us to write dynamic SQL based on warehouse name or environment.

You can do things like pivot columns with for-loops, apply limits based on warehouse size, and add conditional logic based on the environment. The possibilities are truly limitless.

Suppose our DEV environment has stale data, and we’re testing a query that relies on the current date being available. Instead of commenting, testing, forgetting to uncomment, promoting, cussing, and re-doing, we can let a dynamic macro handle such eventualities.

Define a dynamic macro:

{% macro dev_filter(column_name, day_offset=100) %}{# limits data in dev environment because data is stale: by default 100 days #}{% if target.name == ‘DEV’ }WHERE {{ column_name }} = dateadd(‘day’, {{ day_offset }} , current_date()){% else %}WHERE {{ column_name }} = current_date(){% endif }{% endmacro %}Implement a dynamic macro:SELECT * FROM “MY_TABLE”{{ dev_filter(‘load_date’, 105) }}

Trust But Verify — Automatic Testing

We’ve all seen it. Unit tests, integration tests, acceptance tests, performance tests — all passed and documented. Two months later, bam! Unexpected values in “some column” break the process. Let the debugging begin.

If only there were a way to automate testing — to make it seamless and permanent, while functional knowledge is still fresh in our minds. With dbt, there is!

In dbt, there are two types of tests — schema tests and data tests. Schema tests are configured directly in the YAML parameters of a model and run on specific columns. Data tests are unit tests that you write as stand-alone SQL files and run against the entire model.

The following schema tests come built into dbt:

  • Unique tests to see if every value in a column is unique
  • Not_null tests to see if every value in a column is something other than null
  • Accepted_values tests to make sure every value in a column is equal to a value in a provided list.
  • Relationships test ensures that every value in the current model exists in another model. (i.e., referential integrity)
The cost of software testing over time. Courtesy of karllhughes.com

Not all errors are created equal. Downstream errors may be more painful than errors at the source layer, but nothing stings quite like the error that makes it all the way to reporting and is detected by business users instead of BI.

Automated tests don’t cost any more to write than manual tests do, but until dbt made them trivially simple to implement, leveraging them didn’t become the norm.

Automate Deployment

Remember “Move fast and break things?” Well, that changed to “Move fast with stable infrastructure” two years after Facebook went public. Facebook realized that reliability and accountability were crucial for user confidence, and taking the human factor out of the equation is an important factor.

Since all the project config and artifacts have been built and organized in dbt, their deployment becomes a single step instead of a coordinated sequence. dbt will already know what needs to be compiled and the order it should be compiled in. This is more than just creating views and loading tables. The automatic tests that we discussed previously, the documentation that we’ll cover shortly, snapshot freshness — all of it can be automated from a single place.

Forget the complicated checklists and remember this command instead:

dbt run

Snapshots and Materializations

Views are easier to maintain; tables are easier to query. These are performance questions, not fundamental design decisions. However, changing from a view to a materialized table can be tedious and error-prone when coding by hand. But if dbt can refactor code on the fly while separating structure and config parameters, can’t this be one of them?

Your query doesn’t care if the SELECT points to a table or a view, and neither should you. In dbt, as you recall, the focus is on Models, not on database objects. A model is just something represented by a select statement, and its physical properties are set like so:

{{ config(materialized= ‘table’ / ‘view’ / ‘ephemeral’ (CTE))}}

With this in mind, let’s dig deeper into materializations. We know what it’s like to orchestrate our own full loads and deltas and instantly see that this is also a repeatable pattern. I don’t know how many lines of code it took you to implement your incremental model, but can I doubt that it can compete with this:

{{config(materialized=’incremental’)}}
Select * from raw_app_data.my_table{% if is_incremental() %}-- this filter will only be applied on an incremental runwhere event_time > (select max(event_time) from {{ this }}){% endif %}

Finally, we have snapshots. No, not daily full load kinds of snapshots — I mean auto-generated type-2 slowly-changing dimension kind of snapshots. Even something as complex as this can be parameterized into a few lines of dbt config and refactored as robust code upon deployment. Yeah, you could write this transformation yourself, but could you do it in under 5 lines of code?

from full loads to type-2 slowly changing dimensions

Analyses

How does your team deal with recurring business questions? Ones that come up sporadically but don’t quite make it into the project requirements. Does the first person to tackle any given issue become the de-facto owner? Is there a knowledge transfer or documentation for the next time it comes up?

dbt tackles this issue through a feature called “Analyses.” Analyses are SQL queries that live in a dedicated dbt folder but do not get compiled and materialized as models do. This is more of a standardization than an actual feature. It’s a way for any team member to reference a reusable bit of code instead of writing it from scratch or hunt among the team.

Documentation, For Real This Time

Documenting the documenters

Why is the subject of documentation in software development such a joke? Is it that the task of documenting is by its tedious nature at odds with the thrill of coding? Or perhaps that its static nature renders even serious efforts meaningless over time?

Perhaps it’s a fundamental problem of sequencing. If documentation attempts to describe what is and what is is subject to perpetual change, documentation will always lag. So what if we flipped this on its head by allowing what is to describe itself. It’s a thorny problem, so we’re going to need both tools for this one.

Earlier, we spoke about the importance of mapping a database landscape. SqlDBM will automatically lay out existing tables into a visual diagram so you can take stock of existing entities and how they relate to the overall architecture. That’s when modeling and transformation begins. What about the entities that our dbt models generate?

Update the Map

Following our earlier analogy, if the territory changes, the map should too. Unlike static documentation, SqlDBM can update your ER diagrams whenever new tables are created. SqlDBM calls this “Reverse Engineering.” It works by connecting to any of your database environments and scanning the DDL. Tables, column types, even relationships between tables, will automatically update in your diagram.

Document the Trip

Simply seeing our tables in SqlDBM doesn’t tell us how data moves between them. Since dbt acts as the vehicle for these transformations, it can also help fill in that missing piece of the puzzle. dbt will auto-generate flow diagrams and HTML documentation based on the transformations that we design there. Comments and descriptions are also considered. Always accurate, never out of sync. Incredible!

Auto-generated dbt documentation website

Taking Documentation Over the Finish Line

Software can help us describe what it is — generating diagrams and flowcharts — but it can never answer the question of “why.” The final piece of the documentation puzzle is not technical; it’s functional. It’s the human element of explaining what’s been built and the reasoning behind it.

For this reason, SqlDBM has built a data dictionary tool directly into the tables and diagrams that it generates. By allowing functional experts, not just developers, to access an easy-to-use interface and add value by defining the business definitions for metrics and dimensions, SqlDBM brings data to life through shared understanding.

Not only does the information provided through SqlDBM become richer and multivariate once coupled with functional knowledge, but it makes the overall tool greater than the sum of its parts. A field or table definition is searchable, just like a column or table name would be. This means that users at any level of familiarity can browse the database landscape and find answers to their data doubts.

SqlDBM data dictionary

Don’t Reinvent the Wheel — dbt Packages

Libraries exist in programming to help developers obviate common tasks and focus on their own business logic. Many of the shared analytic problems encountered across organizations are already available as libraries in dbt so that your team can focus on their specific challenge. Dbt calls them packages, and they can help save you the work of doing things from scratch, like:

transforming data from a consistently structured SaaS dataset, for example:

  • turning Snowplow, Segment or Heap pageviews into sessions
  • transforming AdWords or Facebook Ads spend data into a consistent format.

writing dbt macros that perform similar functions, for example:

  • generating SQL to union together two relations, pivot columns, or construct a surrogate key
  • creating custom schema tests
  • writing audit queries

building models and macros for a particular tool used in your data stack, for example:

  • Models to understand Redshift privileges.
  • Macros to work with data loaded by Stitch.

Packages are not simply a matter of copying code. Packages are standalone dbt projects, built on dbt frameworks and principles, available and constantly updated in their own shared git repository. Ensuring that even your out-of-the-box solutions stay up-to-date.

Commit, verb

  • perpetrate or carry out
  • pledge or bind (a person or an organization) to a certain course or policy.
  • (databases) end a transaction within (RDBMS) and makes all changes visible to other users

Highlighted in this article are just some of the non-obvious ways in which CI/CD and automation can help development teams save time and increase efficiency. This is by no means a comprehensive list of functions and techniques that SqlDBM and dbt can offer you and your organization, to say nothing of the multitude of tools not covered in this short post.

That having been said, just like a deliberate SQL transaction, I hope these examples offer sufficient motivation for you to commit to take a step towards increased automation and efficiency if you, and your organization, haven’t done so already.

Move fast, don’t break things.

--

--

I am the author of "Data Modeling with Snowflake" and Product Success Lead at SqlDBM. I write about data modeling and cloud cost optimization.