Testing SQL Like a Software program Engineer: Unit Testing, CI/CD, and Information High quality Automation

May 5, 2026

2

SQL Unit Testing and Data Quality Automation

# Introduction

Everybody focuses on writing SQL that “works,” however only a few check whether or not it retains working tomorrow. A single new row, a modified assumption, or a refactor can break a question silently. This text walks by way of an entire workflow, exhibiting how you can deal with SQL like software program: versioned, examined, and automatic. We’ll use an actual Amazon interview query about figuring out prospects with the best each day spending. Then we are going to convert the SQL right into a testable element, outline anticipated outputs, and automate testing with steady integration and steady deployment (CI/CD).

SQL Unit Testing and Data Quality Automation

# Step 1: Fixing an Interview-Type SQL Query

// Understanding the Downside

SQL Unit Testing and Data Quality Automation

On this interview query from Amazon, you’re requested to search out the purchasers with the best each day complete order value between a sure date vary.

// Understanding the Dataset

There are two information tables on this challenge: prospects and orders.

The prospects desk:

SQL Unit Testing and Data Quality Automation

Here’s a preview of the dataset:

SQL Unit Testing and Data Quality Automation

The orders desk:

SQL Unit Testing and Data Quality Automation

Here’s a preview of the dataset:

SQL Unit Testing and Data Quality Automation

This downside is ideal for illustrating how SQL might be handled like software program: the question have to be right, steady, and proof against regressions.

// Writing the SQL Answer

The logic breaks down into three components:

Combination every buyer’s complete spending per day
Rank prospects by complete spending for every date
Return solely the each day high spenders

Right here is the ultimate PostgreSQL answer:

WITH customer_daily_totals AS (
  SELECT
    o.cust_id,
    o.order_date,
    SUM(o.total_order_cost) AS total_daily_cost
  FROM orders o
  WHERE o.order_date BETWEEN '2019-02-01' AND '2019-05-01'
  GROUP BY o.cust_id, o.order_date
),
ranked_daily_totals AS (
  SELECT
    cust_id,
    order_date,
    total_daily_cost,
    RANK() OVER (
      PARTITION BY order_date
      ORDER BY total_daily_cost DESC
    ) AS rnk
  FROM customer_daily_totals
)
SELECT
  c.first_name,
  rdt.order_date,
  rdt.total_daily_cost AS max_cost
FROM ranked_daily_totals rdt
JOIN prospects c ON rdt.cust_id = c.id
WHERE rdt.rnk = 1
ORDER BY rdt.order_date;

// Defining the Anticipated Output

Right here is the anticipated output:

SQL Unit Testing and Data Quality Automation

At this stage, most individuals cease.

# Step 2: Making the SQL Logic Dependable with Unit Checks

SQL breaks extra simply than most assume. A modified default, a renamed column, or a brand new information supply can introduce silent errors. Testing protects you from these points. There are three testing steps we are going to cowl: changing the logic right into a perform, defining anticipated output, and writing a unit check suite.

// Turning the Question right into a Reusable Element

To check the SQL code, we start by wrapping it in a Python perform utilizing a light-weight testing framework like unittest. First, we outline the question that we wish to check:

    question = """
        WITH customer_daily_totals AS (
          SELECT
            o.cust_id,
            o.order_date,
            SUM(o.total_order_cost) AS total_daily_cost
          FROM orders o
          WHERE o.order_date BETWEEN '2019-02-01' AND '2019-05-01'
          GROUP BY o.cust_id, o.order_date
        ),
        ranked_daily_totals AS (
          SELECT
            cust_id,
            order_date,
            total_daily_cost,
            RANK() OVER (
              PARTITION BY order_date
              ORDER BY total_daily_cost DESC
            ) AS rnk
          FROM customer_daily_totals
        )
        SELECT
          c.first_name,
          rdt.order_date,
          rdt.total_daily_cost AS max_cost
        FROM ranked_daily_totals rdt
        JOIN prospects c ON rdt.cust_id = c.id
        WHERE rdt.rnk = 1
        ORDER BY rdt.order_date;
    """

// Defining Check Enter and Anticipated Output

Subsequent, we should create a managed pattern dataset to check in opposition to.

test_customers = [
    (15, "Mia"),
    (7, "Jill"),
    (3, "Farida")
]

test_orders = [
    (1, 3, "2019-03-04", 100),
    (2, 3, "2019-03-01", 80),
    (4, 7, "2019-02-01", 25),
    (6, 15, "2019-02-01", 100)
]

We additionally create the anticipated output:

anticipated = [
    ("Mia", "2019-02-01", 100),
    ("Farida", "2019-03-01", 80),
    ("Farida", "2019-03-04", 100)
]

Why? As a result of defining anticipated outputs creates a benchmark.

// Writing SQL Unit Checks

Now we’ve got the question outlined, the check inputs, and the anticipated outputs. We are able to write an precise unit check. The thought is straightforward:

Create an remoted, in-memory database
Load managed check information
Execute the SQL question
Assert that the outcome obtained matches the anticipated output

SQL Unit Testing and Data Quality Automation

Python’s built-in unittest framework is extremely efficient as a result of it permits us to maintain dependencies minimal whereas offering construction and repeatability. We begin by creating an in-memory SQLite database:

conn = sqlite3.join(":reminiscence:")
cursor = conn.cursor()

Utilizing :reminiscence: ensures that:

the check database is totally remoted
no exterior state can have an effect on the outcome
the database is discarded routinely as soon as the check finishes

Subsequent, we recreate solely the tables required by the question:

CREATE TABLE prospects (...)
CREATE TABLE orders (...)

Regardless that the question solely makes use of a subset of columns, the schema mirrors a sensible manufacturing desk. This reduces the chance of false confidence attributable to oversimplified schemas. We then insert the managed check information outlined earlier:

cursor.executemany("INSERT INTO prospects VALUES (?, ?, ?, ?, ?, ?)", test_customers)
cursor.executemany("INSERT INTO orders VALUES (?, ?, ?, ?, ?)", test_orders)
conn.commit()

At this level, the database comprises a recognized, deterministic state, which is important for significant assessments. Earlier than executing the question, we load and print the check tables utilizing Pandas:

customers_df = pd.read_sql("SELECT id, first_name, last_name, metropolis FROM prospects", conn)
orders_df = pd.read_sql("SELECT * FROM orders", conn)

Whereas this step isn’t strictly required for automation, it’s extremely helpful throughout growth and debugging. When a check fails, with the ability to instantly examine the enter information saves considerably extra time than checking the SQL logic, as a result of it lets you perceive step-by-step what the code is computing. Now we run the question below check:

outcome = pd.read_sql(question, conn)

The result’s loaded right into a DataFrame, which offers:

structured entry to rows and columns
straightforward comparability with anticipated outputs
readable printing for debugging

Subsequent, we should confirm the outcomes row by row. The verification logic makes a handbook assertion between the question output and the anticipated outcome:

all_correct = True

if len(outcome) != len(anticipated):
    all_correct = False

The primary verify confirms whether or not the variety of rows returned by the question matches what we anticipate. A mismatch right here instantly signifies lacking or further information. Subsequent, we iterate by way of the anticipated output and evaluate it to the precise question outcome row by row:

for i, (fname, lname, date, value) in enumerate(anticipated):
    if i < len(outcome):
        precise = outcome.iloc[i]
        if not (
            precise["first_name"] == fname
            and precise["last_name"] == lname
            and precise["order_date"] == date
            and precise["max_cost"] == value
        ):
            all_correct = False

Every row is checked on all related dimensions:

buyer title
order date
aggregated each day value

If any worth differs from the anticipated, the check is marked as failed. Lastly, the check result’s summarized in a transparent go/fail message:

if all_correct and len(outcome) == len(anticipated):
    print("ALL TESTS PASSED")
else:
    print("SOME TESTS FAILED")

The database connection is then closed:

If the assessments go, the anticipated output is:

SQL Unit Testing and Data Quality Automation

This check carries some assumptions value noting:

a steady row order (ORDER BY order_date)
actual matches on all values
no tolerance for ties or duplicate winners per day

The total script, prepared for use, might be seen right here.

# Step 3: Automating SQL Checks with Steady Integration and Steady Deployment

A check suite is just helpful if it runs constantly each time wanted. We make the most of CI/CD to automate testing each time a code change is made.

// Organizing the Undertaking

A minimal repository construction can appear to be this:

SQL Unit Testing and Data Quality Automation

// Creating the GitHub Actions Workflow

The subsequent step is to make sure these assessments run routinely each time the code adjustments. For this, we use GitHub Actions. This device permits us to outline a CI workflow that runs the SQL assessments each time code is pushed or a pull request is opened.

Create the workflow file: In your repository, create the next folder construction if it would not exist already: .github/workflows/. Inside this folder, create a brand new file referred to as test_sql.yml. The title isn’t particular; GitHub solely cares that the file lives contained in the .github/workflows/ listing. You may title it something, however test_sql.yml retains issues clear and easy.

Outline when the workflow ought to run: Right here is the total workflow file:

title: Run SQL Checks

on:
  push:
    branches: [ "main" ]
  pull_request:
    branches: [ "main" ]

This part defines when the workflow runs:

on each push to the principle department
on each pull request focusing on essential

In follow, this implies:

pushing on to essential will set off the assessments
opening or updating a pull request will even set off the assessments

This helps catch SQL regressions earlier than they get merged.

Outline the check job: Subsequent, we outline a job referred to as check:

jobs:
  check:
    runs-on: ubuntu-latest

This tells GitHub to:

create a contemporary Linux machine
run all check steps inside it

Every workflow run begins from a clear surroundings, which prevents “it really works on my machine” issues.

// Including the Workflow Steps

Now we outline the steps the machine ought to execute:

- title: Checkout repository
  makes use of: actions/checkout@v4

This step downloads your repository’s code into the runner so it might entry your SQL recordsdata and assessments.

- title: Arrange Python
  makes use of: actions/setup-python@v5
  with:
    python-version: "3.10"

This installs Python 3.10, guaranteeing a constant runtime throughout all runs.

- title: Set up dependencies
  run: |
    python -m pip set up --upgrade pip
    pip set up -r necessities.txt

This installs all required Python libraries (akin to Pandas) outlined in necessities.txt.

- title: Run unit assessments
  run: python -m unittest uncover

Lastly, this command:

routinely discovers check recordsdata
runs all SQL assessments outlined within the assessments/ folder
fails the workflow if any check fails

The total workflow might be discovered right here.

Working the workflow: You need not run this file manually. As soon as dedicated:

pushing to essential will set off the workflow
opening a pull request will set off the workflow

You may view the outcomes instantly in GitHub by navigating to your repository’s Actions tab.

SQL Unit Testing and Data Quality Automation

Every run will present whether or not your SQL assessments handed or failed.

# Step 4: Automating Information High quality

Unit assessments affirm whether or not the logic nonetheless returns the anticipated output, and CI ensures these assessments run routinely. However in actual information environments, the enter information itself may cause failures: late-arriving rows, malformed dates, lacking keys, and surprising duplicates can break queries lengthy earlier than the SQL logic does. That is the place information high quality automation is available in. Testing and versioning kind a security internet for code adjustments; information high quality automation extends that security internet to the info itself, stopping downstream points earlier than they impression outcomes.

// Understanding Why Information High quality Checks Matter for SQL Workflows

In our interview downside, the next points may make the question return incorrect outcomes:

A buyer’s first title is now not distinctive.
An order arrives with a detrimental value.
Dates fall outdoors the anticipated vary.
Each day aggregates comprise duplicate rows for a similar buyer and date.
A buyer exists in orders however not in prospects.

SQL Unit Testing and Data Quality Automation

With out automated checks, these points could silently distort outcomes. As a result of SQL would not increase apparent exceptions in lots of of those eventualities, errors unfold unnoticed. Automated information high quality checks detect these points early and stop the pipeline from operating with corrupted or incomplete information.

// Turning Information Assumptions into Automated Guidelines

Each SQL question depends on assumptions concerning the information. The issue is that these assumptions are hardly ever written down and virtually by no means enforced. In our each day spenders question, correctness relies upon not solely on SQL logic, but in addition on the form and validity of the enter information. As an alternative of trusting these assumptions implicitly, we are able to flip them into automated information high quality guidelines. The thought is straightforward:

categorical every assumption as a SQL verify
run these checks routinely
fail quick if any assumption is violated

First names have to be distinctive: Our question joins prospects by ID, however returns first_name as an identifier. If first names are now not distinctive, the output turns into ambiguous.

SELECT first_name, COUNT(*)
FROM prospects
GROUP BY first_name
HAVING COUNT(*) > 1;

If this question returns any rows, the idea is damaged.

Order prices have to be non-negative: Damaging order values often point out ingestion or upstream transformation points.

SELECT *
FROM orders
WHERE total_order_cost < 0;

Even a single row right here invalidates monetary aggregates.

Order dates have to be legitimate and inside expectations: Dates which can be lacking or wildly out of vary typically reveal synchronization or parsing errors.

SELECT *
FROM orders
WHERE order_date IS NULL
   OR order_date < '2010-01-01'
   OR order_date > CURRENT_DATE;

This protects the question from silently together with dangerous temporal information.

Each order should reference a sound buyer: If an order refers to a non-existent buyer, joins will silently drop rows.

SELECT o.*
FROM orders o
LEFT JOIN prospects c ON c.id = o.cust_id
WHERE c.id IS NULL;

This rule ensures referential integrity earlier than analytics logic runs.

// Changing Guidelines into an Automated Examine

As an alternative of operating these checks manually, we are able to wrap them right into a single Python perform that fails instantly if any rule is violated.

import pandas as pd

def run_data_quality_checks(conn):
    checks = {
        "Duplicate first names": """
            SELECT first_name
            FROM prospects
            GROUP BY first_name
            HAVING COUNT(*) > 1;
        """,
        "Damaging order prices": """
            SELECT *
            FROM orders
            WHERE total_order_cost < 0;
        """,
        "Invalid order dates": """
            SELECT *
            FROM orders
            WHERE order_date IS NULL
               OR order_date < '2010-01-01'
               OR order_date > CURRENT_DATE;
        """,
        "Orders with out prospects": """
            SELECT o.*
            FROM orders o
            LEFT JOIN prospects c ON c.id = o.cust_id
            WHERE c.id IS NULL;
        """
    }

    for rule_name, question in checks.gadgets():
        outcome = pd.read_sql(question, conn)
        if not outcome.empty:
            increase ValueError(f"Information high quality verify failed: {rule_name}")

    print("All information high quality checks handed.")

This perform:

executes every rule
checks whether or not any rows are returned
raises an error instantly if a violation is discovered

At this level, information high quality guidelines behave identical to unit assessments: go or fail. If assessments go, you will note one thing like:

SQL Unit Testing and Data Quality Automation

As a result of the info high quality checks run inside Python, they’re routinely picked up by the present GitHub Actions workflow:

- title: Run unit assessments
  run: python -m unittest uncover

The CI pipeline will cease instantly so long as:

the perform is imported or executed by your check file
a failure raises an exception

# Concluding Remarks

Most individuals cease as soon as the SQL question produces an accurate reply. However actual information environments reward those that make their queries steady, testable, and version-controlled.

SQL Unit Testing and Data Quality Automation

Combining the next practices ensures the question continues to ship dependable outcomes, at the same time as information adjustments over time:

a transparent answer
a reusable element
unit assessments
automated CI

Correctness is nice, however reliability is important.

Nate Rosidi is an information scientist and in product technique. He is additionally an adjunct professor instructing analytics, and is the founding father of StrataScratch, a platform serving to information scientists put together for his or her interviews with actual interview questions from high firms. Nate writes on the most recent developments within the profession market, offers interview recommendation, shares information science initiatives, and covers every thing SQL.

Testing SQL Like a Software program Engineer: Unit Testing, CI/CD, and Information High quality Automation

# Introduction

# Step 1: Fixing an Interview-Type SQL Query

// Understanding the Downside

// Understanding the Dataset

// Writing the SQL Answer

// Defining the Anticipated Output

# Step 2: Making the SQL Logic Dependable with Unit Checks

// Turning the Question right into a Reusable Element

// Defining Check Enter and Anticipated Output

// Writing SQL Unit Checks

# Step 3: Automating SQL Checks with Steady Integration and Steady Deployment

// Organizing the Undertaking

// Creating the GitHub Actions Workflow

// Including the Workflow Steps

# Step 4: Automating Information High quality

// Understanding Why Information High quality Checks Matter for SQL Workflows

// Turning Information Assumptions into Automated Guidelines

// Changing Guidelines into an Automated Examine

# Concluding Remarks

Related Articles

Hazard Espresso Evaluate: Is It Definitely worth the Hype? My Sincere Take

Buffalo Cauliflower • Crispy, Tasty and Good Buffalo Cauliflower • Two Purple Figs

In case your Pixel Buds ANC toggle is vanishing on you, you are not alone

LEAVE A REPLY Cancel reply

Latest Articles

Hazard Espresso Evaluate: Is It Definitely worth the Hype? My Sincere Take

Buffalo Cauliflower • Crispy, Tasty and Good Buffalo Cauliflower • Two Purple Figs

In case your Pixel Buds ANC toggle is vanishing on you, you are not alone

Apple Eyes Intel and Samsung as Backup US Chipmakers

Planview Launches Agent Useful resource Administration, Redefining Portfolio Useful resource Administration for the AI Period