Recursive queries in PostgreSQL

May 08, 2016 permanent PostgreSQL DataEngineering

For some data relationships in Postgres (or any other relational database that speaks SQL), recursive queries are near inevitable. Let's say you have some top-level product categories, like Shoes, Hats, Wigs etc, and each of those has subcategories, like Sports, Casual, Formal etc, which can have their own subcategories and so on. Something like this:

Then, you wish to construct a query where for some child category, you want to fetch the entire category hierarchy. For example, starting at Nike Air walk up the tree to find all parent categories (Sports -> Shoes). Or maybe you want to go the other way, starting at Shoes find all child categories. In either scenario, you're going to want to understand recursive queries.

Side note: in this contrived example, there are some techniques you could use to skip recursive queries completely, like denormalising your data by adding the parent id into each child category which would also bring significant performance gains, but I still think it's useful to have recursive queries in your toolkit...

As always, the Postgres docs are very comprehensive, however, they can be a little daunting when you're getting started - especially if writing SQL isn't something you're doing every day. Since I was recently tasked with solving a similar problem, the difficulties of understanding it are fresh on my mind. So I figured I'm in a good position to help others do the same.

First things first, you'll need to...

Understand `WITH` queries

WITH queries lets you build a temporary table for use in a subsequent query, also known as "Common Table Expressions" or CTEs. Here's an example where I build a table which uses the VALUES command to return a single row with a number:

WITH bottles_of_beer AS (
   VALUES (99)
)
SELECT * FROM bottles_of_beer;
 column1
---------
       99     
(1 row)

Side note: This example doesn't do justice to the usefulness of the WITH statement check out blogs posts like this for cool stuff you can do with it.

Now what if I wanted to perform an operation on the value returned in the temporary table, like taking away 1 from our bottles of beer? And, what if I then wanted to perform an operation on the output of that operation and so on? Think you can see where I'm going with this.

For that, we'll need to...

Understand `WITH RECURSIVE` queries

A WITH RECURSIVE query let's your perform an operation on the output of a WITH statement until there are no more rows to return. The query is broken down into 3 parts:

The base query: the query that runs first.
The UNION operator: which says whether we'll keep duplicate rows or not.
Then, the recursive part: which will keep operating on the last query's output until no rows are returned or we hit our end condition.

Let's see if we can use that information to rewrite a query that counts down from 99 to 1:

WITH RECURSIVE bottles_of_beer(n) AS (              -- See #1
    VALUES (99)                                     -- See #2
  UNION ALL                                         -- See #3
    SELECT n - 1 FROM bottles_of_beer WHERE n > 1   -- See #4
)
SELECT * FROM bottles_of_beer;

Let's break it down:

Create a temporary table, aka CTE, called bottles_of_beer with a single column referred to as n.
Build the base query: VALUES (99) which returns the number 99.
Tell Postgres what we want to do with duplicate rows. In this query we're keeping "all", ie not discarding duplicate rows (more on that later).
Then we define the recursive query, which will subtract 1 from n while n is greater than 1.

Finally, we can use the || string concatention operator, to do the whole bottles of beer song:

WITH RECURSIVE bottles_of_beer(n) AS (
    VALUES (99)
  UNION ALL
    SELECT n - 1 FROM bottles_of_beer WHERE n > 1
)
SELECT n::text || ' bottles of beer on the wall' from bottles_of_beer;

 column1
--------------------------------
 99 bottles of beer on the wall
 98 bottles of beer on the wall
 97 bottles of beer on the wall
 96 bottles of beer on the wall
 95 bottles of beer on the wall
# and so on...

Cool!

Armed with that knowledge, we can head back to the original problem. But first, it'll help if we...

Understand `JOIN`s in `WITH RECURSIVE` queries

Joins in WIH RECURSIVE queries can be used to join data from a table against the current output of the WITH query. We can see that in action by solving the problem described in the intro: starting with Nike Air, can we walk up the tree finding all parent categories?

Firstly let's create the table to represent our data relationship:

CREATE TABLE category (
    id integer,
    name varchar(256),
    parent_id integer,
    PRIMARY KEY(id)
);

Next, we'll add the example categories in our diagram:

INSERT INTO category (id, name, parent_id) VALUES
    (1, 'Shoes', NULL),
    (2, 'Sports', 1),
    (3, 'Nike Air', 2),
    (4, 'Reebok Pumps', 2);

Since we have learned the basics of building recursive queries, we can firstly define the non-recursive part:

SELECT id, name, parent_id FROM categories WHERE id = 3;

which simply grabs the Nike Air category by id.

Let's put that in a CTE:

WITH RECURSIVE category_tree AS (
   SELECT id, name, parent_id FROM categories WHERE id = 3
)

Now we decide on the union operator, in this example I'll use UNION to remove duplicate rows.

Now here's where it gets interesting: defining the recursive part. Here we'll join the output from the base query to the category table until there's nothing more to be joined:

    SELECT category.id, category.name, category.parent_id
        FROM category
        JOIN category_tree ON (category_tree.parent_id = category.id)

So, when the Nike Air category is first returned, we'll grab a row from the category table which matches it's parent id. Then when that comes back, we'll grab its parent until we can't find any more parents.

Now, let's put it all together:

WITH RECURSIVE category_tree AS (
   SELECT id, name, parent_id FROM category WHERE id = 3
     UNION
   SELECT category.id, category.name, category.parent_id
        FROM category
        JOIN category_tree ON (category.id = category_tree.parent_id)
)
SELECT * FROM category_tree;

Go ahead and run that. Does it work? Here's what was returned for me:

 id |     name     | parent_id
----+--------------+-----------
  3 | Nike Air     |         2
  2 | Sports       |         1
  1 | Shoes        |
(3 rows)

Cool!

Okay, challenge number 2, let's go the other way: start at the top-level return all the children.

There's nothing new here, we just simply change our base query to start by selecting our top-level parent then join any category with a parent id that matches our current result's id. Like so:

WITH RECURSIVE top_level AS (
    SELECT id, name, parent_id
        FROM category WHERE id = 1
      UNION
    SELECT category.id, category.name, category.parent_id
         FROM category
         JOIN top_level ON (category.parent_id = top_level.id)
)
SELECT * FROM top_level;

Go ahead and run that. I'll wait...

Here's what was returned for me:

SELECT * FROM top_level;
 id |     name     | parent_id
----+--------------+-----------
  1 | Shoes        |
  2 | Sports       |         1
  3 | Nike Air     |         2
  4 | Reebok Pumps |         2
(4 rows)

Choice!

Now what would happen if someone goofed up and created a top-level category with a parent id of a child category, creating a circular relationship. Like this:

This is where it'd help for us too...

Understand the `UNION` statement

As already covered, the UNION statement basically tells Postgres what to do with duplicate rows. UNION ALL is basically saying: don't discard dupes eg union all the things, where UNION discards dupes (if you are familiar with set theory at all, then you are probably familiar with union already).

So let's try creating the messed up data relationship, by creating an infinite recursive loop.

UPDATE category SET parent_id = 3 WHERE id = 1;

Then run the same query from the last section:

WITH RECURSIVE category_tree AS (
    SELECT id, name, parent_id
        FROM category WHERE id = 3
    UNION ALL
    SELECT category.id, category.name, category.parent_id
        FROM category
        JOIN category_tree ON (category.id = category_tree.parent_id)
)
SELECT * FROM category_tree;

If you're playing along at home, you'll notice that query never completes and you'll need to Ctrl + c your way of it. That's because each time the recursive part of the query is run, rows are returned. So it never reaches an end condition.

We can combat this problem by using UNION instead of UNION ALL to discards duplicate rows:

WITH RECURSIVE top_category AS (
    SELECT id, name, parent_id
        FROM category WHERE id = 3
    UNION
    SELECT category.id, category.name, category.parent_id
        FROM category
        JOIN top_category ON (top_category.parent_id = category.id)
)
SELECT * FROM top_category;

Now when the query goes to join the Shoes row with Nike Air, it finds a duplicate row, since we are discarding dupe, there's nothing left to join so we return. Voilà:

 id |     name     | parent_id
----+--------------+-----------
  3 | Rebook Pumps |         2
  2 | Sports       |         1
  1 | Shoes        |         3
(3 rows)

Now you may or may not be asking, how does it actually work under the hood? If you're in the former camp, read on...

Understand (very loosely) how it works

Let's try to understand what's going on here, according to the docs. This isn't me deep diving into the Postgres code base, so it could very well be wrong, but hopefully it's a close enough approximation.

We have already determined that the first step in evaluating recursive queries is to deal with the non-recursive term. According to the docs, the output of this query is place onto a table called the "working table". This output is also appending to our results. So we have 2 data structures that looks like this:

Results                         Working table
-----------------------------   --------------------------------

3   Reebok Pumps   2            3   Reebok Pumps   2

Next, we evaluate the recursive term against the working table, which in this case is SELECT category.id, category.name, category.parent_id FROM category JOIN top_category ON (category.id = top_category.parent_id). Effectively we're saying "get a thing from category which can be joined against our working table". So, we fetch the sports category. This time we add the results to a table called the "intermediate table". Okay, so now we have 3 structure that look like this:

Results                         Working table                     Intermediate table
-----------------------------   --------------------------------  --------------------

3   Reebok Pumps   2            3   Reebok Pumps   2              2   Sports   1

Then the contents of the intermediate table replace the working table and are appended to our results. So our tables look like this:

Results                         Working table                     Intermediate table
-----------------------------   --------------------------------  --------------------

3   Reebok Pumps   2            2   Sports     1                    
2   Sports         1

Then we keep following this process until there is nothing to replace into our working table.

Tags

Notes by Lex Toumbourou