The future of querying json in PostgreSQL

17 Jun 2017

Here I’d like to take a look at a very simple and straightforward example depicting how powerful querying json objects is going to be Postgres, some time after Postgres 10. It’s a huge leap forward for querying json and I’m very excited about it.

Imagine the following issue at hand.

Given: a table consists of number column and a ranges jsonb column; ranges holds an array of json objects, where each object has three fields: title, min and max.

Task: select all rows, such that ranges[].min <= number < ranges[].max exists. In other words, ranges has an object that that satisfies a condition .min <= number < .max.

Preparing

Let’s lay out the problem and test data in formal SQL terms.

We have a table my_table:

CREATE TABLE my_table (
  title  VARCHAR,
  number INTEGER,
  ranges JSONB
);

There’s some data in the table:

INSERT INTO my_table (title, number, ranges)
  VALUES
  ('bad', 6, '[{"min": 1, "max": 5}, {"min": 7, "max": 10}]'::JSONB),
  ('good', 7, '[{"min": 1, "max": 5}, {"min": 7, "max": 10}]'::JSONB);

The task is to write a query that will result in only row with title='good' to be selected.

Sub-optimal solution

An obvious solution would be the following:

  1. for each record unnest the jsonb array into rows that can be then operated on via pure SQL,
  2. on the resulting data set use WHERE clause to target rows with object containing desired properties.

Here is what such query would look like:

SELECT title,
       number,
       jsonb_array_elements(ranges) as object
  FROM my_table
 WHERE number BETWEEN (object->>'min')::integer AND (object->>'max')::integer;

Hmm, but that doesn’t seem to work:

ERROR:  column "object" does not exist
LINE 5:  WHERE number BETWEEN object->>'min' AND object->>'max';

Let’s try another approach:

WITH our_data AS (
  SELECT title,
         number,
         jsonb_array_elements(ranges) as object
    FROM my_table
)
SELECT *
  FROM our_data
 WHERE number BETWEEN (object->>'min')::integer AND (object->>'max')::integer;

Okay, that was a little too verbose, but it works!

Let’s take a closer look at this solution:

  1. a CTE is introduced,
  2. inside the CTE the ranges column is being unfolded, e.g. each object of the array becomes an SQL row of its own,
  3. the final set of rows is queried for desired qualities.

It works? Yes. But is it optimal? Here’s how a query plan looks like using EXPLAIN ANALYSE:

                                                   QUERY PLAN
-----------------------------------------------------------------------------------------------------------------
 CTE Scan on our_data  (cost=441.38..3841.38 rows=9444 width=68) (actual time=0.060..0.062 rows=1 loops=1)
   Filter: ((number >= ((object ->> 'min'::text))::integer) AND (number <= ((object ->> 'max'::text))::integer))
   Rows Removed by Filter: 3
   CTE our_data
     ->  Seq Scan on my_table  (cost=0.00..441.38 rows=85000 width=68) (actual time=0.035..0.041 rows=4 loops=1)
 Planning time: 0.082 ms
 Execution time: 0.518 ms
(7 rows)

Apart from suspiciously big numbers in the EXPLAIN ANALYSEs result and a gut feeling, it’s hard to confidently say whether or not this approach is fast. So lets try to solve the problem differently.

Update. User simcitymayor from reddit suggested to use LATERAL, let’s try that:

FROM my_table m,
LATERAL jsonb_array_elements(m.ranges) as e(obj)
WHERE m.number between (e.obj->>'min')::integer and (e.obj->>'max')::integer;

This works, here’s how the query plan looks like:

                                                         QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.00..2662.00 rows=9444 width=68) (actual time=0.049..0.050 rows=1 loops=1)
   ->  Seq Scan on my_table m  (cost=0.00..18.50 rows=850 width=68) (actual time=0.010..0.011 rows=2 loops=1)
   ->  Function Scan on jsonb_array_elements e  (cost=0.00..3.00 rows=11 width=32) (actual time=0.017..0.017 rows=0 loops=2)
         Filter: ((m.number >= ((obj ->> 'min'::text))::integer) AND (m.number <= ((obj ->> 'max'::text))::integer))
         Rows Removed by Filter: 2
 Planning time: 0.101 ms
 Execution time: 0.091 ms

The cost estimated is 30% smaller, but the execution is significantly faster. I never use LATERAL in my queries, time to look closer at the docs and reconsider.

I’ll stick to using LATERAL while more json functions are under way.

Optimal solution

I’ll use a yet to be released feature of Postgres, a function called JSON_EXISTS. The function takes a column containing json (or jsonb in our case) as the first argument, and jsonpath expression as the second argument.

Here’s how the new query looks like:

SELECT *
  FROM my_table
 WHERE JSON_EXISTS(ranges, '$[*] ? (@.min < $x && $x <= @.max)' PASSING number AS x);

Note the $[*] ? (@.min < $x && $x <= @.max) expression - this is a jsonpath.

Let’s see how result EXPLAIN ANALYSE look like:

                                                   QUERY PLAN
-----------------------------------------------------------------------------------------------------------------
 Seq Scan on my_table (cost=0.00..18.50 rows=425 width=68) (actual time=0.018..0.018 rows=0 loops=1)
   Filter: JSON_EXISTS(ranges, '$.[*]?(@."min" < $"x" && $"x" <= @."max")' PASSING number AS x FALSE ON ERROR)
   Rows Removed by Filter: 2
 Planning time: 0.020 ms
 Execution time: 0.031 ms
(5 rows)

Wow, the numbers are much smaller than previously! It looks as if Postgres didn’t have to allocate a lot of memory for unnesting rows. Instead, it performed somewhat complex querying inside a column holding json/jsonb in just one pass, dropping down the query cost significantly.

Conclusion

In order to play with JSON_EXISTS and jsonpath I used PostgreSQL 10dev+SQLJSON branch on sqlfiddle.postgrespro.ru. I was previously barely familiar with jsonpath, but was able to quickly get up to speed using an online repl, jsonpath.com.

Ability to use jsonpath in Postgres appears to be far more superior to what is currently available for querying json objects. What seemed like hard to do previously, now becomes possible. From what I can tell, using JSON_EXISTS is much faster than using existing workarounds to achive the same result.

JSON_EXISTS function is only a tip of the iceberg. There is a lot more additions to json handling in the newest SQL standard, SQL:2016 which is a huge step forward for json adoption in SQL! The standards SQL/JSON section is available to be downloaded from http://standards.iso.org, here: c067367_ISO_IEC_TR_19075-6_2017.zip.

Unfortunately neither Postgres 9.6 nor the upcoming Postgres 10 release include implementation of the new standard’s json functions. The patch containing a subset of standards json functions was submitted right before merge window closed back in February, but it was too late: the size of patch was too big to be properly reviewed and included in Postgres 10.

According to Oleg Bartunov’s slides from PGConf.US in March, the plan is to push SQL/JSON to Postgres 11. Meanwhile the desired json functionality is expected to land in the next major release of Postgres Pro (a fork maintained by the company of the same name).

The slides from Oleg’s talk contain tons of info & examples about SQL/JSON, I highly recommend the slides to anyone intereseted in working with json in SQL.

It’s warming to see folks behind the SQL standard working on json adoption. And of course, huge THANKS to the patch authors from Postgres Pro team for implementing SQL/JSON standard in Postgres!

I hope that “SQL/JSON support in PostgreSQL” patch, now moved to the next Commitfest, will be reviewed & accepted into the Postgres core.

Like the article or have a comment? Ping me on twitter!