Why JSON in PostgreSQL is Awesome

Recently a commenter on Hacker News questioned the usefulness of the JSON data type in PostgreSQL. At the time I posted a brief response, but I have decided to take the time to write up a more detailed explanation here as I believe the JSON data type provides a significant benefit to developers.

Too long; didn’t read: The JSON data type is useful for storing multi-level object graphs. It provides better performance and the code itself is easier (and therefore cheaper) to write and maintain. The developer will be happier and the development will be cheaper.

Multi-Level Object Graph: Names

To demonstrate the difficulties of handling multi-level object graphs in relational databases, consider names as an example.

Handling names in computer systems correctly is a difficult task. Programmers tend to believe falsehoods about names. They often assume that everybody has a first name and a last name (and in that order), but, realistically, one cannot make such “brosumptions” about people’s names. Different cultures have different naming conventions. Over the years I have come to the conclusion that the best way to handle names while preserving practicality is to:

  1. store the full name,
  2. keep an ordered list of name type–name pairs.

For example:

{
    "fullName": "Charles John Huffam Dickens",
    "names":
    [
        { "type": "firstName", "value": "Charles" },
        { "type": "middleName", "value": "John" },
        { "type": "middleName", "value": "Huffam" },
        { "type": "lastName", "value": "Dickens" }
    ]
}

Book Store

Imagine an application that manages books, such as that used by a library or a book store. Books have authors, authors have names and these names have name parts (e.g. first name, last name, etc.). The application must keep track all of this data. If one follows the requirements of the third normal form, this will result in database tables such as the ones below.

Note: for the sake of simplicity, we’ll ignore the fact that the example below wouldn’t meet the requirements of the third normal form if one person could author multiple titles.

Book Table
id title
1 A Christmas Carol
Author Table
id book_id full_name
101 1 Charles John Huffam Dickens
Author Name Part Table
id author_id type value
201 101 first_name Charles
202 101 middle_name John
203 101 middle_name Huffam
204 101 last_name Dickens

To fetch a book from the database there are 3 options. An ORM will probably attempt option 2.

  1. Everything is joined in a single query. This will yield a row for every name part within the name of each author. The columns derived from the book table will have the same data repeated in every row. The book data structure must be assembled from multiple rows. Things will get very convoluted if one attempts to retrieve multiple books at the same time as every row will need to be carefully matched to the correct author and book.

  2. A separate query is run for each table. First the book is returned, then the authors, then the name parts. To retrieve multiple books, either

    • the author and name part queries must be repeated for each book (and the name part query for each author in each book) OR
    • the author and name part queries must be constructed in a manner that retrieves all of the authors of all the books (and all the name parts of all the authors of the books) in a single result at which point the matching exercise described in the previous section must be executed.

    The first variant results in a separate query for each piece of data, which will have a significantly detrimental impact on performance. While the second variant will be much faster, it is still not a very elegant solution as it requires the database to repeat the (possibly expensive) filter that was used to select the books 3 times: once for the books, once for the authors and once for the name parts themselves.

  3. There is no third option! Everybody loses. Always. This is why object–relational mapping is hard.

Serialisation to JSON

Fortunately, it is possible to serialise the entirety of this madness into a single JSON string (or XML or something similar but this post is about JSON). The serialised object can then be stored in a text column. When reading from the database, the text can be deserialised back to an object graph. While this was possible with earlier versions of the software, PostgreSQL 9.2 and 9.3 have introduced a number of features that make working with JSON data a very pleasant experience:

  • The json data type is basically the same as text in terms of behaviour and storage, but the database checks that the value is valid JSON. This was introduced in PostgreSQL 9.2.
  • With JSON functions and operators introduced in PostgreSQL 9.3 one can select authors->1->'fullName' from book to get the full name of the first author of every book. It is even possible to create indexes that traverse JSON values with indexes on expressions.

Without JSON columns, functions and operators, such queries become complex and possibly inefficient. With JSON, only a single table is required:

Book Table
id title authors
1 A Christmas Carol
[
 {
  "fullName": "Charles John Huffam Dickens",
  "names":
  [
   { "type": "firstName", "value": "Charles" },
   { "type": "middleName", "value": "John" },
   { "type": "middleName", "value": "Huffam" },
   { "type": "lastName", "value": "Dickens" }
  ]
 }
]

Less code = faster queries. It’s hard not to be excited about this.

The Downsides

No loops in the object graph. Circular references are not permitted in JSON. An author cannot belong to two books and authors cannot reference each other. The first problem can be tackled by repeating an author’s data for each book he/she has authored. This means, however, that the database will be larger and updating (renaming) authors a very expensive operation. It is up to the programmer to decide if this is compatible with the requirements of the application.

No referential integrity. While it is possible to generate pseudo unique keys by creating a unique index on a JSON expression, foreign keys cannot be created on (or reference) JSON expressions. For a field to reference another column or be referenced by another column, it needs to be promoted into an actual column or perhaps even a table.

Not portable. This is a non-standard extension to SQL specific to PostgreSQL only. It cannot be used with other databases even if they have some JSON support as the actual implementation is probably very different. This means that the application will be locked in to PostgreSQL. In my view, there are worse things than being locked in to an open source technology. However again, it is up to the judgement of the stakeholders to determine whether this meets the needs of the business.

Follow me on Twitter: @tamasczinege