Handling the “N + 1 selects” problem in SQLAlchemy

The "N + 1" selects problem is one of the most common problems one might run into when working on a database-backed web application. In this blog post, I'll describe what this problem exactly is, how ORMs (Object Relational Mappers) come into the picture, and how we can work around 2 variants of these problems when using SQLAlchemy.

ORMs

Most web application frameworks these days provide some sort of an ORM to handle database queries. ORMs can be a double-edged sword. On the one hand they take care of a lot of common database-related issues. On the other hand, using an ORM doesn't mean you can forget about how the underlying database actually works.

A personal rule of thumb is - if an ORM is not focusing on hiding the database layer, it's most probably good to use.

In this blog post, we'll be using SQLAlchemy, which is an excellent database toolkit and ORM for Python applications. It has a vast API - I've been working with SQLAlchemy for a bit more than 5 years now and I can't recall a single instance where it didn't solve a use case I had. On top of that, it does not attempt to hide the database layer. Instead, it adds an abstraction layer on top that makes all the database-interactions in Python applications much easier.

To me, SQLAlchemy feels very "Pythonic".

Description

Consider an application which serves as an online books catalog, something like Goodreads. This most definitely involves having a books table, and an authors table.

from sqlalchemy import BigInteger, Boolean, Column, ForeignKey, String
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import relationship

Base = declarative_base()

class Author(Base):
    __tablename__ = 'authors'

    id = Column(BigInteger, primary_key=True)
    name = Column(String(255), nullable=False)

    def __init__(self, name):
        self.name = name

class Book(Base):
    __tablename__ = 'books'

    id = Column(BigInteger, primary_key=True)
    title = Column(String(255), nullable=False)
    author_id = Column(BigInteger, ForeignKey('authors.id'), nullable=False)
    is_bestseller = Column(Boolean, default=False)

    author = relationship('Author', backref='books')

    def __init__(self, title, author):
        self.title = title
        self.author = author

The two models are linked via the author relationship. This means that accessing the author property on a Book object would return the associated Author object. Since we've defined a backref, accessing the books property for an Author object would return a list of Book objects that this author wrote.

Variant 1 - loading objects

SQLAlchemy defines a bunch of strategies about how such related objects should be loaded from the database. The default is set to lazy, which means fetching a Book would not immediately result in fetching the Author.

To see this in action, the database engine needs to first be configured to echo all the queries being executed.

from sqlalchemy.engine import create_engine
from sqlalchemy.orm import sessionmaker

engine = create_engine('sqlite:///:memory:', echo=True)

session_factory = sessionmaker(bind=engine)

If we now load a book from the database and check its author,

In [1]: book = session.query(Book).get(1)
2017-12-17 09:23:39,833 INFO sqlalchemy.engine.base.Engine BEGIN (implicit)
2017-12-17 09:23:39,833 INFO sqlalchemy.engine.base.Engine SELECT books.id AS books_id, books.title AS books_title, books.author_id AS books_author_id, books.is_bestseller AS books_is_bestseller
FROM books
WHERE books.id = ?
2017-12-17 09:23:39,833 INFO sqlalchemy.engine.base.Engine (1,)

In [2]: author = book.author
2017-12-17 09:23:43,586 INFO sqlalchemy.engine.base.Engine SELECT authors.id AS authors_id, authors.name AS authors_name
FROM authors
WHERE authors.id = ?
2017-12-17 09:23:43,586 INFO sqlalchemy.engine.base.Engine (1,)

it can be seen that SQLAlchemy is lazy-loading the author.

This is what the N + 1 problem is. If related objects are lazy loaded, the ORM has to execute extra queries to load them. And if the original resultset is large (assume 100 objects), then this involves one additional query for each original object.

Fortunately, SQLAlchemy lets you specify what kind of loading strategy to use, all of which are extensively documented here.

We can ask SQLAlchemy to load related objects using a JOIN statement, modifying the author relationship definition as follows.

class Book(Base):
    # ...

    author = relationship('Author', backref='books', lazy='joined')

Defining the author relationship this way tells SQLAlchemy that every time a Book is loaded, the related Author should be loaded as well via a LEFT OUTER JOIN.

In [1]: book = session.query(Book).get(1)
2017-12-17 09:51:04,034 INFO sqlalchemy.engine.base.Engine SELECT books.id AS books_id, books.title AS books_title, books.author_id AS books_author_id, books.is_bestseller AS books_is_bestseller, authors_1.id AS authors_1_id,
authors_1.name AS authors_1_name
FROM books LEFT OUTER JOIN authors AS authors_1 ON authors_1.id = books.author_id
WHERE books.id = ?
2017-12-17 09:51:04,034 INFO sqlalchemy.engine.base.Engine (1,)

In case we don't want to load related Authors on every Book fetch, this can also be done on-demand using sqlalchemy.orm.joinedload when writing the database query.

from sqlalchemy.orm import joinedload

books = session.query(Book).\
    options(joinedload(Book.author)).\
    filter(Book.id == 42).\
    one()

Variant 2 - loading scalars

What if there was the use case that we'd like to load Author from the database, and for each author, we'd like to know how many bestsellers they wrote.

We could define a property.

from sqlalchemy import func
from sqlalchemy.orm import object_session
from sqlalchemy.sql import and_

class Author(Base):
    ...

    @property
    def bestseller_count(self):
        return object_session(self).query(func.count(Book.id)).\
            filter(and_(Book.author_id == self.id,
                        Book.is_bestseller.is_(True))).\
            scalar()

Accessing the bestseller_count property on Author objects is now going to return the total number of bestseller books they wrote.

This works, but we still run into the N + 1 problem.

In [1]: author = session.query(Author).get(1)
2017-12-17 09:58:20,179 INFO sqlalchemy.engine.base.Engine SELECT authors.id AS authors_id, authors.name AS authors_name
FROM authors
WHERE authors.id = ?
2017-12-17 09:58:20,179 INFO sqlalchemy.engine.base.Engine (1,)

In [2]: print(author.bestseller_count)
2017-12-17 09:58:20,180 INFO sqlalchemy.engine.base.Engine SELECT count(books.id) AS count_1
FROM books
WHERE books.author_id = ? AND books.is_bestseller IS 1
2017-12-17 09:58:20,180 INFO sqlalchemy.engine.base.Engine (1,)
0

If we load 100 authors and then access the bestseller_count property on all of them, this means a 100 extra selects being executed.

There are multiple ways to solve this problem, but one solution is to define a sqlalchemy.orm.column_property.

from sqlalchemy import func, select
from sqlalchemy.orm import column_property
from sqlalchemy.sql import and_

class Author(Base):
    ...

    bestseller_count = column_property(
        select([func.count(Book.id)]).\
        where(and_(Book.author_id == id,
                   Book.is_bestseller.is_(True)))
    )

This asks SQLAlchemy to map the bestseller_count similar to how it defines a normal table Column. Defining it this way makes SQLAlchemy load the count in the original query.

In [1]: author = session.query(Author).get(1)
2017-12-17 10:01:07,007 INFO sqlalchemy.engine.base.Engine SELECT (SELECT count(books.id) AS count_1 
FROM books 
WHERE books.author_id = authors.id AND books.is_bestseller IS 1) AS anon_1, authors.id AS authors_id, authors.name AS authors_name 
FROM authors 
WHERE authors.id = ?
2017-12-17 10:01:07,007 INFO sqlalchemy.engine.base.Engine (1,)

And that's pretty cool.

Conclusion

Like I mentioned earlier, you need to be careful when using ORMs. They make life a LOT easier, but you might still run into edge-cases like this when an innocent property access can end up sending a lot more queries to your database.

Also, before ending the blog post, I need to mention that joined loading is not the solution in all cases. Each and every loading strategy that SQLAlchemy supports is good for some cases and bad for some others. Luckily, the SQLAlchemy documentation on the different relationship loading techniques is an excellent resource.