Handling the “N + 1 selects” problem in SQLAlchemy
#databases , #programming , #python , #sqlalchemy
The "N + 1" selects problem is one of the most common problems one might run into when working on a database-backed web application. In this blog post, I'll describe what this problem exactly is, how ORMs (Object Relational Mappers) come into the picture, and how we can work around 2 variants of these problems when using SQLAlchemy.
Most web application frameworks these days provide some sort of an ORM to handle database queries. ORMs can be a double-edged sword. On the one hand they take care of a lot of common database-related issues. On the other hand, using an ORM doesn't mean you can forget about how the underlying database actually works.
A personal rule of thumb is - if an ORM is not focusing on hiding the database layer, it's most probably good to use.
In this blog post, we'll be using SQLAlchemy, which is an excellent database toolkit and ORM for Python applications. It has a vast API - I've been working with SQLAlchemy for a bit more than 5 years now and I can't recall a single instance where it didn't solve a use case I had. On top of that, it does not attempt to hide the database layer. Instead, it adds an abstraction layer on top that makes all the database-interactions in Python applications much easier.
To me, SQLAlchemy feels very "Pythonic".
Consider an application which serves as an online books catalog, something like
Goodreads. This most definitely involves having a
books table, and an
from sqlalchemy import BigInteger, Boolean, Column, ForeignKey, String from sqlalchemy.ext.declarative import declarative_base from sqlalchemy.orm import relationship Base = declarative_base() class Author(Base): __tablename__ = 'authors' id = Column(BigInteger, primary_key=True) name = Column(String(255), nullable=False) def __init__(self, name): self.name = name class Book(Base): __tablename__ = 'books' id = Column(BigInteger, primary_key=True) title = Column(String(255), nullable=False) author_id = Column(BigInteger, ForeignKey('authors.id'), nullable=False) is_bestseller = Column(Boolean, default=False) author = relationship('Author', backref='books') def __init__(self, title, author): self.title = title self.author = author
The two models are linked via the
author relationship. This means that
author property on a
Book object would return the associated
Author object. Since we've defined a
backref, accessing the
Author object would return a
Book objects that this author
Variant 1 - loading objects
SQLAlchemy defines a bunch of strategies about how such related objects should
be loaded from the database. The default is set to
lazy, which means fetching
Book would not immediately result in fetching the
To see this in action, the database engine needs to first be configured to echo all the queries being executed.
from sqlalchemy.engine import create_engine from sqlalchemy.orm import sessionmaker engine = create_engine('sqlite:///:memory:', echo=True) session_factory = sessionmaker(bind=engine)
If we now load a
book from the database and check its
In : book = session.query(Book).get(1) 2017-12-17 09:23:39,833 INFO sqlalchemy.engine.base.Engine BEGIN (implicit) 2017-12-17 09:23:39,833 INFO sqlalchemy.engine.base.Engine SELECT books.id AS books_id, books.title AS books_title, books.author_id AS books_author_id, books.is_bestseller AS books_is_bestseller FROM books WHERE books.id = ? 2017-12-17 09:23:39,833 INFO sqlalchemy.engine.base.Engine (1,) In : author = book.author 2017-12-17 09:23:43,586 INFO sqlalchemy.engine.base.Engine SELECT authors.id AS authors_id, authors.name AS authors_name FROM authors WHERE authors.id = ? 2017-12-17 09:23:43,586 INFO sqlalchemy.engine.base.Engine (1,)
it can be seen that SQLAlchemy is lazy-loading the
This is what the N + 1 problem is. If related objects are lazy loaded, the ORM has to execute extra queries to load them. And if the original resultset is large (assume 100 objects), then this involves one additional query for each original object.
Fortunately, SQLAlchemy lets you specify what kind of loading strategy to use, all of which are extensively documented here.
We can ask SQLAlchemy to load related objects using a
author relationship definition as follows.
class Book(Base): # ... author = relationship('Author', backref='books', lazy='joined')
author relationship this way tells SQLAlchemy that every time a
Book is loaded, the related
Author should be loaded as well via a
LEFT OUTER JOIN.
In : book = session.query(Book).get(1) 2017-12-17 09:51:04,034 INFO sqlalchemy.engine.base.Engine SELECT books.id AS books_id, books.title AS books_title, books.author_id AS books_author_id, books.is_bestseller AS books_is_bestseller, authors_1.id AS authors_1_id, authors_1.name AS authors_1_name FROM books LEFT OUTER JOIN authors AS authors_1 ON authors_1.id = books.author_id WHERE books.id = ? 2017-12-17 09:51:04,034 INFO sqlalchemy.engine.base.Engine (1,)
In case we don't want to load related
Authors on every
Book fetch, this can
also be done on-demand using
sqlalchemy.orm.joinedload when writing the
from sqlalchemy.orm import joinedload books = session.query(Book).\ options(joinedload(Book.author)).\ filter(Book.id == 42).\ one()
Variant 2 - loading scalars
What if there was the use case that we'd like to load
Author from the
database, and for each author, we'd like to know how many bestsellers they
We could define a
from sqlalchemy import func from sqlalchemy.orm import object_session from sqlalchemy.sql import and_ class Author(Base): ... @property def bestseller_count(self): return object_session(self).query(func.count(Book.id)).\ filter(and_(Book.author_id == self.id, Book.is_bestseller.is_(True))).\ scalar()
bestseller_count property on
Author objects is now going to
return the total number of bestseller books they wrote.
This works, but we still run into the N + 1 problem.
In : author = session.query(Author).get(1) 2017-12-17 09:58:20,179 INFO sqlalchemy.engine.base.Engine SELECT authors.id AS authors_id, authors.name AS authors_name FROM authors WHERE authors.id = ? 2017-12-17 09:58:20,179 INFO sqlalchemy.engine.base.Engine (1,) In : print(author.bestseller_count) 2017-12-17 09:58:20,180 INFO sqlalchemy.engine.base.Engine SELECT count(books.id) AS count_1 FROM books WHERE books.author_id = ? AND books.is_bestseller IS 1 2017-12-17 09:58:20,180 INFO sqlalchemy.engine.base.Engine (1,) 0
If we load 100 authors and then access the
bestseller_count property on all of
them, this means a 100 extra selects being executed.
There are multiple ways to solve this problem, but one solution is to define a
from sqlalchemy import func, select from sqlalchemy.orm import column_property from sqlalchemy.sql import and_ class Author(Base): ... bestseller_count = column_property( select([func.count(Book.id)]).\ where(and_(Book.author_id == id, Book.is_bestseller.is_(True))) )
This asks SQLAlchemy to map the
bestseller_count similar to how it defines a
Column. Defining it this way makes SQLAlchemy load the count in
the original query.
In : author = session.query(Author).get(1) 2017-12-17 10:01:07,007 INFO sqlalchemy.engine.base.Engine SELECT (SELECT count(books.id) AS count_1 FROM books WHERE books.author_id = authors.id AND books.is_bestseller IS 1) AS anon_1, authors.id AS authors_id, authors.name AS authors_name FROM authors WHERE authors.id = ? 2017-12-17 10:01:07,007 INFO sqlalchemy.engine.base.Engine (1,)
And that's pretty cool.
Like I mentioned earlier, you need to be careful when using ORMs. They make life a LOT easier, but you might still run into edge-cases like this when an innocent property access can end up sending a lot more queries to your database.
Also, before ending the blog post, I need to mention that joined loading is not the solution in all cases. Each and every loading strategy that SQLAlchemy supports is good for some cases and bad for some others. Luckily, the SQLAlchemy documentation on the different relationship loading techniques is an excellent resource.