The "N + 1" selects problem is one of the most common problems one might run into when working on a database-backed web application. In this blog post, I'll describe what this problem exactly is, how ORMs (Object Relational Mappers) come into the picture, and how we can work around 2 variants of these problems when using SQLAlchemy.
ORMs
Most web application frameworks these days provide some sort of an ORM to handle database queries. ORMs can be a double-edged sword. On the one hand they take care of a lot of common database-related issues. On the other hand, using an ORM doesn't mean you can forget about how the underlying database actually works.
A personal rule of thumb is - if an ORM is not focusing on hiding the database layer, it's most probably good to use.
In this blog post, we'll be using SQLAlchemy, which is an excellent database toolkit and ORM for Python applications. It has a vast API - I've been working with SQLAlchemy for a bit more than 5 years now and I can't recall a single instance where it didn't solve a use case I had. On top of that, it does not attempt to hide the database layer. Instead, it adds an abstraction layer on top that makes all the database-interactions in Python applications much easier.
To me, SQLAlchemy feels very "Pythonic".
Description
Consider an application which serves as an online books catalog, something like
Goodreads. This most definitely involves having a books
table, and an
authors
table.
from sqlalchemy import BigInteger, Boolean, Column, ForeignKey, String
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import relationship
Base = declarative_base()
class Author(Base):
__tablename__ = 'authors'
id = Column(BigInteger, primary_key=True)
name = Column(String(255), nullable=False)
def __init__(self, name):
self.name = name
class Book(Base):
__tablename__ = 'books'
id = Column(BigInteger, primary_key=True)
title = Column(String(255), nullable=False)
author_id = Column(BigInteger, ForeignKey('authors.id'), nullable=False)
is_bestseller = Column(Boolean, default=False)
author = relationship('Author', backref='books')
def __init__(self, title, author):
self.title = title
self.author = author
The two models are linked via the author
relationship. This means that
accessing the author
property on a Book
object would return the associated
Author
object. Since we've defined a backref
, accessing the books
property
for an Author
object would return a list
of Book
objects that this author
wrote.
Variant 1 - loading objects
SQLAlchemy defines a bunch of strategies about how such related objects should
be loaded from the database. The default is set to lazy
, which means fetching
a Book
would not immediately result in fetching the Author
.
To see this in action, the database engine needs to first be configured to echo all the queries being executed.
from sqlalchemy.engine import create_engine
from sqlalchemy.orm import sessionmaker
engine = create_engine('sqlite:///:memory:', echo=True)
session_factory = sessionmaker(bind=engine)
If we now load a book
from the database and check its author
,
In [1]: book = session.query(Book).get(1)
2017-12-17 09:23:39,833 INFO sqlalchemy.engine.base.Engine BEGIN (implicit)
2017-12-17 09:23:39,833 INFO sqlalchemy.engine.base.Engine SELECT books.id AS books_id, books.title AS books_title, books.author_id AS books_author_id, books.is_bestseller AS books_is_bestseller
FROM books
WHERE books.id = ?
2017-12-17 09:23:39,833 INFO sqlalchemy.engine.base.Engine (1,)
In [2]: author = book.author
2017-12-17 09:23:43,586 INFO sqlalchemy.engine.base.Engine SELECT authors.id AS authors_id, authors.name AS authors_name
FROM authors
WHERE authors.id = ?
2017-12-17 09:23:43,586 INFO sqlalchemy.engine.base.Engine (1,)
it can be seen that SQLAlchemy is lazy-loading the author
.
This is what the N + 1 problem is. If related objects are lazy loaded, the ORM has to execute extra queries to load them. And if the original resultset is large (assume 100 objects), then this involves one additional query for each original object.
Fortunately, SQLAlchemy lets you specify what kind of loading strategy to use, all of which are extensively documented here.
We can ask SQLAlchemy to load related objects using a JOIN
statement,
modifying the author
relationship definition as follows.
class Book(Base):
# ...
author = relationship('Author', backref='books', lazy='joined')
Defining the author
relationship this way tells SQLAlchemy that every time a
Book
is loaded, the related Author
should be loaded as well via a LEFT OUTER JOIN
.
In [1]: book = session.query(Book).get(1)
2017-12-17 09:51:04,034 INFO sqlalchemy.engine.base.Engine SELECT books.id AS books_id, books.title AS books_title, books.author_id AS books_author_id, books.is_bestseller AS books_is_bestseller, authors_1.id AS authors_1_id,
authors_1.name AS authors_1_name
FROM books LEFT OUTER JOIN authors AS authors_1 ON authors_1.id = books.author_id
WHERE books.id = ?
2017-12-17 09:51:04,034 INFO sqlalchemy.engine.base.Engine (1,)
In case we don't want to load related Author
s on every Book
fetch, this can
also be done on-demand using sqlalchemy.orm.joinedload
when writing the
database query.
from sqlalchemy.orm import joinedload
books = session.query(Book).\
options(joinedload(Book.author)).\
filter(Book.id == 42).\
one()
Variant 2 - loading scalars
What if there was the use case that we'd like to load Author
from the
database, and for each author, we'd like to know how many bestsellers they
wrote.
We could define a property
.
from sqlalchemy import func
from sqlalchemy.orm import object_session
from sqlalchemy.sql import and_
class Author(Base):
...
@property
def bestseller_count(self):
return object_session(self).query(func.count(Book.id)).\
filter(and_(Book.author_id == self.id,
Book.is_bestseller.is_(True))).\
scalar()
Accessing the bestseller_count
property on Author
objects is now going to
return the total number of bestseller books they wrote.
This works, but we still run into the N + 1 problem.
In [1]: author = session.query(Author).get(1)
2017-12-17 09:58:20,179 INFO sqlalchemy.engine.base.Engine SELECT authors.id AS authors_id, authors.name AS authors_name
FROM authors
WHERE authors.id = ?
2017-12-17 09:58:20,179 INFO sqlalchemy.engine.base.Engine (1,)
In [2]: print(author.bestseller_count)
2017-12-17 09:58:20,180 INFO sqlalchemy.engine.base.Engine SELECT count(books.id) AS count_1
FROM books
WHERE books.author_id = ? AND books.is_bestseller IS 1
2017-12-17 09:58:20,180 INFO sqlalchemy.engine.base.Engine (1,)
0
If we load 100 authors and then access the bestseller_count
property on all of
them, this means a 100 extra selects being executed.
There are multiple ways to solve this problem, but one solution is to define a
sqlalchemy.orm.column_property
.
from sqlalchemy import func, select
from sqlalchemy.orm import column_property
from sqlalchemy.sql import and_
class Author(Base):
...
bestseller_count = column_property(
select([func.count(Book.id)]).\
where(and_(Book.author_id == id,
Book.is_bestseller.is_(True)))
)
This asks SQLAlchemy to map the bestseller_count
similar to how it defines a
normal table Column
. Defining it this way makes SQLAlchemy load the count in
the original query.
In [1]: author = session.query(Author).get(1)
2017-12-17 10:01:07,007 INFO sqlalchemy.engine.base.Engine SELECT (SELECT count(books.id) AS count_1
FROM books
WHERE books.author_id = authors.id AND books.is_bestseller IS 1) AS anon_1, authors.id AS authors_id, authors.name AS authors_name
FROM authors
WHERE authors.id = ?
2017-12-17 10:01:07,007 INFO sqlalchemy.engine.base.Engine (1,)
And that's pretty cool.
Conclusion
Like I mentioned earlier, you need to be careful when using ORMs. They make life a LOT easier, but you might still run into edge-cases like this when an innocent property access can end up sending a lot more queries to your database.
Also, before ending the blog post, I need to mention that joined loading is not the solution in all cases. Each and every loading strategy that SQLAlchemy supports is good for some cases and bad for some others. Luckily, the SQLAlchemy documentation on the different relationship loading techniques is an excellent resource.