起步
进行数十亿数据压测时,测试环境发生了一个很奇怪的现象:在对 SQLAlchemy 查出的结果做遍历操作时,随着集合越大,遍历的时间跟着变长,是肉眼可见的变长。
根据现象,一位腾讯的程序员向我提出了疑问:会不会遍历过程中 SQLAlchemy 又去查询了数据库?
环境准备
-- DDL
CREATE TABLE `student` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`name` varchar(10) NOT NULL,
`age` int(4) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
-- DML
INSERT INTO `test`.`student`(`name`, `age`) VALUES ('小庭', 16), ('小英', 17), ('小慧', 10);
奇怪的查数据库行为
在使用 SQLAlchemy 时,为了更方便的使用,我自己对 session 做了一层封装:
from contextlib import contextmanager
session = sessionmaker(create_engine("mysql+pymysql://root:123456@192.168.111.136/test"))
@contextmanager
def MSession():
""" session 管理器 """
sess = session()
yield sess
sess.commit()
这样做的好处是,我不用每一次都显式地调用 session() 和 commit()。代码组织起来也好看许多。完整代码如下:
from contextlib import contextmanager
from sqlalchemy import Column, String, Integer, create_engine
from sqlalchemy.orm import sessionmaker
from sqlalchemy.ext import declarative
Base = declarative.declarative_base()
class Student(Base):
__tablename__ = "student"
id = Column(Integer, primary_key=True, autoincrement=True)
name = Column(String, nullable=False)
age = Column(Integer, nullable=False)
session = sessionmaker(create_engine("mysql+pymysql://root:123456@192.168.111.136/test"))
@contextmanager
def MSession():
""" session 管理器 """
sess = session()
yield sess
sess.commit()
def main():
with MSession() as sess:
# 查询所有学生信息
students = sess.query(Student).all()
# 遍历查询结果
for stu in students:
print(f"查询结果: {stu.id} {stu.name} {stu.age}")
if __name__ == "__main__":
main()
如果就这样执行上述代码,可以看到输出合理、一切正常。但要是打开 SQLAlchemy 的 echo 行为,就会发现事情并没有那么简单。
设置 echo=True:
session = sessionmaker(
create_engine("mysql+pymysql://root:123456@192.168.111.136/test", echo=True)
)
此时再运行代码会看到如下输出(输出的一部分信息):
1. 2020-10-16 17:12:04,147 INFO sqlalchemy.engine.base.Engine BEGIN (implicit)
2. 22020-10-16 17:12:04,147 INFO sqlalchemy.engine.base.Engine SELECT student.id AS student_id, student.name AS student_name, student.age AS student_age
FROM student
3. 2020-10-16 17:12:04,148 INFO sqlalchemy.engine.base.Engine {}
4. 2020-10-16 17:12:04,158 INFO sqlalchemy.engine.base.Engine COMMIT
5. 2020-10-16 17:12:04,171 INFO sqlalchemy.engine.base.Engine BEGIN (implicit)
6. 2020-10-16 17:12:04,173 INFO sqlalchemy.engine.base.Engine SELECT student.id AS student_id, student.name AS student_name, student.age AS student_age
FROM student
WHERE student.id = %(param_1)s
7. 2020-10-16 17:12:04,173 INFO sqlalchemy.engine.base.Engine {'param_1': 1}
查询结果: 1 小庭 16
8. 2020-10-16 17:12:04,183 INFO sqlalchemy.engine.base.Engine SELECT student.id AS student_id, student.name AS student_name, student.age AS student_age
FROM student
WHERE student.id = %(param_1)s
9. 2020-10-16 17:12:04,183 INFO sqlalchemy.engine.base.Engine {'param_1': 2}
查询结果: 2 小英 17
10. 2020-10-16 17:12:04,195 INFO sqlalchemy.engine.base.Engine SELECT student.id AS student_id, student.name AS student_name, student.age AS student_age
FROM student
WHERE student.id = %(param_1)s
11. 2020-10-16 17:12:04,195 INFO sqlalchemy.engine.base.Engine {'param_1': 3}
查询结果: 3 小慧 10
程序的执行顺序对应到日志从上到下的输出顺序。可以看到,最开始 sqlalchemy 确实执行了查询语句:
SELECT student.id AS student_id, student.name AS student_name, student.age AS student_age FROM student
对应到代码的:
with MSession() as sess:
# 查询所有学生信息
students = sess.query(Student).all()
可是后面的行为就很奇怪了,每 print 一次查询结果之前,就会有一次数据库查询行为:
6. 2020-10-16 17:12:04,173 INFO sqlalchemy.engine.base.Engine SELECT student.id AS student_id, student.name AS student_name, student.age AS student_age
FROM student
WHERE student.id = %(param_1)s
7. 2020-10-16 17:12:04,173 INFO sqlalchemy.engine.base.Engine {'param_1': 1}
查询结果: 1 小庭 16
...
假设 students 是一个包含 1w 个元素的集合,那么这里就会有 1w 次查询,对应 1w 次 io 操作。且不说这样查询多此一举,重要的是大量 io 操作会拖慢程序的运行速度。
expire\_on\_commit
起初我是在网上找原因,无果。继而到 SQLAlchemy 源码里来回逛几圈,终于在 Session 类的注释中找到了造成上述现象的原因。
# SQLAlchemy 源码,非关键内容省略
class Session(_SessionClassMethods):
"""
Manages persistence operations for ORM-mapped objects.
The Session's usage paradigm is described at :doc:`/orm/session`.
"""
...
def __init__(
self,
bind=None,
autoflush=True,
expire_on_commit=True,
...
):
"""
:param expire_on_commit: Defaults to ``True``. When ``True``, all
instances will be fully expired after each :meth:`~.commit`,
so that all attribute/object access subsequent to a completed
transaction will load from the most recent database state.
"""
大概意思就是,当 expire_on_commit=True
时,commit 之后所有实例都会过期,之后再访问这些过期实例的属性时,SQLAlchemy 会重新去数据库加载实例对应的数据记录。
如何解决
知道原因以后,解决方案就有了。目前我想到了两种:
把用到实例的逻辑移到 commit 操作之前,如:
with MSession() as sess: # 查询所有学生信息 students = sess.query(Student).all() # 遍历查询结果 for stu in students: print(f"查询结果: {stu.id} {stu.name} {stu.age}")
但这样的缺点是,当你需要处理实例的逻辑很耗时时,意味着会产生一个长事务。无论何种情况下,长事务都应该尽量避免。
将
expire_on_commit
置为 False:@contextmanager def MSession(): """ session 管理器 """ sess = session(expire_on_commit=False) # 关闭 expire_on_commit 行为 yield sess sess.commit()
还不快抢沙发