how (and why!) to build a django based project with ... · how (and why!) to build a django based...
Post on 30-Oct-2019
17 Views
Preview:
TRANSCRIPT
How (and why!) to build a Django based project with
SQLAlchemy Core for data analysis
Hi!I’m Gleb PushkovSoftware developer 6+ years(Python & Django)
Kyiv, Ukraine
glib.pushkov@gmail.com
https://github.com/glebtor
Link to slides
Why do we need SQLAlchemy Core in Django app?
your application mostly works with aggregations
you have a lot of data
you need precise and performant queries
you’re building advanced queries dynamically
you’re transforming complex queries from SQL to Python
database is not natively supported by Django
(e.g SQL Azure, Sybase, Firebird)
You’re building some kind of Data-Analysis app, e.g.:
Subqueries
Window functions
FilteredRelation
Conditional Expressions
Date, Math, Text functions
Custom db constraints
Cool new features:
You have fewer reasons to switch to raw SQL!
But...Django ORM has its specific
Property.objects.filter(city__startswith='K').select_related('owner')[:5]
SELECT"properties"."id","users"."username"...
FROM "properties"LEFT OUTER JOIN "users" ON ("properties"."owner_id" = "users"."id")WHERE "properties"."city" LIKE 'K%'LIMIT 5
SQL getting simpler - python query getting complexProperty.objects
.filter(city__startswith='K')
.annotate(owner_name=F('owner__username'))
.values('id', 'owner_name')[:5]
SELECT"properties"."id","users"."username" as "owner_name"
FROM "properties"LEFT OUTER JOIN "users" ON ("properties"."owner_id" = "users"."id")WHERE "properties"."city" LIKE 'K%'LIMIT 5
SQL getting simpler - python query getting complexProperty.objects
.filter(city__startswith='K')
.annotate(owner_name=F('owner__username'))
.values('id', 'owner_name')[:5]
SELECT"properties"."id","users"."username" as "owner_name"
FROM "properties"LEFT OUTER JOIN "users" ON ("properties"."owner_id" = "users"."id")WHERE "properties"."city" LIKE 'K%'LIMIT 5
We JOIN all rows, and only then we have a LIMIT.
Explain
What we want to get
SELECT"properties_by_city"."id","users"."username"FROM (SELECT "properties"."id" AS "id", "properties"."owner_id" AS "owner_id" FROM "properties" WHERE "properties"."city" LIKE 'K%'LIMIT 5) AS "properties_by_city"LEFT OUTER JOIN "users"ON "users"."id" = "properties_by_city"."owner_id"
Looks good!
Explain
How it will look for SQLAlchemy Core
properties_by_city = (select([properties.c.uuid, properties.c.owner_id]).select_from(properties).where(properties.c.city.like('K%')).limit(5).alias()
)
query = select([properties_by_city.c.uuid, users.c.username]).select_from(properties_by_city.outerjoin(users))
How it will look for Django ORM
You can’t build such queries!
In ORM world everything is tied to models (in python) and tables (in db)
We can use `Subquery` in SELECT, WHERE, HAVING, but not in FROM. The “root” of query is a model/table.
How similar query will look for Django ORM
properties_by_city = (Property.objects.filter(city__startswith='K')[:5].values('pk')
)
Property.objects.filter(pk__in=Subquery(properties_by_city)).annotate(owner_name=F('owner__username')).values('pk', 'owner_name')
Looks good!
Explain
With django ORM you get everything / with SQLAlchemy Coreyou get only what you asked - they’re on different layers
Django ORM
Non-public API
Raw SQL
SQLAlchemy ORM
SQLAlchemy Core
Raw SQL
Model.objects.filter(...).values(...).annotate(...).filter(...)
select([...]).select_from(...).where(...).group_by(...).having(...)
Django ORM SQLAlchemy
SELECT ...FROM ...WHERE ...GROUP BY ...HAVING ...
SQL
There is a distance between SQL & ORM layer,so sometimes it’s not clear which query will be generated
There is no freedom on ORM level!
SQLAlchemy example
Select properties by criteria
level1 = (select([
properties.c.building_id,properties.c.sale_price,properties.c.owner_id
]).select_from(properties).where(properties.c.selling_status=='for_sale').where(properties.c.sale_price!=None).alias()
)
Join usernames
level2 = (select([
level1.c.building_id,level1.c.sale_price,users.c.username
]).select_from(level1.outerjoin(users)).alias()
)
Group by
level3 = (select([
level2.c.building_id, func.count(level2.c.building_id).label('apartments_count'), func.sum(level2.c.sale_price).label('sum_price'), func.array_agg(level2.c.username).label('users')
]).select_from(level2).group_by(level2.c.building_id).alias()
)
One more join at the top
level4 = (select([
properties.c.total_apartments,level3.c.apartments_count,level3.c.sum_price,level3.c.users
]).select_from(level3.join(properties, properties.c.uuid==level3.c.building_id)
))
SELECT properties.number_of_units, anon_1.apartments_count, anon_1.sum_price, anon_1.usersFROM (SELECT anon_2.building_id AS building_id, count(anon_2.building_id) AS apartments_count, sum(anon_2.sale_price) AS sum_price, array_agg(anon_2.username) AS users FROM (SELECT anon_3.building_id AS building_id, anon_3.sale_price AS sale_price, users.username AS username FROM (
level4 - join
level3 - group by
level2 - join
SELECT properties.building_id AS building_id, properties.sale_price AS sale_price, properties.owner_id AS owner_id FROM properties WHERE properties.selling_status = :selling_status_1 AND properties.sale_price IS NOT NULL) AS anon_3LEFT OUTER JOIN users ON users.uuid = anon_3.owner_id) AS anon_2 GROUP BY anon_2.building_id) AS anon_1 JOIN properties ON properties.uuid = anon_1.building_id
level1
Wow! SQLAlchemy <3
Aggregation in subquery
SELECT * FROM “properties” WHERE "properties"."sale_price" = (
SELECT MIN("properties"."sale_price") FROM "properties" WHERE "properties"."sale_price" > 1000000
)
SQL we want to get
Situation: we want to find a first price which is bigger than 1’000’000, and then get all properties with exact price.
So we need to make a MIN aggregation.
Aggregation in subquery
min_price = Property.objects .filter(sale_price__gte=1000000) .aggregate(Min('sale_price'))['sale_price__min'] # evaluated :(
Property.objects.filter(sale_price=min_price)
SELECT * FROM "properties" WHERE "properties"."sale_price" = 1234567
We performed two separate queries
Aggregation in subquery - take 2 - subquery
min_price = Listing.objects.filter(sale_price__gte=1000000).values('sale_price').order_by('sale_price')[:1]
Property.objects.filter(sale_price=Subquery(min_price))
... WHERE "properties"."sale_price" = (SELECT U0."sale_price" FROM "properties" U0 WHERE U0."sale_price" >= 1000000 ORDER BY U0."sale_price" ASC LIMIT 1)
Now we have ORDER BY and LIMIT...too complicated for code & db
Aggregation in subquery - take 3 - not recommended
min_price_queryset = Property.objects.filter(sale_price__gte=1000000)min_price_queryset.query.add_annotation(
Min('sale_price'), 'min_price', is_summary=True)Property.objects.filter(
sale_price=Subquery(min_price_queryset.values('min_price')))
WHERE "properties"."sale_price" >= (SELECT MIN(U0."sale_price") AS "min_price" FROM "listings" U0 WHERE U0."sale_price" >= 1000000)
Aggregation in subquery - take 4 - template
class MinSalePrice(Subquery):template = "(SELECT MIN(sale_price) FROM (%(subquery)s) _subq)"output_field = models.IntegerField()
filtered_properties = Property.objects.filter(sale_price__gte=1000000)
Property.objects.filter(sale_price=MinSalePrice(filtered_properties))
Generated SQL is fine, but such approach is some kind of a hack and ....
Aggregation in subquery - SQLAlchemy
min_price = (select([func.min(properties.c.sale_price)]).select_from(properties)
)
query = (select([properties.c.id]).select_from(properties).where(properties.c.sale_price==min_price)
)
Joins:
You can't join tables of non-related models
You can't perform RIGHT OUTER JOIN… yes, it’s very rare :)
Django decides for you which join type to apply (INNER or LEFT OUTER)
and always generates
JOIN "table2" ON ("table1"."table2_id" = "table2"."id")
which could be a bit customized by FilteredRelation
Not supported by Django (yet)
Recursive CTE
raw SQL
django-cte-forest (implemented via ‘extra’, has limitations)
SQLAlchemy
Could be done via:
Combining multiple aggregations © Django docs <3
>>> book = Book.objects.first()>>> book.authors.count()2>>> book.store_set.count()3>>> q = Book.objects.annotate(Count('authors'), Count('store'))>>> q[0].authors__count6>>> q[0].store__count
6
Count('field', distinct=True) will fix this query, but other aggregations will not work as expected!
Hard to read advanced queries
Hard to understand what’s going on in SQL-level
Takes time & effort to convert SQL to python
Can’t control / change some parts of generated SQL
Queries could be not efficient
To sum up...
Usually all above is not a problem in 95%* of cases
* Just a number from my head
your application mostly works with aggregations
you have a lot of data
you need precise and performant queries
you’re transforming complex queries from SQL to Python
you’re building advanced queries dynamically
database is not natively supported by Django
(e.g SQL Azure, Sybase, Firebird)
But only when you’re building some kind of Data-Analysis app, e.g.:
Ok, how to start??
1. Create `Engine` as a global variable and describe your connection
Engine Database
Pool
Dialect
DBAPIconnect()
QueuePool is default, to disable pooling use NullPool
sa_engine = create_engine(settings.DB_CONNECTION_URL,pool_recycle=settings.POOL_RECYCLE
)
Pooling
Postgres
8 threads4 uWSGI workers
Django connection
SQLAlchemy (NullPool)
pgbouncer
1 instance produce up to 64 connections1 connection to Postgres ~ 10 MB of RAM124 connections == ~1.2 GB of RAM
2. Define tables
Re-use django models (aldjemy)
Table reflection (django-sabridge)
If you have models for tables:
Table reflection ( messages = Table('messages', meta, autoload=True)
Define explicitly
Define inline with expressions
No models:
Define explicitly
users = Table('users', metadata,Column('id', Integer, primary_key=True),Column('username', String(150), nullable=False),Column('email', String(254)),Column('role', String(64), nullable=False)
)
Define explicitly
Or even keep it simple, but to simplify `join` describe ForeignKeys:
users = Table('users', metadata,Column('id'),Column('username'),Column('email'),Column('role')
)properties = Table('properties', metadata,
Column('owner_id', None, ForeignKey('users.id')),...
)
Usage
Each returned row is a RowProxy:
all_users = engine.execute(select([users.c.username, users.c.email]).select_from(users)
).fetchall()
[('johndoe', 'john_doe@example.com'),('janedoe', 'jane_doe@example.com')]
all_users[0].username / all_users[0][‘username’] / all_users[0][0]
Define inline with expressions
from sqlalchemy import table, columnengine.execute(
select([column('username'), column('email'),]).select_from(table('users'))).fetchall()
# Or if you need columns to be associated with tables:user = table(
‘user’, column(‘id’), column(‘username’),) queries like: select([user.c.username, ...
That’s all, start building your fancy queries!
That’s all, start building your fancy queries!
But what about tests?
Switch connection!
def create_sa_engine(connection_url): extra = {...} if "pytest" in sys.modules: connection_url = _get_test_db_url(connection_url) return create_engine(connection_url, **extra)
engine = create_sa_engine(settings.REMOTE_DB_CONNECTION_URL):
As `engine` is a global variable it will be evaluated earlier than any call of django.test.override_settings or some other approaches.Test db will be created by Django (if it’s listed in settings)
Pytest + ResultProxy cursor issue
return self.process_rows(result_proxy) # iterates over ResultProxy
If you have an exception during iteration over a cursor (ResultProxy)pytest will hang on forever
We have to close cursor explicitly
def close_cursors(func): @wraps(func) def wrapper(*args, **kwargs): try: return func(*args, **kwargs) except Exception as e: for arg in chain(args, kwargs.values()): if isinstance(arg, ResultProxy): arg.close() raise e return wrapper
@close_cursorsdef process_rows(result_proxy: ResultProxy) ...
TestCases & connections
TestCase - wraps test with a transaction and performs a rollback
TransactionTestCase - code is not wrapped with transaction, truncate all tables
Django connection
SQLAlchemy connectionsDatabaseApplication
Read Committed
When to use TestCase
if you write tests for code which works only with 1 of connections
and test data is populated via the same connection
if tables populated via SQLAlchemy connection - you have
to clean up tables by yourself;
it's possible to share data between connections, but it requires
to change transaction isolation level (READ_UNCOMMITED),
but I would not recommend.
Keep in mind
When to use TransactionTestCase
test code which works with both connections.
(no issues because of autocommit behavior)
this tests are slower
models related tables flushed automatically.
other tables have to be cleaned up by yourself (if you have such)
Keep in mind
Drawbacks
A bit hard to start
Can't easily get a final SQL query with parameters
Slower tests
More connections to database
Can't reuse libraries which work with querysets (e.g django-filters, pagination)
Benefits
Full control over SQL
Faster to express SQL in Python code
Easier to build application-specific SQL-generation layer
Readability & maintainability
Performance
Questions?
Thank you for attention! Link to slides
top related