Tips on Django ORM aggregation
Let's have Django 2 ORM models:from enum import Enum, unique from django.db import models @unique class TaskStatus(Enum): NOT_STARTED = 0 IN_PROGRESS = 1 DEPLOY = 2 DONE = 3 @unique class TaskPriority(Enum): LOW = 0 MEDIUM = 1 HIGH = 2 CRITICAL = 3 class User(models.Model): username = models.CharField(max_length=50) class Task(models.Model): # each task assigned to a user, many tasks per user assigned_to = models.ForeignKey(User, related_name='tasks') title = models.CharField(max_length=250) status = models.IntegerField(choices=[(v.value, v.name) for v in TaskStatus]) priority = models.IntegerField(choices=[(v.value, v.name) for v in TaskPriority]) hours_spent = models.DecimalField(max_digits=10, decimal_places=2)and sample data:
User | Task | Priority | Status | Hours spent |
---|---|---|---|---|
alex | T1 | medium | in progress | 4 |
T2 | high | deploy | 2 | |
john | T5 | medium | not started | 0 |
T6 | critical | done | 3 | |
T7 | medium | in progress | 5 |
Data aggregation
When we calculate value from many data values, it's called data aggregation or applying aggregation function. The function result could be called the data summary. Examples of aggregation functions: values count, sum, average, minumum or maximum.
Let's calculate some summaries for tasks from the example above:
- 5 tasks
- 2 tasks in progress
- 2 tasks for alex and 3 tasks for john
4 + 2 + 3 + 5 = 14
hours spent on tasks- 6 hours spent on alex tasks, 8 hours spent on john tasks
Calculate summaries with Django ORM
We can calculate tasks counts with QuerySet.count()
:
Task.objects.count() # all tasks count Task.objects.filter(in_progress=TaskStatus.IN_PROGRESS).count() # in progress tasks countor with
QuerySet.aggregate()
, and not only counts:
from django.db.models import Count, Sum, Q Task.objects.aggregate( tasks_cnt=Count('id'), # all tasks count active_tasks_cnt=Count('id', filter=Q(status=TaskStatus.IN_PROGRESS)), # active tasks hours_spent=Sum('hours_spent'), # hours spent on all tasks )or per user with
QuerySet.annotate()
:
User.objects.annotate( tasks_cnt=Count('tasks'), # all tasks count active_tasks_cnt=Count('task_set', filter=Q(status=TaskStatus.IN_PROGRESS)), hours_spent=Sum('tasks__hours_spent'), # hours spent on all tasks )
What's the difference?
QuerySet.count()
produces only one value so need one SQL query per count: all, active etcQuerySet.aggregate()
,QuerySet.annotate()
calculate many values with one SQL query.- There's no other one-value functions like
QuerySet.count()
so have to useaggregate()
if calculate sum, max, average etc. aggregate()
andannotate()
has similar syntax but the result type is different:dict
foraggregate()
andQuerySet
forannotate()
. The latter could be used for otherQuerySet
operations likefilter()
,count()
.
Emulate SQL GROUP BY with annotate()
applied to QuerySet.values()
There are two types of QuerySet
: model and values.
Annotating a model query set adds the summaries properties to the each model instance from the annotated QuerySet
.
If you annotate QuerySet
of 2 users with tasks_cnt
, active_tasks_cnt
, hours_spent
values then
you get 2 User
objects with 3 extra properties each: tasks_cnt
, active_tasks_cnt
, hours_spent
.
What if you want to know how many tasks are per each status? In SQL it's done via GROUP BY
:
SELECT status, COUNT(id) FROM task GROUP BY statusHow to do the same with Django ORM:
In: Task.objects.values('status') Out: <QuerySet [{'status': 0}, {'status': 1}, {'status': 2}, {'status': 3}]> In: Task.objects.values('status').annotate(cnt=Count('id')) Out: <QuerySet [{'cnt': 1, 'status': 0}, {'cnt': 2, 'status': 1}, {'cnt': 1, 'status': 2}, {'cnt': 1, 'status': 3}]>
annotate()
called on the QuerySet.values()
adds the summary value entry to the each values dict. The values()
parameter is the field for GROUP BY
column.
Aggregation filtering in Django 2 and Django 1.x
Count
, Sum
and other Aggregate
sub-classes accept the filter parameter in Django 2 version so the aggregated data filtering is simple:
Task.objects.aggregate( active_tasks_cnt=Count('id', filter=Q(status=TaskStatus.IN_PROGRESS) )Hovewer it's not supported in Django 1.x and requires
Case
trick:
from django.db.models import Count, Case, When, IntegerField Task.objects.aggregate( active_tasks_cnt=Count(Case( When(status=TaskStatus.IN_PROGRESS, then=1), output_field=IntegerField() )) )