Tips on Django ORM aggregation
Let's have Django 2 ORM models:from enum import Enum, unique from django.db import models @unique class TaskStatus(Enum): NOT_STARTED = 0 IN_PROGRESS = 1 DEPLOY = 2 DONE = 3 @unique class TaskPriority(Enum): LOW = 0 MEDIUM = 1 HIGH = 2 CRITICAL = 3 class User(models.Model): username = models.CharField(max_length=50) class Task(models.Model): # each task assigned to a user, many tasks per user assigned_to = models.ForeignKey(User, related_name='tasks') title = models.CharField(max_length=250) status = models.IntegerField(choices=[(v.value, v.name) for v in TaskStatus]) priority = models.IntegerField(choices=[(v.value, v.name) for v in TaskPriority]) hours_spent = models.DecimalField(max_digits=10, decimal_places=2)and sample data:
| User | Task | Priority | Status | Hours spent |
|---|---|---|---|---|
| alex | T1 | medium | in progress | 4 |
| T2 | high | deploy | 2 | |
| john | T5 | medium | not started | 0 |
| T6 | critical | done | 3 | |
| T7 | medium | in progress | 5 |
Data aggregation
When we calculate value from many data values, it's called data aggregation or applying aggregation function. The function result could be called the data summary. Examples of aggregation functions: values count, sum, average, minumum or maximum.
Let's calculate some summaries for tasks from the example above:
- 5 tasks
- 2 tasks in progress
- 2 tasks for alex and 3 tasks for john
4 + 2 + 3 + 5 = 14hours spent on tasks- 6 hours spent on alex tasks, 8 hours spent on john tasks
Calculate summaries with Django ORM
We can calculate tasks counts with QuerySet.count():
Task.objects.count() # all tasks count Task.objects.filter(in_progress=TaskStatus.IN_PROGRESS).count() # in progress tasks countor with
QuerySet.aggregate(), and not only counts:
from django.db.models import Count, Sum, Q
Task.objects.aggregate(
tasks_cnt=Count('id'), # all tasks count
active_tasks_cnt=Count('id', filter=Q(status=TaskStatus.IN_PROGRESS)), # active tasks
hours_spent=Sum('hours_spent'), # hours spent on all tasks
)
or per user with QuerySet.annotate():
User.objects.annotate(
tasks_cnt=Count('tasks'), # all tasks count
active_tasks_cnt=Count('task_set', filter=Q(status=TaskStatus.IN_PROGRESS)),
hours_spent=Sum('tasks__hours_spent'), # hours spent on all tasks
)
What's the difference?
QuerySet.count()produces only one value so need one SQL query per count: all, active etcQuerySet.aggregate(),QuerySet.annotate()calculate many values with one SQL query.- There's no other one-value functions like
QuerySet.count()so have to useaggregate()if calculate sum, max, average etc. aggregate()andannotate()has similar syntax but the result type is different:dictforaggregate()andQuerySetforannotate(). The latter could be used for otherQuerySetoperations likefilter(),count().
Emulate SQL GROUP BY with annotate() applied to QuerySet.values()
There are two types of QuerySet: model and values.
Annotating a model query set adds the summaries properties to the each model instance from the annotated QuerySet.
If you annotate QuerySet of 2 users with tasks_cnt, active_tasks_cnt, hours_spent values then
you get 2 User objects with 3 extra properties each: tasks_cnt, active_tasks_cnt, hours_spent.
What if you want to know how many tasks are per each status? In SQL it's done via GROUP BY:
SELECT status, COUNT(id) FROM task GROUP BY statusHow to do the same with Django ORM:
In: Task.objects.values('status')
Out: <QuerySet [{'status': 0}, {'status': 1}, {'status': 2}, {'status': 3}]>
In: Task.objects.values('status').annotate(cnt=Count('id'))
Out: <QuerySet [{'cnt': 1, 'status': 0}, {'cnt': 2, 'status': 1}, {'cnt': 1, 'status': 2}, {'cnt': 1, 'status': 3}]>
annotate() called on the QuerySet.values() adds the summary value entry to the each values dict. The values() parameter is the field for GROUP BY column.
Aggregation filtering in Django 2 and Django 1.x
Count, Sum and other Aggregate sub-classes accept the filter parameter in Django 2 version so the aggregated data filtering is simple:
Task.objects.aggregate(
active_tasks_cnt=Count('id', filter=Q(status=TaskStatus.IN_PROGRESS)
)
Hovewer it's not supported in Django 1.x and requires Case trick:
from django.db.models import Count, Case, When, IntegerField
Task.objects.aggregate(
active_tasks_cnt=Count(Case(
When(status=TaskStatus.IN_PROGRESS, then=1), output_field=IntegerField()
))
)
