updated August 31, 2020

Tips on Django ORM aggregation

Let's have Django 2 ORM models:
from enum import Enum, unique
from django.db import models


@unique
class TaskStatus(Enum):
  NOT_STARTED = 0
  IN_PROGRESS = 1
  DEPLOY = 2
  DONE = 3


@unique
class TaskPriority(Enum):
  LOW = 0
  MEDIUM = 1
  HIGH = 2
  CRITICAL = 3


class User(models.Model):
  username = models.CharField(max_length=50)


class Task(models.Model):
  # each task assigned to a user, many tasks per user
  assigned_to = models.ForeignKey(User, related_name='tasks')
  
  title = models.CharField(max_length=250)
  status = models.IntegerField(choices=[(v.value, v.name) for v in TaskStatus])
  priority = models.IntegerField(choices=[(v.value, v.name) for v in TaskPriority])
  hours_spent = models.DecimalField(max_digits=10, decimal_places=2)
and sample data:
User Task Priority Status Hours spent
alex T1 medium in progress 4
T2 high deploy 2
john T5 medium not started 0
T6 critical done 3
T7 medium in progress 5

Data aggregation

When we calculate value from many data values, it's called data aggregation or applying aggregation function. The function result could be called the data summary. Examples of aggregation functions: values count, sum, average, minumum or maximum.

Let's calculate some summaries for tasks from the example above:

  • 5 tasks
  • 2 tasks in progress
  • 2 tasks for alex and 3 tasks for john
  • 4 + 2 + 3 + 5 = 14 hours spent on tasks
  • 6 hours spent on alex tasks, 8 hours spent on john tasks

Calculate summaries with Django ORM

We can calculate tasks counts with QuerySet.count():

Task.objects.count()  # all tasks count
Task.objects.filter(in_progress=TaskStatus.IN_PROGRESS).count()  # in progress tasks count

or with QuerySet.aggregate(), and not only counts:
from django.db.models import Count, Sum, Q

Task.objects.aggregate(
  tasks_cnt=Count('id'),  # all tasks count
  active_tasks_cnt=Count('id', filter=Q(status=TaskStatus.IN_PROGRESS)),  # active tasks
  hours_spent=Sum('hours_spent'),  # hours spent on all tasks
)
or per user with QuerySet.annotate():
User.objects.annotate(
  tasks_cnt=Count('tasks'),  # all tasks count
  active_tasks_cnt=Count('task_set', filter=Q(status=TaskStatus.IN_PROGRESS)),
  hours_spent=Sum('tasks__hours_spent'),  # hours spent on all tasks
)

What's the difference?

  • QuerySet.count() produces only one value so need one SQL query per count: all, active etc QuerySet.aggregate(), QuerySet.annotate() calculate many values with one SQL query.
  • There's no other one-value functions like QuerySet.count() so have to use aggregate() if calculate sum, max, average etc.
  • aggregate() and annotate() has similar syntax but the result type is different: dict for aggregate() and QuerySet for annotate(). The latter could be used for other QuerySet operations like filter(), count().

Emulate SQL GROUP BY with annotate() applied to QuerySet.values()

There are two types of QuerySet: model and values. Annotating a model query set adds the summaries properties to the each model instance from the annotated QuerySet. If you annotate QuerySet of 2 users with tasks_cnt, active_tasks_cnt, hours_spent values then you get 2 User objects with 3 extra properties each: tasks_cnt, active_tasks_cnt, hours_spent.

What if you want to know how many tasks are per each status? In SQL it's done via GROUP BY:

SELECT status, COUNT(id) FROM task GROUP BY status

How to do the same with Django ORM:
In: Task.objects.values('status')
Out: <QuerySet [{'status': 0}, {'status': 1}, {'status': 2}, {'status': 3}]>
In: Task.objects.values('status').annotate(cnt=Count('id'))
Out: <QuerySet [{'cnt': 1, 'status': 0}, {'cnt': 2, 'status': 1}, {'cnt': 1, 'status': 2}, {'cnt': 1, 'status': 3}]>
annotate() called on the QuerySet.values() adds the summary value entry to the each values dict. The values() parameter is the field for GROUP BY column.

Aggregation filtering in Django 2 and Django 1.x

Count, Sum and other Aggregate sub-classes accept the filter parameter in Django 2 version so the aggregated data filtering is simple:

Task.objects.aggregate(
  active_tasks_cnt=Count('id', filter=Q(status=TaskStatus.IN_PROGRESS)
)

Hovewer it's not supported in Django 1.x and requires Case trick:
from django.db.models import Count, Case, When, IntegerField

Task.objects.aggregate(
  active_tasks_cnt=Count(Case(
        When(status=TaskStatus.IN_PROGRESS, then=1), output_field=IntegerField()
    ))
)