Skip to content

Commit

Permalink
DB: do not fetch data and others when deleting rows (#10446)
Browse files Browse the repository at this point in the history
* DB: do not fetch `data` and others when deleting rows

This task was cancelled again. The query shows a SELECT first that fetchs the
whole rows. I think we can reduce this time/memory by only fetching the ids.

* DB: only fetch "id" when deleting rows

* DB: clean up old data using raw SQL from Django

We are facing an issue with this query because it takes too long to
execute (more than 30s) making our DB to kill the query. This is because Django
performs a `SELECT` first to be able to trigger pre_ and post_ delete signals on
each object delete.

We don't really need this here, so we are using raw SQL to bypass this and make
the query to execute faster. This is not ideal, but we didn't find a better approach.

* DB: there is no results to fetch

The query is executed without requiring `.fetchall()`
  • Loading branch information
humitos authored Jun 28, 2023
1 parent 6d2f858 commit cd4535e
Show file tree
Hide file tree
Showing 2 changed files with 27 additions and 9 deletions.
19 changes: 14 additions & 5 deletions readthedocs/analytics/tasks.py
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
"""Tasks for Read the Docs' analytics."""

from django.conf import settings
from django.db import connection
from django.utils import timezone

import readthedocs
from readthedocs.worker import app

from .models import PageView
from .utils import send_to_analytics

DEFAULT_PARAMETERS = {
Expand Down Expand Up @@ -80,7 +80,16 @@ def delete_old_page_counts():
"""
retention_days = settings.RTD_ANALYTICS_DEFAULT_RETENTION_DAYS
days_ago = timezone.now().date() - timezone.timedelta(days=retention_days)
return PageView.objects.filter(
date__lt=days_ago,
date__gt=days_ago - timezone.timedelta(days=90),
).delete()

# NOTE: we are using raw SQL here to avoid Django doing a SELECT first to
# send `pre_` and `post_` delete signals
# See https://docs.djangoproject.com/en/4.2/ref/models/querysets/#delete
with connection.cursor() as cursor:
cursor.execute(
# "SELECT COUNT(*) FROM analytics_pageview WHERE date BETWEEN %s AND %s",
"DELETE FROM analytics_pageview WHERE date BETWEEN %s AND %s",
[
days_ago - timezone.timedelta(days=90),
days_ago,
],
)
17 changes: 13 additions & 4 deletions readthedocs/telemetry/tasks.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
"""Tasks related to telemetry."""

from django.conf import settings
from django.db import connections
from django.utils import timezone

from readthedocs.builds.models import Build
Expand Down Expand Up @@ -33,7 +34,15 @@ def delete_old_build_data():
"""
retention_days = settings.RTD_TELEMETRY_DATA_RETENTION_DAYS
days_ago = timezone.now().date() - timezone.timedelta(days=retention_days)
return BuildData.objects.filter(
created__lt=days_ago,
created__gt=days_ago - timezone.timedelta(days=90),
).delete()
# NOTE: we are using raw SQL here to avoid Django doing a SELECT first to
# send `pre_` and `post_` delete signals
# See https://docs.djangoproject.com/en/4.2/ref/models/querysets/#delete
with connections["telemetry"].cursor() as cursor:
cursor.execute(
# "SELECT COUNT(*) FROM telemetry_builddata WHERE created BETWEEN %s AND %s",
"DELETE FROM telemetry_builddata WHERE created BETWEEN %s AND %s",
[
days_ago - timezone.timedelta(days=90),
days_ago,
],
)

0 comments on commit cd4535e

Please sign in to comment.