-
Notifications
You must be signed in to change notification settings - Fork 5
feat: backfill data to clickhouse in batches and same thread #74
Conversation
Thanks for the pull request, @Ian2012! Please note that it may take us up to several weeks or months to complete a review and merge your PR. Feel free to add as much of the following information to the ticket as you can:
All technical communication about the code itself will be done via the GitHub pull request interface. As a reminder, our process documentation is here. Please let us know once your PR is ready for our review and all tests are green. |
d8d4c80
to
bb364df
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few nits, otherwise looks great
event_sink_clickhouse/management/commands/dump_data_to_clickhouse.py
Outdated
Show resolved
Hide resolved
event_sink_clickhouse/management/commands/dump_data_to_clickhouse.py
Outdated
Show resolved
Hide resolved
""" | ||
Return the queryset to be used for the insert | ||
""" | ||
return self.get_model().objects.all() | ||
if start_pk: | ||
return self.get_model().objects.filter(pk__gt=start_pk).order_by("pk") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One last performance improvement we could make here is to use .values()
or .values_list()
, which would save a bunch of overhead on creating model instances, instead returning a dict or tuple. values_list
performs best, but dealing with tuples can be annoying. We could also then choose the fields we want to select to just what we use, potentially saving a lot of data transfer from the db and memory on the job.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dealing with values_list
would break the serializer, while values
would return the data already partially serialized (still needs to deal with dates and related names). and we would need to implement a method serializer for every sink. Not sure about the gains here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I've seen some pretty significant gains in the past from it, but not always. If you're comfortable with the performance as it is we can leave it and revisit later if necessary.
1c5281e
to
e752256
Compare
b2d450d
to
358b2cf
Compare
fix: query course overviews with django model fix: remove submitted_objects objects chore: quality fixes chore: remove old dump courses to clickhouse command fix: fix initial page, tests, and quality chore: handle pr comments test: improve coverage fix: paginator last page is included test: fixing test
test: add test for get_queryset test: add test for get_queryset
f8d341d
to
a43189d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
a43189d
to
755f961
Compare
test: add test for course_published changes chore: quality fixes test: add test for get_queryset changes in sinks test: add test for should_dump_item in course_published chore: use invalid emails chore: quality fixes test: add tests for course_published should_dump_item tmp
84d9c64
to
69fa84e
Compare
@Ian2012 🎉 Your pull request was merged! Please take a moment to answer a two question survey so we can improve your experience in the future. |
Description
This PR improves the backfill command to send data in configurable batches and dismisses the use of Celery. This doesn't affect the usual LMS traffic and backfill time is reduced because clickhouse likes batch inserts and serialization is more performant.
This PR:
select_related
when needed.