Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

initial commit for perf improvement in tasks save_stories_from_feed #23

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

opme
Copy link

@opme opme commented Mar 6, 2023

improve performance saving stories from feed. In the case where there are 500 stories that need duplicate checking, this would reduce sql to database from 1000 to 2. interested in feedback for this pr. So far tested against the small dokku feedset. I have the full 110k feeds running on a cloud server and will deploy there also.

  • Check for existing urls in a single query
  • Check for existing titles in a single query
  • make sure sources_id is in the query when doing these checks. It was already in the query when checking titles but not for urls. (Would like to add index there)

Todo:
add an index to Stories.sources_id
batch insert of new stories into stories table?

@opme
Copy link
Author

opme commented Mar 7, 2023

Running the full 110K feeds now with this pull request. Database cpu and network is reduced.
Setup is 2 cpu app server with 16 workers, 1 cpu postgres server.
Still want to reduce the spikes.

perfdb12hour
perfnetwork_new_old3

Feed table has most reads:

perfreads

current cpu on app server, Want to get it more even without the hills and valleys. It maxes out the 2 cpu for some time then falls to almost zero.

image

@philbudne
Copy link
Contributor

A few thoughts:

  1. I think we check sources_id in the titles query because we want to allow the same title across different sources, but we want to avoid duplicate URLs in the output.
  2. Our current environment is (old) dedicated hardware with lots of free CPU cycles. Depending on your workload, and available memory, it might make sense to keep an in memory (redis?) cache for titles and URLs (if you don't have lots of memory, redis will end up paging, and causing I/O). I'm not convinced an SQL stories table is the best solution: it's a write-mostly table, and (ideally) should contain all URLs/titles ever seen (I put in table archiving so that backups don't become more and more onerous over time). My original thought was to make stories a partitioned table (with one table per month), but I couldn't find any ready to use implementations of partitioned table management in alembic (never mind the fun of migrating back and forth). At least with table partitions we'd only need to back up the current month (past months would be read-only). And I can't help thinking a non-SQL store might suit our needs as well (or better), and we may need to implement a duplicate URL test in other contexts, which means implementing the solution as a library or service might make sense.

@philbudne
Copy link
Contributor

philbudne commented Apr 25, 2023

  1. The reason we don't do batch inserts into the stories table is that the table has a unique constraint on the url column, so (at the very least) we'd need to do the insert with "on conflict ignore" (I don't have enough past history with the project, postgresql or RDBs in general) to be sure that querying for matches first has any benefit.

Lastly, let me warn you that the latest revision of the code (v0.14) has a new implementation of RSS fetch scheduling, and that it's in its infancy, and not at ALL well tuned, either in terms of database access or in terms of algorithmic complexity (I can point to many places that are likely to have O(n^2) worst case behavior)!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants