initial commit for perf improvement in tasks save_stories_from_feed #23

opme · 2023-03-06T14:08:55Z

improve performance saving stories from feed. In the case where there are 500 stories that need duplicate checking, this would reduce sql to database from 1000 to 2. interested in feedback for this pr. So far tested against the small dokku feedset. I have the full 110k feeds running on a cloud server and will deploy there also.

Check for existing urls in a single query
Check for existing titles in a single query
make sure sources_id is in the query when doing these checks. It was already in the query when checking titles but not for urls. (Would like to add index there)

Todo:
add an index to Stories.sources_id
batch insert of new stories into stories table?

opme · 2023-03-07T13:54:34Z

Running the full 110K feeds now with this pull request. Database cpu and network is reduced.
Setup is 2 cpu app server with 16 workers, 1 cpu postgres server.
Still want to reduce the spikes.

Feed table has most reads:

current cpu on app server, Want to get it more even without the hills and valleys. It maxes out the 2 cpu for some time then falls to almost zero.

philbudne · 2023-04-25T19:41:30Z

A few thoughts:

I think we check sources_id in the titles query because we want to allow the same title across different sources, but we want to avoid duplicate URLs in the output.
Our current environment is (old) dedicated hardware with lots of free CPU cycles. Depending on your workload, and available memory, it might make sense to keep an in memory (redis?) cache for titles and URLs (if you don't have lots of memory, redis will end up paging, and causing I/O). I'm not convinced an SQL stories table is the best solution: it's a write-mostly table, and (ideally) should contain all URLs/titles ever seen (I put in table archiving so that backups don't become more and more onerous over time). My original thought was to make stories a partitioned table (with one table per month), but I couldn't find any ready to use implementations of partitioned table management in alembic (never mind the fun of migrating back and forth). At least with table partitions we'd only need to back up the current month (past months would be read-only). And I can't help thinking a non-SQL store might suit our needs as well (or better), and we may need to implement a duplicate URL test in other contexts, which means implementing the solution as a library or service might make sense.

philbudne · 2023-04-25T19:50:02Z

The reason we don't do batch inserts into the stories table is that the table has a unique constraint on the url column, so (at the very least) we'd need to do the insert with "on conflict ignore" (I don't have enough past history with the project, postgresql or RDBs in general) to be sure that querying for matches first has any benefit.

Lastly, let me warn you that the latest revision of the code (v0.14) has a new implementation of RSS fetch scheduling, and that it's in its infancy, and not at ALL well tuned, either in terms of database access or in terms of algorithmic complexity (I can point to many places that are likely to have O(n^2) worst case behavior)!!!

initial commit for perf improvement

d533887

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

initial commit for perf improvement in tasks save_stories_from_feed #23

initial commit for perf improvement in tasks save_stories_from_feed #23

opme commented Mar 6, 2023

opme commented Mar 7, 2023 •

edited

Loading

philbudne commented Apr 25, 2023

philbudne commented Apr 25, 2023 •

edited

Loading

initial commit for perf improvement in tasks save_stories_from_feed #23

Are you sure you want to change the base?

initial commit for perf improvement in tasks save_stories_from_feed #23

Conversation

opme commented Mar 6, 2023

opme commented Mar 7, 2023 • edited Loading

philbudne commented Apr 25, 2023

philbudne commented Apr 25, 2023 • edited Loading

opme commented Mar 7, 2023 •

edited

Loading

philbudne commented Apr 25, 2023 •

edited

Loading