Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autopurging does not delete unread news items #110

Closed
1 task done
fabolhak opened this issue Mar 11, 2017 · 46 comments · Fixed by #1931
Closed
1 task done

Autopurging does not delete unread news items #110

fabolhak opened this issue Mar 11, 2017 · 46 comments · Fixed by #1931
Labels

Comments

@fabolhak
Copy link

fabolhak commented Mar 11, 2017

IMPORTANT

Read and tick the following checkbox after you have created the issue or place an x inside the brackets ;)

Explain the Problem

Old feed data items are not getting purged automatically. Since I don't mark all items as read the database grows over time and becomes very big.

Steps to Reproduce

Clone Docker Image from official repo, install news app and install/configure cron.

  1. add a news feed with a big number of unread news items (e.g. http://usa.chinadaily.com.cn/china_rss.xml)
  2. set the autopurge limit ("Maximum read count per feed") to a lower value (e.g. 10)
  3. wait for the next cron update (or manually trigger)

Expected Behavior

The number of unread news items should be 10 and should not increase over time.

Actual Behavior

The number of unread news items is higher than 10 and increases when new items are being published.

System Information

  • News app version: 10.1.0
  • Nextcloud version: 11.0.1
  • PHP Version: 5.6.30
  • PHP Memory Limit: 512.0 MB
  • PHP Max Execution Time: 3600
  • PHP Upload max size: 511.0 MB
  • Database and version: sqlite3 (3.8.10)

Contents of nextcloud/data/nextcloud.log

nothing relevant

Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

@fabolhak
Copy link
Author

similiar to this: owncloud-archive/news#1024

@BernhardPosselt
Copy link
Member

The number is added to the maximum number of entries the feed served at some point to prevent read items reappearing as unread

@fabolhak
Copy link
Author

Hi, thanks for the fast reply. :)
I'm not sure if I understand the functionality of the "Maximum read count per feed" correctly.

So basically you are saying when a feed has the 100 most recent items (number of feed items is always 100, but feed items change over time) and the "Maximum read count per feed" is 10, then this two numbers add up? So in the nextcloud news app there should never be more than 110 items?

@BernhardPosselt
Copy link
Member

Exactly :)

You can view the number in the oc_news_feeds table (articles_per_update)

The logic itself is covered by an integration test so I'm pretty sure that it works (though there's still a chance it doesn't xD)

@BernhardPosselt
Copy link
Member

@fabolhak
Copy link
Author

Ok cool. I also investigated the code a little bit. However my SQL and php is a bit rusty.😅

In my current owncloud instance I have this problem too. For example the China Daily Feed has an "articles_per_update" of 300. The maximum read count is configured to 60. However, when I check the database I can find over 3800 items with the feed id of China daily.

The owncloud news app is a bit outdated though. I also pulled the newest nextcloud docker image and installed the news app (see above). If the problem still exists I will report back in a few days :).

@fabolhak
Copy link
Author

fabolhak commented Mar 13, 2017

Hey,
so the issue still exists in the nextcloud version of the news app. I subscribed to the China daily feed which has an "articles_per_update" of 300. My maximum read count is configured to 10. Today I had more than 310 unread items in the news app.

I also had a look at the code. What I don't understand is this line:
https://github.com/nextcloud/news/blob/master/lib/Db/ItemMapper.php#L275

The $status seems to be the status which is excluded from the auto purge process. However, also unread items are excluded because of the | StatusFlag::UNREAD. This causes the auto purge process to exclude unread items as well and we have the described behavior above (database becomes very large over time).

When I uncommend the later part:
$status = StatusFlag::STARRED /*| StatusFlag::UNREAD*/;

The autopurge process seems to work and the number of China daily items is constantly at 310.

I'm not sure if this is intended or not. But with the | StatusFlag::UNREAD the autopurge process only deletes read items which are over the "articles_per_update"+"maximum read count" limit.

@fabolhak fabolhak changed the title Autopurging of old news items does not work Autopurging does not delete unread news items Mar 24, 2017
@e-alfred
Copy link
Contributor

e-alfred commented Jun 29, 2017

I have several feeds that get updated often, so they show "999+" after some time. This lags the database quite heavily because the old feed items don't get purged. I logged this using the MYSQL slow query log as mentioned in the Nextcloud developer manual and here is the output of it:

# Time: 2017-06-28T12:40:39.210304Z
# User@Host: owncloud[owncloud] @ localhost []  Id:  2768
# Query_time: 229.221434  Lock_time: 0.000387 Rows_sent: 0  Rows_examined: 54006
SET timestamp=1498653639;
SELECT COUNT(*) AS size FROM `oc_news_items` `items` JOIN `oc_news_feeds` `feeds` ON `feeds`.`id` = `items`.`feed_id` AND `feeds`.`deleted_at` = 0 AND `feeds`.`user_id` = 'alfred' AND ((`items`.`status` & 4) = 4)LEFT OUTER JOIN `oc_news_folders` `folders` ON `folders`.`id` = `feeds`.`folder_id` WHERE `feeds`.`folder_id` = 0 OR `folders`.`deleted_at` = 0;
# Time: 2017-06-28T12:40:39.211512Z
# User@Host: owncloud[owncloud] @ localhost []  Id:  2765
# Query_time: 229.177449  Lock_time: 0.000232 Rows_sent: 0  Rows_examined: 54006
SET timestamp=1498653639;
SELECT COUNT(*) AS size FROM `oc_news_items` `items` JOIN `oc_news_feeds` `feeds` ON `feeds`.`id` = `items`.`feed_id` AND `feeds`.`deleted_at` = 0 AND `feeds`.`user_id` = 'alfred' AND ((`items`.`status` & 4) = 4)LEFT OUTER JOIN `oc_news_folders` `folders` ON `folders`.`id` = `feeds`.`folder_id` WHERE `feeds`.`folder_id` = 0 OR `folders`.`deleted_at` = 0;
# Time: 2017-06-28T12:40:39.214785Z
# User@Host: owncloud[owncloud] @ localhost []  Id:  2756
# Query_time: 280.522846  Lock_time: 0.000236 Rows_sent: 0  Rows_examined: 64214
SET timestamp=1498653639;
SELECT COUNT(*) AS size FROM `oc_news_items` `items` JOIN `oc_news_feeds` `feeds` ON `feeds`.`id` = `items`.`feed_id` AND `feeds`.`deleted_at` = 0 AND `feeds`.`user_id` = 'alfred' AND ((`items`.`status` & 4) = 4)LEFT OUTER JOIN `oc_news_folders` `folders` ON `folders`.`id` = `feeds`.`folder_id` WHERE `feeds`.`folder_id` = 0 OR `folders`.`deleted_at` = 0;
# Time: 2017-06-28T12:40:39.219031Z
# User@Host: owncloud[owncloud] @ localhost []  Id:  2959
# Query_time: 66.927684  Lock_time: 0.000692 Rows_sent: 0  Rows_examined: 11211
SET timestamp=1498653639;
SELECT COUNT(*) AS size FROM `oc_news_items` `items` JOIN `oc_news_feeds` `feeds` ON `feeds`.`id` = `items`.`feed_id` AND `feeds`.`deleted_at` = 0 AND `feeds`.`user_id` = 'alfred' AND ((`items`.`status` & 4) = 4)LEFT OUTER JOIN `oc_news_folders` `folders` ON `folders`.`id` = `feeds`.`folder_id` WHERE `feeds`.`folder_id` = 0 OR `folders`.`deleted_at` = 0;
# Time: 2017-06-28T12:40:39.247040Z
# User@Host: owncloud[owncloud] @ localhost []  Id:  2993
# Query_time: 411.752023  Lock_time: 0.000392 Rows_sent: 0  Rows_examined: 64217
SET timestamp=1498653639;
SELECT `feeds`.*, COUNT(`items`.`id`) AS `unread_count` FROM `oc_news_feeds` `feeds` LEFT OUTER JOIN `oc_news_folders` `folders` ON `feeds`.`folder_id` = `folders`.`id` LEFT JOIN `oc_news_items` `items` ON `feeds`.`id` = `items`.`feed_id` AND (`items`.`status` & 2) = 2 WHERE `feeds`.`user_id` = 'alfred' AND (`feeds`.`folder_id` = 0 OR `folders`.`deleted_at` = 0)AND `feeds`.`deleted_at` = 0 GROUP BY `feeds`.`id`;

The only app lagging the database is basically the News app. No other queries were logged. Sometimes this results in heavy load leading to the effect that the feeds are not updated anymore. Only resetting the News app and importing a backup *.opml file solves the problem.

@sonologic
Copy link
Collaborator

This might be related to #191, the queries are slow because of the way status is handled.

So perhaps once that is fixed, this issue becomes a non-issue.

I'm trying to wrap my head around what is intended with the purging code here, but that needs a bit more study. It does look like there's a bug in this bit of code indeed, in that it doesn't delete unread items which then accumulate quickly on high-traffic feeds.

@sonologic
Copy link
Collaborator

sonologic commented Jun 30, 2017

Diving into it a bit deeper, it appears to me unread items are not purged by design, the assumption being that we should never throw away items that the user hasn't seen yet. The number "Maximum read count per feed" mentioned above is described in the admin for the news app as "Defines the maximum amount of articles that can be read per feed which won't be deleted by the cleanup job"

There is no mechanism that purges unread items I think.

Given experiences reported in this issue, we probably should introduce a mechanism to purge unread items if they get above a certain (configurable?) threshold. I would propose a new configuration option 'Maximum items per feed'. If a feed has more than that number of items, excess items are purged.

The default setting for 'Maximum items per feed' should be 'disabled' so that unread items are never deleted.

Example: 'Maximum items per feed' is set to 1000. Feed has 1500 items. Then the oldest 500 (determined by id) items are purged, regardless of whether they are read or unread; with the exception of starred items of course.

Just out of curiosity, how is anyone managing to keep up with feeds that have that many articles?

@e-alfred
Copy link
Contributor

e-alfred commented Jul 4, 2017

@sonologic Well, I have a bunch of high volume forum feeds which I like to look through from time to time. The news app is a good place to keep them in one place, so I have to go only to one location. Also, being part of Nextcloud, I do not have to run another newsfeed application which is a big bonus.

A limit of items would be very appreciated because it is what puts a high load on the database as I showed above. It gets out of control after some time and the only way to get rid of the problem is to reset the news app and reimport the *.opml file.

Also, for people with less powerful machines, this is an even bigger problem I suppose. The news app shouldn't be a power hog just because a few feeds have high volumes of new itmes after a short time.

It also does not have to be a default setting though, just deactivate it by default and let the user decide what to do/put it in the documentation.

@fabolhak
Copy link
Author

fabolhak commented Jul 4, 2017

I changed this line to $status = StatusFlag::STARRED;

This works perfectly for me. With this fix the number of feed items is limited to "articles_per_update"+"maximum read count".

@sonologic
Copy link
Collaborator

@fabolhak while that may be perfect for you, for a lot of other people that is far from perfect. Deleting unread items is unexpected behaviour. It must be a conscious opt-in, and not be default behaviour.

Also, probably with #191 fixed, the load on the database will no longer manifest as it does now.

@piggz
Copy link

piggz commented Aug 4, 2018

Any progress on this? I too would like to allow deleting of items that are not marked as 'read'. Today, i had trouble updating my nextcloud instance because my sqlite db had grown to >600mb, and my vps was low on space and the db couldnt be converted. I ended up doing a 'delete from oc_news_items' and a vacuum, and got the db to 33mb. Perhaps an option to keep X unread, or delete if > X days would do? i use the news app to skim through news feeds, and wouldnt ever want to go back more that a few weeks. I guess the root cause is that I use my own app on sailfish to sync from my nextcloud instance, and I never set the read status, perhaps I should set the read status of all items after ive downloaded them, but still, the db shouldnt grow to an unmanageable size.

@smoebody
Copy link

smoebody commented Sep 20, 2018

In my opinion the there is a misunderstanding in this function. Despite it is named deleteReadOlderThanThreshold which states that "it deletes read items older than the threshold" - and is, what this function actually does - i believe the author wanted to write deleteUnreadOlderThanThreshold, which is what the phpdoc explains.

Delete all items for feeds that have over $threshold unread and not starred items

A simple change from

$params = [false, false, $threshold];

to

$params = [true, false, $threshold];

should do the trick.

Update

More convenient of course would be a new function deleteUnreadOlderThanThreshold with its own configurable threshold number (or zero if anybody wants to keep the old behavior).

I think i will do a PR.

@BernhardPosselt
Copy link
Member

You do not want to delete read items because you didn't read them yet. Works as intended

@piggz
Copy link

piggz commented Sep 25, 2018

It may work as intended, but could be considered a feature request. There is definitely a use case deleting old news, read or not, so as to not exceed disk space if the user wishes that.

@muppeth
Copy link

muppeth commented Jan 31, 2019

I know the app is pretty much unmaintained for quite a while, but this ever growing db is no fun. Atm. my db (lots of feeds and users though) consists of 28GB (feed items only) which is 95% of the entire nextlcoud database where I have 9709932 unread vs 702188 read items.
It's clearly user fault as people just add lots of feeds and never read them.
I think I will just mark everything read that is older then a month or two. Will the read items be removed from the db? Is there a way anyone though of to get rid of the (pretty much) waste at this point? I did once just removed items older then.. but that introduced a lot of issues with items re-apearing etc. Is there a way to change the articles_per_update? for instance if I were to run a query that changes all articles_per_update above (say) 100 to 50, will that get overwritten on next feed update?

Seeing how much db is used by unread items I want to put some order and purge as much items as possible to prevent the ennorous db growth without being forced to disable news app all together.

@danopz
Copy link
Contributor

danopz commented Mar 18, 2019

Once I thought about normalizing feeds and feeds items

  • you would have one single list of feeds and items in stored in your database
  • every user would be subscribed to a feed by a relation in the database - User x Feed
  • users would have relations to feed items for different status like read/starred/deleted (probably we could delete the relation directly instead of using a flag)
  • when a user subscribes to an existing feed -> create the user-feed relation + create the relation for x feed items
  • when the user subscibes to a new feed create the feed and setup the relations - basically what we currently have
  • feed items would be deleted when every user read that one
  • feeds would be deleted if no one is subscribed anymore

We could have same benefits:

  • every feed entry is retrieved once and not per subscription - saves CPU, RAM, energy
  • every feed + item is stored once and just linked - less database size

Just want to put my idea in, because in this case it could help reducing the db size. Let me know what you think and if I should create a separate issue for that or if its just random nonsense.

@BernhardPosselt
Copy link
Member

BernhardPosselt commented Mar 18, 2019

Right, it's a trade off.

Why I decided against it back in the days was simplicity: you make your data model a lot harder to use and change for a theoretical gain (assumption: people have the same feeds), especially when you can have many different versions of a feed.

Your assumption requires feed ids to be unique which doesn't hold up in practice. You can get a different version of a feed based on authentication parameters, full text feed settings, ip ranges, client headers etc. You also need to anticipate future changes.

Even migrating this stuff could take days for certain installations (e.g. 1000+ users)

@smoebody
Copy link

You do not want to delete read items because you didn't read them yet. Works as intended

No it doesn't. It makes absolutely no sense to keep read items, so the threshold for autopurging was always meant to be the treshold for unread items. Each comment in the code indicates that. Always there is the talk of unread items regarding autopurging. Especially the documentation for the autopurge-threshold says

Config.php:27

    private $autoPurgeCount;  // number of allowed unread articles per feed

itemService.php:250

    /**
     * This method deletes all unread feeds that are not starred and over the
     * count of $this->autoPurgeCount starting by the oldest. This is to clean
     * up the database so that old entries don't spam your db. As criteria for
     * old, the id is taken
     */
    public function autoPurgeOld()

itemMapper.php:335

    /**
     * Delete all items for feeds that have over $threshold unread and not
     * starred items
     *
     * @param int $threshold the number of items that should be deleted
     */
    public function deleteReadOlderThanThreshold($threshold)

I made a PR #506 which removes the unread condition from the Queries and its working for me. If anybody should doubt it just increase the threshold of 200 to - lets say 1000 - and you will still save more items than you ever might want - without keeping a table full of garbage.

@BernhardPosselt
Copy link
Member

BernhardPosselt commented May 12, 2019

No it doesn't. It makes absolutely no sense to keep read items, so the threshold for autopurging was always meant to be the treshold for unread items

Autopurge works in the following way (even if comments are wonky, sorry but I implemented this stuff):

  • Per feed: keep a maximum number (the one you configure) of read items ordered by latest added
  • Delete all articles older than the stuff you marked as read which is not starred (basically stuff that you consumed but did not want to keep.

and you will still save more items than you ever might want - without keeping a table full of garbage.

The "garbage" that you mean is basically stuff that you did not read yet. That's why you have a feed reader. If you (don't) want to read it, mark it as read and it will be cleaned up for you automatically.

There isn't really a benefit in deleting unread items since data usage is minimal and the original use case for this feature is to not accumulate more and more data in your table that you are not interested in any more.

@piggz
Copy link

piggz commented May 12, 2019

No it doesn't. It makes absolutely no sense to keep read items, so the threshold for autopurging was always meant to be the treshold for unread items

Autopurge works in the following way (even if comments are wonky, sorry but I implemented this stuff):

* Per feed: keep a maximum number (the one you configure) of read items ordered by latest added

* Delete all articles older than the stuff you marked as read which is not starred (basically stuff that you consumed but did not want to keep.

and you will still save more items than you ever might want - without keeping a table full of garbage.

The "garbage" that you mean is basically stuff that you did not read yet. That's why you have a feed reader. If you (don't) want to read it, mark it as read and it will be cleaned up for you automatically.

There isn't really a benefit in deleting unread items since data usage is minimal and the original use case for this feature is to not accumulate more and more data in your table that you are not interested in any more.

Yes, but that is still a viable benefit .... i have had cases of my nextcloud db failing to upgrade as it has grown so large, that I have run out of space on my hosting when the db upgrade occurs, and the size was taken by unread articles. Its easy to say 'mark them as read' but that can mean endlessly scrolling through old stuff im never going to read. Im sure a single option of 'also delete unread items', defaulting to off would be good, but, in the mean time, i may try the version from the PR.

@BernhardPosselt
Copy link
Member

BernhardPosselt commented May 12, 2019

@piggz did you see the mark all as read button? And you can trigger the cleanup (aka the cron jobs) manually using the same php command that you use to update stuff or just wait until it runs as configured.

@piggz
Copy link

piggz commented May 12, 2019

@BernhardPosselt I hadnt, although I dont use the web app, I use my own mobile app. I presume there is an api call I can make to mark all as read?

@BernhardPosselt
Copy link
Member

Yep

@smoebody
Copy link

smoebody commented May 13, 2019

@BernhardPosselt

Autopurge works in the following way (even if comments are wonky, sorry but I implemented this stuff):

  • Per feed: keep a maximum number (the one you configure) of read items ordered by latest added
  • Delete all articles older than the stuff you marked as read which is not starred (basically stuff that you consumed but did not want to keep.

Yes, this is exactly how it works, but say: What is the use in keeping read items afterall? If you want to keep them you should star them.

The "garbage" that you mean is basically stuff that you did not read yet. That's why you have a feed reader. If you (don't) want to read it, mark it as read and it will be cleaned up for you automatically.

There isn't really a benefit in deleting unread items since data usage is minimal and the original use case for this feature is to not accumulate more and more data in your table that you are not interested in any more.

Marking items as read which i do not want to read is a workaround and not a solution. Right now items I do not want to read persist in the database and create garbage. This a fact. Many people here complain about an ever-growing table because of this issue - including me. Since my last comment several months passed, which is the time my database grew so much that i came back to this issue.

did you see the mark all as read button? And you can trigger the cleanup (aka the cron jobs) manually using the same php command that you use to update stuff or just wait until it runs as configured.

This is ridiculous. You suggest to instruct all users of my nextcloud that they have to mark all items as read whether they read it or not in order to keep my database clean. Mark all as read is useless as well since it marks all items as read - also items which i might want to read, because they are actually new items.

You should admit that at some point items are no news anymore and should be cleaned up automatically in order to keep the database sane. My PR does exactly that and it is simplifying the cleanup process as well. Please consider it.

@BernhardPosselt
Copy link
Member

You should admit that at some point items are no news anymore and should be cleaned up automatically in order to keep the database sane. My PR does exactly that and it is simplifying the cleanup process as well. Please consider it.

RSS is not limited to news per se so your pull request would nuke stuff like podcasts and other things.

I think I understand your use case now: you basically want an incomplete stream of fresh entries that is limited to something like: "show me what's relevant for today"

That behavior is probably problematic on a global level: not every feed is strictly out of date so if at all this would have to be configured on per feed level IMHO. And then you probably want to keep stuff by date.

This is ridiculous. You suggest to instruct all users of my nextcloud that they have to mark all items as read whether they read it or not in order to keep my database clean. Mark all as read is useless as well since it marks all items as read - also items which i might want to read, because they are actually new items.

That is an entirely different issue: what about users importing tons of contacts and files? In general you need to talk to and trust your users to not abuse it. On bigger installations this is probably an issue but there you have a lot more resources available. I personally know of a setup for a finnish university which works just fine.

What could be done though is to build in quota support as in: letting users only allocate a specific maximum amount of MBs in your db and otherwise refuse to update feeds. You can also expose that using the feed warning APIs in place. That way you don't have data loss for older items and a user should be able to fix it quickly.

@smoebody
Copy link

@BernhardPosselt

RSS is not limited to news per se so your pull request would nuke stuff like podcasts and other things.

How is that? Can you explain, please.

I think I understand your use case now: you basically want an incomplete stream of fresh entries that is limited to something like: "show me what's relevant for today"

That behavior is probably problematic on a global level: not every feed is strictly out of date so if at all this would have to be configured on per feed level IMHO. And then you probably want to keep stuff by date.

What do you mean by not every feed is strictly out of date? As i said: set the items to keep to 1000 which will holds 1000 items per feed. I can not imagine a use-case where anybody would want to read the 1001st item.

That is an entirely different issue: what about users importing tons of contacts and files? In general you need to talk to and trust your users to not abuse it.

Abuse and misuse are two different things. A user has to act manually to import files and contacts. This is purpose and the user can be hold responsible for that.

A user importing 10 feeds, each generating 100 items per day (and there are such) without reading them cannot be held responsible for that, although it fills the nextcloud-database with garbage.

What could be done though is to build in quota support as in: letting users only allocate a specific maximum amount of MBs in your db and otherwise refuse to update feeds. You can also expose that using the feed warning APIs in place. That way you don't have data loss for older items and a user should be able to fix it quickly.

I think you overcomplicate things. Also you bring the responsibility back to the user which is a bad idea IMO. The user will not understand.

I suggest to keep the items in a simple FIFO of a definable length (this is already there). If items disappear to soon, than increase the length, but making a difference between read and unread items regarding whether they should be kept or not leads to confusion and misunderstanding (which is already the case).
If this break how people are using the extension, please give an example. I cannot think of one.

Cheers

@BernhardPosselt
Copy link
Member

BernhardPosselt commented May 13, 2019

What do you mean by not every feed is strictly out of date? As i said: set the items to keep to 1000 which will holds 1000 items per feed. I can not imagine a use-case where anybody would want to read the 1001st item.

Think of feeds containing images, videos, open issues on your customer management system (like github for instance where read means: done), error messages from logs, torrents, movie episodes and their availability, music etc. If you delete those, then you basically lose data. Therefore you can also not use a FIFO queue because you are deleting data that is not yet marked for deletion.

As an analogy: thing of a bookmarks application that just deletes your old bookmarks because you went over the global limit.

If items disappear to soon, than increase the length

You don't really know what a safe amount of unread articles is really. So people will increase it to something very big which in turn will put in much more garbage than the current implementation.

I think you overcomplicate things. Also you bring the responsibility back to the user which is a bad idea IMO. The user will not understand.

You are arguing for deleting user data rather than explaining to him, that he can only keep a limited amount and his data usage is full. Lots of software like Dropbox and the files app does that and it works just fine.

@smoebody
Copy link

@BernhardPosselt

Think of feeds containing images, videos, open issues on your customer management system (like github for instance where read means: done), error messages from logs, torrents, movie episodes and their availability, music etc. If you delete those, then you basically lose data. Therefore you can also not use a FIFO queue because you are deleting data that is not yet marked for deletion.

I see. You are abusing the news-aggregator as a content-aggregator. Now it makes sense, that you do not run into the problems described here. I use it as replacement to https://feedly.com which is not saving obsolete items.

As an analogy: thing of a bookmarks application that just deletes your old bookmarks because you went over the global limit.

You are comparing direct data with indirect data. With a bookmarks application I decide whether to store a new bookmark as well as to remove an old bookmark. When i do nothing, no new bookmarks are created at all.
The news-feed is the opposite: I do not decide what items are stored. However you say that I have to decide what to keep and what to remove.

If items disappear to soon, than increase the length

You don't really know what a safe amount of unread articles is really. So people will increase it to something very big which in turn will put in much more garbage than the current implementation.

I think i said enough about the garbage the current implementation produces.

You are arguing for deleting user data rather than explaining to him, that he can only keep a limited amount and his data usage is full. Lots of software like Dropbox and the files app does that and it works just fine.

I cannot follow your argumentation about deleting user data. I think we both have different understandings of what information a news item means.
For me it is announcement of new information which will become obsolete after a certain amount of time. With a news-feed producing 10 new items per day it is fine for me to keep only 20 days back, because i will not read them in any case.
A comic-feed, producing one feed per day (at most) will keep 200 days back, and still i will not read all of them.

Anyways, since i seem to be the only one who is having this issue with this app I cannot be bothered anymore. I have a patch to the code and it is more easy to use it after each update than trying to convince you to my perspective :)

Cheers

@BernhardPosselt
Copy link
Member

I see. You are abusing the news-aggregator as a content-aggregator.

Yes. It really depends on what you are subscribing to and therefore deleting stuff should be pretty conservative and not surprising.

Dealing with high frequency feeds in a manner like "show only stuff from the last 24h" could be appealing in certain cases I guess. But it should be configurable on per feed basis and use a time span (compared to your quick PR which is global and based on item size).

A comic-feed, producing one feed per day (at most) will keep 200 days back, and still i will not read all of them.

200 articles might sound a lot but takes up almost no space in your database. The comic feed that you mention is probably taking up 200kb of disk storage maximum. Sure there are people that subscribe to 100 high frequency feeds, don't read them and maybe pile up an extreme of 1 Gb of text inside their db over a few months. This issue however sounds more like a quota thing to me.

@programagor
Copy link

programagor commented Oct 25, 2019

Having a per-feed setting like "Automatically delete all entries older than NUMBER days" would be perfect for my use-case.

@muppeth
Copy link

muppeth commented Oct 27, 2019

At this moment I have 13million unread news items vs 600 000 read (about 40GB of data noone will ever read through). I wrote a small script that would mark everything older then X days as read and then cleanup, but people started reporting weird issues where items re-appeared etc. I think, specially in case of shared instances where people basically dont realize the litter they cause there should be something that purged automatically things that are older then some time.

@pefimo
Copy link

pefimo commented Feb 11, 2020

@BernhardPosselt Could you please give me a hint which integration test covers purging? (You've mentioned it earlier.) Method FeedService.updateAll starts with a comment stating it's not covered by any tests...

In another comment, you suggested it's a quota thing. Could you elaborate on that?

In general, I'm analysing options we've got to reduce resource usage of a certain NextCloud instance:

  1. Implement quota mechanism.
  2. Implement a plug-in, yet I don't think it would be easy to use a plug-in to make News purge unread items older than a given "age" configuration parameter.
  3. Purging both read and unread items with a command similar to the news:updater:after-update.

In order for such a quota mechanism to be safe and useful, it would need to:

  • be disabled by default;
  • refuse to update feeds that exceed the limit.

Any hints, comments?

@NilsGriebner
Copy link

@pefimo Are you working on that? I'm interested in such a feature, too. After don't reading my feeds for a year, this took down my database. In my opinion a per-feed setting like "purge items after n days" would be great. I'm willing to help implement this feature.

@pefimo
Copy link

pefimo commented Feb 24, 2020

@NilsGriebner Yes I am, but slowly... At the moment I'm analysing the code, setting up development environment and so on. I'd be very glad to cooperate -- thanks!

However, I don't visit Github often. Do you think we could move this cooperation somewhere else? You can reach me via email (pfm at untalkative.one) or jabber (pfm at disroot.org).

@NilsGriebner
Copy link

@pefimo I'll contact you.

@NilsGriebner
Copy link

NilsGriebner commented Feb 28, 2020

@BernhardPosselt

@pefimo and I discussed this over the last days and agreed on this high level concept:

We'd like to implement a quota system so that an administrator is able to create quotas and stick them to a specific user. A quota data structure will consist of:

  • Name
  • Description
  • Storage limit in bytes
  • Limit in days

The quotas will be persisted in an additional db table. Relation between user and quota could be realized with a foreign key in user table. This architecture will enable us to expire feed items which are too old and/or keep track of disk space.

What do you think? Could this be a valid solution?

PS: @pefimo thank you for the awesome collaboration. It's a great pleasure working with you 😺 ❤️

@pefimo
Copy link

pefimo commented Feb 28, 2020

Thank you too, @NilsGriebner! I'm looking forward to the implementation phase. 🙂

@BernhardPosselt
Copy link
Member

@pefimo sounds good

@pefimo
Copy link

pefimo commented Nov 19, 2020

Hi @muppeth @programagor @smoebody @danopz @piggz, you may be interested in the enhancement request I've just submitted: #953.

@SMillerDev SMillerDev added the API Impact API/Backend code label Dec 11, 2020
@cdauth
Copy link
Contributor

cdauth commented Feb 20, 2021

This truly is a blocker. I believe on my instance only two people are using the news app. Each time I update the news app, my instance stays in maintenance mode for several hours because it takes so long to update the news table, making Nextcloud unusable during that time for all users.

I cannot understand how the attempts to fix this have been rejected with the argument that it breaks existing functionality. The current implementation makes Nextcloud completely unusable in some situations. How is some people losing some old news items more relevant than people not being able to access their files, calendar, passwords etc. for several hours?

I agree that when I'm subscribed to a blog that makes one or two posts a month, I wouldn't like if unread (and even read) items are removed after one or two months (or ever). But when I'm subscribed to a news website that posts 100 articles a day, it just doesn't make sense to keep the last 100000 unread items in the database. With the infinite scrolling that the news app uses, it would probably take me a week to even scroll down that far, and my browser would probably crash on the way.

Both as a user and as an administrator, the behaviour that I would expect would be that there is a limit somewhere between 1000 and 10000 news items (read or unread) per feed.

@Grotax
Copy link
Member

Grotax commented Feb 20, 2021

Check the Changelog to find a solution to your problem.

Else make a PR and implement it or pay someone to do it for you, we are not payed to work on this application we have other things in our lifes that we need to deal with and we are investing our free time to improve this application. We are not a company that you pay to provide you some service, read the license it states clearly what this is.

Locked.

@nextcloud nextcloud locked and limited conversation to collaborators Feb 20, 2021
@anoymouserver
Copy link
Contributor

See the new FAQ entry for this topic.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.