-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Indexes the same size as messages? #73
Comments
I don't see anything approaching your usage in real-world (or in the CI/unit testing). If not using substring search (which doesn't appear to be active in your config), the expectation is that index should take up about 10% of message data size. (Substring indexes are more like 40+% of message size, because Xapian doesn't natively search substrings so we have to index all component strings). And that's what I see, picking a fairly large mailbox (Trash), which should have a collection of various message data:
356M index for 5+G mailbox =< 10%. So you'll have to debug locally more, since this is not reproducible on my side. Note that disk usage of mailbox using You can try rebuilding the indexes for a mailbox ( You can also try manually optimizing the Xapian indexes ( Otherwise, not sure. Xapian is optimized more for utility and stability rather than disk usage, so it's possible that your use case isn't all that efficient for its design. |
Yeah, so the vsize for the I've deleted and re-created the indexes many times, mostly trying to figure out where it's all going. Also I had a system-wide optimize run scheduled for each night and it seemed like every time that ran it inflated the indexes more. It was after a few nights of that when the disk was exhausted even though I had already done a full index run. One thing I'm also struggling with is trying to stem the problem by avoiding indexes on things like archive mailboxes that are large, old, and static. However, Anyway, I'm definitely not up for debugging it at a low level, so I guess I'll have to disable it. Super bummer because the fts-xapian plugin seems crashy, solr was very heavyweight and seemed to not work very well. Aside from eating all my disk, fts-flatcurve has been a breath of fresh air :) |
Definitely sounds like there is something different about the data in there that is causing different indexing behavior, but can't comment on anything without analyzing in detail unforunately.
That seems weird. I've never seen an optimize increase the size of indexes. As mentioned, xapian indexes are fairly sparse (i.e., if you delete a message it doesn't recover the space), so I really can't explain this.
That's not a flatcurve issue then, that's a core Dovecot issue. So if you can reproduce, or at least provide a nice reproducible test case, we would love a report in the core repo (https://github.com/dovecot/core).
OK, that's getting a bit more specific. Product-wise, there is no way we can ship a solution that doesn't index every message in the mailbox. (That's the number one complaint about FTS searches - if a message is not indexed so doesn't appear in search results.) So very unlikely we will ever ship a solution for this - you would have to alter source to get this done.
Understood, although we are (finally) getting close to a 2.4 release where flatcurve is the main search solution so would be great to get any feedback. That being said, this (currently) doesn't sound like a core bug - more like an optimization issue - so maintenance of flatcurve is really being moved to core so further debugging should probably be moved there. Anyway, I will close this ticket, as there is nothing obviously buggy about the code given the current info. |
I've got v1.0.5 built for dovecot 2.3.21 (in ubuntu 24.04) and it's working. However, it seems like the indexes grow and grow to be at least if not larger than the messages themselves, which I assume is not intended behavior.
The above mailbox is almost all text, although it does have some attachments. Here's another example which is not quite 1:1:
but it still seems like a huge amount of fts index data for the content, no?
After running fts-flatcurve for a while and everything being indexed (I've also been unsuccessful in convincing it to not index everything including archive mailboxes) I end up running out of space in my mail store, even though that filesystem is less than half full of actual mail.
Are my expectations wrong? Any help with debugging what my problem is? fts-flatcurve config mostly just cobbled from examples:
Any help would be appreciated.
The text was updated successfully, but these errors were encountered: