Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexes the same size as messages? #73

Closed
kk7ds opened this issue Dec 31, 2024 · 3 comments
Closed

Indexes the same size as messages? #73

kk7ds opened this issue Dec 31, 2024 · 3 comments

Comments

@kk7ds
Copy link

kk7ds commented Dec 31, 2024

I've got v1.0.5 built for dovecot 2.3.21 (in ubuntu 24.04) and it's working. However, it seems like the indexes grow and grow to be at least if not larger than the messages themselves, which I assume is not intended behavior.

# du -ch  --max-depth 0 dsmith/Maildir/.Sys.Postmaster/{fts-flatcurve,cur}
61M	dsmith/Maildir/.Sys.Postmaster/fts-flatcurve
61M	dsmith/Maildir/.Sys.Postmaster/cur
122M	total

The above mailbox is almost all text, although it does have some attachments. Here's another example which is not quite 1:1:

# du -ch  --max-depth 0 dsmith/Maildir/{fts-flatcurve,cur}
225M	dsmith/Maildir/fts-flatcurve
841M	dsmith/Maildir/cur

but it still seems like a huge amount of fts index data for the content, no?

After running fts-flatcurve for a while and everything being indexed (I've also been unsuccessful in convincing it to not index everything including archive mailboxes) I end up running out of space in my mail store, even though that filesystem is less than half full of actual mail.

Are my expectations wrong? Any help with debugging what my problem is? fts-flatcurve config mostly just cobbled from examples:

plugin {
    fts = flatcurve
    fts_languages = en
    fts_tokenizers = generic email-address
    fts_tokenizer_email_address = maxlen=100
    fts_tokenizer_generic = algorithm=simple maxlen=100
    fts_filters = normalizer-icu snowball stopwords
    fts_filters_en = lowercase snowball english-possessive stopwords

    fts_autoindex = yes
    fts_enforced = yes

    fts_autoindex_exclude = \Trash
    fts_autoindex_exclude2 = Archives*
    fts_autoindex_exclude3 = \Junk
    fts_autoindex_exclude4 = Spam
    fts_autoindex_exclude5 = spam
    fts_autoindex_exclude6 = \Sent

    fts_autoindex_max_recent_msgs = 999
}

Any help would be appreciated.

@slusarz
Copy link
Owner

slusarz commented Dec 31, 2024

I don't see anything approaching your usage in real-world (or in the CI/unit testing).

If not using substring search (which doesn't appear to be active in your config), the expectation is that index should take up about 10% of message data size. (Substring indexes are more like 40+% of message size, because Xapian doesn't natively search substrings so we have to index all component strings).

And that's what I see, picking a fairly large mailbox (Trash), which should have a collection of various message data:

# du -chs /srv/mail/user/mdbox/mailboxes/Trash/dbox-Mails/fts-flatcurve/
356.0M  /srv/mail/user/mdbox/mailboxes/Trash/dbox-Mails/fts-flatcurve/
356.0M  total

# doveadm mailbox status -u user "messages vsize" Trash
Trash messages=90456 vsize=5594174696

# doveadm fts-flatcurve stats -u user Trash
Trash guid=457bc41733e62a5195650000e5379cf6 last_uid=283523 messages=90455 shards=4 version=1

356M index for 5+G mailbox =< 10%.

So you'll have to debug locally more, since this is not reproducible on my side.

Note that disk usage of mailbox using df is not all that useful if you are using mailbox compression - indexing takes up 10% of uncompressed message data, so comparing to physical storage is irrelevant. You might want to try the fetch vsize argument like I used to see the actual virtual size of the mailbox, which is the important metric for indexing.

You can try rebuilding the indexes for a mailbox (doveadm fts-flatcurve remove followed by doveadm index) to see if that helps.

You can also try manually optimizing the Xapian indexes (doveadm fts optimize). Xapian indexes are fairly sparse, so if you've deleted a lot of messages you'll have tons of extra disk usage until the indexes are compacted again.

Otherwise, not sure. Xapian is optimized more for utility and stability rather than disk usage, so it's possible that your use case isn't all that efficient for its design.

@kk7ds
Copy link
Author

kk7ds commented Dec 31, 2024

Yeah, so the vsize for the Sys.Postmaster box above is 65M (vs 61M on disk). My inbox though is 2.2G so I guess the index of ~200M is about right for that one. Not sure why the postmaster one is 10x worse.

I've deleted and re-created the indexes many times, mostly trying to figure out where it's all going. Also I had a system-wide optimize run scheduled for each night and it seemed like every time that ran it inflated the indexes more. It was after a few nights of that when the disk was exhausted even though I had already done a full index run. One thing I'm also struggling with is trying to stem the problem by avoiding indexes on things like archive mailboxes that are large, old, and static. However, doveadm index and optimize both seem to ignore the autoindex_exclude lists, and index the mailbox anyway. It's super frustrating because I'd really just like to index inbox and nothing else, but it seems hard/impossible to limit it like that. I'd really like to be able to also limit the index to like the last 24 months or something in addition to excluding Archive*. I have some users with every email they've ever received in their inbox, which I suspect is not that unusual, but I really don't care to allow them to search for 20-year-old emails.

Anyway, I'm definitely not up for debugging it at a low level, so I guess I'll have to disable it. Super bummer because the fts-xapian plugin seems crashy, solr was very heavyweight and seemed to not work very well. Aside from eating all my disk, fts-flatcurve has been a breath of fresh air :)

@slusarz
Copy link
Owner

slusarz commented Jan 4, 2025

Yeah, so the vsize for the Sys.Postmaster box above is 65M (vs 61M on disk). My inbox though is 2.2G so I guess the index of ~200M is about right for that one. Not sure why the postmaster one is 10x worse.

Definitely sounds like there is something different about the data in there that is causing different indexing behavior, but can't comment on anything without analyzing in detail unforunately.

I've deleted and re-created the indexes many times, mostly trying to figure out where it's all going. Also I had a system-wide optimize run scheduled for each night and it seemed like every time that ran it inflated the indexes more.

That seems weird. I've never seen an optimize increase the size of indexes. As mentioned, xapian indexes are fairly sparse (i.e., if you delete a message it doesn't recover the space), so I really can't explain this.

However, doveadm index and optimize both seem to ignore the autoindex_exclude lists, and index the mailbox anyway. It's super frustrating because I'd really just like to index inbox and nothing else, but it seems hard/impossible to limit it like that.

That's not a flatcurve issue then, that's a core Dovecot issue. So if you can reproduce, or at least provide a nice reproducible test case, we would love a report in the core repo (https://github.com/dovecot/core).

I'd really like to be able to also limit the index to like the last 24 months or something in addition to excluding Archive*. I have some users with every email they've ever received in their inbox, which I suspect is not that unusual, but I really don't care to allow them to search for 20-year-old emails.

OK, that's getting a bit more specific. Product-wise, there is no way we can ship a solution that doesn't index every message in the mailbox. (That's the number one complaint about FTS searches - if a message is not indexed so doesn't appear in search results.) So very unlikely we will ever ship a solution for this - you would have to alter source to get this done.

Anyway, I'm definitely not up for debugging it at a low level, so I guess I'll have to disable it. Super bummer because the fts-xapian plugin seems crashy, solr was very heavyweight and seemed to not work very well. Aside from eating all my disk, fts-flatcurve has been a breath of fresh air :)

Understood, although we are (finally) getting close to a 2.4 release where flatcurve is the main search solution so would be great to get any feedback. That being said, this (currently) doesn't sound like a core bug - more like an optimization issue - so maintenance of flatcurve is really being moved to core so further debugging should probably be moved there.

Anyway, I will close this ticket, as there is nothing obviously buggy about the code given the current info.

@slusarz slusarz closed this as completed Jan 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants