Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache mx_server_is_in? to speed up disposable_mx_server? #1

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

jurajjuricic-st
Copy link

@jurajjuricic-st jurajjuricic-st commented Feb 20, 2025

PR to upstream: micke#283

end
end

MX_SERVERS_CACHE[cache_key] = result unless cache_key.nil?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not thread-safe, which might lead to multiple threads trying to calculate value for the same key. Probably not a big deal though.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the inputs are the same (i.e., no edge case in DNS record change happening during execution), the result will be the same so it doesn't matter if two threads write at the same time... I think 🤔

mx_servers_str = mx_servers.map(&:exchange).map(&:to_s).sort.join
return domain if mx_servers_str == ""

"#{domain_list.object_id}_#{domain_list.length}_#{mx_servers_str.downcase}"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use hashing function here instead? I might be missing the idea of this cache key function tho.

Zlib.crc32("#{domain_list.length}_#{mx_servers_str.downcase}")

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of checking against disposable MX servers, domain_list is a set of 160k disposable domains.
It could be something else tho, so I thought we'd capture the object id to avoid misfiring cache between similar requests, while still not having to iterate over the whole set.

I also thought about some kind of consistent sampling of the domain list, but maybe we don't need to go there?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the current approach seems reasonable and we can see what the maintainers think when you open a PR with the open source project.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, object_id is unique only while object exist. So whenever object is garbage collected, this id might be assigned to a new object. It doesn't refer to the content of the object.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm aware of that and... that's the compromise. In my book, as long as the cache is either valid or invalidated, it's fine.
In this case, if object gets garbage collected then the cache entry implicitly gets invalidated and the service will recalculate the query next time it arrives (with a new domain_list and its object_id). Yes, it might slow things down a bit every now and then, but it's good enough for our and probably most other use cases (bulk import and similar).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants