Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache mx_server_is_in? to speed up disposable_mx_server? #1

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 1 addition & 9 deletions lib/valid_email2/address.rb
Original file line number Diff line number Diff line change
Expand Up @@ -130,15 +130,7 @@ def domain_is_in?(domain_list)
end

def mx_server_is_in?(domain_list)
@dns.mx_servers(address.domain).any? { |mx_server|
return false unless mx_server.respond_to?(:exchange)

mx_server = mx_server.exchange.to_s

domain_list.any? { |domain|
mx_server.end_with?(domain) && mx_server =~ /\A(?:.+\.)*?#{domain}\z/
}
}
@dns.mx_servers_disposable?(address.domain, domain_list)
end

def address_contain_multibyte_characters?
Expand Down
31 changes: 30 additions & 1 deletion lib/valid_email2/dns.rb
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ module ValidEmail2
class Dns
MAX_CACHE_SIZE = 1_000
CACHE = {}

MX_SERVERS_CACHE = {}
CacheEntry = Struct.new(:records, :cached_at, :ttl)

def self.prune_cache
Expand All @@ -26,6 +26,26 @@ def mx_servers(domain)
fetch(domain, Resolv::DNS::Resource::IN::MX)
end

def mx_servers_disposable?(domain, domain_list)
servers = mx_servers(domain)
cache_key = generate_mx_cache_key(domain, domain_list, servers)
return MX_SERVERS_CACHE[cache_key] if !cache_key.nil? && MX_SERVERS_CACHE.key?(cache_key)

result = servers.any? do |mx_server|
return false unless mx_server.respond_to?(:exchange)

mx_server = mx_server.exchange.to_s

domain_list.any? do |disposable_domain|
mx_server.end_with?(disposable_domain) && mx_server.match?(/\A(?:.+\.)*?#{disposable_domain}\z/)
end
end

MX_SERVERS_CACHE[cache_key] = result unless cache_key.nil?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not thread-safe, which might lead to multiple threads trying to calculate value for the same key. Probably not a big deal though.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the inputs are the same (i.e., no edge case in DNS record change happening during execution), the result will be the same so it doesn't matter if two threads write at the same time... I think 🤔


result
end

def a_servers(domain)
fetch(domain, Resolv::DNS::Resource::IN::A)
end
Expand Down Expand Up @@ -67,5 +87,14 @@ def resolv_config
config[:nameserver] = @dns_nameserver if @dns_nameserver
config
end

def generate_mx_cache_key(domain, domain_list, mx_servers)
return if mx_servers.empty? || domain_list.empty?

mx_servers_str = mx_servers.map(&:exchange).map(&:to_s).sort.join
return domain if mx_servers_str.empty?

"#{domain_list.object_id}_#{domain_list.length}_#{mx_servers_str.downcase}"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use hashing function here instead? I might be missing the idea of this cache key function tho.

Zlib.crc32("#{domain_list.length}_#{mx_servers_str.downcase}")

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of checking against disposable MX servers, domain_list is a set of 160k disposable domains.
It could be something else tho, so I thought we'd capture the object id to avoid misfiring cache between similar requests, while still not having to iterate over the whole set.

I also thought about some kind of consistent sampling of the domain list, but maybe we don't need to go there?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the current approach seems reasonable and we can see what the maintainers think when you open a PR with the open source project.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, object_id is unique only while object exist. So whenever object is garbage collected, this id might be assigned to a new object. It doesn't refer to the content of the object.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm aware of that and... that's the compromise. In my book, as long as the cache is either valid or invalidated, it's fine.
In this case, if object gets garbage collected then the cache entry implicitly gets invalidated and the service will recalculate the query next time it arrives (with a new domain_list and its object_id). Yes, it might slow things down a bit every now and then, but it's good enough for our and probably most other use cases (bulk import and similar).

end
end
end
45 changes: 43 additions & 2 deletions spec/address_spec.rb
Original file line number Diff line number Diff line change
Expand Up @@ -111,14 +111,55 @@
let(:email_instance) { described_class.new(email_address, dns_instance) }
let(:ttl) { 1_000 }
let(:mock_resolv_dns) { instance_double(Resolv::DNS) }
let(:mock_mx_records) { [double("MX", exchange: "mx.ymail.com", preference: 10, ttl: ttl)] }
let(:mock_mx_records) { [double("MX", exchange: "mx.ymail.com", preference: 10, ttl:)] }

before do
allow(email_instance).to receive(:null_mx?).and_return(false)
allow(Resolv::DNS).to receive(:open).and_yield(mock_resolv_dns)
allow(mock_resolv_dns).to receive(:timeouts=)
end

describe "#disposable_mx_server?" do
let(:disposable_email_address) { "[email protected]" }
let(:disposable_mx_server) { ValidEmail2.disposable_emails.select { |domain| domain.count(".") == 1 }.sample }
let(:disposable_email_instance) { described_class.new(disposable_email_address, dns_instance) }
let(:mock_disposable_mx_records) { [double("MX", exchange: "mx.#{disposable_mx_server}", preference: 10, ttl:)] }

before do
allow(mock_resolv_dns).to receive(:getresources)
.with(disposable_email_instance.address.domain, Resolv::DNS::Resource::IN::MX)
.and_return(mock_disposable_mx_records)

allow(mock_resolv_dns).to receive(:getresources)
.with(email_instance.address.domain, Resolv::DNS::Resource::IN::MX)
.and_return(mock_mx_records)
end

it "is false if the MX server is not in the disposable_emails list" do
expect(email_instance).not_to be_disposable_mx_server
end

it "is true if the MX server is in the disposable_emails list" do
expect(disposable_email_instance).to be_disposable_mx_server
end

it "is false and then true when the MX record changes from non-disposable to disposable" do
allow(mock_resolv_dns).to receive(:getresources)
.with(disposable_email_instance.address.domain, Resolv::DNS::Resource::IN::MX)
.and_return(mock_mx_records) # non-disposable MX records

expect(disposable_email_instance).not_to be_disposable_mx_server

ValidEmail2::Dns.clear_cache

allow(mock_resolv_dns).to receive(:getresources)
.with(disposable_email_instance.address.domain, Resolv::DNS::Resource::IN::MX)
.and_return(mock_disposable_mx_records) # disposable MX records

expect(disposable_email_instance).to be_disposable_mx_server
end
end

describe "#valid_strict_mx?" do
let(:cached_at) { Time.now }
let(:mock_cache_data) { { [email_instance.address.domain, Resolv::DNS::Resource::IN::MX] => ValidEmail2::Dns::CacheEntry.new(mock_mx_records, cached_at, ttl) } }
Expand Down Expand Up @@ -251,7 +292,7 @@
describe "#valid_mx?" do
let(:cached_at) { Time.now }
let(:mock_cache_data) { { [email_instance.address.domain, Resolv::DNS::Resource::IN::MX] => ValidEmail2::Dns::CacheEntry.new(mock_a_records, cached_at, ttl) } }
let(:mock_a_records) { [double("A", address: "192.168.1.1", ttl: ttl)] }
let(:mock_a_records) { [double("A", address: "192.168.1.1", ttl:)] }

before do
allow(email_instance).to receive(:mx_servers).and_return(mock_mx_records)
Expand Down
15 changes: 13 additions & 2 deletions spec/benchmark_spec.rb
Original file line number Diff line number Diff line change
Expand Up @@ -5,14 +5,25 @@
describe "Performance testing" do
let(:disposable_domain) { ValidEmail2.disposable_emails.first }

it "has acceptable lookup performance" do
it "disposable_domain? has acceptable lookup performance" do
address = ValidEmail2::Address.new("[email protected]")

# preload list and check size
expect(ValidEmail2.disposable_emails).to be_a(Set)
expect(ValidEmail2.disposable_emails.count).to be > 30000

# check lookup timing
expect { address.disposable_domain? }.to perform_under(0.0001).sample(10).times
expect { address.disposable_domain? }.to perform_under(0.0001).sec.sample(10).times
end

it "disposable_mx_server? has acceptable lookup performance" do
address = ValidEmail2::Address.new("[email protected]")

# preload list and check size
expect(ValidEmail2.disposable_emails).to be_a(Set)
expect(ValidEmail2.disposable_emails.count).to be > 30000

# check lookup timing
expect { address.disposable_mx_server? }.to perform_under(0.0001).sec.warmup(1).times.sample(10).times
end
end