Skip to content

Memcached Benchmark

Dor Laor edited this page Feb 19, 2015 · 12 revisions

The following describe the details of the Memcached benchmark making it reproducible. Let us know if you find anything is missing.

Latest Results (Feb 19)

Raw data:

CPU Seastar Memcached with DPDK Stock Memcached (multi process) Stock memcached (multi threaded)
2 553,175 350,844 321,287
4 1,021,918 615,270 573,149
6 1,703,790 857,428 709,502
8 2,149,162 1,102,417 741,356
10 2,629,885 1,335,069 608,014
12 2,870,919 1,528,598 608,968
14 3,217,044 1,726,642 440,658
16 3,460,167 1,887,060 603,479
18 4,049,397 2,167,573 902,192
20 4,426,457 2,281,064 1,128,469

As you can see, SeaStar's Memcache server is 4X faster than the stock threaded memcache. The later suffers from various locking issues, especially the mutex_trylock busy wait look. In order to squeeze more performance out of stock memcache we executed it as multiple single processes that share nothing. It's not a fair comparison since this way memory isn't shared and it puts some responsibility and complexity on the client. Even with this approach SeaStar outperforms stock memcache by 2X.

It worth to note that SeaStar was designed for much more complex scenarios than memcache and should excel even more when high level of parallelism is needed.

Collectd/Graphite statistics

The stats were retrieved using graphite and the internal collectd client when run with 4 cores. The top right graph shows the packet coalescing rate - as the load increases (the top left graph shows the idle time shrinks to zero), each packet processing round handles 30 packets.

The bottom right graph shows #tasks executed per core. The number is 1,250,000/sec. Remember it's the amount of SeaStar tasks, not memcache. The bottom left graph shows the number of network packets each core handles (in this setup, between 200k/s-250k/s).

Linux perf data

Let's observe the difference between the various cpu hog functions using perf top. First let's see SeaStar's perf data. The kernel code is completely out of the picture. The most cpu intensive function is that hash function, not surprisingly. Afterwards memory deletion and allocation are next.

Percent Binary Function
8.54% memcached boost::intrusive::hashtable_impl<boost::intrusive::mhtraits<memcache::item,
4.67% memcached memory::cpu_pages::free
3.74% memcached deleter::~deleter
2.84% memcached promise<>::promise
2.72% memcached rte_pktmbuf_alloc
2.51% memcached ixgbe_xmit_pkts
2.50% memcached memory::small_pool::allocate
2.49% libc-2.20.so
2.02% memcached dpdk::dpdk_qp::send
1.95% memcached net::interface::dispatch_packet
1.94% memcached promise<>::~promise
1.82% memcached memcache_ascii_parser::parse
1.74% memcached _ZNO6futureII11foreign_ptrIN5boost13intrusive_ptrIN8memcache4itemILb0EEEEEEEE6rescueIZN17smp_message_queue15async_work_itemIZN11distributedINS3_5cache
1.71% memcached net::ipv4::get_packet
1.33% libc-2.20.so
1.33% memcached net::packet::packet
1.24% memcached scattered_message::append_static<unsigned
1.22% memcached memcache::ascii_protocol::handle
1.22% memcached net::tcpnet::ipv4_traits::tcb::output_one
1.21% memcached future<temporary_buffer
1.14% memcached net::packet::impl::allocate_if_needed
1.10% memcached std::_Hashtable<net::l4connidnet::ipv4_traits,
1.09% memcached net::packet::share
1.08% memcached smp_message_queue::process_queue<2ul,
1.02% memcached memory::cpu_pages::allocate_small
1.01% memcached net::packet::impl::allocate
0.93% memcached memory::cpu_pages::translate
0.91% memcached memcache::intrusive_ptr_release
0.91% memcached dpdk::dpdk_qp::tx_buf_factory::get
0.88% memcached _ZZN3net3tcpINS_11ipv4_traitsEEC4ERNS_7ipv4_l4ILNS_15ip_protocol_numE6EEEENUlvE_clEv
0.82% memcached memory::allocate
0.81% memcached memcache_ascii_parser::parse(char*,
0.76% memcached promise<>::set_value
0.74% memcached net::native_connected_socket_impl<net::tcpnet::ipv4_traits
0.74% memcached net::ipv4::send(net::ipv4_address,
0.70% memcached smp_message_queue::async_work_item<std::enable_if<is_future<future<foreign_ptr<boost::intrusive_ptr<memcache::item
0.64% memcached net::native_connected_socket_impl<net::tcpnet::ipv4_traits
0.61% memcached reactor::run_tasks
0.61% memcached memory::free
0.60% memcached _ZNSt17_Function_handlerIFNSt12experimental8optionalIN3net6packetEEEvEZNS2_9interfaceC4ESt10shared_ptrINS2_6deviceEEEUlvE0_E9_M_invokeERKSt9_Any_data
0.60% memcached future_state<>::operator=
0.55% memcached dpdk::dpdk_qp::tx_buf::reset_zc
0.55% memcached net::tcpnet::ipv4_traits::tcb::can_send
0.53% memcached lw_shared_ptr<memcache::tcp_server::connection>::~lw_shared_ptr
0.53% memcached _ZN6futureII11foreign_ptrIN5boost13intrusive_ptrIN8memcache4itemILb0EEEEEEEE4thenIZNS3_14ascii_protocolILb0EE10handle_getILb0EEES_IIEER13output_stream
0.52% memcached net::tcpnet::ipv4_traits::received
0.52% memcached net::packet::allocate_headroom
0.50% memcached do_until_continued<memcache::tcp_server::start()::{lambda()#1}::operator()()
0.49% memcached _ZZN6futureIIEE4thenIZN3net28native_connected_socket_implINS2_3tcpINS2_11ipv4_traitsEEEE23native_data_source_impl3getEvEUlvE0_EENSt9result_ofIFT_vEE4t
0.49% memcached foreign_ptr<boost::intrusive_ptr<memcache::item
0.49% memcached net::tcpnet::ipv4_traits::tcb::send

Next is the perf top output of SeaStar's posix. Now the kernel is all over the place. The hash function only shows up on the 11th place!

Percent Binary Function
3.81% [kernel] ipt_do_table
3.22% [kernel] copy_user_enhanced_fast_string
2.74% [kernel] tcp_sendmsg
2.34% [kernel] sock_poll
2.25% [kernel] __nf_conntrack_find_get
2.17% [kernel] _raw_spin_lock
2.09% [kernel] __fget
1.77% [kernel] _raw_spin_lock_irqsave
1.67% [kernel] __skb_clone
1.65% memcached memory::cpu_pages::free
1.61% memcached boost::intrusive::hashtable_impl<boost::intrusive::mhtraits<memcache::item
1.36% [kernel] reschedule_interrupt
1.35% [kernel] nf_iterate
1.32% [kernel] sock_has_perm
1.18% [kernel] __ip_local_out
1.17% [kernel] tcp_poll
1.15% memcached memory::small_pool::allocate
1.08% [kernel] tcp_packet
1.06% [kernel] nf_conntrack_in
1.04% [kernel] skb_entail
1.04% memcached reactor::run_tasks
1.03% [kernel] avc_has_perm
1.03% memcached promise<temporary_buffer
1.02% [kernel] sys_epoll_ctl
0.97% [kernel] tcp_recvmsg
0.91% [kernel] tcp_transmit_skb
0.89% libc-2.20.so
0.87% [kernel] __alloc_skb
0.86% memcached memcache_ascii_parser::parse
0.72% memcached promise<>::promise
0.72% [kernel] _raw_spin_lock_bh
0.70% [kernel] system_call_after_swapgs
0.61% [kernel] tcp_rearm_rto
0.60% memcached memory::cpu_pages::allocate_small
0.57% memcached deleter::~deleter
0.56% [kernel] __local_bh_enable_ip
0.54% [kernel] selinux_file_permission
0.54% memcached scattered_message::append_static<unsigned
0.54% [kernel] tcp_v4_rcv
0.51% [kernel] ip_queue_xmit
0.51% [kernel] tcp_write_xmit
0.50% [kernel] selinux_socket_sock_rcv_skb
0.49% memcached reactor_backend_epoll::get_epoll_future
0.49% memcached net::packet::packet
0.48% [kernel] sock_def_readable
0.46% [kernel] ip_finish_output
0.46% memcached do_until_continued<memcache::tcp_server::start()::{lambda()#1}::operator()()
0.45% [kernel] tcp_wfree
0.44% [kernel] system_call

Test bed:

  • Server 1: Memcache server
  • Server 2: Memcache Client - memaslap

Software

Server 1 (Stock Memcached) Setup:

  • Memcached version 1.4.17
  • One, single threaded, Memcached process per CPU

Server 1 (Seastar Memcached with DPDK) Setup:

  1. Fetch dpdk from upstream (support for i40e is not sufficient in 1.8.0)
  2. update config/common_linuxapp
  3. update CONFIG_RTE_MBUF_REFCNT to 'n'
  4. update CONFIG_RTE_MAX_MEMSEG=4096
  5. follow instructions from Seastar readme on DPDK installation for 1.8.0
  • hugepages define 2048,2048 pages
  • compile seastar
  1. sudo build/release/apps/memcached/memcached --network-stack native --dpdk-pmd --dhcp 0 --host-ipv4-addr $seastar_ip --netmask-ipv4-addr 255.255.255.0 --collectd 0 --smp $cpu

Server 2 (memaslap) Setup

  • memaslap from libmemcached-1.0.18
  • Disable irqbalance
  • Fix the irq smp_affinity of the 40Gb card to invoke each interrupt on a single cpu
for $cpu < 6 
for ((i = 0; i < 12; ++i)); do taskset -c $i memaslap -s $seastar_ip:11211 -t 60s -T 1 -c 60 -X 64 & done

for $cpu >= 6 
for ((i = 0; i < 52; ++i)); do taskset -c $i memaslap -s $seastar_ip:11211 -t 60s -T 1 -c 60 -X 64 & done
  • verify there are no misses in each test - restart memcached for each test

Hardware

Same as HTTPD Test

Clone this wiki locally