In this lesson you will learn about BPF maps, the persistent storage mechanism available to BPF programs. The assignments will give you hands-on experience with extending the “value” size/content, and reading the contents from userspace.
In this lesson we will only cover two simple maps types:
BPF_MAP_TYPE_ARRAY
andBPF_MAP_TYPE_PERCPU_ARRAY
.
Creating a BPF map is done by defining a global struct bpf_map_def
(in
xdp_prog_kern.c), with a special SEC("maps")
as below:
struct bpf_map_def SEC("maps") xdp_stats_map = {
.type = BPF_MAP_TYPE_ARRAY,
.key_size = sizeof(__u32),
.value_size = sizeof(struct datarec),
.max_entries = XDP_ACTION_MAX,
};
BPF maps are generic key/value stores (hence the key_size
and
value_size
parameters), with a given type
, and maximum allowed entries
max_entries
. Here we focus on the simple BPF_MAP_TYPE_ARRAY
, which means
max_entries
array elements get allocated when the map is first created.
The BPF map is accessible from both the BPF program (kernel) side and from userspace. How this is done and how they differ is part of this lesson.
It is worth pointing out that everything goes through the bpf syscall. This means that the user space program must create the maps and programs with separate invocations of the bpf syscall. So how does a BPF program reference a BPF map?
This happens by first loading all the BPF maps, and storing their corresponding file descriptors (FDs). Then the ELF relocation table is used to identify each reference the BPF program makes to a given map; each such reference is then rewritten, so the BPF byte code instructions use the right map FD for each map.
All this needs to be done before the BPF program itself can be loaded into the kernel. Fortunately, the libbpf library handles the ELF object decoding and map reference relocation, transparently to the user space program performing the loads.
As you learned in basic02 the libbpf API have “objects” and functions
working on/with these objects. The struct bpf_object
represents the ELF
object itself (which is returned from our load_bpf_and_xdp_attach()
function).
Similarly to what we did for BPF functions, our load has a function called
find_map_fd()
(in xdp_load_and_stats.c), which uses the library
function bpf_object__find_map_by_name()
for finding the bpf_map
object
with a given name. (Note, the length of the map name is provided by ELF and
is longer than what the name kernel stores, after loading it). After finding
the bpf_object
, we obtain the map file descriptor via bpf_map__fd()
.
There is also a libbpf function that wraps these two steps, which is called
bpf_object__find_map_fd_by_name()
.
The contents of a map is read from userspace via the function
bpf_map_lookup_elem()
, which is a simple syscall-wrapper, that operates on
the map file descriptor (FD). The syscall looks up the key
and stores the
value into the memory area supplied by the value pointer. It is up to the
calling userspace program to ensure that the memory allocated to hold the
returned value is large enough to store the type of data contained in the
map. In our example we demonstrate how userspace can query the map FD and
get back some info in struct bpf_map_info
via the syscall wrapper
bpf_obj_get_info_by_fd()
.
For example, the program xdp_load_and_stats
will periodically read the
xdp_stats_map value and produce some stats.
The assignments have “hint” marks in the code via Assignment#num
comments.
The current assignment code only counts packets. It is your assignment to extend this to also count bytes.
Notice how the BPF map xdp_stats_map
used:
.value_size = sizeof(struct datarec)
The BPF map has no knowledge about the data-structure used for the value
record, it only knows the size. (The BPF Type Format (BTF) is an advanced
topic, that allows for associating data struct knowledge via debug info, but
we ignore that for now). Thus, it is up to the two sides (userspace and
BPF-prog kernel side) to ensure they stay in sync on the content and
structure of value
. The hint here on the data structure used comes from
sizeof(struct datarec)
, which indicate that struct datarec
is used.
This struct datarec
is defined in the include common_kern_user.h as:
/* This is the data record stored in the map */
struct datarec {
__u64 rx_packets;
/* Assignment#1: Add byte counters */
};
Next step is to update the kernel side BPF program: xdp_prog_kern.c.
To figure out the length of the packet, you need to learn about the context
variable *ctx
with type struct xdp_md that the BPF program gets a pointer
to when invoked by the kernel. This struct xdp_md
is a little odd, as all
members have type __u32
. However, this is not actually their real data
types, as access to this data-structure is remapped by the kernel when the
program is loaded into the kernel. Access gets remapped to struct xdp_buff
and also struct xdp_rxq_info
.
struct xdp_md {
// (Note: type __u32 is NOT the real-type)
__u32 data;
__u32 data_end;
__u32 data_meta;
/* Below access go through struct xdp_rxq_info */
__u32 ingress_ifindex; /* rxq->dev->ifindex */
__u32 rx_queue_index; /* rxq->queue_index */
};
While we know this, the compiler doesn’t. So we need to type-cast the fields into void pointers before we can use them:
void *data_end = (void *)(long)ctx->data_end;
void *data = (void *)(long)ctx->data;
The next step is calculating the number of bytes in each packet, by simply
subtracting data
from data_end
, and update the datarec member.
__u64 bytes = data_end - data; /* Calculate packet length */
lock_xadd(&rec->rx_bytes, bytes);
Now it is time to update the userspace program that reads stats (in xdp_load_and_stats.c).
Update the functions:
map_collect()
to also collect rx_bytes.stats_print()
to also print rx_bytes (adjust fmt string)
Notice how the BPF map xdp_stats_map
we defined above is actually an
array, with max_entries=XDP_ACTION_MAX
. The idea with this is to keep
stats per (enum) xdp_action, but our program does not yet take advantage of
this.
The assignment is to extend userspace stats tool (in xdp_load_and_stats.c) to collect and print these extra stats.
Thus far, we have used atomic operations to increment our stats counters; however, this is expensive as it inserts memory barriers to make sure different CPUs don’t garble each other’s data. We can avoid this by using another array type that stores its data in per-CPU storage. The drawback of this is that we move the burden of summing to userspace.
To achieve this, the first step is to change map type
(in
xdp_prog_kern.c) to use BPF_MAP_TYPE_PERCPU_ARRAY
. If you only make
this change, the userspace program will detect this and complain, as we
query the map FD for some info (via bpf_obj_get_info_by_fd()
) and e.g.
check the map type. Remember it is userspace’s responsibility to make sure
the data record for the value is large enough.
Next step is writing a function that gets the values per CPU and sum these.
In the xdp_load_and_stats.c. You can copy paste this, and call it from
the switch-case statement in function map_collect()
:
/* BPF_MAP_TYPE_PERCPU_ARRAY */
void map_get_value_percpu_array(int fd, __u32 key, struct datarec *value)
{
/* For percpu maps, userspace gets a value per possible CPU */
unsigned int nr_cpus = bpf_num_possible_cpus();
struct datarec values[nr_cpus];
__u64 sum_bytes = 0;
__u64 sum_pkts = 0;
int i;
if ((bpf_map_lookup_elem(fd, &key, values)) != 0) {
fprintf(stderr,
"ERR: bpf_map_lookup_elem failed key:0x%X\n", key);
return;
}
/* Sum values from each CPU */
for (i = 0; i < nr_cpus; i++) {
sum_pkts += values[i].rx_packets;
sum_bytes += values[i].rx_bytes;
}
value->rx_packets = sum_pkts;
value->rx_bytes = sum_bytes;
}