<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux/kernel/bpf/hashtab.c, branch v5.10</title>
<subtitle>Mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
</subtitle>
<id>https://git.shady.money/linux/atom?h=v5.10</id>
<link rel='self' href='https://git.shady.money/linux/atom?h=v5.10'/>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/'/>
<updated>2020-11-06T03:55:57Z</updated>
<entry>
<title>bpf: Zero-fill re-used per-cpu map element</title>
<updated>2020-11-06T03:55:57Z</updated>
<author>
<name>David Verbeiren</name>
<email>david.verbeiren@tessares.net</email>
</author>
<published>2020-11-04T11:23:32Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=d3bec0138bfbe58606fc1d6f57a4cdc1a20218db'/>
<id>urn:sha1:d3bec0138bfbe58606fc1d6f57a4cdc1a20218db</id>
<content type='text'>
Zero-fill element values for all other cpus than current, just as
when not using prealloc. This is the only way the bpf program can
ensure known initial values for all cpus ('onallcpus' cannot be
set when coming from the bpf program).

The scenario is: bpf program inserts some elements in a per-cpu
map, then deletes some (or userspace does). When later adding
new elements using bpf_map_update_elem(), the bpf program can
only set the value of the new elements for the current cpu.
When prealloc is enabled, previously deleted elements are re-used.
Without the fix, values for other cpus remain whatever they were
when the re-used entry was previously freed.

A selftest is added to validate correct operation in above
scenario as well as in case of LRU per-cpu map element re-use.

Fixes: 6c9059817432 ("bpf: pre-allocate hash map elements")
Signed-off-by: David Verbeiren &lt;david.verbeiren@tessares.net&gt;
Signed-off-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
Acked-by: Matthieu Baerts &lt;matthieu.baerts@tessares.net&gt;
Acked-by: Andrii Nakryiko &lt;andrii@kernel.org&gt;
Link: https://lore.kernel.org/bpf/20201104112332.15191-1-david.verbeiren@tessares.net
</content>
</entry>
<entry>
<title>bpf: Allow for map-in-map with dynamic inner array map entries</title>
<updated>2020-10-11T17:21:04Z</updated>
<author>
<name>Daniel Borkmann</name>
<email>daniel@iogearbox.net</email>
</author>
<published>2020-10-10T23:40:03Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=4a8f87e60f6db40e640f1db555d063b2c4dea5f1'/>
<id>urn:sha1:4a8f87e60f6db40e640f1db555d063b2c4dea5f1</id>
<content type='text'>
Recent work in f4d05259213f ("bpf: Add map_meta_equal map ops") and 134fede4eecf
("bpf: Relax max_entries check for most of the inner map types") added support
for dynamic inner max elements for most map-in-map types. Exceptions were maps
like array or prog array where the map_gen_lookup() callback uses the maps'
max_entries field as a constant when emitting instructions.

We recently implemented Maglev consistent hashing into Cilium's load balancer
which uses map-in-map with an outer map being hash and inner being array holding
the Maglev backend table for each service. This has been designed this way in
order to reduce overall memory consumption given the outer hash map allows to
avoid preallocating a large, flat memory area for all services. Also, the
number of service mappings is not always known a-priori.

The use case for dynamic inner array map entries is to further reduce memory
overhead, for example, some services might just have a small number of back
ends while others could have a large number. Right now the Maglev backend table
for small and large number of backends would need to have the same inner array
map entries which adds a lot of unneeded overhead.

Dynamic inner array map entries can be realized by avoiding the inlined code
generation for their lookup. The lookup will still be efficient since it will
be calling into array_map_lookup_elem() directly and thus avoiding retpoline.
The patch adds a BPF_F_INNER_MAP flag to map creation which therefore skips
inline code generation and relaxes array_map_meta_equal() check to ignore both
maps' max_entries. This also still allows to have faster lookups for map-in-map
when BPF_F_INNER_MAP is not specified and hence dynamic max_entries not needed.

Example code generation where inner map is dynamic sized array:

  # bpftool p d x i 125
  int handle__sys_enter(void * ctx):
  ; int handle__sys_enter(void *ctx)
     0: (b4) w1 = 0
  ; int key = 0;
     1: (63) *(u32 *)(r10 -4) = r1
     2: (bf) r2 = r10
  ;
     3: (07) r2 += -4
  ; inner_map = bpf_map_lookup_elem(&amp;outer_arr_dyn, &amp;key);
     4: (18) r1 = map[id:468]
     6: (07) r1 += 272
     7: (61) r0 = *(u32 *)(r2 +0)
     8: (35) if r0 &gt;= 0x3 goto pc+5
     9: (67) r0 &lt;&lt;= 3
    10: (0f) r0 += r1
    11: (79) r0 = *(u64 *)(r0 +0)
    12: (15) if r0 == 0x0 goto pc+1
    13: (05) goto pc+1
    14: (b7) r0 = 0
    15: (b4) w6 = -1
  ; if (!inner_map)
    16: (15) if r0 == 0x0 goto pc+6
    17: (bf) r2 = r10
  ;
    18: (07) r2 += -4
  ; val = bpf_map_lookup_elem(inner_map, &amp;key);
    19: (bf) r1 = r0                               | No inlining but instead
    20: (85) call array_map_lookup_elem#149280     | call to array_map_lookup_elem()
  ; return val ? *val : -1;                        | for inner array lookup.
    21: (15) if r0 == 0x0 goto pc+1
  ; return val ? *val : -1;
    22: (61) r6 = *(u32 *)(r0 +0)
  ; }
    23: (bc) w0 = w6
    24: (95) exit

Signed-off-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Signed-off-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
Acked-by: Andrii Nakryiko &lt;andrii@kernel.org&gt;
Link: https://lore.kernel.org/bpf/20201010234006.7075-4-daniel@iogearbox.net
</content>
</entry>
<entry>
<title>Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net</title>
<updated>2020-09-22T23:45:34Z</updated>
<author>
<name>David S. Miller</name>
<email>davem@davemloft.net</email>
</author>
<published>2020-09-22T23:45:34Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=3ab0a7a0c349a1d7beb2bb371a62669d1528269d'/>
<id>urn:sha1:3ab0a7a0c349a1d7beb2bb371a62669d1528269d</id>
<content type='text'>
Two minor conflicts:

1) net/ipv4/route.c, adding a new local variable while
   moving another local variable and removing it's
   initial assignment.

2) drivers/net/dsa/microchip/ksz9477.c, overlapping changes.
   One pretty prints the port mode differently, whilst another
   changes the driver to try and obtain the port mode from
   the port node rather than the switch node.

Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>bpf: Do not use bucket_lock for hashmap iterator</title>
<updated>2020-09-04T00:36:41Z</updated>
<author>
<name>Yonghong Song</name>
<email>yhs@fb.com</email>
</author>
<published>2020-09-02T23:53:40Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=dc0988bbe1bd41e2fa555e4a6f890b819a34b49b'/>
<id>urn:sha1:dc0988bbe1bd41e2fa555e4a6f890b819a34b49b</id>
<content type='text'>
Currently, for hashmap, the bpf iterator will grab a bucket lock, a
spinlock, before traversing the elements in the bucket. This can ensure
all bpf visted elements are valid. But this mechanism may cause
deadlock if update/deletion happens to the same bucket of the
visited map in the program. For example, if we added bpf_map_update_elem()
call to the same visited element in selftests bpf_iter_bpf_hash_map.c,
we will have the following deadlock:

  ============================================
  WARNING: possible recursive locking detected
  5.9.0-rc1+ #841 Not tainted
  --------------------------------------------
  test_progs/1750 is trying to acquire lock:
  ffff9a5bb73c5e70 (&amp;htab-&gt;buckets[i].raw_lock){....}-{2:2}, at: htab_map_update_elem+0x1cf/0x410

  but task is already holding lock:
  ffff9a5bb73c5e20 (&amp;htab-&gt;buckets[i].raw_lock){....}-{2:2}, at: bpf_hash_map_seq_find_next+0x94/0x120

  other info that might help us debug this:
   Possible unsafe locking scenario:

         CPU0
         ----
    lock(&amp;htab-&gt;buckets[i].raw_lock);
    lock(&amp;htab-&gt;buckets[i].raw_lock);

   *** DEADLOCK ***
   ...
  Call Trace:
   dump_stack+0x78/0xa0
   __lock_acquire.cold.74+0x209/0x2e3
   lock_acquire+0xba/0x380
   ? htab_map_update_elem+0x1cf/0x410
   ? __lock_acquire+0x639/0x20c0
   _raw_spin_lock_irqsave+0x3b/0x80
   ? htab_map_update_elem+0x1cf/0x410
   htab_map_update_elem+0x1cf/0x410
   ? lock_acquire+0xba/0x380
   bpf_prog_ad6dab10433b135d_dump_bpf_hash_map+0x88/0xa9c
   ? find_held_lock+0x34/0xa0
   bpf_iter_run_prog+0x81/0x16e
   __bpf_hash_map_seq_show+0x145/0x180
   bpf_seq_read+0xff/0x3d0
   vfs_read+0xad/0x1c0
   ksys_read+0x5f/0xe0
   do_syscall_64+0x33/0x40
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  ...

The bucket_lock first grabbed in seq_ops-&gt;next() called by bpf_seq_read(),
and then grabbed again in htab_map_update_elem() in the bpf program, causing
deadlocks.

Actually, we do not need bucket_lock here, we can just use rcu_read_lock()
similar to netlink iterator where the rcu_read_{lock,unlock} likes below:
 seq_ops-&gt;start():
     rcu_read_lock();
 seq_ops-&gt;next():
     rcu_read_unlock();
     /* next element */
     rcu_read_lock();
 seq_ops-&gt;stop();
     rcu_read_unlock();

Compared to old bucket_lock mechanism, if concurrent updata/delete happens,
we may visit stale elements, miss some elements, or repeat some elements.
I think this is a reasonable compromise. For users wanting to avoid
stale, missing/repeated accesses, bpf_map batch access syscall interface
can be used.

Signed-off-by: Yonghong Song &lt;yhs@fb.com&gt;
Signed-off-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
Link: https://lore.kernel.org/bpf/20200902235340.2001375-1-yhs@fb.com
</content>
</entry>
<entry>
<title>bpf: Introduce sleepable BPF programs</title>
<updated>2020-08-28T19:20:33Z</updated>
<author>
<name>Alexei Starovoitov</name>
<email>ast@kernel.org</email>
</author>
<published>2020-08-27T22:01:11Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=1e6c62a8821557720a9b2ea9617359b264f2f67c'/>
<id>urn:sha1:1e6c62a8821557720a9b2ea9617359b264f2f67c</id>
<content type='text'>
Introduce sleepable BPF programs that can request such property for themselves
via BPF_F_SLEEPABLE flag at program load time. In such case they will be able
to use helpers like bpf_copy_from_user() that might sleep. At present only
fentry/fexit/fmod_ret and lsm programs can request to be sleepable and only
when they are attached to kernel functions that are known to allow sleeping.

The non-sleepable programs are relying on implicit rcu_read_lock() and
migrate_disable() to protect life time of programs, maps that they use and
per-cpu kernel structures used to pass info between bpf programs and the
kernel. The sleepable programs cannot be enclosed into rcu_read_lock().
migrate_disable() maps to preempt_disable() in non-RT kernels, so the progs
should not be enclosed in migrate_disable() as well. Therefore
rcu_read_lock_trace is used to protect the life time of sleepable progs.

There are many networking and tracing program types. In many cases the
'struct bpf_prog *' pointer itself is rcu protected within some other kernel
data structure and the kernel code is using rcu_dereference() to load that
program pointer and call BPF_PROG_RUN() on it. All these cases are not touched.
Instead sleepable bpf programs are allowed with bpf trampoline only. The
program pointers are hard-coded into generated assembly of bpf trampoline and
synchronize_rcu_tasks_trace() is used to protect the life time of the program.
The same trampoline can hold both sleepable and non-sleepable progs.

When rcu_read_lock_trace is held it means that some sleepable bpf program is
running from bpf trampoline. Those programs can use bpf arrays and preallocated
hash/lru maps. These map types are waiting on programs to complete via
synchronize_rcu_tasks_trace();

Updates to trampoline now has to do synchronize_rcu_tasks_trace() and
synchronize_rcu_tasks() to wait for sleepable progs to finish and for
trampoline assembly to finish.

This is the first step of introducing sleepable progs. Eventually dynamically
allocated hash maps can be allowed and networking program types can become
sleepable too.

Signed-off-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
Signed-off-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Reviewed-by: Josef Bacik &lt;josef@toxicpanda.com&gt;
Acked-by: Andrii Nakryiko &lt;andriin@fb.com&gt;
Acked-by: KP Singh &lt;kpsingh@google.com&gt;
Link: https://lore.kernel.org/bpf/20200827220114.69225-3-alexei.starovoitov@gmail.com
</content>
</entry>
<entry>
<title>bpf: Add map_meta_equal map ops</title>
<updated>2020-08-28T13:41:30Z</updated>
<author>
<name>Martin KaFai Lau</name>
<email>kafai@fb.com</email>
</author>
<published>2020-08-28T01:18:06Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=f4d05259213ff1e91f767c91dcab455f68308fac'/>
<id>urn:sha1:f4d05259213ff1e91f767c91dcab455f68308fac</id>
<content type='text'>
Some properties of the inner map is used in the verification time.
When an inner map is inserted to an outer map at runtime,
bpf_map_meta_equal() is currently used to ensure those properties
of the inserting inner map stays the same as the verification
time.

In particular, the current bpf_map_meta_equal() checks max_entries which
turns out to be too restrictive for most of the maps which do not use
max_entries during the verification time.  It limits the use case that
wants to replace a smaller inner map with a larger inner map.  There are
some maps do use max_entries during verification though.  For example,
the map_gen_lookup in array_map_ops uses the max_entries to generate
the inline lookup code.

To accommodate differences between maps, the map_meta_equal is added
to bpf_map_ops.  Each map-type can decide what to check when its
map is used as an inner map during runtime.

Also, some map types cannot be used as an inner map and they are
currently black listed in bpf_map_meta_alloc() in map_in_map.c.
It is not unusual that the new map types may not aware that such
blacklist exists.  This patch enforces an explicit opt-in
and only allows a map to be used as an inner map if it has
implemented the map_meta_equal ops.  It is based on the
discussion in [1].

All maps that support inner map has its map_meta_equal points
to bpf_map_meta_equal in this patch.  A later patch will
relax the max_entries check for most maps.  bpf_types.h
counts 28 map types.  This patch adds 23 ".map_meta_equal"
by using coccinelle.  -5 for
	BPF_MAP_TYPE_PROG_ARRAY
	BPF_MAP_TYPE_(PERCPU)_CGROUP_STORAGE
	BPF_MAP_TYPE_STRUCT_OPS
	BPF_MAP_TYPE_ARRAY_OF_MAPS
	BPF_MAP_TYPE_HASH_OF_MAPS

The "if (inner_map-&gt;inner_map_meta)" check in bpf_map_meta_alloc()
is moved such that the same error is returned.

[1]: https://lore.kernel.org/bpf/20200522022342.899756-1-kafai@fb.com/

Signed-off-by: Martin KaFai Lau &lt;kafai@fb.com&gt;
Signed-off-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Link: https://lore.kernel.org/bpf/20200828011806.1970400-1-kafai@fb.com
</content>
</entry>
<entry>
<title>Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next</title>
<updated>2020-08-04T01:27:40Z</updated>
<author>
<name>David S. Miller</name>
<email>davem@davemloft.net</email>
</author>
<published>2020-08-04T01:27:40Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=2e7199bd773bff3220184d071ed9c9cd34950e51'/>
<id>urn:sha1:2e7199bd773bff3220184d071ed9c9cd34950e51</id>
<content type='text'>
Daniel Borkmann says:

====================
pull-request: bpf-next 2020-08-04

The following pull-request contains BPF updates for your *net-next* tree.

We've added 73 non-merge commits during the last 9 day(s) which contain
a total of 135 files changed, 4603 insertions(+), 1013 deletions(-).

The main changes are:

1) Implement bpf_link support for XDP. Also add LINK_DETACH operation for the BPF
   syscall allowing processes with BPF link FD to force-detach, from Andrii Nakryiko.

2) Add BPF iterator for map elements and to iterate all BPF programs for efficient
   in-kernel inspection, from Yonghong Song and Alexei Starovoitov.

3) Separate bpf_get_{stack,stackid}() helpers for perf events in BPF to avoid
   unwinder errors, from Song Liu.

4) Allow cgroup local storage map to be shared between programs on the same
   cgroup. Also extend BPF selftests with coverage, from YiFei Zhu.

5) Add BPF exception tables to ARM64 JIT in order to be able to JIT BPF_PROBE_MEM
   load instructions, from Jean-Philippe Brucker.

6) Follow-up fixes on BPF socket lookup in combination with reuseport group
   handling. Also add related BPF selftests, from Jakub Sitnicki.

7) Allow to use socket storage in BPF_PROG_TYPE_CGROUP_SOCK-typed programs for
   socket create/release as well as bind functions, from Stanislav Fomichev.

8) Fix an info leak in xsk_getsockopt() when retrieving XDP stats via old struct
   xdp_statistics, from Peilin Ye.

9) Fix PT_REGS_RC{,_CORE}() macros in libbpf for MIPS arch, from Jerry Crunchtime.

10) Extend BPF kernel test infra with skb-&gt;family and skb-&gt;{local,remote}_ip{4,6}
    fields and allow user space to specify skb-&gt;dev via ifindex, from Dmitry Yakunin.

11) Fix a bpftool segfault due to missing program type name and make it more robust
    to prevent them in future gaps, from Quentin Monnet.

12) Consolidate cgroup helper functions across selftests and fix a v6 localhost
    resolver issue, from John Fastabend.
====================

Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net</title>
<updated>2020-08-02T08:02:12Z</updated>
<author>
<name>David S. Miller</name>
<email>davem@davemloft.net</email>
</author>
<published>2020-08-02T08:02:12Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=bd0b33b24897ba9ddad221e8ac5b6f0e38a2e004'/>
<id>urn:sha1:bd0b33b24897ba9ddad221e8ac5b6f0e38a2e004</id>
<content type='text'>
Resolved kernel/bpf/btf.c using instructions from merge commit
69138b34a7248d2396ab85c8652e20c0c39beaba

Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>bpf: Fix map leak in HASH_OF_MAPS map</title>
<updated>2020-07-29T23:30:22Z</updated>
<author>
<name>Andrii Nakryiko</name>
<email>andriin@fb.com</email>
</author>
<published>2020-07-29T04:09:12Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=1d4e1eab456e1ee92a94987499b211db05f900ea'/>
<id>urn:sha1:1d4e1eab456e1ee92a94987499b211db05f900ea</id>
<content type='text'>
Fix HASH_OF_MAPS bug of not putting inner map pointer on bpf_map_elem_update()
operation. This is due to per-cpu extra_elems optimization, which bypassed
free_htab_elem() logic doing proper clean ups. Make sure that inner map is put
properly in optimized case as well.

Fixes: 8c290e60fa2a ("bpf: fix hashmap extra_elems logic")
Signed-off-by: Andrii Nakryiko &lt;andriin@fb.com&gt;
Signed-off-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Acked-by: Song Liu &lt;songliubraving@fb.com&gt;
Link: https://lore.kernel.org/bpf/20200729040913.2815687-1-andriin@fb.com
</content>
</entry>
<entry>
<title>bpf: Implement bpf iterator for hash maps</title>
<updated>2020-07-26T03:16:33Z</updated>
<author>
<name>Yonghong Song</name>
<email>yhs@fb.com</email>
</author>
<published>2020-07-23T18:41:14Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=d6c4503cc29638f328e1a6e6fefbdbda401c28fc'/>
<id>urn:sha1:d6c4503cc29638f328e1a6e6fefbdbda401c28fc</id>
<content type='text'>
The bpf iterators for hash, percpu hash, lru hash
and lru percpu hash are implemented. During link time,
bpf_iter_reg-&gt;check_target() will check map type
and ensure the program access key/value region is
within the map defined key/value size limit.

For percpu hash and lru hash maps, the bpf program
will receive values for all cpus. The map element
bpf iterator infrastructure will prepare value
properly before passing the value pointer to the
bpf program.

This patch set supports readonly map keys and
read/write map values. It does not support deleting
map elements, e.g., from hash tables. If there is
a user case for this, the following mechanism can
be used to support map deletion for hashtab, etc.
  - permit a new bpf program return value, e.g., 2,
    to let bpf iterator know the map element should
    be removed.
  - since bucket lock is taken, the map element will be
    queued.
  - once bucket lock is released after all elements under
    this bucket are traversed, all to-be-deleted map
    elements can be deleted.

Signed-off-by: Yonghong Song &lt;yhs@fb.com&gt;
Signed-off-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
Link: https://lore.kernel.org/bpf/20200723184114.590470-1-yhs@fb.com
</content>
</entry>
</feed>
