<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux/kernel/bpf/syscall.c, branch v4.6</title>
<subtitle>Mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
</subtitle>
<id>https://git.shady.money/linux/atom?h=v4.6</id>
<link rel='self' href='https://git.shady.money/linux/atom?h=v4.6'/>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/'/>
<updated>2016-04-28T21:29:45Z</updated>
<entry>
<title>bpf: fix refcnt overflow</title>
<updated>2016-04-28T21:29:45Z</updated>
<author>
<name>Alexei Starovoitov</name>
<email>ast@fb.com</email>
</author>
<published>2016-04-28T01:56:20Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=92117d8443bc5afacc8d5ba82e541946310f106e'/>
<id>urn:sha1:92117d8443bc5afacc8d5ba82e541946310f106e</id>
<content type='text'>
On a system with &gt;32Gbyte of phyiscal memory and infinite RLIMIT_MEMLOCK,
the malicious application may overflow 32-bit bpf program refcnt.
It's also possible to overflow map refcnt on 1Tb system.
Impose 32k hard limit which means that the same bpf program or
map cannot be shared by more than 32k processes.

Fixes: 1be7f75d1668 ("bpf: enable non-root eBPF programs")
Reported-by: Jann Horn &lt;jannh@google.com&gt;
Signed-off-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
Acked-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>bpf: add missing map_flags to bpf_map_show_fdinfo</title>
<updated>2016-03-25T15:36:41Z</updated>
<author>
<name>Daniel Borkmann</name>
<email>daniel@iogearbox.net</email>
</author>
<published>2016-03-24T23:30:25Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=322cea2f41adb62c975f46a3242f4e3b43226fa1'/>
<id>urn:sha1:322cea2f41adb62c975f46a3242f4e3b43226fa1</id>
<content type='text'>
Add map_flags attribute to bpf_map_show_fdinfo(), so that tools like
tc can check for them when loading objects from a pinned entry, e.g.
if user intent wrt allocation (BPF_F_NO_PREALLOC) is different to the
pinned object, it can bail out. Follow-up to 6c9059817432 ("bpf:
pre-allocate hash map elements"), so that tc can still support this
with v4.6.

Signed-off-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Acked-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>bpf: bpf_stackmap_copy depends on CONFIG_PERF_EVENTS</title>
<updated>2016-03-10T04:26:51Z</updated>
<author>
<name>Alexei Starovoitov</name>
<email>ast@fb.com</email>
</author>
<published>2016-03-10T02:56:49Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=b8cdc05173f05d212627b7aba7ec47fa334a79f2'/>
<id>urn:sha1:b8cdc05173f05d212627b7aba7ec47fa334a79f2</id>
<content type='text'>
0-day bot reported build error:
kernel/built-in.o: In function `map_lookup_elem':
&gt;&gt; kernel/bpf/.tmp_syscall.o:(.text+0x329b3c): undefined reference to `bpf_stackmap_copy'
when CONFIG_BPF_SYSCALL is set and CONFIG_PERF_EVENTS is not.
Add weak definition to resolve it.
This code path in map_lookup_elem() is never taken
when CONFIG_PERF_EVENTS is not set.

Fixes: 557c0c6e7df8 ("bpf: convert stackmap to pre-allocation")
Reported-by: Fengguang Wu &lt;fengguang.wu@intel.com&gt;
Signed-off-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>bpf: convert stackmap to pre-allocation</title>
<updated>2016-03-08T20:28:31Z</updated>
<author>
<name>Alexei Starovoitov</name>
<email>ast@fb.com</email>
</author>
<published>2016-03-08T05:57:17Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=557c0c6e7df8e14a46bd7560d193fa5bbc00a858'/>
<id>urn:sha1:557c0c6e7df8e14a46bd7560d193fa5bbc00a858</id>
<content type='text'>
It was observed that calling bpf_get_stackid() from a kprobe inside
slub or from spin_unlock causes similar deadlock as with hashmap,
therefore convert stackmap to use pre-allocated memory.

The call_rcu is no longer feasible mechanism, since delayed freeing
causes bpf_get_stackid() to fail unpredictably when number of actual
stacks is significantly less than user requested max_entries.
Since elements are no longer freed into slub, we can push elements into
freelist immediately and let them be recycled.
However the very unlikley race between user space map_lookup() and
program-side recycling is possible:
     cpu0                          cpu1
     ----                          ----
user does lookup(stackidX)
starts copying ips into buffer
                                   delete(stackidX)
                                   calls bpf_get_stackid()
				   which recyles the element and
                                   overwrites with new stack trace

To avoid user space seeing a partial stack trace consisting of two
merged stack traces, do bucket = xchg(, NULL); copy; xchg(,bucket);
to preserve consistent stack trace delivery to user space.
Now we can move memset(,0) of left-over element value from critical
path of bpf_get_stackid() into slow-path of user space lookup.
Also disallow lookup() from bpf program, since it's useless and
program shouldn't be messing with collected stack trace.

Note that similar race between user space lookup and kernel side updates
is also present in hashmap, but it's not a new race. bpf programs were
always allowed to modify hash and array map elements while user space
is copying them.

Fixes: d5a3b1f69186 ("bpf: introduce BPF_MAP_TYPE_STACK_TRACE")
Signed-off-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>bpf: pre-allocate hash map elements</title>
<updated>2016-03-08T20:28:31Z</updated>
<author>
<name>Alexei Starovoitov</name>
<email>ast@fb.com</email>
</author>
<published>2016-03-08T05:57:15Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=6c90598174322b8888029e40dd84a4eb01f56afe'/>
<id>urn:sha1:6c90598174322b8888029e40dd84a4eb01f56afe</id>
<content type='text'>
If kprobe is placed on spin_unlock then calling kmalloc/kfree from
bpf programs is not safe, since the following dead lock is possible:
kfree-&gt;spin_lock(kmem_cache_node-&gt;lock)...spin_unlock-&gt;kprobe-&gt;
bpf_prog-&gt;map_update-&gt;kmalloc-&gt;spin_lock(of the same kmem_cache_node-&gt;lock)
and deadlocks.

The following solutions were considered and some implemented, but
eventually discarded
- kmem_cache_create for every map
- add recursion check to slow-path of slub
- use reserved memory in bpf_map_update for in_irq or in preempt_disabled
- kmalloc via irq_work

At the end pre-allocation of all map elements turned out to be the simplest
solution and since the user is charged upfront for all the memory, such
pre-allocation doesn't affect the user space visible behavior.

Since it's impossible to tell whether kprobe is triggered in a safe
location from kmalloc point of view, use pre-allocation by default
and introduce new BPF_F_NO_PREALLOC flag.

While testing of per-cpu hash maps it was discovered
that alloc_percpu(GFP_ATOMIC) has odd corner cases and often
fails to allocate memory even when 90% of it is free.
The pre-allocation of per-cpu hash elements solves this problem as well.

Turned out that bpf_map_update() quickly followed by
bpf_map_lookup()+bpf_map_delete() is very common pattern used
in many of iovisor/bcc/tools, so there is additional benefit of
pre-allocation, since such use cases are must faster.

Since all hash map elements are now pre-allocated we can remove
atomic increment of htab-&gt;count and save few more cycles.

Also add bpf_map_precharge_memlock() to check rlimit_memlock early to avoid
large malloc/free done by users who don't have sufficient limits.

Pre-allocation is done with vmalloc and alloc/free is done
via percpu_freelist. Here are performance numbers for different
pre-allocation algorithms that were implemented, but discarded
in favor of percpu_freelist:

1 cpu:
pcpu_ida	2.1M
pcpu_ida nolock	2.3M
bt		2.4M
kmalloc		1.8M
hlist+spinlock	2.3M
pcpu_freelist	2.6M

4 cpu:
pcpu_ida	1.5M
pcpu_ida nolock	1.8M
bt w/smp_align	1.7M
bt no/smp_align	1.1M
kmalloc		0.7M
hlist+spinlock	0.2M
pcpu_freelist	2.0M

8 cpu:
pcpu_ida	0.7M
bt w/smp_align	0.8M
kmalloc		0.4M
pcpu_freelist	1.5M

32 cpu:
kmalloc		0.13M
pcpu_freelist	0.49M

pcpu_ida nolock is a modified percpu_ida algorithm without
percpu_ida_cpu locks and without cross-cpu tag stealing.
It's faster than existing percpu_ida, but not as fast as pcpu_freelist.

bt is a variant of block/blk-mq-tag.c simlified and customized
for bpf use case. bt w/smp_align is using cache line for every 'long'
(similar to blk-mq-tag). bt no/smp_align allocates 'long'
bitmasks continuously to save memory. It's comparable to percpu_ida
and in some cases faster, but slower than percpu_freelist

hlist+spinlock is the simplest free list with single spinlock.
As expeceted it has very bad scaling in SMP.

kmalloc is existing implementation which is still available via
BPF_F_NO_PREALLOC flag. It's significantly slower in single cpu and
in 8 cpu setup it's 3 times slower than pre-allocation with pcpu_freelist,
but saves memory, so in cases where map-&gt;max_entries can be large
and number of map update/delete per second is low, it may make
sense to use it.

Signed-off-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>bpf: prevent kprobe+bpf deadlocks</title>
<updated>2016-03-08T20:28:30Z</updated>
<author>
<name>Alexei Starovoitov</name>
<email>ast@fb.com</email>
</author>
<published>2016-03-08T05:57:13Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=b121d1e74d1f24654bdc3165d3db1ca149501356'/>
<id>urn:sha1:b121d1e74d1f24654bdc3165d3db1ca149501356</id>
<content type='text'>
if kprobe is placed within update or delete hash map helpers
that hold bucket spin lock and triggered bpf program is trying to
grab the spinlock for the same bucket on the same cpu, it will
deadlock.
Fix it by extending existing recursion prevention mechanism.

Note, map_lookup and other tracing helpers don't have this problem,
since they don't hold any locks and don't modify global data.
bpf_trace_printk has its own recursive check and ok as well.

Signed-off-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
Acked-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>bpf: add lookup/update support for per-cpu hash and array maps</title>
<updated>2016-02-06T08:34:36Z</updated>
<author>
<name>Alexei Starovoitov</name>
<email>ast@fb.com</email>
</author>
<published>2016-02-02T06:39:55Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=15a07b33814d14ca817887dbea8530728dc0fbe4'/>
<id>urn:sha1:15a07b33814d14ca817887dbea8530728dc0fbe4</id>
<content type='text'>
The functions bpf_map_lookup_elem(map, key, value) and
bpf_map_update_elem(map, key, value, flags) need to get/set
values from all-cpus for per-cpu hash and array maps,
so that user space can aggregate/update them as necessary.

Example of single counter aggregation in user space:
  unsigned int nr_cpus = sysconf(_SC_NPROCESSORS_CONF);
  long values[nr_cpus];
  long value = 0;

  bpf_lookup_elem(fd, key, values);
  for (i = 0; i &lt; nr_cpus; i++)
    value += values[i];

The user space must provide round_up(value_size, 8) * nr_cpus
array to get/set values, since kernel will use 'long' copy
of per-cpu values to try to copy good counters atomically.
It's a best-effort, since bpf programs and user space are racing
to access the same memory.

Signed-off-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net</title>
<updated>2015-12-04T02:09:12Z</updated>
<author>
<name>David S. Miller</name>
<email>davem@davemloft.net</email>
</author>
<published>2015-12-04T02:03:21Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=f188b951f33a0464338f94f928338f84fc0e4392'/>
<id>urn:sha1:f188b951f33a0464338f94f928338f84fc0e4392</id>
<content type='text'>
Conflicts:
	drivers/net/ethernet/renesas/ravb_main.c
	kernel/bpf/syscall.c
	net/ipv4/ipmr.c

All three conflicts were cases of overlapping changes.

Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>bpf: fix allocation warnings in bpf maps and integer overflow</title>
<updated>2015-12-03T04:36:00Z</updated>
<author>
<name>Alexei Starovoitov</name>
<email>ast@kernel.org</email>
</author>
<published>2015-11-30T00:59:35Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=01b3f52157ff5a47d6d8d796f396a4b34a53c61d'/>
<id>urn:sha1:01b3f52157ff5a47d6d8d796f396a4b34a53c61d</id>
<content type='text'>
For large map-&gt;value_size the user space can trigger memory allocation warnings like:
WARNING: CPU: 2 PID: 11122 at mm/page_alloc.c:2989
__alloc_pages_nodemask+0x695/0x14e0()
Call Trace:
 [&lt;     inline     &gt;] __dump_stack lib/dump_stack.c:15
 [&lt;ffffffff82743b56&gt;] dump_stack+0x68/0x92 lib/dump_stack.c:50
 [&lt;ffffffff81244ec9&gt;] warn_slowpath_common+0xd9/0x140 kernel/panic.c:460
 [&lt;ffffffff812450f9&gt;] warn_slowpath_null+0x29/0x30 kernel/panic.c:493
 [&lt;     inline     &gt;] __alloc_pages_slowpath mm/page_alloc.c:2989
 [&lt;ffffffff81554e95&gt;] __alloc_pages_nodemask+0x695/0x14e0 mm/page_alloc.c:3235
 [&lt;ffffffff816188fe&gt;] alloc_pages_current+0xee/0x340 mm/mempolicy.c:2055
 [&lt;     inline     &gt;] alloc_pages include/linux/gfp.h:451
 [&lt;ffffffff81550706&gt;] alloc_kmem_pages+0x16/0xf0 mm/page_alloc.c:3414
 [&lt;ffffffff815a1c89&gt;] kmalloc_order+0x19/0x60 mm/slab_common.c:1007
 [&lt;ffffffff815a1cef&gt;] kmalloc_order_trace+0x1f/0xa0 mm/slab_common.c:1018
 [&lt;     inline     &gt;] kmalloc_large include/linux/slab.h:390
 [&lt;ffffffff81627784&gt;] __kmalloc+0x234/0x250 mm/slub.c:3525
 [&lt;     inline     &gt;] kmalloc include/linux/slab.h:463
 [&lt;     inline     &gt;] map_update_elem kernel/bpf/syscall.c:288
 [&lt;     inline     &gt;] SYSC_bpf kernel/bpf/syscall.c:744

To avoid never succeeding kmalloc with order &gt;= MAX_ORDER check that
elem-&gt;value_size and computed elem_size are within limits for both hash and
array type maps.
Also add __GFP_NOWARN to kmalloc(value_size | elem_size) to avoid OOM warnings.
Note kmalloc(key_size) is highly unlikely to trigger OOM, since key_size &lt;= 512,
so keep those kmalloc-s as-is.

Large value_size can cause integer overflows in elem_size and map.pages
formulas, so check for that as well.

Fixes: aaac3ba95e4c ("bpf: charge user for creation of BPF maps and programs")
Reported-by: Dmitry Vyukov &lt;dvyukov@google.com&gt;
Signed-off-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>bpf: fix clearing on persistent program array maps</title>
<updated>2015-11-25T17:14:09Z</updated>
<author>
<name>Daniel Borkmann</name>
<email>daniel@iogearbox.net</email>
</author>
<published>2015-11-24T20:28:15Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=c9da161c6517ba12154059d3b965c2cbaf16f90f'/>
<id>urn:sha1:c9da161c6517ba12154059d3b965c2cbaf16f90f</id>
<content type='text'>
Currently, when having map file descriptors pointing to program arrays,
there's still the issue that we unconditionally flush program array
contents via bpf_fd_array_map_clear() in bpf_map_release(). This happens
when such a file descriptor is released and is independent of the map's
refcount.

Having this flush independent of the refcount is for a reason: there
can be arbitrary complex dependency chains among tail calls, also circular
ones (direct or indirect, nesting limit determined during runtime), and
we need to make sure that the map drops all references to eBPF programs
it holds, so that the map's refcount can eventually drop to zero and
initiate its freeing. Btw, a walk of the whole dependency graph would
not be possible for various reasons, one being complexity and another
one inconsistency, i.e. new programs can be added to parts of the graph
at any time, so there's no guaranteed consistent state for the time of
such a walk.

Now, the program array pinning itself works, but the issue is that each
derived file descriptor on close would nevertheless call unconditionally
into bpf_fd_array_map_clear(). Instead, keep track of users and postpone
this flush until the last reference to a user is dropped. As this only
concerns a subset of references (f.e. a prog array could hold a program
that itself has reference on the prog array holding it, etc), we need to
track them separately.

Short analysis on the refcounting: on map creation time usercnt will be
one, so there's no change in behaviour for bpf_map_release(), if unpinned.
If we already fail in map_create(), we are immediately freed, and no
file descriptor has been made public yet. In bpf_obj_pin_user(), we need
to probe for a possible map in bpf_fd_probe_obj() already with a usercnt
reference, so before we drop the reference on the fd with fdput().
Therefore, if actual pinning fails, we need to drop that reference again
in bpf_any_put(), otherwise we keep holding it. When last reference
drops on the inode, the bpf_any_put() in bpf_evict_inode() will take
care of dropping the usercnt again. In the bpf_obj_get_user() case, the
bpf_any_get() will grab a reference on the usercnt, still at a time when
we have the reference on the path. Should we later on fail to grab a new
file descriptor, bpf_any_put() will drop it, otherwise we hold it until
bpf_map_release() time.

Joint work with Alexei.

Fixes: b2197755b263 ("bpf: add support for persistent maps/progs")
Signed-off-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Signed-off-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
</feed>
