<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux/kernel/bpf, branch v5.1</title>
<subtitle>Mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
</subtitle>
<id>https://git.shady.money/linux/atom?h=v5.1</id>
<link rel='self' href='https://git.shady.money/linux/atom?h=v5.1'/>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/'/>
<updated>2019-04-26T00:20:06Z</updated>
<entry>
<title>bpf: mark registers in all frames after pkt/null checks</title>
<updated>2019-04-26T00:20:06Z</updated>
<author>
<name>Paul Chaignon</name>
<email>paul.chaignon@orange.com</email>
</author>
<published>2019-04-24T19:50:42Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=c6a9efa1d8353d8960d152e7d469d952b01495c0'/>
<id>urn:sha1:c6a9efa1d8353d8960d152e7d469d952b01495c0</id>
<content type='text'>
In case of a null check on a pointer inside a subprog, we should mark all
registers with this pointer as either safe or unknown, in both the current
and previous frames.  Currently, only spilled registers and registers in
the current frame are marked.  Packet bound checks in subprogs have the
same issue.  This patch fixes it to mark registers in previous frames as
well.

A good reproducer for null checks looks as follow:

1: ptr = bpf_map_lookup_elem(map, &amp;key);
2: ret = subprog(ptr) {
3:   return ptr != NULL;
4: }
5: if (ret)
6:   value = *ptr;

With the above, the verifier will complain on line 6 because it sees ptr
as map_value_or_null despite the null check in subprog 1.

Note that this patch fixes another resulting bug when using
bpf_sk_release():

1: sk = bpf_sk_lookup_tcp(...);
2: subprog(sk) {
3:   if (sk)
4:     bpf_sk_release(sk);
5: }
6: if (!sk)
7:   return 0;
8: return 1;

In the above, mark_ptr_or_null_regs will warn on line 6 because it will
try to free the reference state, even though it was already freed on
line 3.

Fixes: f4d7e40a5b71 ("bpf: introduce function calls (verification)")
Signed-off-by: Paul Chaignon &lt;paul.chaignon@orange.com&gt;
Signed-off-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
</content>
</entry>
<entry>
<title>xdp: fix cpumap redirect SKB creation bug</title>
<updated>2019-03-29T19:15:02Z</updated>
<author>
<name>Jesper Dangaard Brouer</name>
<email>brouer@redhat.com</email>
</author>
<published>2019-03-29T09:18:00Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=676e4a6fe703f2dae699ee9d56f14516f9ada4ea'/>
<id>urn:sha1:676e4a6fe703f2dae699ee9d56f14516f9ada4ea</id>
<content type='text'>
We want to avoid leaking pointer info from xdp_frame (that is placed in
top of frame) like commit 6dfb970d3dbd ("xdp: avoid leaking info stored in
frame data on page reuse"), and followup commit 97e19cce05e5 ("bpf:
reserve xdp_frame size in xdp headroom") that reserve this headroom.

These changes also affected how cpumap constructed SKBs, as xdpf-&gt;headroom
size changed, the skb data starting point were in-effect shifted with 32
bytes (sizeof xdp_frame). This was still okay, as the cpumap frame_size
calculation also included xdpf-&gt;headroom which were reduced by same amount.

A bug was introduced in commit 77ea5f4cbe20 ("bpf/cpumap: make sure
frame_size for build_skb is aligned if headroom isn't"), where the
xdpf-&gt;headroom became part of the SKB_DATA_ALIGN rounding up. This
round-up to find the frame_size is in principle still correct as it does
not exceed the 2048 bytes frame_size (which is max for ixgbe and i40e),
but the 32 bytes offset of pkt_data_start puts this over the 2048 bytes
limit. This cause skb_shared_info to spill into next frame. It is a little
hard to trigger, as the SKB need to use above 15 skb_shinfo-&gt;frags[] as
far as I calculate. This does happen in practise for TCP streams when
skb_try_coalesce() kicks in.

KASAN can be used to detect these wrong memory accesses, I've seen:
 BUG: KASAN: use-after-free in skb_try_coalesce+0x3cb/0x760
 BUG: KASAN: wild-memory-access in skb_release_data+0xe2/0x250

Driver veth also construct a SKB from xdp_frame in this way, but is not
affected, as it doesn't reserve/deduct the room (used by xdp_frame) from
the SKB headroom. Instead is clears the pointers via xdp_scrub_frame(),
and allows SKB to use this area.

The fix in this patch is to do like veth and instead allow SKB to (re)use
the area occupied by xdp_frame, by clearing via xdp_scrub_frame().  (This
does kill the idea of the SKB being able to access (mem) info from this
area, but I guess it was a bad idea anyhow, and it was already killed by
the veth changes.)

Fixes: 77ea5f4cbe20 ("bpf/cpumap: make sure frame_size for build_skb is aligned if headroom isn't")
Signed-off-by: Jesper Dangaard Brouer &lt;brouer@redhat.com&gt;
Signed-off-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
</content>
</entry>
<entry>
<title>bpf: remove incorrect 'verifier bug' warning</title>
<updated>2019-03-26T20:02:16Z</updated>
<author>
<name>Paul Chaignon</name>
<email>paul.chaignon@orange.com</email>
</author>
<published>2019-03-20T12:58:27Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=927cb78177ae3773d0d27404566a93cb8e88890c'/>
<id>urn:sha1:927cb78177ae3773d0d27404566a93cb8e88890c</id>
<content type='text'>
The BPF verifier checks the maximum number of call stack frames twice,
first in the main CFG traversal (do_check) and then in a subsequent
traversal (check_max_stack_depth).  If the second check fails, it logs a
'verifier bug' warning and errors out, as the number of call stack frames
should have been verified already.

However, the second check may fail without indicating a verifier bug: if
the excessive function calls reside in dead code, the main CFG traversal
may not visit them; the subsequent traversal visits all instructions,
including dead code.

This case raises the question of how invalid dead code should be treated.
This patch implements the conservative option and rejects such code.

Signed-off-by: Paul Chaignon &lt;paul.chaignon@orange.com&gt;
Tested-by: Xiao Han &lt;xiao.han@orange.com&gt;
Signed-off-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
</content>
</entry>
<entry>
<title>bpf: fix use after free in bpf_evict_inode</title>
<updated>2019-03-26T00:38:49Z</updated>
<author>
<name>Daniel Borkmann</name>
<email>daniel@iogearbox.net</email>
</author>
<published>2019-03-25T14:54:43Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=1da6c4d9140cb7c13e87667dc4e1488d6c8fc10f'/>
<id>urn:sha1:1da6c4d9140cb7c13e87667dc4e1488d6c8fc10f</id>
<content type='text'>
syzkaller was able to generate the following UAF in bpf:

  BUG: KASAN: use-after-free in lookup_last fs/namei.c:2269 [inline]
  BUG: KASAN: use-after-free in path_lookupat.isra.43+0x9f8/0xc00 fs/namei.c:2318
  Read of size 1 at addr ffff8801c4865c47 by task syz-executor2/9423

  CPU: 0 PID: 9423 Comm: syz-executor2 Not tainted 4.20.0-rc1-next-20181109+
  #110
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
  Google 01/01/2011
  Call Trace:
    __dump_stack lib/dump_stack.c:77 [inline]
    dump_stack+0x244/0x39d lib/dump_stack.c:113
    print_address_description.cold.7+0x9/0x1ff mm/kasan/report.c:256
    kasan_report_error mm/kasan/report.c:354 [inline]
    kasan_report.cold.8+0x242/0x309 mm/kasan/report.c:412
    __asan_report_load1_noabort+0x14/0x20 mm/kasan/report.c:430
    lookup_last fs/namei.c:2269 [inline]
    path_lookupat.isra.43+0x9f8/0xc00 fs/namei.c:2318
    filename_lookup+0x26a/0x520 fs/namei.c:2348
    user_path_at_empty+0x40/0x50 fs/namei.c:2608
    user_path include/linux/namei.h:62 [inline]
    do_mount+0x180/0x1ff0 fs/namespace.c:2980
    ksys_mount+0x12d/0x140 fs/namespace.c:3258
    __do_sys_mount fs/namespace.c:3272 [inline]
    __se_sys_mount fs/namespace.c:3269 [inline]
    __x64_sys_mount+0xbe/0x150 fs/namespace.c:3269
    do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x49/0xbe
  RIP: 0033:0x457569
  Code: fd b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7
  48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 &lt;48&gt; 3d 01 f0 ff
  ff 0f 83 cb b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00
  RSP: 002b:00007fde6ed96c78 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
  RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 0000000000457569
  RDX: 0000000020000040 RSI: 0000000020000000 RDI: 0000000000000000
  RBP: 000000000072bf00 R08: 0000000020000340 R09: 0000000000000000
  R10: 0000000000200000 R11: 0000000000000246 R12: 00007fde6ed976d4
  R13: 00000000004c2c24 R14: 00000000004d4990 R15: 00000000ffffffff

  Allocated by task 9424:
    save_stack+0x43/0xd0 mm/kasan/kasan.c:448
    set_track mm/kasan/kasan.c:460 [inline]
    kasan_kmalloc+0xc7/0xe0 mm/kasan/kasan.c:553
    __do_kmalloc mm/slab.c:3722 [inline]
    __kmalloc_track_caller+0x157/0x760 mm/slab.c:3737
    kstrdup+0x39/0x70 mm/util.c:49
    bpf_symlink+0x26/0x140 kernel/bpf/inode.c:356
    vfs_symlink+0x37a/0x5d0 fs/namei.c:4127
    do_symlinkat+0x242/0x2d0 fs/namei.c:4154
    __do_sys_symlink fs/namei.c:4173 [inline]
    __se_sys_symlink fs/namei.c:4171 [inline]
    __x64_sys_symlink+0x59/0x80 fs/namei.c:4171
    do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

  Freed by task 9425:
    save_stack+0x43/0xd0 mm/kasan/kasan.c:448
    set_track mm/kasan/kasan.c:460 [inline]
    __kasan_slab_free+0x102/0x150 mm/kasan/kasan.c:521
    kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:528
    __cache_free mm/slab.c:3498 [inline]
    kfree+0xcf/0x230 mm/slab.c:3817
    bpf_evict_inode+0x11f/0x150 kernel/bpf/inode.c:565
    evict+0x4b9/0x980 fs/inode.c:558
    iput_final fs/inode.c:1550 [inline]
    iput+0x674/0xa90 fs/inode.c:1576
    do_unlinkat+0x733/0xa30 fs/namei.c:4069
    __do_sys_unlink fs/namei.c:4110 [inline]
    __se_sys_unlink fs/namei.c:4108 [inline]
    __x64_sys_unlink+0x42/0x50 fs/namei.c:4108
    do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

In this scenario path lookup under RCU is racing with the final
unlink in case of symlinks. As Linus puts it in his analysis:

  [...] We actually RCU-delay the inode freeing itself, but
  when we do the final iput(), the "evict()" function is called
  synchronously. Now, the simple fix would seem to just RCU-delay
  the kfree() of the symlink data in bpf_evict_inode(). Maybe
  that's the right thing to do. [...]

Al suggested to piggy-back on the -&gt;destroy_inode() callback in
order to implement RCU deferral there which can then kfree() the
inode-&gt;i_link eventually right before putting inode back into
inode cache. By reusing free_inode_nonrcu() from there we can
avoid the need for our own inode cache and just reuse generic
one as we currently do.

And in-fact on top of all this we should just get rid of the
bpf_evict_inode() entirely. This means truncate_inode_pages_final()
and clear_inode() will then simply be called by the fs core via
evict(). Dropping the reference should really only be done when
inode is unhashed and nothing reachable anymore, so it's better
also moved into the final -&gt;destroy_inode() callback.

Fixes: 0f98621bef5d ("bpf, inode: add support for symlinks and fix mtime/ctime")
Reported-by: syzbot+fb731ca573367b7f6564@syzkaller.appspotmail.com
Reported-by: syzbot+a13e5ead792d6df37818@syzkaller.appspotmail.com
Reported-by: syzbot+7a8ba368b47fdefca61e@syzkaller.appspotmail.com
Suggested-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Analyzed-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Acked-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
Acked-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Acked-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Link: https://lore.kernel.org/lkml/0000000000006946d2057bbd0eef@google.com/T/
</content>
</entry>
<entry>
<title>bpf: verifier: propagate liveness on all frames</title>
<updated>2019-03-22T02:57:02Z</updated>
<author>
<name>Jakub Kicinski</name>
<email>jakub.kicinski@netronome.com</email>
</author>
<published>2019-03-21T21:34:36Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=83d163124cf1104cca5b668d5fe6325715a60855'/>
<id>urn:sha1:83d163124cf1104cca5b668d5fe6325715a60855</id>
<content type='text'>
Commit 7640ead93924 ("bpf: verifier: make sure callees don't prune
with caller differences") connected up parentage chains of all
frames of the stack.  It didn't, however, ensure propagate_liveness()
propagates all liveness information along those chains.

This means pruning happening in the callee may generate explored
states with incomplete liveness for the chains in lower frames
of the stack.

The included selftest is similar to the prior one from commit
7640ead93924 ("bpf: verifier: make sure callees don't prune with
caller differences"), where callee would prune regardless of the
difference in r8 state.

Now we also initialize r9 to 0 or 1 based on a result from get_random().
r9 is never read so the walk with r9 = 0 gets pruned (correctly) after
the walk with r9 = 1 completes.

The selftest is so arranged that the pruning will happen in the
callee.  Since callee does not propagate read marks of r8, the
explored state at the pruning point prior to the callee will
now ignore r8.

Propagate liveness on all frames of the stack when pruning.

Fixes: f4d7e40a5b71 ("bpf: introduce function calls (verification)")
Signed-off-by: Jakub Kicinski &lt;jakub.kicinski@netronome.com&gt;
Signed-off-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
</content>
</entry>
<entry>
<title>bpf: do not restore dst_reg when cur_state is freed</title>
<updated>2019-03-21T11:18:18Z</updated>
<author>
<name>Xu Yu</name>
<email>xuyu@linux.alibaba.com</email>
</author>
<published>2019-03-21T10:00:35Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=0803278b0b4d8eeb2b461fb698785df65a725d9e'/>
<id>urn:sha1:0803278b0b4d8eeb2b461fb698785df65a725d9e</id>
<content type='text'>
Syzkaller hit 'KASAN: use-after-free Write in sanitize_ptr_alu' bug.

Call trace:

  dump_stack+0xbf/0x12e
  print_address_description+0x6a/0x280
  kasan_report+0x237/0x360
  sanitize_ptr_alu+0x85a/0x8d0
  adjust_ptr_min_max_vals+0x8f2/0x1ca0
  adjust_reg_min_max_vals+0x8ed/0x22e0
  do_check+0x1ca6/0x5d00
  bpf_check+0x9ca/0x2570
  bpf_prog_load+0xc91/0x1030
  __se_sys_bpf+0x61e/0x1f00
  do_syscall_64+0xc8/0x550
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

Fault injection trace:

  kfree+0xea/0x290
  free_func_state+0x4a/0x60
  free_verifier_state+0x61/0xe0
  push_stack+0x216/0x2f0	          &lt;- inject failslab
  sanitize_ptr_alu+0x2b1/0x8d0
  adjust_ptr_min_max_vals+0x8f2/0x1ca0
  adjust_reg_min_max_vals+0x8ed/0x22e0
  do_check+0x1ca6/0x5d00
  bpf_check+0x9ca/0x2570
  bpf_prog_load+0xc91/0x1030
  __se_sys_bpf+0x61e/0x1f00
  do_syscall_64+0xc8/0x550
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

When kzalloc() fails in push_stack(), free_verifier_state() will free
current verifier state. As push_stack() returns, dst_reg was restored
if ptr_is_dst_reg is false. However, as member of the cur_state,
dst_reg is also freed, and error occurs when dereferencing dst_reg.
Simply fix it by testing ret of push_stack() before restoring dst_reg.

Fixes: 979d63d50c0c ("bpf: prevent out of bounds speculation on pointer arithmetic")
Signed-off-by: Xu Yu &lt;xuyu@linux.alibaba.com&gt;
Signed-off-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
</content>
</entry>
<entry>
<title>bpf: Only print ref_obj_id for refcounted reg</title>
<updated>2019-03-21T01:24:35Z</updated>
<author>
<name>Martin KaFai Lau</name>
<email>kafai@fb.com</email>
</author>
<published>2019-03-18T17:37:13Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=cba368c1f01c27ed62fca7a853531845d263bb01'/>
<id>urn:sha1:cba368c1f01c27ed62fca7a853531845d263bb01</id>
<content type='text'>
Naresh reported that test_align fails because of the mismatch at the
verbose printout of the register states.  The reason is due to the newly
added ref_obj_id.

ref_obj_id is only useful for refcounted reg.  Thus, this patch fixes it
by only printing ref_obj_id for refcounted reg.  While at it, it also uses
comma instead of space to separate between "id" and "ref_obj_id".

Fixes: 1b986589680a ("bpf: Fix bpf_tcp_sock and bpf_sk_fullsock issue related to bpf_sk_release")
Reported-by: Naresh Kamboju &lt;naresh.kamboju@linaro.org&gt;
Signed-off-by: Martin KaFai Lau &lt;kafai@fb.com&gt;
Acked-by: Andrii Nakryiko &lt;andriin@fb.com&gt;
Signed-off-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
</content>
</entry>
<entry>
<title>bpf: Try harder when allocating memory for large maps</title>
<updated>2019-03-18T15:48:25Z</updated>
<author>
<name>Martynas Pumputis</name>
<email>m@lambda.lt</email>
</author>
<published>2019-03-18T15:10:26Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=f01a7dbe98ae4265023fa5d3af0f076f0b18a647'/>
<id>urn:sha1:f01a7dbe98ae4265023fa5d3af0f076f0b18a647</id>
<content type='text'>
It has been observed that sometimes a higher order memory allocation
for BPF maps fails when there is no obvious memory pressure in a system.
E.g. the map (BPF_MAP_TYPE_LRU_HASH, key=38, value=56, max_elems=524288)
could not be created due to vmalloc unable to allocate 75497472B,
when the system's memory consumption (in MB) was the following:

    Total: 3942 Used: 837 (21.24%) Free: 138 Buffers: 239 Cached: 2727

Later analysis [1] by Michal Hocko showed that the vmalloc was not trying
to reclaim memory from the page cache and was failing prematurely due to
__GFP_NORETRY.

Considering dcda9b0471 ("mm, tree wide: replace __GFP_REPEAT by
__GFP_RETRY_MAYFAIL with more useful semantic") and [1], we can replace
__GFP_NORETRY with __GFP_RETRY_MAYFAIL, as it won't invoke OOM killer
and will try harder to fulfil allocation requests.

Unfortunately, replacing the body of the BPF map memory allocation
function with the kvmalloc_node helper function is not an option at
this point in time, given 1) kmalloc is non-optional for higher order
allocations, and 2) passing __GFP_RETRY_MAYFAIL to the kmalloc would
stress the slab allocator too much for large requests.

The change has been tested with the workloads mentioned above and by
observing oom_kill value from /proc/vmstat.

[1]: https://lore.kernel.org/bpf/20190310071318.GW5232@dhcp22.suse.cz/

Signed-off-by: Martynas Pumputis &lt;m@lambda.lt&gt;
Acked-by: Yonghong Song &lt;yhs@fb.com&gt;
Cc: Michal Hocko &lt;mhocko@suse.com&gt;
Signed-off-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Link: https://lore.kernel.org/bpf/20190318153940.GL8924@dhcp22.suse.cz/
</content>
</entry>
<entry>
<title>Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf</title>
<updated>2019-03-16T19:20:08Z</updated>
<author>
<name>David S. Miller</name>
<email>davem@davemloft.net</email>
</author>
<published>2019-03-16T19:20:08Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=0aedadcf6b4863a0d6eaad05a26425cc52944027'/>
<id>urn:sha1:0aedadcf6b4863a0d6eaad05a26425cc52944027</id>
<content type='text'>
Daniel Borkmann says:

====================
pull-request: bpf 2019-03-16

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) Fix a umem memory leak on cleanup in AF_XDP, from Björn.

2) Fix BTF to properly resolve forward-declared enums into their corresponding
   full enum definition types during deduplication, from Andrii.

3) Fix libbpf to reject invalid flags in xsk_socket__create(), from Magnus.

4) Fix accessing invalid pointer returned from bpf_tcp_sock() and
   bpf_sk_fullsock() after bpf_sk_release() was called, from Martin.

5) Fix generation of load/store DW instructions in PPC JIT, from Naveen.

6) Various fixes in BPF helper function documentation in bpf.h UAPI header
   used to bpf-helpers(7) man page, from Quentin.

7) Fix segfault in BPF test_progs when prog loading failed, from Yonghong.
====================

Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>bpf: Fix bpf_tcp_sock and bpf_sk_fullsock issue related to bpf_sk_release</title>
<updated>2019-03-13T19:04:35Z</updated>
<author>
<name>Martin KaFai Lau</name>
<email>kafai@fb.com</email>
</author>
<published>2019-03-12T17:23:02Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=1b986589680a2a5b6fc1ac196ea69925a93d9dd9'/>
<id>urn:sha1:1b986589680a2a5b6fc1ac196ea69925a93d9dd9</id>
<content type='text'>
Lorenz Bauer [thanks!] reported that a ptr returned by bpf_tcp_sock(sk)
can still be accessed after bpf_sk_release(sk).
Both bpf_tcp_sock() and bpf_sk_fullsock() have the same issue.
This patch addresses them together.

A simple reproducer looks like this:

	sk = bpf_sk_lookup_tcp();
	/* if (!sk) ... */
	tp = bpf_tcp_sock(sk);
	/* if (!tp) ... */
	bpf_sk_release(sk);
	snd_cwnd = tp-&gt;snd_cwnd; /* oops! The verifier does not complain. */

The problem is the verifier did not scrub the register's states of
the tcp_sock ptr (tp) after bpf_sk_release(sk).

[ Note that when calling bpf_tcp_sock(sk), the sk is not always
  refcount-acquired. e.g. bpf_tcp_sock(skb-&gt;sk). The verifier works
  fine for this case. ]

Currently, the verifier does not track if a helper's return ptr (in REG_0)
is "carry"-ing one of its argument's refcount status. To carry this info,
the reg1-&gt;id needs to be stored in reg0.

One approach was tried, like "reg0-&gt;id = reg1-&gt;id", when calling
"bpf_tcp_sock()".  The main idea was to avoid adding another "ref_obj_id"
for the same reg.  However, overlapping the NULL marking and ref
tracking purpose in one "id" does not work well:

	ref_sk = bpf_sk_lookup_tcp();
	fullsock = bpf_sk_fullsock(ref_sk);
	tp = bpf_tcp_sock(ref_sk);
	if (!fullsock) {
	     bpf_sk_release(ref_sk);
	     return 0;
	}
	/* fullsock_reg-&gt;id is marked for NOT-NULL.
	 * Same for tp_reg-&gt;id because they have the same id.
	 */

	/* oops. verifier did not complain about the missing !tp check */
	snd_cwnd = tp-&gt;snd_cwnd;

Hence, a new "ref_obj_id" is needed in "struct bpf_reg_state".
With a new ref_obj_id, when bpf_sk_release(sk) is called, the verifier can
scrub all reg states which has a ref_obj_id match.  It is done with the
changes in release_reg_references() in this patch.

While fixing it, sk_to_full_sk() is removed from bpf_tcp_sock() and
bpf_sk_fullsock() to avoid these helpers from returning
another ptr. It will make bpf_sk_release(tp) possible:

	sk = bpf_sk_lookup_tcp();
	/* if (!sk) ... */
	tp = bpf_tcp_sock(sk);
	/* if (!tp) ... */
	bpf_sk_release(tp);

A separate helper "bpf_get_listener_sock()" will be added in a later
patch to do sk_to_full_sk().

Misc change notes:
- To allow bpf_sk_release(tp), the arg of bpf_sk_release() is changed
  from ARG_PTR_TO_SOCKET to ARG_PTR_TO_SOCK_COMMON.  ARG_PTR_TO_SOCKET
  is removed from bpf.h since no helper is using it.

- arg_type_is_refcounted() is renamed to arg_type_may_be_refcounted()
  because ARG_PTR_TO_SOCK_COMMON is the only one and skb-&gt;sk is not
  refcounted.  All bpf_sk_release(), bpf_sk_fullsock() and bpf_tcp_sock()
  take ARG_PTR_TO_SOCK_COMMON.

- check_refcount_ok() ensures is_acquire_function() cannot take
  arg_type_may_be_refcounted() as its argument.

- The check_func_arg() can only allow one refcount-ed arg.  It is
  guaranteed by check_refcount_ok() which ensures at most one arg can be
  refcounted.  Hence, it is a verifier internal error if &gt;1 refcount arg
  found in check_func_arg().

- In release_reference(), release_reference_state() is called
  first to ensure a match on "reg-&gt;ref_obj_id" can be found before
  scrubbing the reg states with release_reg_references().

- reg_is_refcounted() is no longer needed.
  1. In mark_ptr_or_null_regs(), its usage is replaced by
     "ref_obj_id &amp;&amp; ref_obj_id == id" because,
     when is_null == true, release_reference_state() should only be
     called on the ref_obj_id obtained by a acquire helper (i.e.
     is_acquire_function() == true).  Otherwise, the following
     would happen:

	sk = bpf_sk_lookup_tcp();
	/* if (!sk) { ... } */
	fullsock = bpf_sk_fullsock(sk);
	if (!fullsock) {
		/*
		 * release_reference_state(fullsock_reg-&gt;ref_obj_id)
		 * where fullsock_reg-&gt;ref_obj_id == sk_reg-&gt;ref_obj_id.
		 *
		 * Hence, the following bpf_sk_release(sk) will fail
		 * because the ref state has already been released in the
		 * earlier release_reference_state(fullsock_reg-&gt;ref_obj_id).
		 */
		bpf_sk_release(sk);
	}

  2. In release_reg_references(), the current reg_is_refcounted() call
     is unnecessary because the id check is enough.

- The type_is_refcounted() and type_is_refcounted_or_null()
  are no longer needed also because reg_is_refcounted() is removed.

Fixes: 655a51e536c0 ("bpf: Add struct bpf_tcp_sock and BPF_FUNC_tcp_sock")
Reported-by: Lorenz Bauer &lt;lmb@cloudflare.com&gt;
Signed-off-by: Martin KaFai Lau &lt;kafai@fb.com&gt;
Signed-off-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
</content>
</entry>
</feed>
