<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux/net/core/skmsg.c, branch v5.7</title>
<subtitle>Mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
</subtitle>
<id>https://git.shady.money/linux/atom?h=v5.7</id>
<link rel='self' href='https://git.shady.money/linux/atom?h=v5.7'/>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/'/>
<updated>2020-02-25T00:20:09Z</updated>
<entry>
<title>bpf: Use bpf_prog_run_pin_on_cpu() at simple call sites.</title>
<updated>2020-02-25T00:20:09Z</updated>
<author>
<name>David Miller</name>
<email>davem@davemloft.net</email>
</author>
<published>2020-02-24T14:01:43Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=3d9f773cf2876c01a505b9fe27270901d464e90a'/>
<id>urn:sha1:3d9f773cf2876c01a505b9fe27270901d464e90a</id>
<content type='text'>
All of these cases are strictly of the form:

	preempt_disable();
	BPF_PROG_RUN(...);
	preempt_enable();

Replace this with bpf_prog_run_pin_on_cpu() which wraps BPF_PROG_RUN()
with:

	migrate_disable();
	BPF_PROG_RUN(...);
	migrate_enable();

On non RT enabled kernels this maps to preempt_disable/enable() and on RT
enabled kernels this solely prevents migration, which is sufficient as
there is no requirement to prevent reentrancy to any BPF program from a
preempting task. The only requirement is that the program stays on the same
CPU.

Therefore, this is a trivially correct transformation.

The seccomp loop does not need protection over the loop. It only needs
protection per BPF filter program

[ tglx: Converted to bpf_prog_run_pin_on_cpu() ]

Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Signed-off-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
Link: https://lore.kernel.org/bpf/20200224145643.691493094@linutronix.de
</content>
</entry>
<entry>
<title>net, sk_msg: Clear sk_user_data pointer on clone if tagged</title>
<updated>2020-02-21T21:29:45Z</updated>
<author>
<name>Jakub Sitnicki</name>
<email>jakub@cloudflare.com</email>
</author>
<published>2020-02-18T17:10:14Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=f1ff5ce2cd5ef3335f19c0f6576582c87045b04f'/>
<id>urn:sha1:f1ff5ce2cd5ef3335f19c0f6576582c87045b04f</id>
<content type='text'>
sk_user_data can hold a pointer to an object that is not intended to be
shared between the parent socket and the child that gets a pointer copy on
clone. This is the case when sk_user_data points at reference-counted
object, like struct sk_psock.

One way to resolve it is to tag the pointer with a no-copy flag by
repurposing its lowest bit. Based on the bit-flag value we clear the child
sk_user_data pointer after cloning the parent socket.

The no-copy flag is stored in the pointer itself as opposed to externally,
say in socket flags, to guarantee that the pointer and the flag are copied
from parent to child socket in an atomic fashion. Parent socket state is
subject to change while copying, we don't hold any locks at that time.

This approach relies on an assumption that sk_user_data holds a pointer to
an object aligned at least 2 bytes. A manual audit of existing users of
rcu_dereference_sk_user_data helper confirms our assumption.

Also, an RCU-protected sk_user_data is not likely to hold a pointer to a
char value or a pathological case of "struct { char c; }". To be safe, warn
when the flag-bit is set when setting sk_user_data to catch any future
misuses.

It is worth considering why clearing sk_user_data unconditionally is not an
option. There exist users, DRBD, NVMe, and Xen drivers being among them,
that rely on the pointer being copied when cloning the listening socket.

Potentially we could distinguish these users by checking if the listening
socket has been created in kernel-space via sock_create_kern, and hence has
sk_kern_sock flag set. However, this is not the case for NVMe and Xen
drivers, which create sockets without marking them as belonging to the
kernel.

Signed-off-by: Jakub Sitnicki &lt;jakub@cloudflare.com&gt;
Signed-off-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Acked-by: John Fastabend &lt;john.fastabend@gmail.com&gt;
Acked-by: Martin KaFai Lau &lt;kafai@fb.com&gt;
Link: https://lore.kernel.org/bpf/20200218171023.844439-3-jakub@cloudflare.com
</content>
</entry>
<entry>
<title>net, sk_msg: Don't check if sock is locked when tearing down psock</title>
<updated>2020-01-22T19:30:20Z</updated>
<author>
<name>Jakub Sitnicki</name>
<email>jakub@cloudflare.com</email>
</author>
<published>2020-01-21T12:31:47Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=58c8db929db1c1d785a6f5d8f8692e5dbcc35e84'/>
<id>urn:sha1:58c8db929db1c1d785a6f5d8f8692e5dbcc35e84</id>
<content type='text'>
As John Fastabend reports [0], psock state tear-down can happen on receive
path *after* unlocking the socket, if the only other psock user, that is
sockmap or sockhash, releases its psock reference before tcp_bpf_recvmsg
does so:

 tcp_bpf_recvmsg()
  psock = sk_psock_get(sk)                         &lt;- refcnt 2
  lock_sock(sk);
  ...
                                  sock_map_free()  &lt;- refcnt 1
  release_sock(sk)
  sk_psock_put()                                   &lt;- refcnt 0

Remove the lockdep check for socket lock in psock tear-down that got
introduced in 7e81a3530206 ("bpf: Sockmap, ensure sock lock held during
tear down").

[0] https://lore.kernel.org/netdev/5e25dc995d7d_74082aaee6e465b441@john-XPS-13-9370.notmuch/

Fixes: 7e81a3530206 ("bpf: Sockmap, ensure sock lock held during tear down")
Reported-by: syzbot+d73682fcf7fee6982fe3@syzkaller.appspotmail.com
Suggested-by: John Fastabend &lt;john.fastabend@gmail.com&gt;
Signed-off-by: Jakub Sitnicki &lt;jakub@cloudflare.com&gt;
Acked-by: John Fastabend &lt;john.fastabend@gmail.com&gt;
Acked-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>bpf: Sockmap, ensure sock lock held during tear down</title>
<updated>2020-01-15T22:26:13Z</updated>
<author>
<name>John Fastabend</name>
<email>john.fastabend@gmail.com</email>
</author>
<published>2020-01-11T06:12:00Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=7e81a35302066c5a00b4c72d83e3ea4cad6eeb5b'/>
<id>urn:sha1:7e81a35302066c5a00b4c72d83e3ea4cad6eeb5b</id>
<content type='text'>
The sock_map_free() and sock_hash_free() paths used to delete sockmap
and sockhash maps walk the maps and destroy psock and bpf state associated
with the socks in the map. When done the socks no longer have BPF programs
attached and will function normally. This can happen while the socks in
the map are still "live" meaning data may be sent/received during the walk.

Currently, though we don't take the sock_lock when the psock and bpf state
is removed through this path. Specifically, this means we can be writing
into the ops structure pointers such as sendmsg, sendpage, recvmsg, etc.
while they are also being called from the networking side. This is not
safe, we never used proper READ_ONCE/WRITE_ONCE semantics here if we
believed it was safe. Further its not clear to me its even a good idea
to try and do this on "live" sockets while networking side might also
be using the socket. Instead of trying to reason about using the socks
from both sides lets realize that every use case I'm aware of rarely
deletes maps, in fact kubernetes/Cilium case builds map at init and
never tears it down except on errors. So lets do the simple fix and
grab sock lock.

This patch wraps sock deletes from maps in sock lock and adds some
annotations so we catch any other cases easier.

Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: John Fastabend &lt;john.fastabend@gmail.com&gt;
Signed-off-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Acked-by: Song Liu &lt;songliubraving@fb.com&gt;
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/bpf/20200111061206.8028-3-john.fastabend@gmail.com
</content>
</entry>
<entry>
<title>net: skmsg: fix TLS 1.3 crash with full sk_msg</title>
<updated>2019-11-29T06:40:29Z</updated>
<author>
<name>Jakub Kicinski</name>
<email>jakub.kicinski@netronome.com</email>
</author>
<published>2019-11-27T20:16:41Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=031097d9e079e40dce401031d1012e83d80eaf01'/>
<id>urn:sha1:031097d9e079e40dce401031d1012e83d80eaf01</id>
<content type='text'>
TLS 1.3 started using the entry at the end of the SG array
for chaining-in the single byte content type entry. This mostly
works:

[ E E E E E E . . ]
  ^           ^
   start       end

                 E &lt; content type
               /
[ E E E E E E C . ]
  ^           ^
   start       end

(Where E denotes a populated SG entry; C denotes a chaining entry.)

If the array is full, however, the end will point to the start:

[ E E E E E E E E ]
  ^
   start
   end

And we end up overwriting the start:

    E &lt; content type
   /
[ C E E E E E E E ]
  ^
   start
   end

The sg array is supposed to be a circular buffer with start and
end markers pointing anywhere. In case where start &gt; end
(i.e. the circular buffer has "wrapped") there is an extra entry
reserved at the end to chain the two halves together.

[ E E E E E E . . l ]

(Where l is the reserved entry for "looping" back to front.

As suggested by John, let's reserve another entry for chaining
SG entries after the main circular buffer. Note that this entry
has to be pointed to by the end entry so its position is not fixed.

Examples of full messages:

[ E E E E E E E E . l ]
  ^               ^
   start           end

   &lt;---------------.
[ E E . E E E E E E l ]
      ^ ^
   end   start

Now the end will always point to an unused entry, so TLS 1.3
can always use it.

Fixes: 130b392c6cd6 ("net: tls: Add tls 1.3 support")
Signed-off-by: Jakub Kicinski &lt;jakub.kicinski@netronome.com&gt;
Reviewed-by: Simon Horman &lt;simon.horman@netronome.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>bpf: skmsg, fix potential psock NULL pointer dereference</title>
<updated>2019-11-21T21:43:45Z</updated>
<author>
<name>John Fastabend</name>
<email>john.fastabend@gmail.com</email>
</author>
<published>2019-11-21T16:25:09Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=8163999db445021f2651a8a47b5632483e8722ea'/>
<id>urn:sha1:8163999db445021f2651a8a47b5632483e8722ea</id>
<content type='text'>
Report from Dan Carpenter,

 net/core/skmsg.c:792 sk_psock_write_space()
 error: we previously assumed 'psock' could be null (see line 790)

 net/core/skmsg.c
   789 psock = sk_psock(sk);
   790 if (likely(psock &amp;&amp; sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED)))
 Check for NULL
   791 schedule_work(&amp;psock-&gt;work);
   792 write_space = psock-&gt;saved_write_space;
                     ^^^^^^^^^^^^^^^^^^^^^^^^
   793          rcu_read_unlock();
   794          write_space(sk);

Ensure psock dereference on line 792 only occurs if psock is not null.

Reported-by: Dan Carpenter &lt;dan.carpenter@oracle.com&gt;
Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: John Fastabend &lt;john.fastabend@gmail.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>net/tls: fix sk_msg trim on fallback to copy mode</title>
<updated>2019-11-06T02:07:47Z</updated>
<author>
<name>Jakub Kicinski</name>
<email>jakub.kicinski@netronome.com</email>
</author>
<published>2019-11-04T23:36:57Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=683916f6a84023407761d843048f1aea486b2612'/>
<id>urn:sha1:683916f6a84023407761d843048f1aea486b2612</id>
<content type='text'>
sk_msg_trim() tries to only update curr pointer if it falls into
the trimmed region. The logic, however, does not take into the
account pointer wrapping that sk_msg_iter_var_prev() does nor
(as John points out) the fact that msg-&gt;sg is a ring buffer.

This means that when the message was trimmed completely, the new
curr pointer would have the value of MAX_MSG_FRAGS - 1, which is
neither smaller than any other value, nor would it actually be
correct.

Special case the trimming to 0 length a little bit and rework
the comparison between curr and end to take into account wrapping.

This bug caused the TLS code to not copy all of the message, if
zero copy filled in fewer sg entries than memcopy would need.

Big thanks to Alexander Potapenko for the non-KMSAN reproducer.

v2:
 - take into account that msg-&gt;sg is a ring buffer (John).

Link: https://lore.kernel.org/netdev/20191030160542.30295-1-jakub.kicinski@netronome.com/ (v1)

Fixes: d829e9c4112b ("tls: convert to generic sk_msg interface")
Reported-by: syzbot+f8495bff23a879a6d0bd@syzkaller.appspotmail.com
Reported-by: syzbot+6f50c99e8f6194bf363f@syzkaller.appspotmail.com
Co-developed-by: John Fastabend &lt;john.fastabend@gmail.com&gt;
Signed-off-by: Jakub Kicinski &lt;jakub.kicinski@netronome.com&gt;
Signed-off-by: John Fastabend &lt;john.fastabend@gmail.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>net/core/skmsg: Delete an unnecessary check before the function call “consume_skb”</title>
<updated>2019-08-24T23:24:53Z</updated>
<author>
<name>Markus Elfring</name>
<email>elfring@users.sourceforge.net</email>
</author>
<published>2019-08-22T16:00:40Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=dd016aca28f67603f32c4e666805db519df2120a'/>
<id>urn:sha1:dd016aca28f67603f32c4e666805db519df2120a</id>
<content type='text'>
The consume_skb() function performs also input parameter validation.
Thus the test around the call is not needed.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring &lt;elfring@users.sourceforge.net&gt;
Acked-by: Song Liu &lt;songliubraving@fb.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>bpf: sockmap/tls, close can race with map free</title>
<updated>2019-07-22T14:04:17Z</updated>
<author>
<name>John Fastabend</name>
<email>john.fastabend@gmail.com</email>
</author>
<published>2019-07-19T17:29:22Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=95fa145479fbc0a0c1fd3274ceb42ec03c042a4a'/>
<id>urn:sha1:95fa145479fbc0a0c1fd3274ceb42ec03c042a4a</id>
<content type='text'>
When a map free is called and in parallel a socket is closed we
have two paths that can potentially reset the socket prot ops, the
bpf close() path and the map free path. This creates a problem
with which prot ops should be used from the socket closed side.

If the map_free side completes first then we want to call the
original lowest level ops. However, if the tls path runs first
we want to call the sockmap ops. Additionally there was no locking
around prot updates in TLS code paths so the prot ops could
be changed multiple times once from TLS path and again from sockmap
side potentially leaving ops pointed at either TLS or sockmap
when psock and/or tls context have already been destroyed.

To fix this race first only update ops inside callback lock
so that TLS, sockmap and lowest level all agree on prot state.
Second and a ULP callback update() so that lower layers can
inform the upper layer when they are being removed allowing the
upper layer to reset prot ops.

This gets us close to allowing sockmap and tls to be stacked
in arbitrary order but will save that patch for *next trees.

v4:
 - make sure we don't free things for device;
 - remove the checks which swap the callbacks back
   only if TLS is at the top.

Reported-by: syzbot+06537213db7ba2745c4a@syzkaller.appspotmail.com
Fixes: 02c558b2d5d6 ("bpf: sockmap, support for msg_peek in sk_msg with redirect ingress")
Signed-off-by: John Fastabend &lt;john.fastabend@gmail.com&gt;
Signed-off-by: Jakub Kicinski &lt;jakub.kicinski@netronome.com&gt;
Reviewed-by: Dirk van der Merwe &lt;dirk.vandermerwe@netronome.com&gt;
Signed-off-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
</content>
</entry>
<entry>
<title>bpf: sockmap fix msg-&gt;sg.size account on ingress skb</title>
<updated>2019-05-13T23:31:43Z</updated>
<author>
<name>John Fastabend</name>
<email>john.fastabend@gmail.com</email>
</author>
<published>2019-05-13T14:19:55Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=cabede8b4f2b746232aa25730a0b752de1cb82ca'/>
<id>urn:sha1:cabede8b4f2b746232aa25730a0b752de1cb82ca</id>
<content type='text'>
When converting a skb to msg-&gt;sg we forget to set the size after the
latest ktls/tls code conversion. This patch can be reached by doing
a redir into ingress path from BPF skb sock recv hook. Then trying to
read the size fails.

Fix this by setting the size.

Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: John Fastabend &lt;john.fastabend@gmail.com&gt;
Signed-off-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
</content>
</entry>
</feed>
