<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux/include/net/netns, branch v6.2</title>
<subtitle>Mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
</subtitle>
<id>https://git.shady.money/linux/atom?h=v6.2</id>
<link rel='self' href='https://git.shady.money/linux/atom?h=v6.2'/>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/'/>
<updated>2022-12-07T04:04:34Z</updated>
<entry>
<title>net: xsk: Don't include &lt;linux/rculist.h&gt;</title>
<updated>2022-12-07T04:04:34Z</updated>
<author>
<name>Christophe JAILLET</name>
<email>christophe.jaillet@wanadoo.fr</email>
</author>
<published>2022-12-03T16:51:04Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=e9b4aeed56699b469206d05e706ddf2db95700a9'/>
<id>urn:sha1:e9b4aeed56699b469206d05e706ddf2db95700a9</id>
<content type='text'>
There is no need to include &lt;linux/rculist.h&gt; here.

Prefer the less invasive &lt;linux/types.h&gt; which is needed for 'hlist_head'.

Signed-off-by: Christophe JAILLET &lt;christophe.jaillet@wanadoo.fr&gt;
Acked-by: John Fastabend &lt;john.fastabend@gmail.com&gt;
Link: https://lore.kernel.org/r/88d6a1d88764cca328610854f890a9ca1f4b029e.1670086246.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
</content>
</entry>
<entry>
<title>sctp: add dif and sdif check in asoc and ep lookup</title>
<updated>2022-11-18T11:42:54Z</updated>
<author>
<name>Xin Long</name>
<email>lucien.xin@gmail.com</email>
</author>
<published>2022-11-16T20:01:20Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=0af03170637f47fb5cc6501d4b2dcbf1c14772a9'/>
<id>urn:sha1:0af03170637f47fb5cc6501d4b2dcbf1c14772a9</id>
<content type='text'>
This patch at first adds a pernet global l3mdev_accept to decide if it
accepts the packets from a l3mdev when a SCTP socket doesn't bind to
any interface. It's set to 1 to avoid any possible incompatible issue,
and in next patch, a sysctl will be introduced to allow to change it.

Then similar to inet/udp_sk_bound_dev_eq(), sctp_sk_bound_dev_eq() is
added to check either dif or sdif is equal to sk_bound_dev_if, and to
check sid is 0 or l3mdev_accept is 1 if sk_bound_dev_if is not set.
This function is used to match a association or a endpoint, namely
called by sctp_addrs_lookup_transport() and sctp_endpoint_is_match().
All functions that needs updating are:

sctp_rcv():
  asoc:
  __sctp_rcv_lookup()
    __sctp_lookup_association() -&gt; sctp_addrs_lookup_transport()
    __sctp_rcv_lookup_harder()
      __sctp_rcv_init_lookup()
         __sctp_lookup_association() -&gt; sctp_addrs_lookup_transport()
      __sctp_rcv_walk_lookup()
         __sctp_rcv_asconf_lookup()
           __sctp_lookup_association() -&gt; sctp_addrs_lookup_transport()

  ep:
  __sctp_rcv_lookup_endpoint() -&gt; sctp_endpoint_is_match()

sctp_connect():
  sctp_endpoint_is_peeled_off()
    __sctp_lookup_association()
      sctp_has_association()
        sctp_lookup_association()
          __sctp_lookup_association() -&gt; sctp_addrs_lookup_transport()

sctp_diag_dump_one():
  sctp_transport_lookup_process() -&gt; sctp_addrs_lookup_transport()

Signed-off-by: Xin Long &lt;lucien.xin@gmail.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>udp: Introduce optional per-netns hash table.</title>
<updated>2022-11-16T09:43:35Z</updated>
<author>
<name>Kuniyuki Iwashima</name>
<email>kuniyu@amazon.com</email>
</author>
<published>2022-11-14T21:57:57Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=9804985bf27f8fbcf0d96c7435b5ad94a2a6ea20'/>
<id>urn:sha1:9804985bf27f8fbcf0d96c7435b5ad94a2a6ea20</id>
<content type='text'>
The maximum hash table size is 64K due to the nature of the protocol. [0]
It's smaller than TCP, and fewer sockets can cause a performance drop.

On an EC2 c5.24xlarge instance (192 GiB memory), after running iperf3 in
different netns, creating 32Mi sockets without data transfer in the root
netns causes regression for the iperf3's connection.

  uhash_entries		sockets		length		Gbps
	    64K		      1		     1		5.69
			    1Mi		    16		5.27
			    2Mi		    32		4.90
			    4Mi		    64		4.09
			    8Mi		   128		2.96
			   16Mi		   256		2.06
			   32Mi		   512		1.12

The per-netns hash table breaks the lengthy lists into shorter ones.  It is
useful on a multi-tenant system with thousands of netns.  With smaller hash
tables, we can look up sockets faster, isolate noisy neighbours, and reduce
lock contention.

The max size of the per-netns table is 64K as well.  This is because the
possible hash range by udp_hashfn() always fits in 64K within the same
netns and we cannot make full use of the whole buckets larger than 64K.

  /* 0 &lt; num &lt; 64K  -&gt;  X &lt; hash &lt; X + 64K */
  (num + net_hash_mix(net)) &amp; mask;

Also, the min size is 128.  We use a bitmap to search for an available
port in udp_lib_get_port().  To keep the bitmap on the stack and not
fire the CONFIG_FRAME_WARN error at build time, we round up the table
size to 128.

The sysctl usage is the same with TCP:

  $ dmesg | cut -d ' ' -f 6- | grep "UDP hash"
  UDP hash table entries: 65536 (order: 9, 2097152 bytes, vmalloc)

  # sysctl net.ipv4.udp_hash_entries
  net.ipv4.udp_hash_entries = 65536  # can be changed by uhash_entries

  # sysctl net.ipv4.udp_child_hash_entries
  net.ipv4.udp_child_hash_entries = 0  # disabled by default

  # ip netns add test1
  # ip netns exec test1 sysctl net.ipv4.udp_hash_entries
  net.ipv4.udp_hash_entries = -65536  # share the global table

  # sysctl -w net.ipv4.udp_child_hash_entries=100
  net.ipv4.udp_child_hash_entries = 100

  # ip netns add test2
  # ip netns exec test2 sysctl net.ipv4.udp_hash_entries
  net.ipv4.udp_hash_entries = 128  # own a per-netns table with 2^n buckets

We could optimise the hash table lookup/iteration further by removing
the netns comparison for the per-netns one in the future.  Also, we
could optimise the sparse udp_hslot layout by putting it in udp_table.

[0]: https://lore.kernel.org/netdev/4ACC2815.7010101@gmail.com/

Signed-off-by: Kuniyuki Iwashima &lt;kuniyu@amazon.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>udp: Set NULL to sk-&gt;sk_prot-&gt;h.udp_table.</title>
<updated>2022-11-16T09:43:35Z</updated>
<author>
<name>Kuniyuki Iwashima</name>
<email>kuniyu@amazon.com</email>
</author>
<published>2022-11-14T21:57:54Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=67fb43308f4b354f13aabcc66dd5d99bfbb7e838'/>
<id>urn:sha1:67fb43308f4b354f13aabcc66dd5d99bfbb7e838</id>
<content type='text'>
We will soon introduce an optional per-netns hash table
for UDP.

This means we cannot use the global sk-&gt;sk_prot-&gt;h.udp_table
to fetch a UDP hash table.

Instead, set NULL to sk-&gt;sk_prot-&gt;h.udp_table for UDP and get
a proper table from net-&gt;ipv4.udp_table.

Note that we still need sk-&gt;sk_prot-&gt;h.udp_table for UDP LITE.

Signed-off-by: Kuniyuki Iwashima &lt;kuniyu@amazon.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>tcp: add sysctls for TCP PLB parameters</title>
<updated>2022-10-28T09:47:42Z</updated>
<author>
<name>Mubashir Adnan Qureshi</name>
<email>mubashirq@google.com</email>
</author>
<published>2022-10-26T13:51:11Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=bd456f283b66704920fae8e655ebc769cb743420'/>
<id>urn:sha1:bd456f283b66704920fae8e655ebc769cb743420</id>
<content type='text'>
PLB (Protective Load Balancing) is a host based mechanism for load
balancing across switch links. It leverages congestion signals(e.g. ECN)
from transport layer to randomly change the path of the connection
experiencing congestion. PLB changes the path of the connection by
changing the outgoing IPv6 flow label for IPv6 connections (implemented
in Linux by calling sk_rethink_txhash()). Because of this implementation
mechanism, PLB can currently only work for IPv6 traffic. For more
information, see the SIGCOMM 2022 paper:
  https://doi.org/10.1145/3544216.3544226

This commit adds new sysctl knobs and sets their default values for
TCP PLB.

Signed-off-by: Mubashir Adnan Qureshi &lt;mubashirq@google.com&gt;
Signed-off-by: Yuchung Cheng &lt;ycheng@google.com&gt;
Signed-off-by: Neal Cardwell &lt;ncardwell@google.com&gt;
Reviewed-by: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>netns: Replace zero-length array with DECLARE_FLEX_ARRAY() helper</title>
<updated>2022-09-29T01:51:47Z</updated>
<author>
<name>Gustavo A. R. Silva</name>
<email>gustavoars@kernel.org</email>
</author>
<published>2022-09-26T23:02:20Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=63a8bf85568b30438431d4d78afb2fde4a280760'/>
<id>urn:sha1:63a8bf85568b30438431d4d78afb2fde4a280760</id>
<content type='text'>
Zero-length arrays are deprecated and we are moving towards adopting
C99 flexible-array members, instead. So, replace zero-length arrays
declarations in anonymous union with the new DECLARE_FLEX_ARRAY()
helper macro.

This helper allows for flexible-array members in unions.

Link: https://github.com/KSPP/linux/issues/193
Link: https://github.com/KSPP/linux/issues/225
Link: https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
Signed-off-by: Gustavo A. R. Silva &lt;gustavoars@kernel.org&gt;
Reviewed-by: Kees Cook &lt;keescook@chromium.org&gt;
Link: https://lore.kernel.org/r/YzIvfGXxfjdXmIS3@work
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
</content>
</entry>
<entry>
<title>net/smc: Unbind r/w buffer size from clcsock and make them tunable</title>
<updated>2022-09-22T10:58:21Z</updated>
<author>
<name>Tony Lu</name>
<email>tonylu@linux.alibaba.com</email>
</author>
<published>2022-09-20T09:52:22Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=0227f058aa29f5ab6f6ec79c3a36ae41f1e03a13'/>
<id>urn:sha1:0227f058aa29f5ab6f6ec79c3a36ae41f1e03a13</id>
<content type='text'>
Currently, SMC uses smc-&gt;sk.sk_{rcv|snd}buf to create buffers for
send buffer and RMB. And the values of buffer size are from tcp_{w|r}mem
in clcsock.

The buffer size from TCP socket doesn't fit SMC well. Generally, buffers
are usually larger than TCP for SMC-R/-D to get higher performance, for
they are different underlay devices and paths.

So this patch unbinds buffer size from TCP, and introduces two sysctl
knobs to tune them independently. Also, these knobs are per net
namespace and work for containers.

Signed-off-by: Tony Lu &lt;tonylu@linux.alibaba.com&gt;
Signed-off-by: Paolo Abeni &lt;pabeni@redhat.com&gt;
</content>
</entry>
<entry>
<title>net/smc: Introduce a specific sysctl for TEST_LINK time</title>
<updated>2022-09-22T10:58:21Z</updated>
<author>
<name>Wen Gu</name>
<email>guwen@linux.alibaba.com</email>
</author>
<published>2022-09-20T09:52:21Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=77eee32514314209961af5c2982e871ecb364445'/>
<id>urn:sha1:77eee32514314209961af5c2982e871ecb364445</id>
<content type='text'>
SMC-R tests the viability of link by sending out TEST_LINK LLC
messages over RoCE fabric when connections on link have been
idle for a time longer than keepalive interval (testlink time).

But using tcp_keepalive_time as testlink time maybe not quite
suitable because it is default no less than two hours[1], which
is too long for single link to find peer dead. The active host
will still use peer-dead link (QP) sending messages, and can't
find out until get IB_WC_RETRY_EXC_ERR error CQEs, which takes
more time than TEST_LINK timeout (SMC_LLC_WAIT_TIME) normally.

So this patch introduces a independent sysctl for SMC-R to set
link keepalive time, in order to detect link down in time. The
default value is 30 seconds.

[1] https://www.rfc-editor.org/rfc/rfc1122#page-101

Signed-off-by: Wen Gu &lt;guwen@linux.alibaba.com&gt;
Signed-off-by: Paolo Abeni &lt;pabeni@redhat.com&gt;
</content>
</entry>
<entry>
<title>tcp: Introduce optional per-netns ehash.</title>
<updated>2022-09-20T17:21:50Z</updated>
<author>
<name>Kuniyuki Iwashima</name>
<email>kuniyu@amazon.com</email>
</author>
<published>2022-09-08T01:10:22Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=d1e5e6408b305ff78b825d437df8d3f77e82a4be'/>
<id>urn:sha1:d1e5e6408b305ff78b825d437df8d3f77e82a4be</id>
<content type='text'>
The more sockets we have in the hash table, the longer we spend looking
up the socket.  While running a number of small workloads on the same
host, they penalise each other and cause performance degradation.

The root cause might be a single workload that consumes much more
resources than the others.  It often happens on a cloud service where
different workloads share the same computing resource.

On EC2 c5.24xlarge instance (196 GiB memory and 524288 (1Mi / 2) ehash
entries), after running iperf3 in different netns, creating 24Mi sockets
without data transfer in the root netns causes about 10% performance
regression for the iperf3's connection.

 thash_entries		sockets		length		Gbps
	524288		      1		     1		50.7
			   24Mi		    48		45.1

It is basically related to the length of the list of each hash bucket.
For testing purposes to see how performance drops along the length,
I set 131072 (1Mi / 8) to thash_entries, and here's the result.

 thash_entries		sockets		length		Gbps
        131072		      1		     1		50.7
			    1Mi		     8		49.9
			    2Mi		    16		48.9
			    4Mi		    32		47.3
			    8Mi		    64		44.6
			   16Mi		   128		40.6
			   24Mi		   192		36.3
			   32Mi		   256		32.5
			   40Mi		   320		27.0
			   48Mi		   384		25.0

To resolve the socket lookup degradation, we introduce an optional
per-netns hash table for TCP, but it's just ehash, and we still share
the global bhash, bhash2 and lhash2.

With a smaller ehash, we can look up non-listener sockets faster and
isolate such noisy neighbours.  In addition, we can reduce lock contention.

We can control the ehash size by a new sysctl knob.  However, depending
on workloads, it will require very sensitive tuning, so we disable the
feature by default (net.ipv4.tcp_child_ehash_entries == 0).  Moreover,
we can fall back to using the global ehash in case we fail to allocate
enough memory for a new ehash.  The maximum size is 16Mi, which is large
enough that even if we have 48Mi sockets, the average list length is 3,
and regression would be less than 1%.

We can check the current ehash size by another read-only sysctl knob,
net.ipv4.tcp_ehash_entries.  A negative value means the netns shares
the global ehash (per-netns ehash is disabled or failed to allocate
memory).

  # dmesg | cut -d ' ' -f 5- | grep "established hash"
  TCP established hash table entries: 524288 (order: 10, 4194304 bytes, vmalloc hugepage)

  # sysctl net.ipv4.tcp_ehash_entries
  net.ipv4.tcp_ehash_entries = 524288  # can be changed by thash_entries

  # sysctl net.ipv4.tcp_child_ehash_entries
  net.ipv4.tcp_child_ehash_entries = 0  # disabled by default

  # ip netns add test1
  # ip netns exec test1 sysctl net.ipv4.tcp_ehash_entries
  net.ipv4.tcp_ehash_entries = -524288  # share the global ehash

  # sysctl -w net.ipv4.tcp_child_ehash_entries=100
  net.ipv4.tcp_child_ehash_entries = 100

  # ip netns add test2
  # ip netns exec test2 sysctl net.ipv4.tcp_ehash_entries
  net.ipv4.tcp_ehash_entries = 128  # own a per-netns ehash with 2^n buckets

When more than two processes in the same netns create per-netns ehash
concurrently with different sizes, we need to guarantee the size in
one of the following ways:

  1) Share the global ehash and create per-netns ehash

  First, unshare() with tcp_child_ehash_entries==0.  It creates dedicated
  netns sysctl knobs where we can safely change tcp_child_ehash_entries
  and clone()/unshare() to create a per-netns ehash.

  2) Control write on sysctl by BPF

  We can use BPF_PROG_TYPE_CGROUP_SYSCTL to allow/deny read/write on
  sysctl knobs.

Note that the global ehash allocated at the boot time is spread over
available NUMA nodes, but inet_pernet_hashinfo_alloc() will allocate
pages for each per-netns ehash depending on the current process's NUMA
policy.  By default, the allocation is done in the local node only, so
the per-netns hash table could fully reside on a random node.  Thus,
depending on the NUMA policy the netns is created with and the CPU the
current thread is running on, we could see some performance differences
for highly optimised networking applications.

Note also that the default values of two sysctl knobs depend on the ehash
size and should be tuned carefully:

  tcp_max_tw_buckets  : tcp_child_ehash_entries / 2
  tcp_max_syn_backlog : max(128, tcp_child_ehash_entries / 128)

As a bonus, we can dismantle netns faster.  Currently, while destroying
netns, we call inet_twsk_purge(), which walks through the global ehash.
It can be potentially big because it can have many sockets other than
TIME_WAIT in all netns.  Splitting ehash changes that situation, where
it's only necessary for inet_twsk_purge() to clean up TIME_WAIT sockets
in each netns.

With regard to this, we do not free the per-netns ehash in inet_twsk_kill()
to avoid UAF while iterating the per-netns ehash in inet_twsk_purge().
Instead, we do it in tcp_sk_exit_batch() after calling tcp_twsk_purge() to
keep it protocol-family-independent.

In the future, we could optimise ehash lookup/iteration further by removing
netns comparison for the per-netns ehash.

Signed-off-by: Kuniyuki Iwashima &lt;kuniyu@amazon.com&gt;
Reviewed-by: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
</content>
</entry>
<entry>
<title>tcp: Don't allocate tcp_death_row outside of struct netns_ipv4.</title>
<updated>2022-09-20T17:21:49Z</updated>
<author>
<name>Kuniyuki Iwashima</name>
<email>kuniyu@amazon.com</email>
</author>
<published>2022-09-08T01:10:18Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=e9bd0cca09d13ac2f08d25e195203e42d4ad1ce8'/>
<id>urn:sha1:e9bd0cca09d13ac2f08d25e195203e42d4ad1ce8</id>
<content type='text'>
We will soon introduce an optional per-netns ehash and access hash
tables via net-&gt;ipv4.tcp_death_row-&gt;hashinfo instead of &amp;tcp_hashinfo
in most places.

It could harm the fast path because dereferences of two fields in net
and tcp_death_row might incur two extra cache line misses.  To save one
dereference, let's place tcp_death_row back in netns_ipv4 and fetch
hashinfo via net-&gt;ipv4.tcp_death_row"."hashinfo.

Note tcp_death_row was initially placed in netns_ipv4, and commit
fbb8295248e1 ("tcp: allocate tcp_death_row outside of struct netns_ipv4")
changed it to a pointer so that we can fire TIME_WAIT timers after freeing
net.  However, we don't do so after commit 04c494e68a13 ("Revert "tcp/dccp:
get rid of inet_twsk_purge()""), so we need not define tcp_death_row as a
pointer.

Also, we move refcount_dec_and_test(&amp;tw_refcount) from tcp_sk_exit() to
tcp_sk_exit_batch() as a debug check.

Signed-off-by: Kuniyuki Iwashima &lt;kuniyu@amazon.com&gt;
Reviewed-by: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
</content>
</entry>
</feed>
