<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux/include/net/netns, branch v4.2</title>
<subtitle>Mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
</subtitle>
<id>https://git.shady.money/linux/atom?h=v4.2</id>
<link rel='self' href='https://git.shady.money/linux/atom?h=v4.2'/>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/'/>
<updated>2015-07-20T12:58:19Z</updated>
<entry>
<title>netfilter: fix netns dependencies with conntrack templates</title>
<updated>2015-07-20T12:58:19Z</updated>
<author>
<name>Pablo Neira Ayuso</name>
<email>pablo@netfilter.org</email>
</author>
<published>2015-07-13T13:11:48Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=0838aa7fcfcd875caa7bcc5dab0c3fd40444553d'/>
<id>urn:sha1:0838aa7fcfcd875caa7bcc5dab0c3fd40444553d</id>
<content type='text'>
Quoting Daniel Borkmann:

"When adding connection tracking template rules to a netns, f.e. to
configure netfilter zones, the kernel will endlessly busy-loop as soon
as we try to delete the given netns in case there's at least one
template present, which is problematic i.e. if there is such bravery that
the priviledged user inside the netns is assumed untrusted.

Minimal example:

  ip netns add foo
  ip netns exec foo iptables -t raw -A PREROUTING -d 1.2.3.4 -j CT --zone 1
  ip netns del foo

What happens is that when nf_ct_iterate_cleanup() is being called from
nf_conntrack_cleanup_net_list() for a provided netns, we always end up
with a net-&gt;ct.count &gt; 0 and thus jump back to i_see_dead_people. We
don't get a soft-lockup as we still have a schedule() point, but the
serving CPU spins on 100% from that point onwards.

Since templates are normally allocated with nf_conntrack_alloc(), we
also bump net-&gt;ct.count. The issue why they are not yet nf_ct_put() is
because the per netns .exit() handler from x_tables (which would eventually
invoke xt_CT's xt_ct_tg_destroy() that drops reference on info-&gt;ct) is
called in the dependency chain at a *later* point in time than the per
netns .exit() handler for the connection tracker.

This is clearly a chicken'n'egg problem: after the connection tracker
.exit() handler, we've teared down all the connection tracking
infrastructure already, so rightfully, xt_ct_tg_destroy() cannot be
invoked at a later point in time during the netns cleanup, as that would
lead to a use-after-free. At the same time, we cannot make x_tables depend
on the connection tracker module, so that the xt_ct_tg_destroy() would
be invoked earlier in the cleanup chain."

Daniel confirms this has to do with the order in which modules are loaded or
having compiled nf_conntrack as modules while x_tables built-in. So we have no
guarantees regarding the order in which netns callbacks are executed.

Fix this by allocating the templates through kmalloc() from the respective
SYNPROXY and CT targets, so they don't depend on the conntrack kmem cache.
Then, release then via nf_ct_tmpl_free() from destroy_conntrack(). This branch
is marked as unlikely since conntrack templates are rarely allocated and only
from the configuration plane path.

Note that templates are not kept in any list to avoid further dependencies with
nf_conntrack anymore, thus, the tmpl larval list is removed.

Reported-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Signed-off-by: Pablo Neira Ayuso &lt;pablo@netfilter.org&gt;
Tested-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
</content>
</entry>
<entry>
<title>Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net</title>
<updated>2015-06-24T09:58:51Z</updated>
<author>
<name>David S. Miller</name>
<email>davem@davemloft.net</email>
</author>
<published>2015-06-24T09:58:51Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=3a07bd6fead4f00f67b1bf5f551e686661c4f52c'/>
<id>urn:sha1:3a07bd6fead4f00f67b1bf5f551e686661c4f52c</id>
<content type='text'>
Conflicts:
	drivers/net/ethernet/mellanox/mlx4/main.c
	net/packet/af_packet.c

Both conflicts were cases of simple overlapping changes.

Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>netfilter: don't pull include/linux/netfilter.h from netns headers</title>
<updated>2015-06-18T19:14:31Z</updated>
<author>
<name>Pablo Neira Ayuso</name>
<email>pablo@netfilter.org</email>
</author>
<published>2015-06-17T15:28:27Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=a263653ed798216c0069922d7b5237ca49436007'/>
<id>urn:sha1:a263653ed798216c0069922d7b5237ca49436007</id>
<content type='text'>
This pulls the full hook netfilter definitions from all those that include
net_namespace.h.

Instead let's just include the bare minimum required in the new
linux/netfilter_defs.h file, and use it from the netfilter netns header files.

I also needed to include in.h and in6.h from linux/netfilter.h otherwise we hit
this compilation error:

In file included from include/linux/netfilter_defs.h:4:0,
                 from include/net/netns/netfilter.h:4,
                 from include/net/net_namespace.h:22,
                 from include/linux/netdevice.h:43,
                 from net/netfilter/nfnetlink_queue_core.c:23:
include/uapi/linux/netfilter.h:76:17: error: field ‘in’ has incomplete type struct in_addr in;

And also explicit include linux/netfilter.h in several spots.

Signed-off-by: Pablo Neira Ayuso &lt;pablo@netfilter.org&gt;
Signed-off-by: Eric W. Biederman &lt;ebiederm@xmission.com&gt;
</content>
</entry>
<entry>
<title>netfilter: use forward declaration instead of including linux/proc_fs.h</title>
<updated>2015-06-18T19:14:30Z</updated>
<author>
<name>Pablo Neira Ayuso</name>
<email>pablo@netfilter.org</email>
</author>
<published>2015-06-17T15:28:26Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=10c04a8e715cca824f96bcbf4af07f5a40985357'/>
<id>urn:sha1:10c04a8e715cca824f96bcbf4af07f5a40985357</id>
<content type='text'>
We don't need to pull the full definitions in that file, a simple forward
declaration is enough.

Moreover, include linux/procfs.h from nf_synproxy_core, otherwise this hits a
compilation error due to missing declarations, ie.

net/netfilter/nf_synproxy_core.c: In function ‘synproxy_proc_init’:
net/netfilter/nf_synproxy_core.c:326:2: error: implicit declaration of function ‘proc_create’ [-Werror=implicit-function-declaration]
  if (!proc_create("synproxy", S_IRUGO, net-&gt;proc_net_stat,
  ^

Signed-off-by: Pablo Neira Ayuso &lt;pablo@netfilter.org&gt;
Signed-off-by: Eric W. Biederman &lt;ebiederm@xmission.com&gt;
</content>
</entry>
<entry>
<title>sctp: fix ASCONF list handling</title>
<updated>2015-06-14T19:55:49Z</updated>
<author>
<name>Marcelo Ricardo Leitner</name>
<email>marcelo.leitner@gmail.com</email>
</author>
<published>2015-06-12T13:16:41Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=2d45a02d0166caf2627fe91897c6ffc3b19514c4'/>
<id>urn:sha1:2d45a02d0166caf2627fe91897c6ffc3b19514c4</id>
<content type='text'>
-&gt;auto_asconf_splist is per namespace and mangled by functions like
sctp_setsockopt_auto_asconf() which doesn't guarantee any serialization.

Also, the call to inet_sk_copy_descendant() was backuping
-&gt;auto_asconf_list through the copy but was not honoring
-&gt;do_auto_asconf, which could lead to list corruption if it was
different between both sockets.

This commit thus fixes the list handling by using -&gt;addr_wq_lock
spinlock to protect the list. A special handling is done upon socket
creation and destruction for that. Error handlig on sctp_init_sock()
will never return an error after having initialized asconf, so
sctp_destroy_sock() can be called without addrq_wq_lock. The lock now
will be take on sctp_close_sock(), before locking the socket, so we
don't do it in inverse order compared to sctp_addr_wq_timeout_handler().

Instead of taking the lock on sctp_sock_migrate() for copying and
restoring the list values, it's preferred to avoid rewritting it by
implementing sctp_copy_descendant().

Issue was found with a test application that kept flipping sysctl
default_auto_asconf on and off, but one could trigger it by issuing
simultaneous setsockopt() calls on multiple sockets or by
creating/destroying sockets fast enough. This is only triggerable
locally.

Fixes: 9f7d653b67ae ("sctp: Add Auto-ASCONF support (core).")
Reported-by: Ji Jianwen &lt;jiji@redhat.com&gt;
Suggested-by: Neil Horman &lt;nhorman@tuxdriver.com&gt;
Suggested-by: Hannes Frederic Sowa &lt;hannes@stressinduktion.org&gt;
Acked-by: Hannes Frederic Sowa &lt;hannes@stressinduktion.org&gt;
Signed-off-by: Marcelo Ricardo Leitner &lt;marcelo.leitner@gmail.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next</title>
<updated>2015-05-31T07:02:30Z</updated>
<author>
<name>David S. Miller</name>
<email>davem@davemloft.net</email>
</author>
<published>2015-05-31T07:02:30Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=583d3f5af2a6dfa7866715d9e062dbfb3b66a6f0'/>
<id>urn:sha1:583d3f5af2a6dfa7866715d9e062dbfb3b66a6f0</id>
<content type='text'>
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for net-next, they are:

1) default CONFIG_NETFILTER_INGRESS to y for easier compile-testing of all
   options.

2) Allow to bind a table to net_device. This introduces the internal
   NFT_AF_NEEDS_DEV flag to perform a mandatory check for this binding.
   This is required by the next patch.

3) Add the 'netdev' table family, this new table allows you to create ingress
   filter basechains. This provides access to the existing nf_tables features
   from ingress.

4) Kill unused argument from compat_find_calc_{match,target} in ip_tables
   and ip6_tables, from Florian Westphal.
====================

Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>tcp/dccp: warn user for preferred ip_local_port_range</title>
<updated>2015-05-27T18:35:36Z</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2015-05-27T18:34:37Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=ed2dfd900992aa7b6b3d0abd8ec9a7e9d2c7f827'/>
<id>urn:sha1:ed2dfd900992aa7b6b3d0abd8ec9a7e9d2c7f827</id>
<content type='text'>
After commit 07f4c90062f8f ("tcp/dccp: try to not exhaust
ip_local_port_range in connect()") it is advised to have an even number
of ports described in /proc/sys/net/ipv4/ip_local_port_range

This means start/end values should have a different parity.

Let's warn sysadmins of this, so that they can update their settings
if they want to.

Suggested-by: David S. Miller &lt;davem@davemloft.net&gt;
Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>netfilter: nf_tables: add netdev table to filter from ingress</title>
<updated>2015-05-26T16:41:23Z</updated>
<author>
<name>Pablo Neira Ayuso</name>
<email>pablo@netfilter.org</email>
</author>
<published>2015-05-26T16:41:40Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=ed6c4136f1571bd6ab362afc3410905a8a69ca42'/>
<id>urn:sha1:ed6c4136f1571bd6ab362afc3410905a8a69ca42</id>
<content type='text'>
This allows us to create netdev tables that contain ingress chains. Use
skb_header_pointer() as we may see shared sk_buffs at this stage.

This change provides access to the existing nf_tables features from the ingress
hook.

Signed-off-by: Pablo Neira Ayuso &lt;pablo@netfilter.org&gt;
</content>
</entry>
<entry>
<title>tcp: add rfc3168, section 6.1.1.1. fallback</title>
<updated>2015-05-19T20:53:37Z</updated>
<author>
<name>Daniel Borkmann</name>
<email>daniel@iogearbox.net</email>
</author>
<published>2015-05-19T19:04:22Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=492135557dc090a1abb2cfbe1a412757e3ed68ab'/>
<id>urn:sha1:492135557dc090a1abb2cfbe1a412757e3ed68ab</id>
<content type='text'>
This work as a follow-up of commit f7b3bec6f516 ("net: allow setting ecn
via routing table") and adds RFC3168 section 6.1.1.1. fallback for outgoing
ECN connections. In other words, this work adds a retry with a non-ECN
setup SYN packet, as suggested from the RFC on the first timeout:

  [...] A host that receives no reply to an ECN-setup SYN within the
  normal SYN retransmission timeout interval MAY resend the SYN and
  any subsequent SYN retransmissions with CWR and ECE cleared. [...]

Schematic client-side view when assuming the server is in tcp_ecn=2 mode,
that is, Linux default since 2009 via commit 255cac91c3c9 ("tcp: extend
ECN sysctl to allow server-side only ECN"):

 1) Normal ECN-capable path:

    SYN ECE CWR -----&gt;
                &lt;----- SYN ACK ECE
            ACK -----&gt;

 2) Path with broken middlebox, when client has fallback:

    SYN ECE CWR ----X crappy middlebox drops packet
                      (timeout, rtx)
            SYN -----&gt;
                &lt;----- SYN ACK
            ACK -----&gt;

In case we would not have the fallback implemented, the middlebox drop
point would basically end up as:

    SYN ECE CWR ----X crappy middlebox drops packet
                      (timeout, rtx)
    SYN ECE CWR ----X crappy middlebox drops packet
                      (timeout, rtx)
    SYN ECE CWR ----X crappy middlebox drops packet
                      (timeout, rtx)

In any case, it's rather a smaller percentage of sites where there would
occur such additional setup latency: it was found in end of 2014 that ~56%
of IPv4 and 65% of IPv6 servers of Alexa 1 million list would negotiate
ECN (aka tcp_ecn=2 default), 0.42% of these webservers will fail to connect
when trying to negotiate with ECN (tcp_ecn=1) due to timeouts, which the
fallback would mitigate with a slight latency trade-off. Recent related
paper on this topic:

  Brian Trammell, Mirja Kühlewind, Damiano Boppart, Iain Learmonth,
  Gorry Fairhurst, and Richard Scheffenegger:
    "Enabling Internet-Wide Deployment of Explicit Congestion Notification."
    Proc. PAM 2015, New York.
  http://ecn.ethz.ch/ecn-pam15.pdf

Thus, when net.ipv4.tcp_ecn=1 is being set, the patch will perform RFC3168,
section 6.1.1.1. fallback on timeout. For users explicitly not wanting this
which can be in DC use case, we add a net.ipv4.tcp_ecn_fallback knob that
allows for disabling the fallback.

tp-&gt;ecn_flags are not being cleared in tcp_ecn_clear_syn() on output, but
rather we let tcp_ecn_rcv_synack() take that over on input path in case a
SYN ACK ECE was delayed. Thus a spurious SYN retransmission will not prevent
ECN being negotiated eventually in that case.

Reference: https://www.ietf.org/proceedings/92/slides/slides-92-iccrg-1.pdf
Reference: https://www.ietf.org/proceedings/89/slides/slides-89-tsvarea-1.pdf
Signed-off-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Signed-off-by: Florian Westphal &lt;fw@strlen.de&gt;
Signed-off-by: Mirja Kühlewind &lt;mirja.kuehlewind@tik.ee.ethz.ch&gt;
Signed-off-by: Brian Trammell &lt;trammell@tik.ee.ethz.ch&gt;
Cc: Eric Dumazet &lt;edumazet@google.com&gt;
Cc: Dave That &lt;dave.taht@gmail.com&gt;
Acked-by: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>ipv6: Flow label state ranges</title>
<updated>2015-05-04T01:58:01Z</updated>
<author>
<name>Tom Herbert</name>
<email>tom@herbertland.com</email>
</author>
<published>2015-04-29T22:33:21Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=82a584b7cd366511a22e37675b029cf2fb58e291'/>
<id>urn:sha1:82a584b7cd366511a22e37675b029cf2fb58e291</id>
<content type='text'>
This patch divides the IPv6 flow label space into two ranges:
0-7ffff is reserved for flow label manager, 80000-fffff will be
used for creating auto flow labels (per RFC6438). This only affects how
labels are set on transmit, it does not affect receive. This range split
can be disbaled by systcl.

Background:

IPv6 flow labels have been an unmitigated disappointment thus far
in the lifetime of IPv6. Support in HW devices to use them for ECMP
is lacking, and OSes don't turn them on by default. If we had these
we could get much better hashing in IPv6 networks without resorting
to DPI, possibly eliminating some of the motivations to to define new
encaps in UDP just for getting ECMP.

Unfortunately, the initial specfications of IPv6 did not clarify
how they are to be used. There has always been a vague concept that
these can be used for ECMP, flow hashing, etc. and we do now have a
good standard how to this in RFC6438. The problem is that flow labels
can be either stateful or stateless (as in RFC6438), and we are
presented with the possibility that a stateless label may collide
with a stateful one.  Attempts to split the flow label space were
rejected in IETF. When we added support in Linux for RFC6438, we
could not turn on flow labels by default due to this conflict.

This patch splits the flow label space and should give us
a path to enabling auto flow labels by default for all IPv6 packets.
This is an API change so we need to consider compatibility with
existing deployment. The stateful range is chosen to be the lower
values in hopes that most uses would have chosen small numbers.

Once we resolve the stateless/stateful issue, we can proceed to
look at enabling RFC6438 flow labels by default (starting with
scaled testing).

Signed-off-by: Tom Herbert &lt;tom@herbertland.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
</feed>
