<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux/net/netlink, branch v4.3</title>
<subtitle>Mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
</subtitle>
<id>https://git.shady.money/linux/atom?h=v4.3</id>
<link rel='self' href='https://git.shady.money/linux/atom?h=v4.3'/>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/'/>
<updated>2015-10-22T14:18:28Z</updated>
<entry>
<title>netlink: fix locking around NETLINK_LIST_MEMBERSHIPS</title>
<updated>2015-10-22T14:18:28Z</updated>
<author>
<name>David Herrmann</name>
<email>dh.herrmann@gmail.com</email>
</author>
<published>2015-10-21T09:47:43Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=47191d65b647af5eb5c82ede70ed4c24b1e93ef4'/>
<id>urn:sha1:47191d65b647af5eb5c82ede70ed4c24b1e93ef4</id>
<content type='text'>
Currently, NETLINK_LIST_MEMBERSHIPS grabs the netlink table while copying
the membership state to user-space. However, grabing the netlink table is
effectively a write_lock_irq(), and as such we should not be triggering
page-faults in the critical section.

This can be easily reproduced by the following snippet:
    int s = socket(AF_NETLINK, SOCK_RAW, NETLINK_ROUTE);
    void *p = mmap(0, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANON, -1, 0);
    int r = getsockopt(s, 0x10e, 9, p, (void*)((char*)p + 4092));

This should work just fine, but currently triggers EFAULT and a possible
WARN_ON below handle_mm_fault().

Fix this by reducing locking of NETLINK_LIST_MEMBERSHIPS to a read-side
lock. The write-lock was overkill in the first place, and the read-lock
allows page-faults just fine.

Reported-by: Dmitry Vyukov &lt;dvyukov@google.com&gt;
Signed-off-by: David Herrmann &lt;dh.herrmann@gmail.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>netlink: Trim skb to alloc size to avoid MSG_TRUNC</title>
<updated>2015-10-19T02:34:12Z</updated>
<author>
<name>Arad, Ronen</name>
<email>ronen.arad@intel.com</email>
</author>
<published>2015-10-15T08:55:17Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=db65a3aaf29ecce2e34271d52e8d2336b97bd9fe'/>
<id>urn:sha1:db65a3aaf29ecce2e34271d52e8d2336b97bd9fe</id>
<content type='text'>
netlink_dump() allocates skb based on the calculated min_dump_alloc or
a per socket max_recvmsg_len.
min_alloc_size is maximum space required for any single netdev
attributes as calculated by rtnl_calcit().
max_recvmsg_len tracks the user provided buffer to netlink_recvmsg.
It is capped at 16KiB.
The intention is to avoid small allocations and to minimize the number
of calls required to obtain dump information for all net devices.

netlink_dump packs as many small messages as could fit within an skb
that was sized for the largest single netdev information. The actual
space available within an skb is larger than what is requested. It could
be much larger and up to near 2x with align to next power of 2 approach.

Allowing netlink_dump to use all the space available within the
allocated skb increases the buffer size a user has to provide to avoid
truncaion (i.e. MSG_TRUNG flag set).

It was observed that with many VLANs configured on at least one netdev,
a larger buffer of near 64KiB was necessary to avoid "Message truncated"
error in "ip link" or "bridge [-c[ompressvlans]] vlan show" when
min_alloc_size was only little over 32KiB.

This patch trims skb to allocated size in order to allow the user to
avoid truncation with more reasonable buffer size.

Signed-off-by: Ronen Arad &lt;ronen.arad@intel.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>netlink: Replace rhash_portid with bound</title>
<updated>2015-09-24T19:07:08Z</updated>
<author>
<name>Herbert Xu</name>
<email>herbert@gondor.apana.org.au</email>
</author>
<published>2015-09-22T03:38:56Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=da314c9923fed553a007785a901fd395b7eb6c19'/>
<id>urn:sha1:da314c9923fed553a007785a901fd395b7eb6c19</id>
<content type='text'>
On Mon, Sep 21, 2015 at 02:20:22PM -0400, Tejun Heo wrote:
&gt;
&gt; store_release and load_acquire are different from the usual memory
&gt; barriers and can't be paired this way.  You have to pair store_release
&gt; and load_acquire.  Besides, it isn't a particularly good idea to

OK I've decided to drop the acquire/release helpers as they don't
help us at all and simply pessimises the code by using full memory
barriers (on some architectures) where only a write or read barrier
is needed.

&gt; depend on memory barriers embedded in other data structures like the
&gt; above.  Here, especially, rhashtable_insert() would have write barrier
&gt; *before* the entry is hashed not necessarily *after*, which means that
&gt; in the above case, a socket which appears to have set bound to a
&gt; reader might not visible when the reader tries to look up the socket
&gt; on the hashtable.

But you are right we do need an explicit write barrier here to
ensure that the hashing is visible.

&gt; There's no reason to be overly smart here.  This isn't a crazy hot
&gt; path, write barriers tend to be very cheap, store_release more so.
&gt; Please just do smp_store_release() and note what it's paired with.

It's not about being overly smart.  It's about actually understanding
what's going on with the code.  I've seen too many instances of
people simply sprinkling synchronisation primitives around without
any knowledge of what is happening underneath, which is just a recipe
for creating hard-to-debug races.

&gt; &gt; @@ -1539,7 +1546,7 @@ static int netlink_bind(struct socket *sock, struct sockaddr *addr,
&gt; &gt;  		}
&gt; &gt;  	}
&gt; &gt;
&gt; &gt; -	if (!nlk-&gt;portid) {
&gt; &gt; +	if (!nlk-&gt;bound) {
&gt;
&gt; I don't think you can skip load_acquire here just because this is the
&gt; second deref of the variable.  That doesn't change anything.  Race
&gt; condition could still happen between the first and second tests and
&gt; skipping the second would lead to the same kind of bug.

The reason this one is OK is because we do not use nlk-&gt;portid or
try to get nlk from the hash table before we return to user-space.

However, there is a real bug here that none of these acquire/release
helpers discovered.  The two bound tests here used to be a single
one.  Now that they are separate it is entirely possible for another
thread to come in the middle and bind the socket.  So we need to
repeat the portid check in order to maintain consistency.

&gt; &gt; @@ -1587,7 +1594,7 @@ static int netlink_connect(struct socket *sock, struct sockaddr *addr,
&gt; &gt;  	    !netlink_allowed(sock, NL_CFG_F_NONROOT_SEND))
&gt; &gt;  		return -EPERM;
&gt; &gt;
&gt; &gt; -	if (!nlk-&gt;portid)
&gt; &gt; +	if (!nlk-&gt;bound)
&gt;
&gt; Don't we need load_acquire here too?  Is this path holding a lock
&gt; which makes that unnecessary?

Ditto.

---8&lt;---
The commit 1f770c0a09da855a2b51af6d19de97fb955eca85 ("netlink:
Fix autobind race condition that leads to zero port ID") created
some new races that can occur due to inconcsistencies between the
two port IDs.

Tejun is right that a barrier is unavoidable.  Therefore I am
reverting to the original patch that used a boolean to indicate
that a user netlink socket has been bound.

Barriers have been added where necessary to ensure that a valid
portid and the hashed socket is visible.

I have also changed netlink_insert to only return EBUSY if the
socket is bound to a portid different to the requested one.  This
combined with only reading nlk-&gt;bound once in netlink_bind fixes
a race where two threads that bind the socket at the same time
with different port IDs may both succeed.

Fixes: 1f770c0a09da ("netlink: Fix autobind race condition that leads to zero port ID")
Reported-by: Tejun Heo &lt;tj@kernel.org&gt;
Reported-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Herbert Xu &lt;herbert@gondor.apana.org.au&gt;
Nacked-by: Tejun Heo &lt;tj@kernel.org&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>netlink: Fix autobind race condition that leads to zero port ID</title>
<updated>2015-09-21T05:55:31Z</updated>
<author>
<name>Herbert Xu</name>
<email>herbert@gondor.apana.org.au</email>
</author>
<published>2015-09-18T11:16:50Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=1f770c0a09da855a2b51af6d19de97fb955eca85'/>
<id>urn:sha1:1f770c0a09da855a2b51af6d19de97fb955eca85</id>
<content type='text'>
The commit c0bb07df7d981e4091432754e30c9c720e2c0c78 ("netlink:
Reset portid after netlink_insert failure") introduced a race
condition where if two threads try to autobind the same socket
one of them may end up with a zero port ID.  This led to kernel
deadlocks that were observed by multiple people.

This patch reverts that commit and instead fixes it by introducing
a separte rhash_portid variable so that the real portid is only set
after the socket has been successfully hashed.

Fixes: c0bb07df7d98 ("netlink: Reset portid after netlink_insert failure")
Reported-by: Tejun Heo &lt;tj@kernel.org&gt;
Reported-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Herbert Xu &lt;herbert@gondor.apana.org.au&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>netlink, mmap: transform mmap skb into full skb on taps</title>
<updated>2015-09-11T21:36:49Z</updated>
<author>
<name>Daniel Borkmann</name>
<email>daniel@iogearbox.net</email>
</author>
<published>2015-09-10T18:05:46Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=1853c949646005b5959c483becde86608f548f24'/>
<id>urn:sha1:1853c949646005b5959c483becde86608f548f24</id>
<content type='text'>
Ken-ichirou reported that running netlink in mmap mode for receive in
combination with nlmon will throw a NULL pointer dereference in
__kfree_skb() on nlmon_xmit(), in my case I can also trigger an "unable
to handle kernel paging request". The problem is the skb_clone() in
__netlink_deliver_tap_skb() for skbs that are mmaped.

I.e. the cloned skb doesn't have a destructor, whereas the mmap netlink
skb has it pointed to netlink_skb_destructor(), set in the handler
netlink_ring_setup_skb(). There, skb-&gt;head is being set to NULL, so
that in such cases, __kfree_skb() doesn't perform a skb_release_data()
via skb_release_all(), where skb-&gt;head is possibly being freed through
kfree(head) into slab allocator, although netlink mmap skb-&gt;head points
to the mmap buffer. Similarly, the same has to be done also for large
netlink skbs where the data area is vmalloced. Therefore, as discussed,
make a copy for these rather rare cases for now. This fixes the issue
on my and Ken-ichirou's test-cases.

Reference: http://thread.gmane.org/gmane.linux.network/371129
Fixes: bcbde0d449ed ("net: netlink: virtual tap device management")
Reported-by: Ken-ichirou MATSUZAWA &lt;chamaken@gmail.com&gt;
Signed-off-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Tested-by: Ken-ichirou MATSUZAWA &lt;chamaken@gmail.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>netlink, mmap: fix edge-case leakages in nf queue zero-copy</title>
<updated>2015-09-10T04:43:22Z</updated>
<author>
<name>Daniel Borkmann</name>
<email>daniel@iogearbox.net</email>
</author>
<published>2015-09-10T00:10:57Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=6bb0fef489f667cf701853054f44579754f00a06'/>
<id>urn:sha1:6bb0fef489f667cf701853054f44579754f00a06</id>
<content type='text'>
When netlink mmap on receive side is the consumer of nf queue data,
it can happen that in some edge cases, we write skb shared info into
the user space mmap buffer:

Assume a possible rx ring frame size of only 4096, and the network skb,
which is being zero-copied into the netlink skb, contains page frags
with an overall skb-&gt;len larger than the linear part of the netlink
skb.

skb_zerocopy(), which is generic and thus not aware of the fact that
shared info cannot be accessed for such skbs then tries to write and
fill frags, thus leaking kernel data/pointers and in some corner cases
possibly writing out of bounds of the mmap area (when filling the
last slot in the ring buffer this way).

I.e. the ring buffer slot is then of status NL_MMAP_STATUS_VALID, has
an advertised length larger than 4096, where the linear part is visible
at the slot beginning, and the leaked sizeof(struct skb_shared_info)
has been written to the beginning of the next slot (also corrupting
the struct nl_mmap_hdr slot header incl. status etc), since skb-&gt;end
points to skb-&gt;data + ring-&gt;frame_size - NL_MMAP_HDRLEN.

The fix adds and lets __netlink_alloc_skb() take the actual needed
linear room for the network skb + meta data into account. It's completely
irrelevant for non-mmaped netlink sockets, but in case mmap sockets
are used, it can be decided whether the available skb_tailroom() is
really large enough for the buffer, or whether it needs to internally
fallback to a normal alloc_skb().

&gt;From nf queue side, the information whether the destination port is
an mmap RX ring is not really available without extra port-to-socket
lookup, thus it can only be determined in lower layers i.e. when
__netlink_alloc_skb() is called that checks internally for this. I
chose to add the extra ldiff parameter as mmap will then still work:
We have data_len and hlen in nfqnl_build_packet_message(), data_len
is the full length (capped at queue-&gt;copy_range) for skb_zerocopy()
and hlen some possible part of data_len that needs to be copied; the
rem_len variable indicates the needed remaining linear mmap space.

The only other workaround in nf queue internally would be after
allocation time by f.e. cap'ing the data_len to the skb_tailroom()
iff we deal with an mmap skb, but that would 1) expose the fact that
we use a mmap skb to upper layers, and 2) trim the skb where we
otherwise could just have moved the full skb into the normal receive
queue.

After the patch, in my test case the ring slot doesn't fit and therefore
shows NL_MMAP_STATUS_COPY, where a full skb carries all the data and
thus needs to be picked up via recv().

Fixes: 3ab1f683bf8b ("nfnetlink: add support for memory mapped netlink")
Signed-off-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>netlink, mmap: don't walk rx ring on poll if receive queue non-empty</title>
<updated>2015-09-10T04:42:51Z</updated>
<author>
<name>Daniel Borkmann</name>
<email>daniel@iogearbox.net</email>
</author>
<published>2015-09-09T23:20:46Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=a66e36568e30ed3714c0e3a12bd3b64696343ff5'/>
<id>urn:sha1:a66e36568e30ed3714c0e3a12bd3b64696343ff5</id>
<content type='text'>
In case of netlink mmap, there can be situations where received frames
have to be placed into the normal receive queue. The ring buffer indicates
this through NL_MMAP_STATUS_COPY, so the user is asked to pick them up
via recvmsg(2) syscall, and to put the slot back to NL_MMAP_STATUS_UNUSED.

Commit 0ef707700f1c ("netlink: rx mmap: fix POLLIN condition") changed
polling, so that we walk in the worst case the whole ring through the
new netlink_has_valid_frame(), for example, when the ring would have no
NL_MMAP_STATUS_VALID, but at least one NL_MMAP_STATUS_COPY frame.

Since we do a datagram_poll() already earlier to pick up a mask that could
possibly contain POLLIN | POLLRDNORM already (due to NL_MMAP_STATUS_COPY),
we can skip checking the rx ring entirely.

In case the kernel is compiled with !CONFIG_NETLINK_MMAP, then all this is
irrelevant anyway as netlink_poll() is just defined as datagram_poll().

Signed-off-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>netlink: rx mmap: fix POLLIN condition</title>
<updated>2015-08-31T04:55:51Z</updated>
<author>
<name>Ken-ichirou MATSUZAWA</name>
<email>chamaken@gmail.com</email>
</author>
<published>2015-08-30T22:54:49Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=0ef707700f1cef2357ce655fc86a4de5e41fa4b5'/>
<id>urn:sha1:0ef707700f1cef2357ce655fc86a4de5e41fa4b5</id>
<content type='text'>
Poll() returns immediately after setting the kernel current frame
(ring-&gt;head) to SKIP from user space even though there is no new
frame. And in a case of all frames is VALID, user space program
unintensionally sets (only) kernel current frame to UNUSED, then
calls poll(), it will not return immediately even though there are
VALID frames.

To avoid situations like above, I think we need to scan all frames
to find VALID frames at poll() like netlink_alloc_skb(),
netlink_forward_ring() finding an UNUSED frame at skb allocation.

Signed-off-by: Ken-ichirou MATSUZAWA &lt;chamas@h4.dion.ne.jp&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>netlink: mmap: fix lookup frame position</title>
<updated>2015-08-29T05:25:42Z</updated>
<author>
<name>Ken-ichirou MATSUZAWA</name>
<email>chamaken@gmail.com</email>
</author>
<published>2015-08-28T07:05:20Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=7084a315897715776d1764f5fd9250609e515beb'/>
<id>urn:sha1:7084a315897715776d1764f5fd9250609e515beb</id>
<content type='text'>
__netlink_lookup_frame() was always called with the same "pos"
value in netlink_forward_ring(). It will look at the same ring entry
header over and over again, every time through this loop. Then cycle
through the whole ring, advancing ring-&gt;head, not "pos" until it
equals the "ring-&gt;head != head" loop test fails.

Signed-off-by: Ken-ichirou MATSUZAWA &lt;chamas@h4.dion.ne.jp&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>netlink: add NETLINK_CAP_ACK socket option</title>
<updated>2015-08-29T05:25:42Z</updated>
<author>
<name>Christophe Ricard</name>
<email>christophe.ricard@gmail.com</email>
</author>
<published>2015-08-28T05:07:48Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=0a6a3a23ea6efde079a5b77688541a98bf202721'/>
<id>urn:sha1:0a6a3a23ea6efde079a5b77688541a98bf202721</id>
<content type='text'>
Since commit c05cdb1b864f ("netlink: allow large data transfers from
user-space"), the kernel may fail to allocate the necessary room for the
acknowledgment message back to userspace. This patch introduces a new
socket option that trims off the payload of the original netlink message.

The netlink message header is still included, so the user can guess from
the sequence number what is the message that has triggered the
acknowledgment.

Signed-off-by: Pablo Neira Ayuso &lt;pablo@netfilter.org&gt;
Signed-off-by: Christophe Ricard &lt;christophe-h.ricard@st.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
</feed>
