<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux/net/ipv4, branch v3.16</title>
<subtitle>Mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
</subtitle>
<id>https://git.shady.money/linux/atom?h=v3.16</id>
<link rel='self' href='https://git.shady.money/linux/atom?h=v3.16'/>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/'/>
<updated>2014-07-29T01:46:34Z</updated>
<entry>
<title>ip: make IP identifiers less predictable</title>
<updated>2014-07-29T01:46:34Z</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2014-07-26T06:58:10Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=04ca6973f7c1a0d8537f2d9906a0cf8e69886d75'/>
<id>urn:sha1:04ca6973f7c1a0d8537f2d9906a0cf8e69886d75</id>
<content type='text'>
In "Counting Packets Sent Between Arbitrary Internet Hosts", Jeffrey and
Jedidiah describe ways exploiting linux IP identifier generation to
infer whether two machines are exchanging packets.

With commit 73f156a6e8c1 ("inetpeer: get rid of ip_id_count"), we
changed IP id generation, but this does not really prevent this
side-channel technique.

This patch adds a random amount of perturbation so that IP identifiers
for a given destination [1] are no longer monotonically increasing after
an idle period.

Note that prandom_u32_max(1) returns 0, so if generator is used at most
once per jiffy, this patch inserts no hole in the ID suite and do not
increase collision probability.

This is jiffies based, so in the worst case (HZ=1000), the id can
rollover after ~65 seconds of idle time, which should be fine.

We also change the hash used in __ip_select_ident() to not only hash
on daddr, but also saddr and protocol, so that ICMP probes can not be
used to infer information for other protocols.

For IPv6, adds saddr into the hash as well, but not nexthdr.

If I ping the patched target, we can see ID are now hard to predict.

21:57:11.008086 IP (...)
    A &gt; target: ICMP echo request, seq 1, length 64
21:57:11.010752 IP (... id 2081 ...)
    target &gt; A: ICMP echo reply, seq 1, length 64

21:57:12.013133 IP (...)
    A &gt; target: ICMP echo request, seq 2, length 64
21:57:12.015737 IP (... id 3039 ...)
    target &gt; A: ICMP echo reply, seq 2, length 64

21:57:13.016580 IP (...)
    A &gt; target: ICMP echo request, seq 3, length 64
21:57:13.019251 IP (... id 3437 ...)
    target &gt; A: ICMP echo reply, seq 3, length 64

[1] TCP sessions uses a per flow ID generator not changed by this patch.

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Reported-by: Jeffrey Knockel &lt;jeffk@cs.unm.edu&gt;
Reported-by: Jedidiah R. Crandall &lt;crandall@cs.unm.edu&gt;
Cc: Willy Tarreau &lt;w@1wt.eu&gt;
Cc: Hannes Frederic Sowa &lt;hannes@redhat.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>ipv4: fix buffer overflow in ip_options_compile()</title>
<updated>2014-07-22T03:16:26Z</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2014-07-21T05:17:42Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=10ec9472f05b45c94db3c854d22581a20b97db41'/>
<id>urn:sha1:10ec9472f05b45c94db3c854d22581a20b97db41</id>
<content type='text'>
There is a benign buffer overflow in ip_options_compile spotted by
AddressSanitizer[1] :

Its benign because we always can access one extra byte in skb-&gt;head
(because header is followed by struct skb_shared_info), and in this case
this byte is not even used.

[28504.910798] ==================================================================
[28504.912046] AddressSanitizer: heap-buffer-overflow in ip_options_compile
[28504.913170] Read of size 1 by thread T15843:
[28504.914026]  [&lt;ffffffff81802f91&gt;] ip_options_compile+0x121/0x9c0
[28504.915394]  [&lt;ffffffff81804a0d&gt;] ip_options_get_from_user+0xad/0x120
[28504.916843]  [&lt;ffffffff8180dedf&gt;] do_ip_setsockopt.isra.15+0x8df/0x1630
[28504.918175]  [&lt;ffffffff8180ec60&gt;] ip_setsockopt+0x30/0xa0
[28504.919490]  [&lt;ffffffff8181e59b&gt;] tcp_setsockopt+0x5b/0x90
[28504.920835]  [&lt;ffffffff8177462f&gt;] sock_common_setsockopt+0x5f/0x70
[28504.922208]  [&lt;ffffffff817729c2&gt;] SyS_setsockopt+0xa2/0x140
[28504.923459]  [&lt;ffffffff818cfb69&gt;] system_call_fastpath+0x16/0x1b
[28504.924722]
[28504.925106] Allocated by thread T15843:
[28504.925815]  [&lt;ffffffff81804995&gt;] ip_options_get_from_user+0x35/0x120
[28504.926884]  [&lt;ffffffff8180dedf&gt;] do_ip_setsockopt.isra.15+0x8df/0x1630
[28504.927975]  [&lt;ffffffff8180ec60&gt;] ip_setsockopt+0x30/0xa0
[28504.929175]  [&lt;ffffffff8181e59b&gt;] tcp_setsockopt+0x5b/0x90
[28504.930400]  [&lt;ffffffff8177462f&gt;] sock_common_setsockopt+0x5f/0x70
[28504.931677]  [&lt;ffffffff817729c2&gt;] SyS_setsockopt+0xa2/0x140
[28504.932851]  [&lt;ffffffff818cfb69&gt;] system_call_fastpath+0x16/0x1b
[28504.934018]
[28504.934377] The buggy address ffff880026382828 is located 0 bytes to the right
[28504.934377]  of 40-byte region [ffff880026382800, ffff880026382828)
[28504.937144]
[28504.937474] Memory state around the buggy address:
[28504.938430]  ffff880026382300: ........ rrrrrrrr rrrrrrrr rrrrrrrr
[28504.939884]  ffff880026382400: ffffffff rrrrrrrr rrrrrrrr rrrrrrrr
[28504.941294]  ffff880026382500: .....rrr rrrrrrrr rrrrrrrr rrrrrrrr
[28504.942504]  ffff880026382600: ffffffff rrrrrrrr rrrrrrrr rrrrrrrr
[28504.943483]  ffff880026382700: ffffffff rrrrrrrr rrrrrrrr rrrrrrrr
[28504.944511] &gt;ffff880026382800: .....rrr rrrrrrrr rrrrrrrr rrrrrrrr
[28504.945573]                         ^
[28504.946277]  ffff880026382900: ffffffff rrrrrrrr rrrrrrrr rrrrrrrr
[28505.094949]  ffff880026382a00: ffffffff rrrrrrrr rrrrrrrr rrrrrrrr
[28505.096114]  ffff880026382b00: ffffffff rrrrrrrr rrrrrrrr rrrrrrrr
[28505.097116]  ffff880026382c00: ffffffff rrrrrrrr rrrrrrrr rrrrrrrr
[28505.098472]  ffff880026382d00: ffffffff rrrrrrrr rrrrrrrr rrrrrrrr
[28505.099804] Legend:
[28505.100269]  f - 8 freed bytes
[28505.100884]  r - 8 redzone bytes
[28505.101649]  . - 8 allocated bytes
[28505.102406]  x=1..7 - x allocated bytes + (8-x) redzone bytes
[28505.103637] ==================================================================

[1] https://code.google.com/p/address-sanitizer/wiki/AddressSanitizerForKernel

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>net-gre-gro: Fix a bug that breaks the forwarding path</title>
<updated>2014-07-16T21:45:26Z</updated>
<author>
<name>Jerry Chu</name>
<email>hkchu@google.com</email>
</author>
<published>2014-07-14T22:54:46Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=c3caf1192f904de2f1381211f564537235d50de3'/>
<id>urn:sha1:c3caf1192f904de2f1381211f564537235d50de3</id>
<content type='text'>
Fixed a bug that was introduced by my GRE-GRO patch
(bf5a755f5e9186406bbf50f4087100af5bd68e40 net-gre-gro: Add GRE
support to the GRO stack) that breaks the forwarding path
because various GSO related fields were not set. The bug will
cause on the egress path either the GSO code to fail, or a
GRE-TSO capable (NETIF_F_GSO_GRE) NICs to choke. The following
fix has been tested for both cases.

Signed-off-by: H.K. Jerry Chu &lt;hkchu@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>GRE: enable offloads for GRE</title>
<updated>2014-07-11T20:53:39Z</updated>
<author>
<name>Amritha Nambiar</name>
<email>amritha.nambiar@intel.com</email>
</author>
<published>2014-07-11T00:29:21Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=d0a7ebbc119738439ff00f7fadbd343ae20ea5e8'/>
<id>urn:sha1:d0a7ebbc119738439ff00f7fadbd343ae20ea5e8</id>
<content type='text'>
To get offloads to work with Generic Routing Encapsulation (GRE), the
outer transport header has to be reset after skb_push is done. This
patch has the support for this fix and hence GRE offloading.

Signed-off-by: Amritha Nambiar &lt;amritha.nambiar@intel.com&gt;
Signed-off-by: Joseph Gasparakis &lt;joseph.gasparakis@intel.com&gt;
Tested-By: Jim Young &lt;jamesx.m.young@intel.com&gt;
Signed-off-by: Jeff Kirsher &lt;jeffrey.t.kirsher@intel.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>ip_tunnel: fix ip_tunnel_lookup</title>
<updated>2014-07-09T02:35:09Z</updated>
<author>
<name>Dmitry Popov</name>
<email>ixaphire@qrator.net</email>
</author>
<published>2014-07-04T22:26:37Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=e0056593b61253f1a8a9941dacda22e73b963cdc'/>
<id>urn:sha1:e0056593b61253f1a8a9941dacda22e73b963cdc</id>
<content type='text'>
This patch fixes 3 similar bugs where incoming packets might be routed into
wrong non-wildcard tunnels:

1) Consider the following setup:
    ip address add 1.1.1.1/24 dev eth0
    ip address add 1.1.1.2/24 dev eth0
    ip tunnel add ipip1 remote 2.2.2.2 local 1.1.1.1 mode ipip dev eth0
    ip link set ipip1 up

Incoming ipip packets from 2.2.2.2 were routed into ipip1 even if it has dst =
1.1.1.2. Moreover even if there was wildcard tunnel like
   ip tunnel add ipip0 remote 2.2.2.2 local any mode ipip dev eth0
but it was created before explicit one (with local 1.1.1.1), incoming ipip
packets with src = 2.2.2.2 and dst = 1.1.1.2 were still routed into ipip1.

Same issue existed with all tunnels that use ip_tunnel_lookup (gre, vti)

2)  ip address add 1.1.1.1/24 dev eth0
    ip tunnel add ipip1 remote 2.2.146.85 local 1.1.1.1 mode ipip dev eth0
    ip link set ipip1 up

Incoming ipip packets with dst = 1.1.1.1 were routed into ipip1, no matter what
src address is. Any remote ip address which has ip_tunnel_hash = 0 raised this
issue, 2.2.146.85 is just an example, there are more than 4 million of them.
And again, wildcard tunnel like
   ip tunnel add ipip0 remote any local 1.1.1.1 mode ipip dev eth0
wouldn't be ever matched if it was created before explicit tunnel like above.

Gre &amp; vti tunnels had the same issue.

3)  ip address add 1.1.1.1/24 dev eth0
    ip tunnel add gre1 remote 2.2.146.84 local 1.1.1.1 key 1 mode gre dev eth0
    ip link set gre1 up

Any incoming gre packet with key = 1 were routed into gre1, no matter what
src/dst addresses are. Any remote ip address which has ip_tunnel_hash = 0 raised
the issue, 2.2.146.84 is just an example, there are more than 4 million of them.
Wildcard tunnel like
   ip tunnel add gre2 remote any local any key 1 mode gre dev eth0
wouldn't be ever matched if it was created before explicit tunnel like above.

All this stuff happened because while looking for a wildcard tunnel we didn't
check that matched tunnel is a wildcard one. Fixed.

Signed-off-by: Dmitry Popov &lt;ixaphire@qrator.net&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>tcp: fix false undo corner cases</title>
<updated>2014-07-08T04:40:48Z</updated>
<author>
<name>Yuchung Cheng</name>
<email>ycheng@google.com</email>
</author>
<published>2014-07-02T19:07:16Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=6e08d5e3c8236e7484229e46fdf92006e1dd4c49'/>
<id>urn:sha1:6e08d5e3c8236e7484229e46fdf92006e1dd4c49</id>
<content type='text'>
The undo code assumes that, upon entering loss recovery, TCP
1) always retransmit something
2) the retransmission never fails locally (e.g., qdisc drop)

so undo_marker is set in tcp_enter_recovery() and undo_retrans is
incremented only when tcp_retransmit_skb() is successful.

When the assumption is broken because TCP's cwnd is too small to
retransmit or the retransmit fails locally. The next (DUP)ACK
would incorrectly revert the cwnd and the congestion state in
tcp_try_undo_dsack() or tcp_may_undo(). Subsequent (DUP)ACKs
may enter the recovery state. The sender repeatedly enter and
(incorrectly) exit recovery states if the retransmits continue to
fail locally while receiving (DUP)ACKs.

The fix is to initialize undo_retrans to -1 and start counting on
the first retransmission. Always increment undo_retrans even if the
retransmissions fail locally because they couldn't cause DSACKs to
undo the cwnd reduction.

Signed-off-by: Yuchung Cheng &lt;ycheng@google.com&gt;
Signed-off-by: Neal Cardwell &lt;ncardwell@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>igmp: fix the problem when mc leave group</title>
<updated>2014-07-08T04:30:55Z</updated>
<author>
<name>dingtianhong</name>
<email>dingtianhong@huawei.com</email>
</author>
<published>2014-07-02T05:50:48Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=52ad353a5344f1f700c5b777175bdfa41d3cd65a'/>
<id>urn:sha1:52ad353a5344f1f700c5b777175bdfa41d3cd65a</id>
<content type='text'>
The problem was triggered by these steps:

1) create socket, bind and then setsockopt for add mc group.
   mreq.imr_multiaddr.s_addr = inet_addr("255.0.0.37");
   mreq.imr_interface.s_addr = inet_addr("192.168.1.2");
   setsockopt(sockfd, IPPROTO_IP, IP_ADD_MEMBERSHIP, &amp;mreq, sizeof(mreq));

2) drop the mc group for this socket.
   mreq.imr_multiaddr.s_addr = inet_addr("255.0.0.37");
   mreq.imr_interface.s_addr = inet_addr("0.0.0.0");
   setsockopt(sockfd, IPPROTO_IP, IP_DROP_MEMBERSHIP, &amp;mreq, sizeof(mreq));

3) and then drop the socket, I found the mc group was still used by the dev:

   netstat -g

   Interface       RefCnt Group
   --------------- ------ ---------------------
   eth2		   1	  255.0.0.37

Normally even though the IP_DROP_MEMBERSHIP return error, the mc group still need
to be released for the netdev when drop the socket, but this process was broken when
route default is NULL, the reason is that:

The ip_mc_leave_group() will choose the in_dev by the imr_interface.s_addr, if input addr
is NULL, the default route dev will be chosen, then the ifindex is got from the dev,
then polling the inet-&gt;mc_list and return -ENODEV, but if the default route dev is NULL,
the in_dev and ifIndex is both NULL, when polling the inet-&gt;mc_list, the mc group will be
released from the mc_list, but the dev didn't dec the refcnt for this mc group, so
when dropping the socket, the mc_list is NULL and the dev still keep this group.

v1-&gt;v2: According Hideaki's suggestion, we should align with IPv6 (RFC3493) and BSDs,
	so I add the checking for the in_dev before polling the mc_list, make sure when
	we remove the mc group, dec the refcnt to the real dev which was using the mc address.
	The problem would never happened again.

Signed-off-by: Ding Tianhong &lt;dingtianhong@huawei.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>ipv4: icmp: Fix pMTU handling for rare case</title>
<updated>2014-07-08T00:22:57Z</updated>
<author>
<name>Edward Allcutt</name>
<email>edward.allcutt@openmarket.com</email>
</author>
<published>2014-06-30T15:16:02Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=68b7107b62983f2cff0948292429d5f5999df096'/>
<id>urn:sha1:68b7107b62983f2cff0948292429d5f5999df096</id>
<content type='text'>
Some older router implementations still send Fragmentation Needed
errors with the Next-Hop MTU field set to zero. This is explicitly
described as an eventuality that hosts must deal with by the
standard (RFC 1191) since older standards specified that those
bits must be zero.

Linux had a generic (for all of IPv4) implementation of the algorithm
described in the RFC for searching a list of MTU plateaus for a good
value. Commit 46517008e116 ("ipv4: Kill ip_rt_frag_needed().")
removed this as part of the changes to remove the routing cache.
Subsequently any Fragmentation Needed packet with a zero Next-Hop
MTU has been discarded without being passed to the per-protocol
handlers or notifying userspace for raw sockets.

When there is a router which does not implement RFC 1191 on an
MTU limited path then this results in stalled connections since
large packets are discarded and the local protocols are not
notified so they never attempt to lower the pMTU.

One example I have seen is an OpenBSD router terminating IPSec
tunnels. It's worth pointing out that this case is distinct from
the BSD 4.2 bug which incorrectly calculated the Next-Hop MTU
since the commit in question dismissed that as a valid concern.

All of the per-protocols handlers implement the simple approach from
RFC 1191 of immediately falling back to the minimum value. Although
this is sub-optimal it is vastly preferable to connections hanging
indefinitely.

Remove the Next-Hop MTU != 0 check and allow such packets
to follow the normal path.

Fixes: 46517008e116 ("ipv4: Kill ip_rt_frag_needed().")
Signed-off-by: Edward Allcutt &lt;edward.allcutt@openmarket.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>tcp: Fix divide by zero when pushing during tcp-repair</title>
<updated>2014-07-03T01:21:03Z</updated>
<author>
<name>Christoph Paasch</name>
<email>christoph.paasch@uclouvain.be</email>
</author>
<published>2014-06-28T16:26:37Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=5924f17a8a30c2ae18d034a86ee7581b34accef6'/>
<id>urn:sha1:5924f17a8a30c2ae18d034a86ee7581b34accef6</id>
<content type='text'>
When in repair-mode and TCP_RECV_QUEUE is set, we end up calling
tcp_push with mss_now being 0. If data is in the send-queue and
tcp_set_skb_tso_segs gets called, we crash because it will divide by
mss_now:

[  347.151939] divide error: 0000 [#1] SMP
[  347.152907] Modules linked in:
[  347.152907] CPU: 1 PID: 1123 Comm: packetdrill Not tainted 3.16.0-rc2 #4
[  347.152907] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
[  347.152907] task: f5b88540 ti: f3c82000 task.ti: f3c82000
[  347.152907] EIP: 0060:[&lt;c1601359&gt;] EFLAGS: 00210246 CPU: 1
[  347.152907] EIP is at tcp_set_skb_tso_segs+0x49/0xa0
[  347.152907] EAX: 00000b67 EBX: f5acd080 ECX: 00000000 EDX: 00000000
[  347.152907] ESI: f5a28f40 EDI: f3c88f00 EBP: f3c83d10 ESP: f3c83d00
[  347.152907]  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
[  347.152907] CR0: 80050033 CR2: 083158b0 CR3: 35146000 CR4: 000006b0
[  347.152907] Stack:
[  347.152907]  c167f9d9 f5acd080 000005b4 00000002 f3c83d20 c16013e6 f3c88f00 f5acd080
[  347.152907]  f3c83da0 c1603b5a f3c83d38 c10a0188 00000000 00000000 f3c83d84 c10acc85
[  347.152907]  c1ad5ec0 00000000 00000000 c1ad679c 010003e0 00000000 00000000 f3c88fc8
[  347.152907] Call Trace:
[  347.152907]  [&lt;c167f9d9&gt;] ? apic_timer_interrupt+0x2d/0x34
[  347.152907]  [&lt;c16013e6&gt;] tcp_init_tso_segs+0x36/0x50
[  347.152907]  [&lt;c1603b5a&gt;] tcp_write_xmit+0x7a/0xbf0
[  347.152907]  [&lt;c10a0188&gt;] ? up+0x28/0x40
[  347.152907]  [&lt;c10acc85&gt;] ? console_unlock+0x295/0x480
[  347.152907]  [&lt;c10ad24f&gt;] ? vprintk_emit+0x1ef/0x4b0
[  347.152907]  [&lt;c1605716&gt;] __tcp_push_pending_frames+0x36/0xd0
[  347.152907]  [&lt;c15f4860&gt;] tcp_push+0xf0/0x120
[  347.152907]  [&lt;c15f7641&gt;] tcp_sendmsg+0xf1/0xbf0
[  347.152907]  [&lt;c116d920&gt;] ? kmem_cache_free+0xf0/0x120
[  347.152907]  [&lt;c106a682&gt;] ? __sigqueue_free+0x32/0x40
[  347.152907]  [&lt;c106a682&gt;] ? __sigqueue_free+0x32/0x40
[  347.152907]  [&lt;c114f0f0&gt;] ? do_wp_page+0x3e0/0x850
[  347.152907]  [&lt;c161c36a&gt;] inet_sendmsg+0x4a/0xb0
[  347.152907]  [&lt;c1150269&gt;] ? handle_mm_fault+0x709/0xfb0
[  347.152907]  [&lt;c15a006b&gt;] sock_aio_write+0xbb/0xd0
[  347.152907]  [&lt;c1180b79&gt;] do_sync_write+0x69/0xa0
[  347.152907]  [&lt;c1181023&gt;] vfs_write+0x123/0x160
[  347.152907]  [&lt;c1181d55&gt;] SyS_write+0x55/0xb0
[  347.152907]  [&lt;c167f0d8&gt;] sysenter_do_call+0x12/0x28

This can easily be reproduced with the following packetdrill-script (the
"magic" with netem, sk_pacing and limit_output_bytes is done to prevent
the kernel from pushing all segments, because hitting the limit without
doing this is not so easy with packetdrill):

0   socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0  setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0

+0  bind(3, ..., ...) = 0
+0  listen(3, 1) = 0

+0  &lt; S 0:0(0) win 32792 &lt;mss 1460&gt;
+0  &gt; S. 0:0(0) ack 1 &lt;mss 1460&gt;
+0.1  &lt; . 1:1(0) ack 1 win 65000

+0  accept(3, ..., ...) = 4

// This forces that not all segments of the snd-queue will be pushed
+0 `tc qdisc add dev tun0 root netem delay 10ms`
+0 `sysctl -w net.ipv4.tcp_limit_output_bytes=2`
+0 setsockopt(4, SOL_SOCKET, 47, [2], 4) = 0

+0 write(4,...,10000) = 10000
+0 write(4,...,10000) = 10000

// Set tcp-repair stuff, particularly TCP_RECV_QUEUE
+0 setsockopt(4, SOL_TCP, 19, [1], 4) = 0
+0 setsockopt(4, SOL_TCP, 20, [1], 4) = 0

// This now will make the write push the remaining segments
+0 setsockopt(4, SOL_SOCKET, 47, [20000], 4) = 0
+0 `sysctl -w net.ipv4.tcp_limit_output_bytes=130000`

// Now we will crash
+0 write(4,...,1000) = 1000

This happens since ec3423257508 (tcp: fix retransmission in repair
mode). Prior to that, the call to tcp_push was prevented by a check for
tp-&gt;repair.

The patch fixes it, by adding the new goto-label out_nopush. When exiting
tcp_sendmsg and a push is not required, which is the case for tp-&gt;repair,
we go to this label.

When repairing and calling send() with TCP_RECV_QUEUE, the data is
actually put in the receive-queue. So, no push is required because no
data has been added to the send-queue.

Cc: Andrew Vagin &lt;avagin@openvz.org&gt;
Cc: Pavel Emelyanov &lt;xemul@parallels.com&gt;
Fixes: ec3423257508 (tcp: fix retransmission in repair mode)
Signed-off-by: Christoph Paasch &lt;christoph.paasch@uclouvain.be&gt;
Acked-by: Andrew Vagin &lt;avagin@openvz.org&gt;
Acked-by: Pavel Emelyanov &lt;xemul@parallels.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>ipv4: irq safe sk_dst_[re]set() and ipv4_sk_update_pmtu() fix</title>
<updated>2014-07-01T06:40:58Z</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2014-06-30T08:26:23Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=7f502361531e9eecb396cf99bdc9e9a59f7ebd7f'/>
<id>urn:sha1:7f502361531e9eecb396cf99bdc9e9a59f7ebd7f</id>
<content type='text'>
We have two different ways to handle changes to sk-&gt;sk_dst

First way (used by TCP) assumes socket lock is owned by caller, and use
no extra lock : __sk_dst_set() &amp; __sk_dst_reset()

Another way (used by UDP) uses sk_dst_lock because socket lock is not
always taken. Note that sk_dst_lock is not softirq safe.

These ways are not inter changeable for a given socket type.

ipv4_sk_update_pmtu(), added in linux-3.8, added a race, as it used
the socket lock as synchronization, but users might be UDP sockets.

Instead of converting sk_dst_lock to a softirq safe version, use xchg()
as we did for sk_rx_dst in commit e47eb5dfb296b ("udp: ipv4: do not use
sk_dst_lock from softirq context")

In a follow up patch, we probably can remove sk_dst_lock, as it is
only used in IPv6.

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Cc: Steffen Klassert &lt;steffen.klassert@secunet.com&gt;
Fixes: 9cb3a50c5f63e ("ipv4: Invalidate the socket cached route on pmtu events if possible")
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
</feed>
