<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux/net/tipc, branch v4.3</title>
<subtitle>Mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
</subtitle>
<id>https://git.shady.money/linux/atom?h=v4.3</id>
<link rel='self' href='https://git.shady.money/linux/atom?h=v4.3'/>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/'/>
<updated>2015-10-22T02:13:48Z</updated>
<entry>
<title>tipc: conditionally expand buffer headroom over udp tunnel</title>
<updated>2015-10-22T02:13:48Z</updated>
<author>
<name>Jon Paul Maloy</name>
<email>jon.maloy@ericsson.com</email>
</author>
<published>2015-10-19T15:43:11Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=e53567948f82368247b4b1a63fcab4c76ef7d51c'/>
<id>urn:sha1:e53567948f82368247b4b1a63fcab4c76ef7d51c</id>
<content type='text'>
In commit d999297c3dbbe ("tipc: reduce locking scope during packet reception")
we altered the packet retransmission function. Since then, when
restransmitting packets, we create a clone of the original buffer
using __pskb_copy(skb, MIN_H_SIZE), where MIN_H_SIZE is the size of
the area we want to have copied, but also the smallest possible TIPC
packet size. The value of MIN_H_SIZE is 24.

Unfortunately, __pskb_copy() also has the effect that the headroom
of the cloned buffer takes the size MIN_H_SIZE. This is too small
for carrying the packet over the UDP tunnel bearer, which requires
a minimum headroom of 28 bytes. A change to just use pskb_copy()
lets the clone inherit the original headroom of 80 bytes, but also
assumes that the copied data area is of at least that size, something
that is not always the case. So that is not a viable solution.

We now fix this by adding a check for sufficient headroom in the
transmit function of udp_media.c, and expanding it when necessary.

Fixes: commit d999297c3dbbe ("tipc: reduce locking scope during packet reception")
Signed-off-by: Jon Maloy &lt;jon.maloy@ericsson.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>tipc: allow non-linear first fragment buffer</title>
<updated>2015-10-22T02:08:24Z</updated>
<author>
<name>Jon Paul Maloy</name>
<email>jon.maloy@ericsson.com</email>
</author>
<published>2015-10-19T15:33:00Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=45c8b7b175ceb2d542e0fe15247377bf3bce29ec'/>
<id>urn:sha1:45c8b7b175ceb2d542e0fe15247377bf3bce29ec</id>
<content type='text'>
The current code for message reassembly is erroneously assuming that
the the first arriving fragment buffer always is linear, and then goes
ahead resetting the fragment list of that buffer in anticipation of
more arriving fragments.

However, if the buffer already happens to be non-linear, we will
inadvertently drop the already attached fragment list, and later
on trig a BUG() in __pskb_pull_tail().

We see this happen when running fragmented TIPC multicast across UDP,
something made possible since
commit d0f91938bede ("tipc: add ip/udp media type")

We fix this by not resetting the fragment list when the buffer is non-
linear, and by initiatlizing our private fragment list tail pointer to
the tail of the existing fragment list.

Fixes: commit d0f91938bede ("tipc: add ip/udp media type")
Signed-off-by: Jon Maloy &lt;jon.maloy@ericsson.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>tipc: extend broadcast link window size</title>
<updated>2015-10-22T02:02:17Z</updated>
<author>
<name>Jon Paul Maloy</name>
<email>jon.maloy@ericsson.com</email>
</author>
<published>2015-10-19T13:21:37Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=53387c4e22ac33d27a552b3d56bad932bd32531b'/>
<id>urn:sha1:53387c4e22ac33d27a552b3d56bad932bd32531b</id>
<content type='text'>
The default fix broadcast window size is currently set to 20 packets.
This is a very low value, set at a time when we were still testing on
10 Mb/s hubs, and a change to it is long overdue.

Commit 7845989cb4b3da1db ("net: tipc: fix stall during bclink wakeup procedure")
revealed a problem with this low value. For messages of importance LOW,
the backlog queue limit will be  calculated to 30 packets, while a
single, maximum sized message of 66000 bytes, carried across a 1500 MTU
network consists of 46 packets.

This leads to the following scenario (among others leading to the same
situation):

1: Msg 1 of 46 packets is sent. 20 packets go to the transmit queue, 26
   packets to the backlog queue.
2: Msg 2 of 46 packets is attempted sent, but rejected because there is
   no more space in the backlog queue at this level. The sender is added
   to the wakeup queue with a "pending packets chain size" number of 46.
3: Some packets in the transmit queue are acked and released. We try to
   wake up the sender, but the pending size of 46 is bigger than the LOW
   wakeup limit of 30, so this doesn't happen.
5: Subsequent acks releases all the remaining buffers. Each time we test
   for the wakeup criteria and find that 46 still is larger than 30,
   even after both the transmit and the backlog queues are empty.
6: The sender is never woken up and given a chance to send its message.
   He is stuck.

We could now loosen the wakeup criteria (used by link_prepare_wakeup())
to become equal to the send criteria (used by tipc_link_xmit()), i.e.,
by ignoring the "pending packets chain size" value altogether, or we can
just increase the queue limits so that the criteria can be satisfied
anyway. There are good reasons (potentially multiple waiting senders) to
not opt for the former solution, so we choose the latter one.

This commit fixes the problem by giving the broadcast link window a
default value of 50 packets. We also introduce a new minimum link
window size BCLINK_MIN_WIN of 32, which is enough to always avoid the
described situation. Finally, in order to not break any existing users
which may set the window explicitly, we enforce that the window is set
to the new minimum value in case the user is trying to set it to
anything lower.

Fixes: 7845989cb4b3da1db ("net: tipc: fix stall during bclink wakeup procedure")
Signed-off-by: Jon Maloy &lt;jon.maloy@ericsson.com&gt;
Reviewed-by: Ying Xue &lt;ying.xue@windriver.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>tipc: move fragment importance field to new header position</title>
<updated>2015-10-15T02:10:08Z</updated>
<author>
<name>Jon Paul Maloy</name>
<email>jon.maloy@ericsson.com</email>
</author>
<published>2015-10-14T13:23:18Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=dde4b5ae65de659b9ec64bafdde0430459fcb495'/>
<id>urn:sha1:dde4b5ae65de659b9ec64bafdde0430459fcb495</id>
<content type='text'>
In commit e3eea1eb47a ("tipc: clean up handling of message priorities")
we introduced a field in the packet header for keeping track of the
priority of fragments, since this value is not present in the specified
protocol header. Since the value so far only is used at the transmitting
end of the link, we have not yet officially defined it as part of the
protocol.

Unfortunately, the field we use for keeping this value, bits 13-15 in
in word 5, has turned out to be a poor choice; it is already used by the
broadcast protocol for carrying the 'network id' field of the sending
node. Since packet fragments also need to be transported across the
broadcast protocol, the risk of conflict is obvious, and we see this
happen when we use network identities larger than 2^13-1. This has
escaped our testing because we have so far only been using small network
id values.

We now move this field to bits 0-2 in word 9, a field that is guaranteed
to be unused by all involved protocols.

Fixes: e3eea1eb47a ("tipc: clean up handling of message priorities")
Signed-off-by: Jon Maloy &lt;jon.maloy@ericsson.com&gt;
Acked-by: Ying Xue &lt;ying.xue@windriver.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>tipc: eliminate risk of stalled link synchronization</title>
<updated>2015-10-14T13:06:40Z</updated>
<author>
<name>Jon Paul Maloy</name>
<email>jon.maloy@ericsson.com</email>
</author>
<published>2015-10-13T16:41:51Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=0f8b8e28fb3241f9fd82ce13bac2b40c35e987e0'/>
<id>urn:sha1:0f8b8e28fb3241f9fd82ce13bac2b40c35e987e0</id>
<content type='text'>
In commit 6e498158a827 ("tipc: move link synch and failover to link aggregation level")
we introduced a new mechanism for performing link failover and
synchronization. We have now detected a bug in this mechanism.

During link synchronization we use the arrival of any packet on
the tunnel link to trig a check for whether it has reached the
synchronization point or not. This has turned out to be too
permissive, since it may cause an arriving non-last SYNCH packet to
end the synch state, just to see the next SYNCH packet initiate a
new synch state with a new, higher synch point. This is not fatal,
but should be avoided, because it may significantly extend the
synchronization period, while at the same time we are not allowed
to send NACKs if packets are lost. In the worst case, a low-traffic
user may see its traffic stall until a LINK_PROTOCOL state message
trigs the link to leave synchronization state.

At the same time, LINK_PROTOCOL packets which happen to have a (non-
valid) sequence number lower than the tunnel link's rcv_nxt value will
be consistently dropped, and will never be able to resolve the situation
described above.

We fix this by exempting LINK_PROTOCOL packets from the sequence number
check, as they should be. We also reduce (but don't completely
eliminate) the risk of entering multiple synchronization states by only
allowing the (logically) first SYNCH packet to initiate a synchronization
state. This works independently of actual packet arrival order.

Fixes: commit 6e498158a827 ("tipc: move link synch and failover to link aggregation level")

Signed-off-by: Jon Maloy &lt;jon.maloy@ericsson.com&gt;
Acked-by: Ying Xue &lt;ying.xue@windriver.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>tipc: reinitialize pointer after skb linearize</title>
<updated>2015-09-21T05:31:20Z</updated>
<author>
<name>Erik Hugne</name>
<email>erik.hugne@ericsson.com</email>
</author>
<published>2015-09-18T08:46:31Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=4e3ae00100945d39e1f83b7c0179a114ccf55759'/>
<id>urn:sha1:4e3ae00100945d39e1f83b7c0179a114ccf55759</id>
<content type='text'>
The msg pointer into header may change after skb linearization.
We must reinitialize it after calling skb_linearize to prevent
operating on a freed or invalid pointer.

Signed-off-by: Erik Hugne &lt;erik.hugne@ericsson.com&gt;
Reported-by: Tamás Végh &lt;tamas.vegh@ericsson.com&gt;
Acked-by: Ying Xue &lt;ying.xue@windriver.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>net: tipc: fix stall during bclink wakeup procedure</title>
<updated>2015-09-09T05:50:26Z</updated>
<author>
<name>Kolmakov Dmitriy</name>
<email>kolmakov.dmitriy@huawei.com</email>
</author>
<published>2015-09-07T09:05:48Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=7845989cb4b3da1db903918c844fccb9817d34a0'/>
<id>urn:sha1:7845989cb4b3da1db903918c844fccb9817d34a0</id>
<content type='text'>
If an attempt to wake up users of broadcast link is made when there is
no enough place in send queue than it may hang up inside the
tipc_sk_rcv() function since the loop breaks only after the wake up
queue becomes empty. This can lead to complete CPU stall with the
following message generated by RCU:

INFO: rcu_sched self-detected stall on CPU { 0}  (t=2101 jiffies
					g=54225 c=54224 q=11465)
Task dump for CPU 0:
tpch            R  running task        0 39949  39948 0x0000000a
 ffffffff818536c0 ffff88181fa037a0 ffffffff8106a4be 0000000000000000
 ffffffff818536c0 ffff88181fa037c0 ffffffff8106d8a8 ffff88181fa03800
 0000000000000001 ffff88181fa037f0 ffffffff81094a50 ffff88181fa15680
Call Trace:
 &lt;IRQ&gt;  [&lt;ffffffff8106a4be&gt;] sched_show_task+0xae/0x120
 [&lt;ffffffff8106d8a8&gt;] dump_cpu_task+0x38/0x40
 [&lt;ffffffff81094a50&gt;] rcu_dump_cpu_stacks+0x90/0xd0
 [&lt;ffffffff81097c3b&gt;] rcu_check_callbacks+0x3eb/0x6e0
 [&lt;ffffffff8106e53f&gt;] ? account_system_time+0x7f/0x170
 [&lt;ffffffff81099e64&gt;] update_process_times+0x34/0x60
 [&lt;ffffffff810a84d1&gt;] tick_sched_handle.isra.18+0x31/0x40
 [&lt;ffffffff810a851c&gt;] tick_sched_timer+0x3c/0x70
 [&lt;ffffffff8109a43d&gt;] __run_hrtimer.isra.34+0x3d/0xc0
 [&lt;ffffffff8109aa95&gt;] hrtimer_interrupt+0xc5/0x1e0
 [&lt;ffffffff81030d52&gt;] ? native_smp_send_reschedule+0x42/0x60
 [&lt;ffffffff81032f04&gt;] local_apic_timer_interrupt+0x34/0x60
 [&lt;ffffffff810335bc&gt;] smp_apic_timer_interrupt+0x3c/0x60
 [&lt;ffffffff8165a3fb&gt;] apic_timer_interrupt+0x6b/0x70
 [&lt;ffffffff81659129&gt;] ? _raw_spin_unlock_irqrestore+0x9/0x10
 [&lt;ffffffff8107eb9f&gt;] __wake_up_sync_key+0x4f/0x60
 [&lt;ffffffffa313ddd1&gt;] tipc_write_space+0x31/0x40 [tipc]
 [&lt;ffffffffa313dadf&gt;] filter_rcv+0x31f/0x520 [tipc]
 [&lt;ffffffffa313d699&gt;] ? tipc_sk_lookup+0xc9/0x110 [tipc]
 [&lt;ffffffff81659259&gt;] ? _raw_spin_lock_bh+0x19/0x30
 [&lt;ffffffffa314122c&gt;] tipc_sk_rcv+0x2dc/0x3e0 [tipc]
 [&lt;ffffffffa312e7ff&gt;] tipc_bclink_wakeup_users+0x2f/0x40 [tipc]
 [&lt;ffffffffa313ce26&gt;] tipc_node_unlock+0x186/0x190 [tipc]
 [&lt;ffffffff81597c1c&gt;] ? kfree_skb+0x2c/0x40
 [&lt;ffffffffa313475c&gt;] tipc_rcv+0x2ac/0x8c0 [tipc]
 [&lt;ffffffffa312ff58&gt;] tipc_l2_rcv_msg+0x38/0x50 [tipc]
 [&lt;ffffffff815a76d3&gt;] __netif_receive_skb_core+0x5a3/0x950
 [&lt;ffffffff815a98d3&gt;] __netif_receive_skb+0x13/0x60
 [&lt;ffffffff815a993e&gt;] netif_receive_skb_internal+0x1e/0x90
 [&lt;ffffffff815aa138&gt;] napi_gro_receive+0x78/0xa0
 [&lt;ffffffffa07f93f4&gt;] tg3_poll_work+0xc54/0xf40 [tg3]
 [&lt;ffffffff81597c8c&gt;] ? consume_skb+0x2c/0x40
 [&lt;ffffffffa07f9721&gt;] tg3_poll_msix+0x41/0x160 [tg3]
 [&lt;ffffffff815ab0f2&gt;] net_rx_action+0xe2/0x290
 [&lt;ffffffff8104b92a&gt;] __do_softirq+0xda/0x1f0
 [&lt;ffffffff8104bc26&gt;] irq_exit+0x76/0xa0
 [&lt;ffffffff81004355&gt;] do_IRQ+0x55/0xf0
 [&lt;ffffffff8165a12b&gt;] common_interrupt+0x6b/0x6b
 &lt;EOI&gt;

The issue occurs only when tipc_sk_rcv() is used to wake up postponed
senders:

	tipc_bclink_wakeup_users()
		// wakeupq - is a queue which consists of special
		// 		 messages with SOCK_WAKEUP type.
		tipc_sk_rcv(wakeupq)
			...
			while (skb_queue_len(inputq)) {
				filter_rcv(skb)
					// Here the type of message is checked
					// and if it is SOCK_WAKEUP then
					// it tries to wake up a sender.
					tipc_write_space(sk)
						wake_up_interruptible_sync_poll()
			}

After the sender thread is woke up it can gather control and perform
an attempt to send a message. But if there is no enough place in send
queue it will call link_schedule_user() function which puts a message
of type SOCK_WAKEUP to the wakeup queue and put the sender to sleep.
Thus the size of the queue actually is not changed and the while()
loop never exits.

The approach I proposed is to wake up only senders for which there is
enough place in send queue so the described issue can't occur.
Moreover the same approach is already used to wake up senders on
unicast links.

I have got into the issue on our product code but to reproduce the
issue I changed a benchmark test application (from
tipcutils/demos/benchmark) to perform the following scenario:
	1. Run 64 instances of test application (nodes). It can be done
	   on the one physical machine.
	2. Each application connects to all other using TIPC sockets in
	   RDM mode.
	3. When setup is done all nodes start simultaneously send
	   broadcast messages.
	4. Everything hangs up.

The issue is reproducible only when a congestion on broadcast link
occurs. For example, when there are only 8 nodes it works fine since
congestion doesn't occur. Send queue limit is 40 in my case (I use a
critical importance level) and when 64 nodes send a message at the
same moment a congestion occurs every time.

Signed-off-by: Dmitry S Kolmakov &lt;kolmakov.dmitriy@huawei.com&gt;
Reviewed-by: Jon Maloy &lt;jon.maloy@ericsson.com&gt;
Acked-by: Ying Xue &lt;ying.xue@windriver.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>tipc: fix stale link problem during synchronization</title>
<updated>2015-08-23T23:14:45Z</updated>
<author>
<name>Jon Paul Maloy</name>
<email>jon.maloy@ericsson.com</email>
</author>
<published>2015-08-20T06:12:56Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=2be80c2d87de789550982e74a11e9f9ff5940845'/>
<id>urn:sha1:2be80c2d87de789550982e74a11e9f9ff5940845</id>
<content type='text'>
Recent changes to the link synchronization means that we can now just
drop packets arriving on the synchronizing link before the synch point
is reached. This has lead to significant simplifications to the
implementation, but also turns out to have a flip side that we need
to consider.

Under unlucky circumstances, the two endpoints may end up
repeatedly dropping each other's packets, while immediately
asking for retransmission of the same packets, just to drop
them once more. This pattern will eventually be broken when
the synch point is reached on the other link, but before that,
the endpoints may have arrived at the retransmission limit
(stale counter) that indicates that the link should be broken.
We see this happen at rare occasions.

The fix for this is to not ask for retransmissions when a link is in
state LINK_SYNCHING. The fact that the link has reached this state
means that it has already received the first SYNCH packet, and that it
knows the synch point. Hence, it doesn't need any more packets until the
other link has reached the synch point, whereafter it can go ahead and
ask for the missing packets.

However, because of the reduced traffic on the synching link that
follows this change, it may now take longer to discover that the
synch point has been reached. We compensate for this by letting all
packets, on any of the links, trig a check for synchronization
termination. This is possible because the packets themselves don't
contain any information that is needed for discovering this condition.

Reviewed-by: Ying Xue &lt;ying.xue@windriver.com&gt;
Signed-off-by: Jon Maloy &lt;jon.maloy@ericsson.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>tipc: interrupt link synchronization when a link goes down</title>
<updated>2015-08-23T23:14:45Z</updated>
<author>
<name>Jon Paul Maloy</name>
<email>jon.maloy@ericsson.com</email>
</author>
<published>2015-08-20T06:12:55Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=5ae2f8e6857968d6dddbd3879ed0a32b860e02d1'/>
<id>urn:sha1:5ae2f8e6857968d6dddbd3879ed0a32b860e02d1</id>
<content type='text'>
When we introduced the new link failover/synch mechanism
in commit 6e498158a827fd515b514842e9a06bdf0f75ab86
("tipc: move link synch and failover to link aggregation level"),
we missed the case when the non-tunnel link goes down during the link
synchronization period. In this case the tunnel link will remain in
state LINK_SYNCHING, something leading to unpredictable behavior when
the failover procedure is initiated.

In this commit, we ensure that the node and remaining link goes
back to regular communication state (SELF_UP_PEER_UP/LINK_ESTABLISHED)
when one of the parallel links goes down. We also ensure that we don't
re-enter synch mode if subsequent SYNCH packets arrive on the remaining
link.

Reviewed-by: Ying Xue &lt;ying.xue@windriver.com&gt;
Signed-off-by: Jon Maloy &lt;jon.maloy@ericsson.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>tipc: eliminate risk of premature link setup during failover</title>
<updated>2015-08-23T23:14:45Z</updated>
<author>
<name>Jon Paul Maloy</name>
<email>jon.maloy@ericsson.com</email>
</author>
<published>2015-08-20T06:12:54Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=17b2063077a7478e5fd3c34b04a059dbb8474638'/>
<id>urn:sha1:17b2063077a7478e5fd3c34b04a059dbb8474638</id>
<content type='text'>
When a link goes down, and there is still a working link towards its
destination node, a failover is initiated, and the failed link is not
allowed to re-establish until that procedure is finished. To ensure
this, the concerned link endpoints are set to state LINK_FAILINGOVER,
and the node endpoints to NODE_FAILINGOVER during the failover period.

However, if the link reset is due to a disabled bearer, the corres-
ponding link endpoint is deleted, and only the node endpoint knows
about the ongoing failover. Now, if the disabled bearer is re-enabled
during the failover period, the discovery mechanism may create a new
link endpoint that is ready to be established, despite that this is not
permitted. This situation may cause both the ongoing failover and any
subsequent link synchronization to fail.

In this commit, we ensure that a newly created link goes directly to
state LINK_FAILINGOVER if the corresponding node state is
NODE_FAILINGOVER. This eliminates the problem described above.

Furthermore, we tighten the criteria for which packets are allowed
to end a failover state in the function tipc_node_check_state().
By checking that the receiving link is up and running, instead of just
checking that it is not in failover mode, we eliminate the risk that
protocol packets from the re-created link may cause the failover to
be prematurely terminated.

Reviewed-by: Ying Xue &lt;ying.xue@windriver.com&gt;
Signed-off-by: Jon Maloy &lt;jon.maloy@ericsson.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
</feed>
