<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux/net/tipc/msg.c, branch v5.4</title>
<subtitle>Mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
</subtitle>
<id>https://git.shady.money/linux/atom?h=v5.4</id>
<link rel='self' href='https://git.shady.money/linux/atom?h=v5.4'/>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/'/>
<updated>2019-10-02T15:02:05Z</updated>
<entry>
<title>tipc: fix unlimited bundling of small messages</title>
<updated>2019-10-02T15:02:05Z</updated>
<author>
<name>Tuong Lien</name>
<email>tuong.t.lien@dektech.com.au</email>
</author>
<published>2019-10-02T11:49:43Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=e95584a889e1902fdf1ded9712e2c3c3083baf96'/>
<id>urn:sha1:e95584a889e1902fdf1ded9712e2c3c3083baf96</id>
<content type='text'>
We have identified a problem with the "oversubscription" policy in the
link transmission code.

When small messages are transmitted, and the sending link has reached
the transmit window limit, those messages will be bundled and put into
the link backlog queue. However, bundles of data messages are counted
at the 'CRITICAL' level, so that the counter for that level, instead of
the counter for the real, bundled message's level is the one being
increased.
Subsequent, to-be-bundled data messages at non-CRITICAL levels continue
to be tested against the unchanged counter for their own level, while
contributing to an unrestrained increase at the CRITICAL backlog level.

This leaves a gap in congestion control algorithm for small messages
that can result in starvation for other users or a "real" CRITICAL
user. Even that eventually can lead to buffer exhaustion &amp; link reset.

We fix this by keeping a 'target_bskb' buffer pointer at each levels,
then when bundling, we only bundle messages at the same importance
level only. This way, we know exactly how many slots a certain level
have occupied in the queue, so can manage level congestion accurately.

By bundling messages at the same level, we even have more benefits. Let
consider this:
- One socket sends 64-byte messages at the 'CRITICAL' level;
- Another sends 4096-byte messages at the 'LOW' level;

When a 64-byte message comes and is bundled the first time, we put the
overhead of message bundle to it (+ 40-byte header, data copy, etc.)
for later use, but the next message can be a 4096-byte one that cannot
be bundled to the previous one. This means the last bundle carries only
one payload message which is totally inefficient, as for the receiver
also! Later on, another 64-byte message comes, now we make a new bundle
and the same story repeats...

With the new bundling algorithm, this will not happen, the 64-byte
messages will be bundled together even when the 4096-byte message(s)
comes in between. However, if the 4096-byte messages are sent at the
same level i.e. 'CRITICAL', the bundling algorithm will again cause the
same overhead.

Also, the same will happen even with only one socket sending small
messages at a rate close to the link transmit's one, so that, when one
message is bundled, it's transmitted shortly. Then, another message
comes, a new bundle is created and so on...

We will solve this issue radically by another patch.

Fixes: 365ad353c256 ("tipc: reduce risk of user starvation during link congestion")
Reported-by: Hoang Le &lt;hoang.h.le@dektech.com.au&gt;
Acked-by: Jon Maloy &lt;jon.maloy@ericsson.com&gt;
Signed-off-by: Tuong Lien &lt;tuong.t.lien@dektech.com.au&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>tipc: fix changeover issues due to large packet</title>
<updated>2019-07-25T22:55:47Z</updated>
<author>
<name>Tuong Lien</name>
<email>tuong.t.lien@dektech.com.au</email>
</author>
<published>2019-07-24T01:56:12Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=2320bcdae62887555701ea78a46b640ff6b63868'/>
<id>urn:sha1:2320bcdae62887555701ea78a46b640ff6b63868</id>
<content type='text'>
In conjunction with changing the interfaces' MTU (e.g. especially in
the case of a bonding) where the TIPC links are brought up and down
in a short time, a couple of issues were detected with the current link
changeover mechanism:

1) When one link is up but immediately forced down again, the failover
procedure will be carried out in order to failover all the messages in
the link's transmq queue onto the other working link. The link and node
state is also set to FAILINGOVER as part of the process. The message
will be transmited in form of a FAILOVER_MSG, so its size is plus of 40
bytes (= the message header size). There is no problem if the original
message size is not larger than the link's MTU - 40, and indeed this is
the max size of a normal payload messages. However, in the situation
above, because the link has just been up, the messages in the link's
transmq are almost SYNCH_MSGs which had been generated by the link
synching procedure, then their size might reach the max value already!
When the FAILOVER_MSG is built on the top of such a SYNCH_MSG, its size
will exceed the link's MTU. As a result, the messages are dropped
silently and the failover procedure will never end up, the link will
not be able to exit the FAILINGOVER state, so cannot be re-established.

2) The same scenario above can happen more easily in case the MTU of
the links is set differently or when changing. In that case, as long as
a large message in the failure link's transmq queue was built and
fragmented with its link's MTU &gt; the other link's one, the issue will
happen (there is no need of a link synching in advance).

3) The link synching procedure also faces with the same issue but since
the link synching is only started upon receipt of a SYNCH_MSG, dropping
the message will not result in a state deadlock, but it is not expected
as design.

The 1) &amp; 3) issues are resolved by the last commit that only a dummy
SYNCH_MSG (i.e. without data) is generated at the link synching, so the
size of a FAILOVER_MSG if any then will never exceed the link's MTU.

For the 2) issue, the only solution is trying to fragment the messages
in the failure link's transmq queue according to the working link's MTU
so they can be failovered then. A new function is made to accomplish
this, it will still be a TUNNEL PROTOCOL/FAILOVER MSG but if the
original message size is too large, it will be fragmented &amp; reassembled
at the receiving side.

Acked-by: Ying Xue &lt;ying.xue@windriver.com&gt;
Acked-by: Jon Maloy &lt;jon.maloy@ericsson.com&gt;
Signed-off-by: Tuong Lien &lt;tuong.t.lien@dektech.com.au&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>tipc: buffer overflow handling in listener socket</title>
<updated>2018-09-29T18:24:22Z</updated>
<author>
<name>Tung Nguyen</name>
<email>tung.q.nguyen@dektech.com.au</email>
</author>
<published>2018-09-28T18:23:22Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=6787927475e52f6933e3affce365dabb2aa2fadf'/>
<id>urn:sha1:6787927475e52f6933e3affce365dabb2aa2fadf</id>
<content type='text'>
Default socket receive buffer size for a listener socket is 2Mb. For
each arriving empty SYN, the linux kernel allocates a 768 bytes buffer.
This means that a listener socket can serve maximum 2700 simultaneous
empty connection setup requests before it hits a receive buffer
overflow, and much fewer if the SYN is carrying any significant
amount of data.

When this happens the setup request is rejected, and the client
receives an ECONNREFUSED error.

This commit mitigates this problem by letting the client socket try to
retransmit the SYN message multiple times when it sees it rejected with
the code TIPC_ERR_OVERLOAD. Retransmission is done at random intervals
in the range of [100 ms, setup_timeout / 4], as many times as there is
room for within the setup timeout limit.

Signed-off-by: Tung Nguyen &lt;tung.q.nguyen@dektech.com.au&gt;
Acked-by: Ying Xue &lt;ying.xue@windriver.com&gt;
Signed-off-by: Jon Maloy &lt;jon.maloy@ericsson.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>tipc: refactor function tipc_msg_reverse()</title>
<updated>2018-09-29T18:24:22Z</updated>
<author>
<name>Jon Maloy</name>
<email>jon.maloy@ericsson.com</email>
</author>
<published>2018-09-28T18:23:18Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=5cbdbd1a1f30a083aada44595ca42952fc31e866'/>
<id>urn:sha1:5cbdbd1a1f30a083aada44595ca42952fc31e866</id>
<content type='text'>
The function tipc_msg_reverse() is reversing the header of a message
while reusing the original buffer. We have seen at several occasions
that this may have unfortunate side effects when the buffer to be
reversed is a clone.

In one of the following commits we will again need to reverse cloned
buffers, so this is the right time to permanently eliminate this
problem. In this commit we let the said function always consume the
original buffer and replace it with a new one when applicable.

Acked-by: Ying Xue &lt;ying.xue@windriver.com&gt;
Signed-off-by: Jon Maloy &lt;jon.maloy@ericsson.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>tipc: eliminate buffer cloning in function tipc_msg_extract()</title>
<updated>2018-06-30T11:48:16Z</updated>
<author>
<name>Tung Nguyen</name>
<email>tung.q.nguyen@dektech.com.au</email>
</author>
<published>2018-06-28T20:25:04Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=ef9be755697f1b841c2a219a05df1a72ccd6f471'/>
<id>urn:sha1:ef9be755697f1b841c2a219a05df1a72ccd6f471</id>
<content type='text'>
The function tipc_msg_extract() is using skb_clone() to clone inner
messages from a message bundle buffer. Although this method is safe,
it has an undesired effect that each buffer clone inherits the
true-size of the bundling buffer. As a result, the buffer clone
almost always ends up with being copied anyway by the message
validation function. This makes the cloning into a sub-optimization.

In this commit we take the consequence of this realization, and copy
each inner message to a separately allocated buffer up front in the
extraction function.

As a bonus we can now eliminate the two cases where we had to copy
re-routed packets that may potentially go out on the wire again.

Signed-off-by: Tung Nguyen &lt;tung.q.nguyen@dektech.com.au&gt;
Signed-off-by: Jon Maloy &lt;jon.maloy@ericsson.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>tipc: obsolete TIPC_ZONE_SCOPE</title>
<updated>2018-03-17T21:11:46Z</updated>
<author>
<name>Jon Maloy</name>
<email>jon.maloy@ericsson.com</email>
</author>
<published>2018-03-15T15:48:51Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=928df1880e24bcd47d6359ff86df24db3dfba3c3'/>
<id>urn:sha1:928df1880e24bcd47d6359ff86df24db3dfba3c3</id>
<content type='text'>
Publications for TIPC_CLUSTER_SCOPE and TIPC_ZONE_SCOPE are in all
aspects handled the same way, both on the publishing node and on the
receiving nodes.

Despite previous ambitions to the contrary, this is never going to change,
so we take the conseqeunce of this and obsolete TIPC_ZONE_SCOPE and related
macros/functions. Whenever a user is doing a bind() or a sendmsg() attempt
using ZONE_SCOPE we translate this internally to CLUSTER_SCOPE, while we
remain compatible with users and remote nodes still using ZONE_SCOPE.

Furthermore, the non-formalized scope value 0 has always been permitted
for use during lookup, with the same meaning as ZONE_SCOPE/CLUSTER_SCOPE.
We now permit it even as binding scope, but for compatibility reasons we
choose to not change the value of TIPC_CLUSTER_SCOPE.

Acked-by: Ying Xue &lt;ying.xue@windriver.com&gt;
Signed-off-by: Jon Maloy &lt;jon.maloy@ericsson.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>tipc: fix skb truesize/datasize ratio control</title>
<updated>2018-02-08T20:30:40Z</updated>
<author>
<name>Hoang Le</name>
<email>hoang.h.le@dektek.com.au</email>
</author>
<published>2018-02-08T16:16:25Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=55b3280d1e471795c08dbbe17325720a843e104c'/>
<id>urn:sha1:55b3280d1e471795c08dbbe17325720a843e104c</id>
<content type='text'>
In commit d618d09a68e4 ("tipc: enforce valid ratio between skb truesize
and contents") we introduced a test for ensuring that the condition
truesize/datasize &lt;= 4 is true for a received buffer. Unfortunately this
test has two problems.

- Because of the integer arithmetics the test
  if (skb-&gt;truesize / buf_roundup_len(skb) &gt; 4) will miss all
  ratios [4 &lt; ratio &lt; 5], which was not the intention.
- The buffer returned by skb_copy() inherits skb-&gt;truesize of the
  original buffer, which doesn't help the situation at all.

In this commit, we change the ratio condition and replace skb_copy()
with a call to skb_copy_expand() to finally get this right.

Acked-by: Jon Maloy &lt;jon.maloy@ericsson.com&gt;
Signed-off-by: Jon Maloy &lt;jon.maloy@ericsson.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>tipc: fall back to smaller MTU if allocation of local send skb fails</title>
<updated>2017-12-01T20:21:25Z</updated>
<author>
<name>Jon Maloy</name>
<email>jon.maloy@ericsson.com</email>
</author>
<published>2017-11-30T15:47:25Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=4c94cc2d3d57a2e843ab10887f67faa82c2337f9'/>
<id>urn:sha1:4c94cc2d3d57a2e843ab10887f67faa82c2337f9</id>
<content type='text'>
When sending node local messages the code is using an 'mtu' of 66060
bytes to avoid unnecessary fragmentation. During situations of low
memory tipc_msg_build() may sometimes fail to allocate such large
buffers, resulting in unnecessary send failures. This can easily be
remedied by falling back to a smaller MTU, and then reassemble the
buffer chain as if the message were arriving from a remote node.

At the same time, we change the initial MTU setting of the broadcast
link to a lower value, so that large messages always are fragmented
into smaller buffers even when we run in single node mode. Apart from
obtaining the same advantage as for the 'fallback' solution above, this
turns out to give a significant performance improvement. This can
probably be explained with the __pskb_copy() operation performed on the
buffer for each recipient during reception. We found the optimal value
for this, considering the most relevant skb pool, to be 3744 bytes.

Acked-by: Ying Xue &lt;ying.xue@ericsson.com&gt;
Signed-off-by: Jon Maloy &lt;jon.maloy@ericsson.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>tipc: enforce valid ratio between skb truesize and contents</title>
<updated>2017-11-16T01:49:00Z</updated>
<author>
<name>Jon Maloy</name>
<email>jon.maloy@ericsson.com</email>
</author>
<published>2017-11-15T20:23:56Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=d618d09a68e4eed7a435beb2e355250f6f40664a'/>
<id>urn:sha1:d618d09a68e4eed7a435beb2e355250f6f40664a</id>
<content type='text'>
The socket level flow control is based on the assumption that incoming
buffers meet the condition (skb-&gt;truesize / roundup(skb-&gt;len) &lt;= 4),
where the latter value is rounded off upwards to the nearest 1k number.
This does empirically hold true for the device drivers we know, but we
cannot trust that it will always be so, e.g., in a system with jumbo
frames and very small packets.

We now introduce a check for this condition at packet arrival, and if
we find it to be false, we copy the packet to a new, smaller buffer,
where the condition will be true. We expect this to affect only a small
fraction of all incoming packets, if at all.

Acked-by: Ying Xue &lt;ying.xue@windriver.com&gt;
Signed-off-by: Jon Maloy &lt;jon.maloy@ericsson.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>tipc: refactor function filter_rcv()</title>
<updated>2017-10-13T15:46:00Z</updated>
<author>
<name>Jon Maloy</name>
<email>jon.maloy@ericsson.com</email>
</author>
<published>2017-10-13T09:04:20Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=64ac5f5977df5b276374fb2f051082129f5cdb22'/>
<id>urn:sha1:64ac5f5977df5b276374fb2f051082129f5cdb22</id>
<content type='text'>
In the following commits we will need to handle multiple incoming and
rejected/returned buffers in the function socket.c::filter_rcv().
As a preparation for this, we generalize the function by handling
buffer queues instead of individual buffers. We also introduce a
help function tipc_skb_reject(), and rename filter_rcv() to
tipc_sk_filter_rcv() in line with other functions in socket.c.

Signed-off-by: Jon Maloy &lt;jon.maloy@ericsson.com&gt;
Acked-by: Ying Xue &lt;ying.xue@windriver.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
</feed>
