linux/net/tipc/socket.c, branch v4.11

tipc: close the connection if protocol messages contain errors

2017-04-28T16:20:42Z

When a socket is shutting down, we notify the peer node about the connection termination by reusing an incoming message if possible. If the last received message was a connection acknowledgment message, we reverse this message and set the error code to TIPC_ERR_NO_PORT and send it to peer. In tipc_sk_proto_rcv(), we never check for message errors while processing the connection acknowledgment or probe messages. Thus this message performs the usual flow control accounting and leaves the session hanging. In this commit, we terminate the connection when we receive such error messages. Signed-off-by: Parthasarathy Bhuvaragan Reviewed-by: Jon Maloy Signed-off-by: David S. Miller

tipc: improve error validations for sockets in CONNECTING state

2017-04-28T16:20:42Z

Until now, the checks for sockets in CONNECTING state was based on the assumption that the incoming message was always from the peer's accepted data socket. However an application using a non-blocking socket sends an implicit connect, this socket which is in CONNECTING state can receive error messages from the peer's listening socket. As we discard these messages, the application socket hangs as there due to inactivity. In addition to this, there are other places where we process errors but do not notify the user. In this commit, we process such incoming error messages and notify our users about them using sk_state_change(). Signed-off-by: Parthasarathy Bhuvaragan Reviewed-by: Jon Maloy Signed-off-by: David S. Miller

tipc: Fix missing connection request handling

2017-04-28T16:20:42Z

In filter_connect, we use waitqueue_active() to check for any connections to wakeup. But waitqueue_active() is missing memory barriers while accessing the critical sections, leading to inconsistent results. In this commit, we replace this with an SMP safe wq_has_sleeper() using the generic socket callback sk_data_ready(). Signed-off-by: Parthasarathy Bhuvaragan Reviewed-by: Jon Maloy Signed-off-by: David S. Miller

tipc: fix socket flow control accounting error at tipc_recv_stream

2017-04-25T15:45:38Z

Until now in tipc_recv_stream(), we update the received unacknowledged bytes based on a stack variable and not based on the actual message size. If the user buffer passed at tipc_recv_stream() is smaller than the received skb, the size variable in stack differs from the actual message size in the skb. This leads to a flow control accounting error causing permanent congestion. In this commit, we fix this accounting error by always using the size of the incoming message. Fixes: 10724cc7bb78 ("tipc: redesign connection-level flow control") Signed-off-by: Parthasarathy Bhuvaragan Reviewed-by: Jon Maloy Signed-off-by: David S. Miller

tipc: fix socket flow control accounting error at tipc_send_stream

2017-04-25T15:45:37Z

Until now in tipc_send_stream(), we return -1 when the socket encounters link congestion even if the socket had successfully sent partial data. This is incorrect as the application resends the same the partial data leading to data corruption at receiver's end. In this commit, we return the partially sent bytes as the return value at link congestion. Fixes: 10724cc7bb78 ("tipc: redesign connection-level flow control") Signed-off-by: Parthasarathy Bhuvaragan Reviewed-by: Jon Maloy Signed-off-by: David S. Miller

net: Work around lockdep limitation in sockets that use sockets

2017-03-10T02:23:27Z

Lockdep issues a circular dependency warning when AFS issues an operation through AF_RXRPC from a context in which the VFS/VM holds the mmap_sem. The theory lockdep comes up with is as follows: (1) If the pagefault handler decides it needs to read pages from AFS, it calls AFS with mmap_sem held and AFS begins an AF_RXRPC call, but creating a call requires the socket lock: mmap_sem must be taken before sk_lock-AF_RXRPC (2) afs_open_socket() opens an AF_RXRPC socket and binds it. rxrpc_bind() binds the underlying UDP socket whilst holding its socket lock. inet_bind() takes its own socket lock: sk_lock-AF_RXRPC must be taken before sk_lock-AF_INET (3) Reading from a TCP socket into a userspace buffer might cause a fault and thus cause the kernel to take the mmap_sem, but the TCP socket is locked whilst doing this: sk_lock-AF_INET must be taken before mmap_sem However, lockdep's theory is wrong in this instance because it deals only with lock classes and not individual locks. The AF_INET lock in (2) isn't really equivalent to the AF_INET lock in (3) as the former deals with a socket entirely internal to the kernel that never sees userspace. This is a limitation in the design of lockdep. Fix the general case by: (1) Double up all the locking keys used in sockets so that one set are used if the socket is created by userspace and the other set is used if the socket is created by the kernel. (2) Store the kern parameter passed to sk_alloc() in a variable in the sock struct (sk_kern_sock). This informs sock_lock_init(), sock_init_data() and sk_clone_lock() as to the lock keys to be used. Note that the child created by sk_clone_lock() inherits the parent's kern setting. (3) Add a 'kern' parameter to ->accept() that is analogous to the one passed in to ->create() that distinguishes whether kernel_accept() or sys_accept4() was the caller and can be passed to sk_alloc(). Note that a lot of accept functions merely dequeue an already allocated socket. I haven't touched these as the new socket already exists before we get the parameter. Note also that there are a couple of places where I've made the accepted socket unconditionally kernel-based: irda_accept() rds_rcp_accept_one() tcp_accept_from_sock() because they follow a sock_create_kern() and accept off of that. Whilst creating this, I noticed that lustre and ocfs don't create sockets through sock_create_kern() and thus they aren't marked as for-kernel, though they appear to be internal. I wonder if these should do that so that they use the new set of lock keys. Signed-off-by: David Howells Signed-off-by: David S. Miller

sched/headers: Prepare to move signal wakeup & sigpending methods from into

2017-03-02T07:42:32Z

Fix up affected files that include this signal functionality via sched.h. Acked-by: Linus Torvalds Cc: Mike Galbraith Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar

tipc: Fix tipc_sk_reinit race conditions

2017-02-17T17:28:35Z

There are two problems with the function tipc_sk_reinit. Firstly it's doing a manual walk over an rhashtable. This is broken as an rhashtable can be resized and if you manually walk over it during a resize then you may miss entries. Secondly it's missing memory barriers as previously the code used spinlocks which provide the barriers implicitly. This patch fixes both problems. Fixes: 07f6c4bc048a ("tipc: convert tipc reference table to...") Signed-off-by: Herbert Xu Acked-by: Ying Xue Signed-off-by: David S. Miller

tipc: uninitialized return code in tipc_setsockopt()

2017-01-25T17:41:34Z

We shuffled some code around and added some new case statements here and now "res" isn't initialized on all paths. Fixes: 01fd12bb189a ("tipc: make replicast a user selectable option") Signed-off-by: Dan Carpenter Signed-off-by: David S. Miller

tipc: make replicast a user selectable option

2017-01-20T17:10:17Z

If the bearer carrying multicast messages supports broadcast, those messages will be sent to all cluster nodes, irrespective of whether these nodes host any actual destinations socket or not. This is clearly wasteful if the cluster is large and there are only a few real destinations for the message being sent. In this commit we extend the eligibility of the newly introduced "replicast" transmit option. We now make it possible for a user to select which method he wants to be used, either as a mandatory setting via setsockopt(), or as a relative setting where we let the broadcast layer decide which method to use based on the ratio between cluster size and the message's actual number of destination nodes. In the latter case, a sending socket must stick to a previously selected method until it enters an idle period of at least 5 seconds. This eliminates the risk of message reordering caused by method change, i.e., when changes to cluster size or number of destinations would otherwise mandate a new method to be used. Reviewed-by: Parthasarathy Bhuvaragan Acked-by: Ying Xue Signed-off-by: Jon Maloy Signed-off-by: David S. Miller