linux/kernel/futex.c, branch v3.14

futex: revert back to the explicit waiter counting code

2014-03-21T05:11:17Z

Srikar Dronamraju reports that commit b0c29f79ecea ("futexes: Avoid taking the hb->lock if there's nothing to wake up") causes java threads getting stuck on futexes when runing specjbb on a power7 numa box. The cause appears to be that the powerpc spinlocks aren't using the same ticket lock model that we use on x86 (and other) architectures, which in turn result in the "spin_is_locked()" test in hb_waiters_pending() occasionally reporting an unlocked spinlock even when there are pending waiters. So this reinstates Davidlohr Bueso's original explicit waiter counting code, which I had convinced Davidlohr to drop in favor of figuring out the pending waiters by just using the existing state of the spinlock and the wait queue. Reported-and-tested-by: Srikar Dronamraju Original-code-by: Davidlohr Bueso Signed-off-by: Linus Torvalds

Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

2014-01-20T18:42:08Z

Pull scheduler changes from Ingo Molnar: - Add the initial implementation of SCHED_DEADLINE support: a real-time scheduling policy where tasks that meet their deadlines and periodically execute their instances in less than their runtime quota see real-time scheduling and won't miss any of their deadlines. Tasks that go over their quota get delayed (Available to privileged users for now) - Clean up and fix preempt_enable_no_resched() abuse all around the tree - Do sched_clock() performance optimizations on x86 and elsewhere - Fix and improve auto-NUMA balancing - Fix and clean up the idle loop - Apply various cleanups and fixes * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (60 commits) sched: Fix __sched_setscheduler() nice test sched: Move SCHED_RESET_ON_FORK into attr::sched_flags sched: Fix up attr::sched_priority warning sched: Fix up scheduler syscall LTP fails sched: Preserve the nice level over sched_setscheduler() and sched_setparam() calls sched/core: Fix htmldocs warnings sched/deadline: No need to check p if dl_se is valid sched/deadline: Remove unused variables sched/deadline: Fix sparse static warnings m68k: Fix build warning in mac_via.h sched, thermal: Clean up preempt_enable_no_resched() abuse sched, net: Fixup busy_loop_us_clock() sched, net: Clean up preempt_enable_no_resched() abuse sched/preempt: Fix up missed PREEMPT_NEED_RESCHED folding sched/preempt, locking: Rework local_bh_{dis,en}able() sched/clock, x86: Avoid a runtime condition in native_sched_clock() sched/clock: Fix up clear_sched_clock_stable() sched/clock, x86: Use a static_key for sched_clock_stable sched/clock: Remove local_irq_disable() from the clocks sched/clock, x86: Rewrite cyc2ns() to avoid the need to disable IRQs ...

futexes: Fix futex_hashsize initialization

2014-01-16T14:14:32Z

"futexes: Increase hash table size for better performance" introduces a new alloc_large_system_hash() call. alloc_large_system_hash() however may allocate less memory than requested, e.g. limited by MAX_ORDER. Hence pass a pointer to alloc_large_system_hash() which will contain the hash shift when the function returns. Afterwards correctly set futex_hashsize. Fixes a crash on s390 where the requested allocation size was 4MB but only 1MB was allocated. Signed-off-by: Heiko Carstens Cc: Darren Hart Cc: Peter Zijlstra Cc: Paul E. McKenney Cc: Waiman Long Cc: Jason Low Cc: Davidlohr Bueso Link: http://lkml.kernel.org/r/20140116135450.GA4345@osiris Signed-off-by: Ingo Molnar

rtmutex: Turn the plist into an rb-tree

2014-01-13T12:41:50Z

Turn the pi-chains from plist to rb-tree, in the rt_mutex code, and provide a proper comparison function for -deadline and -priority tasks. This is done mainly because: - classical prio field of the plist is just an int, which might not be enough for representing a deadline; - manipulating such a list would become O(nr_deadline_tasks), which might be to much, as the number of -deadline task increases. Therefore, an rb-tree is used, and tasks are queued in it according to the following logic: - among two -priority (i.e., SCHED_BATCH/OTHER/RR/FIFO) tasks, the one with the higher (lower, actually!) prio wins; - among a -priority and a -deadline task, the latter always wins; - among two -deadline tasks, the one with the earliest deadline wins. Queueing and dequeueing functions are changed accordingly, for both the list of a task's pi-waiters and the list of tasks blocked on a pi-lock. Signed-off-by: Peter Zijlstra Signed-off-by: Dario Faggioli Signed-off-by: Juri Lelli Signed-off-again-by: Peter Zijlstra Link: http://lkml.kernel.org/r/1383831828-15501-10-git-send-email-juri.lelli@gmail.com Signed-off-by: Ingo Molnar

futexes: Avoid taking the hb->lock if there's nothing to wake up

2014-01-13T10:45:21Z

In futex_wake() there is clearly no point in taking the hb->lock if we know beforehand that there are no tasks to be woken. While the hash bucket's plist head is a cheap way of knowing this, we cannot rely 100% on it as there is a racy window between the futex_wait call and when the task is actually added to the plist. To this end, we couple it with the spinlock check as tasks trying to enter the critical region are most likely potential waiters that will be added to the plist, thus preventing tasks sleeping forever if wakers don't acknowledge all possible waiters. Furthermore, the futex ordering guarantees are preserved, ensuring that waiters either observe the changed user space value before blocking or is woken by a concurrent waker. For wakers, this is done by relying on the barriers in get_futex_key_refs() -- for archs that do not have implicit mb in atomic_inc(), we explicitly add them through a new futex_get_mm function. For waiters we rely on the fact that spin_lock calls already update the head counter, so spinners are visible even if the lock hasn't been acquired yet. For more details please refer to the updated comments in the code and related discussion: https://lkml.org/lkml/2013/11/26/556 Special thanks to tglx for careful review and feedback. Suggested-by: Linus Torvalds Reviewed-by: Darren Hart Reviewed-by: Thomas Gleixner Reviewed-by: Peter Zijlstra Signed-off-by: Davidlohr Bueso Cc: Paul E. McKenney Cc: Mike Galbraith Cc: Jeff Mahoney Cc: Scott Norton Cc: Tom Vaden Cc: Aswin Chandramouleeswaran Cc: Waiman Long Cc: Jason Low Cc: Andrew Morton Link: http://lkml.kernel.org/r/1389569486-25487-5-git-send-email-davidlohr@hp.com Signed-off-by: Ingo Molnar

futexes: Document multiprocessor ordering guarantees

2014-01-13T10:45:19Z

That's essential, if you want to hack on futexes. Reviewed-by: Darren Hart Reviewed-by: Peter Zijlstra Reviewed-by: Paul E. McKenney Signed-off-by: Thomas Gleixner Signed-off-by: Davidlohr Bueso Cc: Mike Galbraith Cc: Jeff Mahoney Cc: Linus Torvalds Cc: Randy Dunlap Cc: Scott Norton Cc: Tom Vaden Cc: Aswin Chandramouleeswaran Cc: Waiman Long Cc: Jason Low Cc: Andrew Morton Link: http://lkml.kernel.org/r/1389569486-25487-4-git-send-email-davidlohr@hp.com Signed-off-by: Ingo Molnar

futexes: Increase hash table size for better performance

2014-01-13T10:45:18Z

Currently, the futex global hash table suffers from its fixed, smallish (for today's standards) size of 256 entries, as well as its lack of NUMA awareness. Large systems, using many futexes, can be prone to high amounts of collisions; where these futexes hash to the same bucket and lead to extra contention on the same hb->lock. Furthermore, cacheline bouncing is a reality when we have multiple hb->locks residing on the same cacheline and different futexes hash to adjacent buckets. This patch keeps the current static size of 16 entries for small systems, or otherwise, 256 * ncpus (or larger as we need to round the number to a power of 2). Note that this number of CPUs accounts for all CPUs that can ever be available in the system, taking into consideration things like hotpluging. While we do impose extra overhead at bootup by making the hash table larger, this is a one time thing, and does not shadow the benefits of this patch. Furthermore, as suggested by tglx, by cache aligning the hash buckets we can avoid access across cacheline boundaries and also avoid massive cache line bouncing if multiple cpus are hammering away at different hash buckets which happen to reside in the same cache line. Also, similar to other core kernel components (pid, dcache, tcp), by using alloc_large_system_hash() we benefit from its NUMA awareness and thus the table is distributed among the nodes instead of in a single one. For a custom microbenchmark that pounds on the uaddr hashing -- making the wait path fail at futex_wait_setup() returning -EWOULDBLOCK for large amounts of futexes, we can see the following benefits on a 80-core, 8-socket 1Tb server: +---------+--------------------+------------------------+-----------------------+-------------------------------+ | threads | baseline (ops/sec) | aligned-only (ops/sec) | large table (ops/sec) | large table+aligned (ops/sec) | +---------+--------------------+------------------------+-----------------------+-------------------------------+ | 512 | 32426 | 50531 (+55.8%) | 255274 (+687.2%) | 292553 (+802.2%) | | 256 | 65360 | 99588 (+52.3%) | 443563 (+578.6%) | 508088 (+677.3%) | | 128 | 125635 | 200075 (+59.2%) | 742613 (+491.1%) | 835452 (+564.9%) | | 80 | 193559 | 323425 (+67.1%) | 1028147 (+431.1%) | 1130304 (+483.9%) | | 64 | 247667 | 443740 (+79.1%) | 997300 (+302.6%) | 1145494 (+362.5%) | | 32 | 628412 | 721401 (+14.7%) | 965996 (+53.7%) | 1122115 (+78.5%) | +---------+--------------------+------------------------+-----------------------+-------------------------------+ Reviewed-by: Darren Hart Reviewed-by: Peter Zijlstra Reviewed-by: Paul E. McKenney Reviewed-by: Waiman Long Reviewed-and-tested-by: Jason Low Reviewed-by: Thomas Gleixner Signed-off-by: Davidlohr Bueso Cc: Mike Galbraith Cc: Jeff Mahoney Cc: Linus Torvalds Cc: Scott Norton Cc: Tom Vaden Cc: Aswin Chandramouleeswaran Link: http://lkml.kernel.org/r/1389569486-25487-3-git-send-email-davidlohr@hp.com Signed-off-by: Ingo Molnar

futexes: Clean up various details

2014-01-13T10:45:17Z

- Remove unnecessary head variables. - Delete unused parameter in queue_unlock(). Reviewed-by: Darren Hart Reviewed-by: Peter Zijlstra Reviewed-by: Paul E. McKenney Reviewed-by: Thomas Gleixner Signed-off-by: Jason Low Signed-off-by: Davidlohr Bueso Cc: Mike Galbraith Cc: Jeff Mahoney Cc: Linus Torvalds Cc: Scott Norton Cc: Tom Vaden Cc: Aswin Chandramouleeswaran Cc: Waiman Long Cc: Andrew Morton Link: http://lkml.kernel.org/r/1389569486-25487-2-git-send-email-davidlohr@hp.com Signed-off-by: Ingo Molnar

futex: move user address verification up to common code

2013-12-12T17:53:51Z

When debugging the read-only hugepage case, I was confused by the fact that get_futex_key() did an access_ok() only for the non-shared futex case, since the user address checking really isn't in any way specific to the private key handling. Now, it turns out that the shared key handling does effectively do the equivalent checks inside get_user_pages_fast() (it doesn't actually check the address range on x86, but does check the page protections for being a user page). So it wasn't actually a bug, but the fact that we treat the address differently for private and shared futexes threw me for a loop. Just move the check up, so that it gets done for both cases. Also, use the 'rw' parameter for the type, even if it doesn't actually matter any more (it's a historical artifact of the old racy i386 "page faults from kernel space don't check write protections"). Cc: Thomas Gleixner Signed-off-by: Linus Torvalds

futex: fix handling of read-only-mapped hugepages

2013-12-12T17:38:42Z

The hugepage code had the exact same bug that regular pages had in commit 7485d0d3758e ("futexes: Remove rw parameter from get_futex_key()"). The regular page case was fixed by commit 9ea71503a8ed ("futex: Fix regression with read only mappings"), but the transparent hugepage case (added in a5b338f2b0b1: "thp: update futex compound knowledge") case remained broken. Found by Dave Jones and his trinity tool. Reported-and-tested-by: Dave Jones Cc: stable@kernel.org # v2.6.38+ Acked-by: Thomas Gleixner Cc: Mel Gorman Cc: Darren Hart Cc: Andrea Arcangeli Cc: Oleg Nesterov Signed-off-by: Linus Torvalds