| Age | Commit message (Collapse) | Author | Files | Lines |
|
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Al Viro <viro@parcelfarce.linux.theplanet.co.uk>
|
|
This patch removes the call to unblank() from printk, and avoids calling
unblank at irq() time _unless_ oops_in_progress is 1. I also export
oops_in_progress() so drivers who care like radeonfb can test it and know
what to do. I audited call sites of unblank_screen(), console_unblank(),
etc... and I _hope_ I got them all, the patch includes a small patch to
the s390 bust_spinlocks code that sets oops_in_progress back to 0 _after_
unblanking for example.
I added a few might_sleep() to help us catch possible remaining callers.
I'll soon write a document explaining fbdev locking. The current situation
after this patch is that:
- All callbacks have console_semaphore held (fbdev's are fully
serialised).
- Everything is called in schedule'able context, except the cfb_*
rendering operations and cursor operations, with the special case of
unblank who can be called at any time when "oops_in_progress" is true. A
driver that needs to sleep in it's unblank implementation is welcome to
test that variable and use a fallback path (or just do nothing if it's
not simple).
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Quiet a warning when compiling without CONFIG_SMP
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Due to the patent situation at least in the USA, the exports of
kernel/rcupdate.c should be EXPORT_SYMBOL_GPL.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This patch against 12-rc1 adds seccomp to the ppc64 arch. I tested it
successfully with the seccomp_test. I didn't bother to change the syscall
exit not to check for TIF_SECCOMP, in theory that bit could be optimized
but it's an optimization in the slow path, and current code is a bit
simpler. I also verified it still compiles and works fine on x86 and
x86-64.
Instead of the TIF_32BIT redefine, if you want to change x86-64 to use
TIF_32BIT too (instead of TIF_IA32), let me know.
Signed-off-by: Andrea Arcangeli <andrea@cpushare.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
kernel/cpuset.c:1428:41: warning: non-ANSI function declaration
Signed-off-by: Randy Dunlap <rddunlap@osdl.org>
Acked-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
The attached patch replaces backing_dev_info::memory_backed with capabilitied
bitmap. The capabilities available include:
(*) BDI_CAP_NO_ACCT_DIRTY
Set if the pages associated with this backing device should not be
tracked by the dirty page accounting.
(*) BDI_CAP_NO_WRITEBACK
Set if dirty pages associated with this backing device should not have
writepage() or writepages() invoked upon them to clean them.
(*) Capability markers that indicate what a backing device is capable of
with regard to memory mapping facilities. These flags indicate whether a
device can be mapped directly, whether it can be copied for a mapping,
and whether direct mappings can be read, written and/or executed. This
information is primarily aimed at improving no-MMU private mapping
support.
The patch also provides convenience functions for determining the dirty-page
capabilities available on backing devices directly or on the backing devices
associated with a mapping. These are provided to keep line length down when
checking for the capabilities.
Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
|
|
Call get_futex_value_locked in futex_wait with futex hash bucket locked and
only enqueue the futex if futex has the expected value. Simplify
futex_requeue.
Signed-off-by: Jakub Jelinek <jakub@redhat.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
The posix cpu timers introduced code that will not work with an arbitrary
type for cputime_t. In particular the division of two cputime_t values
broke the s390 build because cputime_t is define as an unsigned long long.
The first problem is the division of a cputime_t value by a number of
threads. That is a cputime_t divided by an integer. The patch adds
another macro cputime_div to the cputime macro regime which implements this
type of division and replaces all occurences of a cputime / nthread in the
posix cpu timer code.
Next problem is bump_cpu_timer. This function is severly broken:
1) In the body of the first if statement a timer->it.cpu.incr.sched is
used as the second argument of do_div. do_div expects an unsigned long
as "base" parameter but timer->it.cpu.incr.sched is an unsigned long
long. If the timer increment ever happens to be >= 2^32 the result is
wrong and if the lower 32 bits are zero this even crashes with a fixed
point divide exception.
2) The cputime_le(now.cpu, timer->it.cpu.expires.cpu) in the else if
condition is wrong. The cputime_le() reads as "now.cpu <=
timer->it.cpu.expires.cpu" and the subsequent cputime_ge() reads as
"now.cpu >= timer.it.cpu.expires.cpu". That means that the two values
needs to be equal to make the body of the second if to have any effect.
The first cputime_le should be a cputime_ge.
3) timer->it.cpu.expires.cpu and delta in the else part of the if are of
type cputime_t. A division of two cputime_t values is undefined (think
of cputime_t as e.g. a struct timespec, that just doesn't work). We
could add a primitive for this type of division but we'd end up with a
64 bit division or something even more complicated.
The solution for bump_cpu_timer is to use the "slow" division algorithm
that does shifts and subtracts. That adds yet another cputime macro,
cputime_halve to do the right shift of a cputime value.
The next problem is in arm_timer. The UPDATE_CLOCK macro does the wrong
thing for it_prof_expires and it_virt_expires. Expanded the macro and
added the cputime magic to it_prof/it_virt.
The remaining problems are rather simple, timespec_to_jiffies instead of
timespec_to_cputime and several cases where cputime_eq with cputime_zero
needs to be used instead of "== 0".
What still worries me a bit is to use "timer->it.cpu.incr.sched == 0" as
check if the timer is armed at all. It should work but its not really
clean.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This patch fixes the problem of POSIX timers returning too early due to not
accounting for the time starting mid jiffie.
Signed-off-by: George Anzinger <george@mvista.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
The cpuset code to update mems_generation could (in theory) deadlock on
cpuset_sem if it needed to allocate some memory while creating (mkdir) or
removing (rmdir) a cpuset, so already held cpuset_sem. Some other process
would have to mess with this tasks cpuset memory placement at the same
time.
We avoid this possible deadlock by always updating mems_generation after we
grab cpuset_sem on such operations, before we risk any operations that
might require memory allocation.
Thanks to Jack Steiner <steiner@sgi.com> for noticing this.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Found by sparse... since we are passing kernel param to a syscall handler,
we need to do the set_fs() wrappers.
Signed-off-by: Randolph Chung <tausq@debian.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
register_kprobe() routine was calling spin_unlock_irqrestore() wrongly.
This patch removes unwanted spin_unlock_irqrestore() call in
register_kprobe() routine.
Signed-off-by: Prasanna S Panchamukhi <prasanna@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
CON_BOOT is like early printk in that it allows for output really early on.
It's better than early printk because it unregisters automatically when a
real console is initialised. So if you don't get consoles registering in
console_init, there isn't a huge delay between the boot console
unregistering and the real console starting. This is the case on PA-RISC
where we have serial ports that aren't discovered until the PCI bus has
been walked.
I think all the current early printk users could be converted to this
scheme with a minimal amount of effort.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
The following exports are necessary to allow loadable modules to define new
clocks. Without these the mmtimer driver cannot be build correctly as a
module (there is another mmtimer specific fix necessary to get it to build
properly but that will be a separate patch):
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Made GENERIC_HARDIRQ mechanism work for ia64 and CPU hotplug. When write
to /proc/irq is handled it is not appropriate to perform set_rte
immediatly, since there is a race when the interrupt is asserted while the
re-program is happening. Hence such programming is only safe when we do
the re-program at the time of servicing an interrupt. This got broken when
GENERIC_HARDIRQ got introduced for ia64.
- added CONFIG_PENDING_IRQ so default /proc/irq write handler can do the right
thing.
TBD: We currently dont handle redirectable hint either in the display, or
when we handle writes to /proc/irq/XX/smp_affinity. We need an arch
specific way to account for the presence of "r" hint when we handle the
proc write.
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Cc: <linux-ia64@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
lock->break_lock is set when a lock is contended, but cleared only in
cond_resched_lock. Users of need_lockbreak (journal_commit_transaction,
copy_pte_range, unmap_vmas) don't necessarily use cond_resched_lock on it.
So, if the lock has been contended at some time in the past, break_lock
remains set thereafter, and the fastpath keeps dropping lock unnecessarily.
Hanging the system if you make a change like I did, forever restarting a
loop before making any progress. And even users of cond_resched_lock may
well suffer an initial unnecessary lockbreak.
There seems to be no point at which break_lock can be cleared when
unlocking, any point being either too early or too late; but that's okay,
it's only of interest while the lock is held. So clear it whenever the
lock is acquired - and any waiting contenders will quickly set it again.
Additional locking overhead? well, this is only when CONFIG_PREEMPT is on.
Since cond_resched_lock's spin_lock clears break_lock, no need to clear it
itself; and use need_lockbreak there too, preferring optimizer to #ifdefs.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This kills swsusp_resume; it should be arch-neutral but some i386 code
sneaked in. And arch-specific code is better done in assembly anyway.
Plus it fixes memory leaks in error paths.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This kills unused macro and write-only variable, and adds messages where
something goes wrong with suspending devices.
Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This adds few more places where it is possible freeze kernel threads.
From: Nigel Cunningham <ncunningham@cyclades.com>
Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This patch changes the update of the cmos clock to be timer driven rather
than poll driven by the timer interrupt function. If the clock is not
being synced to an outside source the timer is removed and thus system
overhead is nill in that case. The update frequency is still ~11 minutes
and missing the update window still causes a retry in 60 seconds.
We want the calls to sync_cmos_clock() to be made in a consistent environment.
This was not true when calling it directly from the NTP call code. The
change means that sync_cmos_clock() is ALWAYS called from run_timers(), i.e.
as a timer call back function.
Also, call the timer code only through the timer interface (set a short timer
to do it from the ntp call).
Signed-off-by: George Anzinger <george@mvista.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Move the ppc64 specific cond_syscall(ppc_rtas) into sys_ni.c so that it
takes effect. With this fixed we can remove the #define hack.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This patch extracts all the operations on counters protected by the page
table lock (currently rss and anon_rss) into definitions in
include/linux/sched.h. All rss operations are performed through the
following macros:
get_mm_counter(mm, member) -> Obtain the value of a counter
set_mm_counter(mm, member, value) -> Set the value of a counter
update_mm_counter(mm, member, value) -> Add to a counter
inc_mm_counter(mm, member) -> Increment a counter
dec_mm_counter(mm, member) -> Decrement a counter
With this patch it becomes easier to add new counters and it is possible to
redefine the method of counter handling. The counters are an issue for
scalability since they are used in frequently used code paths and may cause
cache line bouncing.
F.e. One may not use counters at all and count the pages when needed, switch
to atomic operations if the mm_struct locking changes or split the rss
into counters that can be locally incremented.
The relevant fields of the task_struct are renamed with a leading underscore
to catch out people who are not using the acceessor macros.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This makes it hard(er) to mix argument orders by mistake for things like
kmalloc() and friends, since silent integer promotion is now caught by
sparse.
|
|
|
|
On 4-way SMP, about one reboot in twenty hangs while killing processes:
exit needs exclusive tasklist_lock, but something still holds read_lock.
do_signal_stop race case misses unlock, and fixing it fixes the symptom.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
panic() doesn't flush the filesystem cache anymore. The comment above the
function still claims it does.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This patch converts verify_area to access_ok in arch/i386, fs/, kernel/ and a
few other bits that didn't fit in the other patches or that I actually was
able to test on my hardware - this is by far the best tested of all the
patches.
Signed-off-by: Jesper Juhl <juhl-lkml@dif.dk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Allow admin to enable only some of the Magic-Sysrq functions. This allows
admin to disable sysrq functions he considers dangerous (e.g. sending kill
signal, remounting fs RO) while keeping the possibility to use the others
(e.g. debug deadlocks by dumps of processes etc.).
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
In platform swsusp mode, we were forgetting to spin disks down, leading to
ugly emergency shutdown. This synchronizes platform method with other
methods and actually helps.
Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
From: <mjg59@scrf.ucam.org>
When using a fully modularized kernel it is necessary to activate resume
manually as the device node might not be available during kernel init.
This patch implements a new sysfs attribute '/sys/power/resume' which allows
for manual activation of software resume. When read from it prints the
configured resume device in 'major:minor' format. When written to it expects
a device in 'major:minor' format. This device is then checked for a suspended
image and resume is started if a valid image is found. The original
functionality is left in place.
It should be used from initramfs, or with care.
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
The following patch is designed to fix a problem in the current implementation
of swsusp in mainline kernels. Namely, swsusp uses an array of page backup
entries (aka pagedir) to store pointers to memory pages that must be saved
during suspend and restored during resume.
Unfortunately, the pagedir has to be located in a contiguous chunk of memory
and it sometimes turns out that an 8-order or even 9-order allocation is
needed for this purpose. It sometimes is impossible to get such an allocation
and swsusp may fail during either suspend or resume due to the lack of memory,
although theoretically there is enough free memory for it to succeed.
Moreover, swsusp is more likely to fail for this reason during resume, which
means that it may fail during resume after a successful suspend (this actually
has happened for some people, including me :-)) and this, potentially, may
lead to the loss of data.
The problem is fixed by replacing the pagedir with a linklist so that
high-order memory allocations are avoided (the patches make swsusp use only
0-order allocations). Unfortunately this means that it's necessary to change
assembly routines used to restore the image after it's been loaded from swap
so that they walk the list instead of walking the array.
This patch makes swsusp allocate only individual pages during resume. it
contains the necessary changes to the assembly routines etc. for i386 and
x86-64.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
into ppc970.osdl.org:/home/torvalds/v2.6/linux
|
|
into mars.ravnborg.org:/home/sam/bk/kbuild
|
|
This is a megarollup of ~60 patches which give various things static scope.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Kernel-docify comments
Signed-off-by: Martin Waitz <tali@admingilde.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Update function parameter description in block/fs code
Signed-off-by: Martin Waitz <tali@admingilde.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This could be part of the unknown 2% performance regression with
db transaction processing benchmark.
The four functions in the following patch use to be inline. They
are un-inlined since 2.6.7.
We measured that by re-inline them back on 2.6.9, it improves performance
for db transaction processing benchmark, +0.2% (on real hardware :-)
The cost is certainly larger kernel size, cost 928 bytes on x86, and
2728 bytes on ia64. But certainly worth the money for enterprise
customer since they improve performance on enterprise workload.
# size vmlinux.*
text data bss dec hex filename
3261844 717184 262020 4241048 40b698 vmlinux.x86.orig
3262772 717488 262020 4242280 40bb68 vmlinux.x86.inline
text data bss dec hex filename
5836933 903828 201940 6942701 69efed vmlinux.ia64.orig
5839661 903460 201940 6945061 69f925 vmlinux.ia64.inline
Possible we can introduce them back?
Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Cleanup find_busiest_group a bit. New sched-domains code means we can't have
groups without a CPU.
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Fix up a few small warts in the periodic multiprocessor rebalancing code.
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Move balancing fields into struct sched_domain, so we can get more useful
results on systems with multiple domains (eg SMT+SMP, CMP+NUMA, SMP+NUMA,
etc).
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Some fixes for unsynchronised TSCs. A task's timestamp may have been set by
another CPU. Although we try to adjust this correctly with the
timestamp_last_tick field, there is no guarantee this will be exactly right.
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This patch fixes a bug with the recently added printk-times feature.
In the case where a printk consists of only the log level (followed
subsequently by printks with more text for the same line), the printk-times
code doesn't correctly recognize the end of the string, and starts emitting
chars at the 0 byte at the end of the string.
The patch below fixes this problem. It also adjusts the handling of
printed_len in the routine, which was affected by the printk-times feature.
Signed-off-by: Tim Bird <tim.bird@am.sony.com>
Acked-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This my cpuset patch, with the following changes in the last two weeks:
1) Updated to 2.6.8.1-mm1
2) [Simon Derr <Simon.Derr@bull.net>] Fix new cpuset to begin empty,
not copied from parent. Needed to avoid breaking exclusive property.
3) [Dinakar Guniguntala <dino@in.ibm.com>] Finish initializing top
cpuset from cpu_possible_map after smp_init() called.
4) [Paul Jackson <pj@sgi.com>] Check on each call to __alloc_pages()
if the current tasks cpuset mems_allowed has changed. Use a cpuset
generation number, bumped on any cpuset memory placement change,
to make this check efficient. Update the tasks mems_allowed from
its cpuset, if the cpuset has changed.
5) [Paul Jackson <pj@sgi.com>] If a task is moved to another cpuset,
then update its cpus_allowed, using set_cpus_allowed().
6) [Paul Jackson <pj@sgi.com>] Update Documentation/cpusets.txt to
reflect above changes (4) and (5).
I continue to recommend the following patch for inclusion in your 2.6.9-*mm
series, when that opens. It provides an important facility for high
performance computing on large systems. Simon Derr of Bull (France) and
myself are the primary authors. Erich Focht has indicated that NEC is also
a potential user of this patch on the TX-7 NUMA machines, and that he
"would very much welcome the inclusion of cpusets."
I offer this update to lkml, in order to invite continued feedback.
The one prerequiste patch for this cpuset patch was just posted before this
one. That was a patch to provide a new bitmap list format, of which
cpusets is the first user.
This patch has been built on top of 2.6.8.1-mm1, for the arch's:
i386 x86_64 sparc ia64 powerpc-405 powerpc-750 sparc64
with and without CONFIG_CPUSET. It has been booted and tested on ia64
(sn2_defconfig, SN2 hardware). The 'alpha' arch also built, except for
what seems to be an unrelated toolchain problem (crosstool ld sigsegv) in
the final link step.
===
Cpusets provide a mechanism for assigning a set of CPUs and Memory Nodes to
a set of tasks.
Cpusets constrain the CPU and Memory placement of tasks to only the
processor and memory resources within a tasks current cpuset. They form a
nested hierarchy visible in a virtual file system. These are the essential
hooks, beyond what is already present, required to manage dynamic job
placement on large systems.
Cpusets require small kernel hooks in init, exit, fork, mempolicy,
sched_setaffinity, page_alloc and vmscan. And they require a "struct
cpuset" pointer, a cpuset_mems_generation, and a "mems_allowed" nodemask_t
(to go along with the "cpus_allowed" cpumask_t that's already there) in
each task struct.
These hooks:
1) establish and propagate cpusets,
2) enforce CPU placement in sched_setaffinity,
3) enforce Memory placement in mbind and sys_set_mempolicy,
4) restrict page allocation and scanning to mems_allowed, and
5) restrict migration and set_cpus_allowed to cpus_allowed.
The other required hook, restricting task scheduling to CPUs in a tasks
cpus_allowed mask, is already present.
Cpusets extend the usefulness of, the existing placement support that was
added to Linux 2.6 kernels: sched_setaffinity() for CPU placement, and
mbind() and set_mempolicy() for memory placement. On smaller or dedicated
use systems, the existing calls are often sufficient.
On larger NUMA systems, running more than one, performance critical, job,
it is necessary to be able to manage jobs in their entirety. This includes
providing a job with exclusive CPU and memory that no other job can use,
and being able to list all tasks currently in a cpuset.
A given job running within a cpuset, would likely use the existing
placement calls to manage its CPU and memory placement in more detail.
Cpusets are named, nested sets of CPUs and Memory Nodes. Each cpuset is
represented by a directory in the cpuset virtual file system, normally
mounted at /dev/cpuset.
Each cpuset directory provides the following files, which can be
read and written:
cpus:
List of CPUs allowed to tasks in that cpuset.
mems:
List of Memory Nodes allowed to tasks in that cpuset.
tasks:
List of pid's of tasks in that cpuset.
cpu_exclusive:
Flag (0 or 1) - if set, cpuset has exclusive use of
its CPUs (no sibling or cousin cpuset may overlap CPUs).
mem_exclusive:
Flag (0 or 1) - if set, cpuset has exclusive use of
its Memory Nodes (no sibling or cousin may overlap).
notify_on_release:
Flag (0 or 1) - if set, then /sbin/cpuset_release_agent
will be invoked, with the name (/dev/cpuset relative path)
of that cpuset in argv[1], when the last user of it (task
or child cpuset) goes away. This supports automatic
cleanup of abandoned cpusets.
In addition one new filetype is added to the /proc file system:
/proc/<pid>/cpuset:
For each task (pid), list its cpuset path, relative to the
root of the cpuset file system. This file is read-only.
New cpusets are created using 'mkdir' (at the shell or in C). Old ones are
removed using 'rmdir'. The above files are accessed using read(2) and
write(2) system calls, or shell commands such as 'cat' and 'echo'.
The CPUs and Memory Nodes in a given cpuset are always a subset of its
parent. The root cpuset has all possible CPUs and Memory Nodes in the
system. A cpuset may be exclusive (cpu or memory) only if its parent is
similarly exclusive.
See further Documentation/cpusets.txt, at the top of the following
patch.
/proc interface:
It is useful, when learning and making new uses of cpusets and placement to be
able to see what are the current value of a tasks cpus_allowed and
mems_allowed, which are the actual placement used by the kernel scheduler and
memory allocator.
The cpus_allowed and mems_allowed values are needed by user space apps that
are micromanaging placement, such as when moving an app to a obtained by
that app within its cpuset using sched_setaffinity, mbind and
set_mempolicy.
The cpus_allowed value is also available via the sched_getaffinity system
call. But since the entire rest of the cpuset API, including the display
of mems_allowed added here, is via an ascii style presentation in /proc and
/dev/cpuset, it is worth the extra couple lines of code to display
cpus_allowed in the same way.
This patch adds the display of these two fields to the 'status' file in the
/proc/<pid> directory of each task. The fields are only added if
CONFIG_CPUSETS is enabled (which is also needed to define the mems_allowed
field of each task). The new output lines look like:
$ tail -2 /proc/1/status
Cpus_allowed: ffffffff,ffffffff,ffffffff,ffffffff
Mems_allowed: ffffffff,ffffffff
Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com>
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Simon Derr <simon.derr@bull.net>
Signed-off-by: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
The attached patch causes process and session keyrings to be shared
properly when CLONE_THREAD is in force. It does this by moving the keyring
pointers into struct signal_struct[*].
[*] I have a patch to rename this to struct thread_group that I'll revisit
after the advent of 2.6.11.
Furthermore, once this patch is applied, process keyrings will no longer be
allocated at fork, but will instead only be allocated when needed.
Allocating them at fork was a way of half getting around the sharing across
threads problem, but that's no longer necessary.
This revision of the patch has the documentation changes patch rolled into it
and no longer abstracts the locking for signal_struct into a pair of macros.
Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
The current cond_syscall #defines add a semicolon on the end, and then
folks leave the semicolons off in kernel/sys_ni.c, which confuses editors
that are language-aware and is just generally bad style. This sweeps all
the users and makes sys_ni.c look like normal C code.
Signed-off-by: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This patch pulls together the compat_sigevent structs. It also
consolidates the copying of these structures into the kernel.
The only part of the second union in sigevent that the kernel looks at
currently is the _tid, so that is the only bit we copy.
This patch depends on my previous two patches "add and use
COMPAT_SIGEV_PAD_SIZE" and "Consolidate the last compat sigvals".
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This patch removes the redundant compiler barrier. As Linus ever said "The
mb() should make sure that gcc cannot move things around...".
Signed-off-by: Coywolf Qi Hunt <coywolf@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
While looking into the issues Jeremy had with the RLIMIT_SIGPENDING limit,
it occurred to me that the normal setting of this limit is bizarrely low.
The initial hard limit setting (MAX_SIGPENDING) was taken from the old
max_queued_signals parameter, which was for the entire system in aggregate.
But even as a per-user limit, the 1024 value is incongruously low for this.
On my machine, RLIMIT_NPROC allows me 8192 processes, but only 1024 queued
signals, i.e. fewer even than one pending signal in each process. (To me,
this really puts in doubt the sensibility of using a per-user limit for
this rather than a per-process one, i.e. counted in sighand_struct or
signal_struct, which could have a much smaller reasonable value. I don't
recall the rationale for making this new limit per-user in the first
place.)
This patch sets the default RLIMIT_SIGPENDING limit at boot time, using the
calculation that decides the default RLIMIT_NPROC limit. This uses the
same value for those two limits, which I think is still pretty conservative
on the RLIMIT_SIGPENDING value.
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|