linux/kernel/sysctl.c, branch v2.6.25

sched: revert load_balance_monitor() changes

2008-03-04T16:54:06Z

The following commits cause a number of regressions: commit 58e2d4ca581167c2a079f4ee02be2f0bc52e8729 Author: Srivatsa Vaddagiri Date: Fri Jan 25 21:08:00 2008 +0100 sched: group scheduling, change how cpu load is calculated commit 6b2d7700266b9402e12824e11e0099ae6a4a6a79 Author: Srivatsa Vaddagiri Date: Fri Jan 25 21:08:00 2008 +0100 sched: group scheduler, fix fairness of cpu bandwidth allocation for task groups Namely: - very frequent wakeups on SMP, reported by PowerTop users. - cacheline trashing on (large) SMP - some latencies larger than 500ms While there is a mergeable patch to fix the latter, the former issues are not fixable in a manner suitable for .25 (we're at -rc3 now). Hence we revert them and try again in v2.6.26. Signed-off-by: Peter Zijlstra CC: Srivatsa Vaddagiri Tested-by: Alexey Zaytsev Signed-off-by: Ingo Molnar

hugetlb: fix overcommit locking

2008-02-14T00:21:18Z

proc_doulongvec_minmax() calls copy_to_user()/copy_from_user(), so we can't hold hugetlb_lock over the call. Use a dummy variable to store the sysctl result, like in hugetlb_sysctl_handler(), then grab the lock to update nr_overcommit_huge_pages. Signed-off-by: Nishanth Aravamudan Reported-by: Miles Lane Cc: Adam Litke Cc: David Gibson Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

sched: rt-group: interface

2008-02-13T14:45:39Z

Change the rt_ratio interface to rt_runtime_us, to match rt_period_us. This avoids picking a granularity for the ratio. Extend the /sys/kernel/uids// interface to allow setting the group's rt_runtime. Signed-off-by: Peter Zijlstra Signed-off-by: Ingo Molnar

printk_ratelimit() functions should use CONFIG_PRINTK

2008-02-08T17:22:39Z

Makes an embedded image a bit smaller. Signed-off-by: Joe Perches Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

Nuke duplicate header from sysctl.c

2008-02-08T17:22:34Z

Don't include linux/security.h twice in kernel/sysctl.c Signed-off-by: Jesper Juhl Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

Pidns: make full use of xxx_vnr() calls

2008-02-08T17:22:29Z

Some time ago the xxx_vnr() calls (e.g. pid_vnr or find_task_by_vpid) were _all_ converted to operate on the current pid namespace. After this each call like xxx_nr_ns(foo, current->nsproxy->pid_ns) is nothing but a xxx_vnr(foo) one. Switch all the xxx_nr_ns() callers to use the xxx_vnr() calls where appropriate. Signed-off-by: Pavel Emelyanov Reviewed-by: Oleg Nesterov Cc: "Eric W. Biederman" Cc: Balbir Singh Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

hugetlb: add locking for overcommit sysctl

2008-02-08T17:22:23Z

When I replaced hugetlb_dynamic_pool with nr_overcommit_hugepages I used proc_doulongvec_minmax() directly. However, hugetlb.c's locking rules require that all counter modifications occur under the hugetlb_lock. Add a callback into the hugetlb code similar to the one for nr_hugepages. Grab the lock around the manipulation of nr_overcommit_hugepages in proc_doulongvec_minmax(). Signed-off-by: Nishanth Aravamudan Acked-by: Adam Litke Cc: David Gibson Cc: William Lee Irwin III Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

oom: add sysctl to enable task memory dump

2008-02-07T16:42:19Z

Adds a new sysctl, 'oom_dump_tasks', that enables the kernel to produce a dump of all system tasks (excluding kernel threads) when performing an OOM-killing. Information includes pid, uid, tgid, vm size, rss, cpu, oom_adj score, and name. This is helpful for determining why there was an OOM condition and which rogue task caused it. It is configurable so that large systems, such as those with several thousand tasks, do not incur a performance penalty associated with dumping data they may not desire. If an OOM was triggered as a result of a memory controller, the tasklist shall be filtered to exclude tasks that are not a member of the same cgroup. Cc: Andrea Arcangeli Cc: Christoph Lameter Cc: Balbir Singh Signed-off-by: David Rientjes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

get rid of NR_OPEN and introduce a sysctl_nr_open

2008-02-06T18:41:06Z

NR_OPEN (historically set to 1024*1024) actually forbids processes to open more than 1024*1024 handles. Unfortunatly some production servers hit the not so 'ridiculously high value' of 1024*1024 file descriptors per process. Changing NR_OPEN is not considered safe because of vmalloc space potential exhaust. This patch introduces a new sysctl (/proc/sys/fs/nr_open) wich defaults to 1024*1024, so that admins can decide to change this limit if their workload needs it. [akpm@linux-foundation.org: export it for sparc64] Signed-off-by: Eric Dumazet Cc: Alan Cox Cc: Richard Henderson Cc: Ivan Kokshaysky Cc: "David S. Miller" Cc: Ralf Baechle Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

capabilities: introduce per-process capability bounding set

2008-02-05T17:44:20Z

The capability bounding set is a set beyond which capabilities cannot grow. Currently cap_bset is per-system. It can be manipulated through sysctl, but only init can add capabilities. Root can remove capabilities. By default it includes all caps except CAP_SETPCAP. This patch makes the bounding set per-process when file capabilities are enabled. It is inherited at fork from parent. Noone can add elements, CAP_SETPCAP is required to remove them. One example use of this is to start a safer container. For instance, until device namespaces or per-container device whitelists are introduced, it is best to take CAP_MKNOD away from a container. The bounding set will not affect pP and pE immediately. It will only affect pP' and pE' after subsequent exec()s. It also does not affect pI, and exec() does not constrain pI'. So to really start a shell with no way of regain CAP_MKNOD, you would do prctl(PR_CAPBSET_DROP, CAP_MKNOD); cap_t cap = cap_get_proc(); cap_value_t caparray[1]; caparray[0] = CAP_MKNOD; cap_set_flag(cap, CAP_INHERITABLE, 1, caparray, CAP_DROP); cap_set_proc(cap); cap_free(cap); The following test program will get and set the bounding set (but not pI). For instance ./bset get (lists capabilities in bset) ./bset drop cap_net_raw (starts shell with new bset) (use capset, setuid binary, or binary with file capabilities to try to increase caps) ************************************************************ cap_bound.c ************************************************************ #include #include #include #include #include #include #include #ifndef PR_CAPBSET_READ #define PR_CAPBSET_READ 23 #endif #ifndef PR_CAPBSET_DROP #define PR_CAPBSET_DROP 24 #endif int usage(char *me) { printf("Usage: %s get\n", me); printf(" %s drop \n", me); return 1; } #define numcaps 32 char *captable[numcaps] = { "cap_chown", "cap_dac_override", "cap_dac_read_search", "cap_fowner", "cap_fsetid", "cap_kill", "cap_setgid", "cap_setuid", "cap_setpcap", "cap_linux_immutable", "cap_net_bind_service", "cap_net_broadcast", "cap_net_admin", "cap_net_raw", "cap_ipc_lock", "cap_ipc_owner", "cap_sys_module", "cap_sys_rawio", "cap_sys_chroot", "cap_sys_ptrace", "cap_sys_pacct", "cap_sys_admin", "cap_sys_boot", "cap_sys_nice", "cap_sys_resource", "cap_sys_time", "cap_sys_tty_config", "cap_mknod", "cap_lease", "cap_audit_write", "cap_audit_control", "cap_setfcap" }; int getbcap(void) { int comma=0; unsigned long i; int ret; printf("i know of %d capabilities\n", numcaps); printf("capability bounding set:"); for (i=0; i Signed-off-by: Andrew G. Morgan Cc: Stephen Smalley Cc: James Morris Cc: Chris Wright Cc: Casey Schaufler a Signed-off-by: "Serge E. Hallyn" Tested-by: Jiri Slaby Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds