linux/fs/proc/base.c, branch v4.4

proc: fix -ESRCH error when writing to /proc/$pid/coredump_filter

2015-12-18T22:25:40Z

Writing to /proc/$pid/coredump_filter always returns -ESRCH because commit 774636e19ed51 ("proc: convert to kstrto*()/kstrto*_from_user()") removed the setting of ret after the get_proc_task call and incorrectly left it as -ESRCH. Instead, return 0 when successful. Example breakage: echo 0 > /proc/self/coredump_filter bash: echo: write error: No such process Fixes: 774636e19ed51 ("proc: convert to kstrto*()/kstrto*_from_user()") Signed-off-by: Colin Ian King Acked-by: Kees Cook Cc: [4.3+] Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

mm, oom: add comment for why oom_adj exists

2015-11-06T03:34:48Z

/proc/pid/oom_adj exists solely to avoid breaking existing userspace binaries that write to the tunable. Add a comment in the only possible location within the kernel tree to describe the situation and motivation for keeping it around. Signed-off-by: David Rientjes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

fs/proc, core/debug: Don't expose absolute kernel addresses via wchan

2015-10-01T10:55:34Z

So the /proc/PID/stat 'wchan' field (the 30th field, which contains the absolute kernel address of the kernel function a task is blocked in) leaks absolute kernel addresses to unprivileged user-space: seq_put_decimal_ull(m, ' ', wchan); The absolute address might also leak via /proc/PID/wchan as well, if KALLSYMS is turned off or if the symbol lookup fails for some reason: static int proc_pid_wchan(struct seq_file *m, struct pid_namespace *ns, struct pid *pid, struct task_struct *task) { unsigned long wchan; char symname[KSYM_NAME_LEN]; wchan = get_wchan(task); if (lookup_symbol_name(wchan, symname) < 0) { if (!ptrace_may_access(task, PTRACE_MODE_READ)) return 0; seq_printf(m, "%lu", wchan); } else { seq_printf(m, "%s", symname); } return 0; } This isn't ideal, because for example it trivially leaks the KASLR offset to any local attacker: fomalhaut:~> printf "%016lx\n" $(cat /proc/$$/stat | cut -d' ' -f35) ffffffff8123b380 Most real-life uses of wchan are symbolic: ps -eo pid:10,tid:10,wchan:30,comm and procps uses /proc/PID/wchan, not the absolute address in /proc/PID/stat: triton:~/tip> strace -f ps -eo pid:10,tid:10,wchan:30,comm 2>&1 | grep wchan | tail -1 open("/proc/30833/wchan", O_RDONLY) = 6 There's one compatibility quirk here: procps relies on whether the absolute value is non-zero - and we can provide that functionality by outputing "0" or "1" depending on whether the task is blocked (whether there's a wchan address). These days there appears to be very little legitimate reason user-space would be interested in the absolute address. The absolute address is mostly historic: from the days when we didn't have kallsyms and user-space procps had to do the decoding itself via the System.map. So this patch sets all numeric output to "0" or "1" and keeps only symbolic output, in /proc/PID/wchan. ( The absolute sleep address can generally still be profiled via perf, by tasks with sufficient privileges. ) Reviewed-by: Thomas Gleixner Acked-by: Kees Cook Acked-by: Linus Torvalds Cc: Cc: Al Viro Cc: Alexander Potapenko Cc: Andrey Konovalov Cc: Andrey Ryabinin Cc: Andy Lutomirski Cc: Andy Lutomirski Cc: Borislav Petkov Cc: Denys Vlasenko Cc: Dmitry Vyukov Cc: Kostya Serebryany Cc: Mike Galbraith Cc: Peter Zijlstra Cc: Peter Zijlstra Cc: Sasha Levin Cc: kasan-dev Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/20150930135917.GA3285@gmail.com Signed-off-by: Ingo Molnar

proc: convert to kstrto()/kstrto_from_user()

2015-09-10T20:29:01Z

Convert from manual allocation/copy_from_user/... to kstrto*() family which were designed for exactly that. One case can not be converted to kstrto*_from_user() to make code even more simpler because of whitespace stripping, oh well... Signed-off-by: Alexey Dobriyan Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

procfs: always expose /proc//map_files/ and make it readable

2015-09-10T20:29:01Z

Currently, /proc//map_files/ is restricted to CAP_SYS_ADMIN, and is only exposed if CONFIG_CHECKPOINT_RESTORE is set. Each mapped file region gets a symlink in /proc//map_files/ corresponding to the virtual address range at which it is mapped. The symlinks work like the symlinks in /proc//fd/, so you can follow them to the backing file even if that backing file has been unlinked. Currently, files which are mapped, unlinked, and closed are impossible to stat() from userspace. Exposing /proc//map_files/ closes this functionality "hole". Not being able to stat() such files makes noticing and explicitly accounting for the space they use on the filesystem impossible. You can work around this by summing up the space used by every file in the filesystem and subtracting that total from what statfs() tells you, but that obviously isn't great, and it becomes unworkable once your filesystem becomes large enough. This patch moves map_files/ out from behind CONFIG_CHECKPOINT_RESTORE, and adjusts the permissions enforced on it as follows: * proc_map_files_lookup() * proc_map_files_readdir() * map_files_d_revalidate() Remove the CAP_SYS_ADMIN restriction, leaving only the current restriction requiring PTRACE_MODE_READ. The information made available to userspace by these three functions is already available in /proc/PID/maps with MODE_READ, so I don't see any reason to limit them any further (see below for more detail). * proc_map_files_follow_link() This stub has been added, and requires that the user have CAP_SYS_ADMIN in order to follow the links in map_files/, since there was concern on LKML both about the potential for bypassing permissions on ancestor directories in the path to files pointed to, and about what happens with more exotic memory mappings created by some drivers (ie dma-buf). In older versions of this patch, I changed every permission check in the four functions above to enforce MODE_ATTACH instead of MODE_READ. This was an oversight on my part, and after revisiting the discussion it seems that nobody was concerned about anything outside of what is made possible by ->follow_link(). So in this version, I've left the checks for PTRACE_MODE_READ as-is. [akpm@linux-foundation.org: catch up with concurrent proc_pid_follow_link() changes] Signed-off-by: Calvin Owens Reviewed-by: Kees Cook Cc: Andy Lutomirski Cc: Cyrill Gorcunov Cc: Joe Perches Cc: Kirill A. Shutemov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

/proc/$PID/cmdline: fixup empty ARGV case

2015-07-17T23:39:54Z

/proc/*/cmdline code checks if it should look at ENVP area by checking last byte of ARGV area: rv = access_remote_vm(mm, arg_end - 1, &c, 1, 0); if (rv <= 0) goto out_free_page; If ARGV is somehow made empty (by doing execve(..., NULL, ...) or manually setting ->arg_start and ->arg_end to equal values), the decision will be based on byte which doesn't even belong to ARGV/ENVP. So, quickly check if ARGV area is empty and report 0 to match previous behaviour. Signed-off-by: Alexey Dobriyan Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

2015-07-04T15:56:53Z

Pull scheduler fixes from Ingo Molnar: "Debug info and other statistics fixes and related enhancements" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/numa: Fix numa balancing stats in /proc/pid/sched sched/numa: Show numa_group ID in /proc/sched_debug task listings sched/debug: Move print_cfs_rq() declaration to kernel/sched/sched.h sched/stat: Expose /proc/pid/schedstat if CONFIG_SCHED_INFO=y sched/stat: Simplify the sched_info accounting dependency

sched/stat: Expose /proc/pid/schedstat if CONFIG_SCHED_INFO=y

2015-07-04T08:04:31Z

Expand /proc/pid/schedstat output: - enable it on CONFIG_TASK_DELAY_ACCT=y && !CONFIG_SCHEDSTATS kernels. - dump all zeroes on kernels that are booted with the 'nodelayacct' option, which boot option disables delay accounting on CONFIG_TASK_DELAY_ACCT=y kernels. Signed-off-by: Naveen N. Rao Cc: Balbir Singh Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Srikar Dronamraju Cc: Thomas Gleixner Cc: a.p.zijlstra@chello.nl Cc: ricklind@us.ibm.com Link: http://lkml.kernel.org/r/5ccbef17d4bc841084ea6e6421d4e4a23b7b806f.1435654789.git.naveen.n.rao@linux.vnet.ibm.com Signed-off-by: Ingo Molnar

fs, proc: introduce CONFIG_PROC_CHILDREN

2015-06-26T00:00:37Z

Commit 818411616baf ("fs, proc: introduce /proc//task//children entry") introduced the children entry for checkpoint restore and the file is only available on kernels configured with CONFIG_EXPERT and CONFIG_CHECKPOINT_RESTORE. This is available in most distributions (Fedora, Debian, Ubuntu, CoreOS) because they usually enable CONFIG_EXPERT and CONFIG_CHECKPOINT_RESTORE. But Arch does not enable CONFIG_EXPERT or CONFIG_CHECKPOINT_RESTORE. However, the children proc file is useful outside of checkpoint restore. I would like to use it in rkt. The rkt process exec() another program it does not control, and that other program will fork()+exec() a child process. I would like to find the pid of the child process from an external tool without iterating in /proc over all processes to find which one has a parent pid equal to rkt. This commit introduces CONFIG_PROC_CHILDREN and makes CONFIG_CHECKPOINT_RESTORE select it. This allows enabling /proc//task//children without needing to enable CONFIG_CHECKPOINT_RESTORE and CONFIG_EXPERT. Alban tested that /proc//task//children is present when the kernel is configured with CONFIG_PROC_CHILDREN=y but without CONFIG_CHECKPOINT_RESTORE Signed-off-by: Iago López Galeiras Tested-by: Alban Crequy Reviewed-by: Cyrill Gorcunov Cc: Oleg Nesterov Cc: Kees Cook Cc: Pavel Emelyanov Cc: Serge Hallyn Cc: KAMEZAWA Hiroyuki Cc: Alexander Viro Cc: Andy Lutomirski Cc: Djalal Harouni Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

proc: fix PAGE_SIZE limit of /proc/$PID/cmdline

2015-06-26T00:00:37Z

/proc/$PID/cmdline truncates output at PAGE_SIZE. It is easy to see with $ cat /proc/self/cmdline $(seq 1037) 2>/dev/null However, command line size was never limited to PAGE_SIZE but to 128 KB and relatively recently limitation was removed altogether. People noticed and ask questions: http://stackoverflow.com/questions/199130/how-do-i-increase-the-proc-pid-cmdline-4096-byte-limit seq file interface is not OK, because it kmalloc's for whole output and open + read(, 1) + sleep will pin arbitrary amounts of kernel memory. To not do that, limit must be imposed which is incompatible with arbitrary sized command lines. I apologize for hairy code, but this it direct consequence of command line layout in memory and hacks to support things like "init [3]". The loops are "unrolled" otherwise it is either macros which hide control flow or functions with 7-8 arguments with equal line count. There should be real setproctitle(2) or something. [akpm@linux-foundation.org: fix a billion min() warnings] Signed-off-by: Alexey Dobriyan Tested-by: Jarod Wilson Acked-by: Jarod Wilson Cc: Cyrill Gorcunov Cc: Jan Stancek Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds