2005-04-17 02:20:36 +04:00
|
|
|
/*
|
|
|
|
* sysctl.c: General linux system control interface
|
|
|
|
*
|
|
|
|
* Begun 24 March 1995, Stephen Tweedie
|
|
|
|
* Added /proc support, Dec 1995
|
|
|
|
* Added bdflush entry and intvec min/max checking, 2/23/96, Tom Dyas.
|
|
|
|
* Added hooks for /proc/sys/net (minor, minor patch), 96/4/1, Mike Shaver.
|
|
|
|
* Added kernel/java-{interpreter,appletviewer}, 96/5/10, Mike Shaver.
|
|
|
|
* Dynamic registration fixes, Stephen Tweedie.
|
|
|
|
* Added kswapd-interval, ctrl-alt-del, printk stuff, 1/8/97, Chris Horn.
|
|
|
|
* Made sysctl support optional via CONFIG_SYSCTL, 1/10/97, Chris
|
|
|
|
* Horn.
|
|
|
|
* Added proc_doulongvec_ms_jiffies_minmax, 09/08/99, Carlos H. Bauer.
|
|
|
|
* Added proc_doulongvec_minmax, 09/08/99, Carlos H. Bauer.
|
|
|
|
* Changed linked lists to use list.h instead of lists.h, 02/24/00, Bill
|
|
|
|
* Wendling.
|
|
|
|
* The list_for_each() macro wasn't appropriate for the sysctl loop.
|
|
|
|
* Removed it and replaced it with older style, 03/23/00, Bill Wendling
|
|
|
|
*/
|
|
|
|
|
|
|
|
#include <linux/module.h>
|
|
|
|
#include <linux/mm.h>
|
|
|
|
#include <linux/swap.h>
|
|
|
|
#include <linux/slab.h>
|
|
|
|
#include <linux/sysctl.h>
|
2010-03-11 02:23:59 +03:00
|
|
|
#include <linux/signal.h>
|
kptr_restrict for hiding kernel pointers from unprivileged users
Add the %pK printk format specifier and the /proc/sys/kernel/kptr_restrict
sysctl.
The %pK format specifier is designed to hide exposed kernel pointers,
specifically via /proc interfaces. Exposing these pointers provides an
easy target for kernel write vulnerabilities, since they reveal the
locations of writable structures containing easily triggerable function
pointers. The behavior of %pK depends on the kptr_restrict sysctl.
If kptr_restrict is set to 0, no deviation from the standard %p behavior
occurs. If kptr_restrict is set to 1, the default, if the current user
(intended to be a reader via seq_printf(), etc.) does not have CAP_SYSLOG
(currently in the LSM tree), kernel pointers using %pK are printed as 0's.
If kptr_restrict is set to 2, kernel pointers using %pK are printed as
0's regardless of privileges. Replacing with 0's was chosen over the
default "(null)", which cannot be parsed by userland %p, which expects
"(nil)".
[akpm@linux-foundation.org: check for IRQ context when !kptr_restrict, save an indent level, s/WARN/WARN_ONCE/]
[akpm@linux-foundation.org: coding-style fixup]
[randy.dunlap@oracle.com: fix kernel/sysctl.c warning]
Signed-off-by: Dan Rosenberg <drosenberg@vsecurity.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: James Morris <jmorris@namei.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Thomas Graf <tgraf@infradead.org>
Cc: Eugene Teo <eugeneteo@kernel.org>
Cc: Kees Cook <kees.cook@canonical.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David S. Miller <davem@davemloft.net>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Eric Paris <eparis@parisplace.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 03:59:41 +03:00
|
|
|
#include <linux/printk.h>
|
2005-04-17 02:20:36 +04:00
|
|
|
#include <linux/proc_fs.h>
|
V3 file capabilities: alter behavior of cap_setpcap
The non-filesystem capability meaning of CAP_SETPCAP is that a process, p1,
can change the capabilities of another process, p2. This is not the
meaning that was intended for this capability at all, and this
implementation came about purely because, without filesystem capabilities,
there was no way to use capabilities without one process bestowing them on
another.
Since we now have a filesystem support for capabilities we can fix the
implementation of CAP_SETPCAP.
The most significant thing about this change is that, with it in effect, no
process can set the capabilities of another process.
The capabilities of a program are set via the capability convolution
rules:
pI(post-exec) = pI(pre-exec)
pP(post-exec) = (X(aka cap_bset) & fP) | (pI(post-exec) & fI)
pE(post-exec) = fE ? pP(post-exec) : 0
at exec() time. As such, the only influence the pre-exec() program can
have on the post-exec() program's capabilities are through the pI
capability set.
The correct implementation for CAP_SETPCAP (and that enabled by this patch)
is that it can be used to add extra pI capabilities to the current process
- to be picked up by subsequent exec()s when the above convolution rules
are applied.
Here is how it works:
Let's say we have a process, p. It has capability sets, pE, pP and pI.
Generally, p, can change the value of its own pI to pI' where
(pI' & ~pI) & ~pP = 0.
That is, the only new things in pI' that were not present in pI need to
be present in pP.
The role of CAP_SETPCAP is basically to permit changes to pI beyond
the above:
if (pE & CAP_SETPCAP) {
pI' = anything; /* ie., even (pI' & ~pI) & ~pP != 0 */
}
This capability is useful for things like login, which (say, via
pam_cap) might want to raise certain inheritable capabilities for use
by the children of the logged-in user's shell, but those capabilities
are not useful to or needed by the login program itself.
One such use might be to limit who can run ping. You set the
capabilities of the 'ping' program to be "= cap_net_raw+i", and then
only shells that have (pI & CAP_NET_RAW) will be able to run
it. Without CAP_SETPCAP implemented as described above, login(pam_cap)
would have to also have (pP & CAP_NET_RAW) in order to raise this
capability and pass it on through the inheritable set.
Signed-off-by: Andrew Morgan <morgan@kernel.org>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: James Morris <jmorris@namei.org>
Cc: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-18 14:05:59 +04:00
|
|
|
#include <linux/security.h>
|
2005-04-17 02:20:36 +04:00
|
|
|
#include <linux/ctype.h>
|
2008-04-04 02:51:41 +04:00
|
|
|
#include <linux/kmemcheck.h>
|
2007-07-17 15:03:45 +04:00
|
|
|
#include <linux/fs.h>
|
2005-04-17 02:20:36 +04:00
|
|
|
#include <linux/init.h>
|
|
|
|
#include <linux/kernel.h>
|
2005-11-11 07:33:52 +03:00
|
|
|
#include <linux/kobject.h>
|
2005-08-16 09:18:02 +04:00
|
|
|
#include <linux/net.h>
|
2005-04-17 02:20:36 +04:00
|
|
|
#include <linux/sysrq.h>
|
|
|
|
#include <linux/highuid.h>
|
|
|
|
#include <linux/writeback.h>
|
2009-09-22 18:18:09 +04:00
|
|
|
#include <linux/ratelimit.h>
|
2010-05-25 01:32:28 +04:00
|
|
|
#include <linux/compaction.h>
|
2005-04-17 02:20:36 +04:00
|
|
|
#include <linux/hugetlb.h>
|
|
|
|
#include <linux/initrd.h>
|
2008-04-29 12:01:32 +04:00
|
|
|
#include <linux/key.h>
|
2005-04-17 02:20:36 +04:00
|
|
|
#include <linux/times.h>
|
|
|
|
#include <linux/limits.h>
|
|
|
|
#include <linux/dcache.h>
|
2010-01-20 23:27:56 +03:00
|
|
|
#include <linux/dnotify.h>
|
2005-04-17 02:20:36 +04:00
|
|
|
#include <linux/syscalls.h>
|
2008-07-24 08:27:03 +04:00
|
|
|
#include <linux/vmstat.h>
|
2006-02-21 05:27:58 +03:00
|
|
|
#include <linux/nfs_fs.h>
|
|
|
|
#include <linux/acpi.h>
|
2007-07-18 05:37:02 +04:00
|
|
|
#include <linux/reboot.h>
|
2008-05-12 23:20:43 +04:00
|
|
|
#include <linux/ftrace.h>
|
perf: Do the big rename: Performance Counters -> Performance Events
Bye-bye Performance Counters, welcome Performance Events!
In the past few months the perfcounters subsystem has grown out its
initial role of counting hardware events, and has become (and is
becoming) a much broader generic event enumeration, reporting, logging,
monitoring, analysis facility.
Naming its core object 'perf_counter' and naming the subsystem
'perfcounters' has become more and more of a misnomer. With pending
code like hw-breakpoints support the 'counter' name is less and
less appropriate.
All in one, we've decided to rename the subsystem to 'performance
events' and to propagate this rename through all fields, variables
and API names. (in an ABI compatible fashion)
The word 'event' is also a bit shorter than 'counter' - which makes
it slightly more convenient to write/handle as well.
Thanks goes to Stephane Eranian who first observed this misnomer and
suggested a rename.
User-space tooling and ABI compatibility is not affected - this patch
should be function-invariant. (Also, defconfigs were not touched to
keep the size down.)
This patch has been generated via the following script:
FILES=$(find * -type f | grep -vE 'oprofile|[^K]config')
sed -i \
-e 's/PERF_EVENT_/PERF_RECORD_/g' \
-e 's/PERF_COUNTER/PERF_EVENT/g' \
-e 's/perf_counter/perf_event/g' \
-e 's/nb_counters/nb_events/g' \
-e 's/swcounter/swevent/g' \
-e 's/tpcounter_event/tp_event/g' \
$FILES
for N in $(find . -name perf_counter.[ch]); do
M=$(echo $N | sed 's/perf_counter/perf_event/g')
mv $N $M
done
FILES=$(find . -name perf_event.*)
sed -i \
-e 's/COUNTER_MASK/REG_MASK/g' \
-e 's/COUNTER/EVENT/g' \
-e 's/\<event\>/event_id/g' \
-e 's/counter/event/g' \
-e 's/Counter/Event/g' \
$FILES
... to keep it as correct as possible. This script can also be
used by anyone who has pending perfcounters patches - it converts
a Linux kernel tree over to the new naming. We tried to time this
change to the point in time where the amount of pending patches
is the smallest: the end of the merge window.
Namespace clashes were fixed up in a preparatory patch - and some
stylistic fallout will be fixed up in a subsequent patch.
( NOTE: 'counters' are still the proper terminology when we deal
with hardware registers - and these sed scripts are a bit
over-eager in renaming them. I've undone some of that, but
in case there's something left where 'counter' would be
better than 'event' we can undo that on an individual basis
instead of touching an otherwise nicely automated patch. )
Suggested-by: Stephane Eranian <eranian@google.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Paul Mackerras <paulus@samba.org>
Reviewed-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <linux-arch@vger.kernel.org>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-09-21 14:02:48 +04:00
|
|
|
#include <linux/perf_event.h>
|
2010-02-25 16:34:15 +03:00
|
|
|
#include <linux/kprobes.h>
|
2010-05-19 23:03:16 +04:00
|
|
|
#include <linux/pipe_fs_i.h>
|
2010-08-10 04:18:56 +04:00
|
|
|
#include <linux/oom.h>
|
2005-04-17 02:20:36 +04:00
|
|
|
|
|
|
|
#include <asm/uaccess.h>
|
|
|
|
#include <asm/processor.h>
|
|
|
|
|
2006-09-30 03:47:55 +04:00
|
|
|
#ifdef CONFIG_X86
|
|
|
|
#include <asm/nmi.h>
|
2006-12-07 04:14:11 +03:00
|
|
|
#include <asm/stacktrace.h>
|
2008-01-30 15:30:05 +03:00
|
|
|
#include <asm/io.h>
|
2006-09-30 03:47:55 +04:00
|
|
|
#endif
|
2010-03-11 02:24:08 +03:00
|
|
|
#ifdef CONFIG_BSD_PROCESS_ACCT
|
|
|
|
#include <linux/acct.h>
|
|
|
|
#endif
|
2010-03-11 02:24:09 +03:00
|
|
|
#ifdef CONFIG_RT_MUTEXES
|
|
|
|
#include <linux/rtmutex.h>
|
|
|
|
#endif
|
2010-03-11 02:24:10 +03:00
|
|
|
#if defined(CONFIG_PROVE_LOCKING) || defined(CONFIG_LOCK_STAT)
|
|
|
|
#include <linux/lockdep.h>
|
|
|
|
#endif
|
2010-03-11 02:24:07 +03:00
|
|
|
#ifdef CONFIG_CHR_DEV_SG
|
|
|
|
#include <scsi/sg.h>
|
|
|
|
#endif
|
2006-09-30 03:47:55 +04:00
|
|
|
|
2010-05-08 01:11:44 +04:00
|
|
|
#ifdef CONFIG_LOCKUP_DETECTOR
|
2010-02-13 01:19:19 +03:00
|
|
|
#include <linux/nmi.h>
|
|
|
|
#endif
|
|
|
|
|
2007-10-18 14:05:58 +04:00
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
#if defined(CONFIG_SYSCTL)
|
|
|
|
|
|
|
|
/* External variables not in a header file. */
|
|
|
|
extern int sysctl_overcommit_memory;
|
|
|
|
extern int sysctl_overcommit_ratio;
|
|
|
|
extern int max_threads;
|
|
|
|
extern int core_uses_pid;
|
2005-06-23 11:09:43 +04:00
|
|
|
extern int suid_dumpable;
|
2005-04-17 02:20:36 +04:00
|
|
|
extern char core_pattern[];
|
2009-09-24 02:56:56 +04:00
|
|
|
extern unsigned int core_pipe_limit;
|
2005-04-17 02:20:36 +04:00
|
|
|
extern int pid_max;
|
|
|
|
extern int min_free_kbytes;
|
|
|
|
extern int pid_max_min, pid_max_max;
|
2006-01-08 12:00:39 +03:00
|
|
|
extern int sysctl_drop_caches;
|
2006-01-08 12:00:40 +03:00
|
|
|
extern int percpu_pagelist_fraction;
|
2006-06-26 15:56:52 +04:00
|
|
|
extern int compat_log;
|
2008-01-25 23:08:34 +03:00
|
|
|
extern int latencytop_enabled;
|
2008-05-10 18:08:32 +04:00
|
|
|
extern int sysctl_nr_open_min, sysctl_nr_open_max;
|
2009-01-08 15:04:47 +03:00
|
|
|
#ifndef CONFIG_MMU
|
|
|
|
extern int sysctl_nr_trim_pages;
|
|
|
|
#endif
|
2009-09-15 23:53:11 +04:00
|
|
|
#ifdef CONFIG_BLOCK
|
2009-08-05 11:07:21 +04:00
|
|
|
extern int blk_iopoll_enabled;
|
2009-09-15 23:53:11 +04:00
|
|
|
#endif
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2007-10-17 10:26:09 +04:00
|
|
|
/* Constants used for minimum and maximum */
|
2010-05-08 01:11:46 +04:00
|
|
|
#ifdef CONFIG_LOCKUP_DETECTOR
|
2007-10-17 10:26:09 +04:00
|
|
|
static int sixty = 60;
|
2008-05-12 23:21:14 +04:00
|
|
|
static int neg_one = -1;
|
2007-10-17 10:26:09 +04:00
|
|
|
#endif
|
|
|
|
|
|
|
|
static int zero;
|
2009-04-07 00:38:46 +04:00
|
|
|
static int __maybe_unused one = 1;
|
|
|
|
static int __maybe_unused two = 2;
|
2009-02-12 00:04:23 +03:00
|
|
|
static unsigned long one_ul = 1;
|
2007-10-17 10:26:09 +04:00
|
|
|
static int one_hundred = 100;
|
2009-09-23 03:43:33 +04:00
|
|
|
#ifdef CONFIG_PRINTK
|
|
|
|
static int ten_thousand = 10000;
|
|
|
|
#endif
|
2007-10-17 10:26:09 +04:00
|
|
|
|
2009-05-01 02:08:57 +04:00
|
|
|
/* this is needed for the proc_doulongvec_minmax of vm_dirty_bytes */
|
|
|
|
static unsigned long dirty_bytes_min = 2 * PAGE_SIZE;
|
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
/* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID */
|
|
|
|
static int maxolduid = 65535;
|
|
|
|
static int minolduid;
|
2006-01-08 12:00:40 +03:00
|
|
|
static int min_percpu_pagelist_fract = 8;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
|
|
|
static int ngroups_max = NGROUPS_MAX;
|
|
|
|
|
2010-02-26 04:28:57 +03:00
|
|
|
#ifdef CONFIG_INOTIFY_USER
|
|
|
|
#include <linux/inotify.h>
|
|
|
|
#endif
|
2008-09-12 10:29:54 +04:00
|
|
|
#ifdef CONFIG_SPARC
|
2008-09-12 10:33:53 +04:00
|
|
|
#include <asm/system.h>
|
2005-04-17 02:20:36 +04:00
|
|
|
#endif
|
|
|
|
|
2008-11-17 10:49:24 +03:00
|
|
|
#ifdef CONFIG_SPARC64
|
|
|
|
extern int sysctl_tsb_ratio;
|
|
|
|
#endif
|
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
#ifdef __hppa__
|
|
|
|
extern int pwrsw_enabled;
|
|
|
|
extern int unaligned_enabled;
|
|
|
|
#endif
|
|
|
|
|
2006-01-06 11:19:28 +03:00
|
|
|
#ifdef CONFIG_S390
|
2005-04-17 02:20:36 +04:00
|
|
|
#ifdef CONFIG_MATHEMU
|
|
|
|
extern int sysctl_ieee_emulation_warnings;
|
|
|
|
#endif
|
|
|
|
extern int sysctl_userprocess_debug;
|
2005-07-27 22:44:57 +04:00
|
|
|
extern int spin_retry;
|
2005-04-17 02:20:36 +04:00
|
|
|
#endif
|
|
|
|
|
2006-02-28 20:42:23 +03:00
|
|
|
#ifdef CONFIG_IA64
|
|
|
|
extern int no_unaligned_warning;
|
2009-01-15 21:38:56 +03:00
|
|
|
extern int unaligned_dump_stack;
|
2006-02-28 20:42:23 +03:00
|
|
|
#endif
|
|
|
|
|
2006-10-20 10:28:34 +04:00
|
|
|
#ifdef CONFIG_PROC_SYSCTL
|
2009-09-24 02:57:19 +04:00
|
|
|
static int proc_do_cad_pid(struct ctl_table *table, int write,
|
2006-10-02 13:19:00 +04:00
|
|
|
void __user *buffer, size_t *lenp, loff_t *ppos);
|
2009-09-24 02:57:19 +04:00
|
|
|
static int proc_taint(struct ctl_table *table, int write,
|
2007-02-10 12:45:24 +03:00
|
|
|
void __user *buffer, size_t *lenp, loff_t *ppos);
|
2006-10-20 10:28:34 +04:00
|
|
|
#endif
|
2006-10-02 13:19:00 +04:00
|
|
|
|
2010-03-22 08:31:26 +03:00
|
|
|
#ifdef CONFIG_MAGIC_SYSRQ
|
|
|
|
static int __sysrq_enabled; /* Note: sysrq code ises it's own private copy */
|
|
|
|
|
|
|
|
static int sysrq_sysctl_handler(ctl_table *table, int write,
|
|
|
|
void __user *buffer, size_t *lenp,
|
|
|
|
loff_t *ppos)
|
|
|
|
{
|
|
|
|
int error;
|
|
|
|
|
|
|
|
error = proc_dointvec(table, write, buffer, lenp, ppos);
|
|
|
|
if (error)
|
|
|
|
return error;
|
|
|
|
|
|
|
|
if (write)
|
|
|
|
sysrq_toggle_support(__sysrq_enabled);
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
#endif
|
|
|
|
|
2007-10-18 14:05:22 +04:00
|
|
|
static struct ctl_table root_table[];
|
2007-11-30 15:54:00 +03:00
|
|
|
static struct ctl_table_root sysctl_table_root;
|
|
|
|
static struct ctl_table_header root_table_header = {
|
2008-09-04 20:05:57 +04:00
|
|
|
.count = 1,
|
2007-11-30 15:54:00 +03:00
|
|
|
.ctl_table = root_table,
|
2008-07-15 05:22:20 +04:00
|
|
|
.ctl_entry = LIST_HEAD_INIT(sysctl_table_root.default_set.list),
|
2007-11-30 15:54:00 +03:00
|
|
|
.root = &sysctl_table_root,
|
2008-07-15 05:22:20 +04:00
|
|
|
.set = &sysctl_table_root.default_set,
|
2007-11-30 15:54:00 +03:00
|
|
|
};
|
|
|
|
static struct ctl_table_root sysctl_table_root = {
|
|
|
|
.root_list = LIST_HEAD_INIT(sysctl_table_root.root_list),
|
2008-07-15 05:22:20 +04:00
|
|
|
.default_set.list = LIST_HEAD_INIT(root_table_header.ctl_entry),
|
2007-11-30 15:54:00 +03:00
|
|
|
};
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2007-10-18 14:05:22 +04:00
|
|
|
static struct ctl_table kern_table[];
|
|
|
|
static struct ctl_table vm_table[];
|
|
|
|
static struct ctl_table fs_table[];
|
|
|
|
static struct ctl_table debug_table[];
|
|
|
|
static struct ctl_table dev_table[];
|
|
|
|
extern struct ctl_table random_table[];
|
epoll: introduce resource usage limits
It has been thought that the per-user file descriptors limit would also
limit the resources that a normal user can request via the epoll
interface. Vegard Nossum reported a very simple program (a modified
version attached) that can make a normal user to request a pretty large
amount of kernel memory, well within the its maximum number of fds. To
solve such problem, default limits are now imposed, and /proc based
configuration has been introduced. A new directory has been created,
named /proc/sys/fs/epoll/ and inside there, there are two configuration
points:
max_user_instances = Maximum number of devices - per user
max_user_watches = Maximum number of "watched" fds - per user
The current default for "max_user_watches" limits the memory used by epoll
to store "watches", to 1/32 of the amount of the low RAM. As example, a
256MB 32bit machine, will have "max_user_watches" set to roughly 90000.
That should be enough to not break existing heavy epoll users. The
default value for "max_user_instances" is set to 128, that should be
enough too.
This also changes the userspace, because a new error code can now come out
from EPOLL_CTL_ADD (-ENOSPC). The EMFILE from epoll_create() was already
listed, so that should be ok.
[akpm@linux-foundation.org: use get_current_user()]
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: <stable@kernel.org>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Reported-by: Vegard Nossum <vegardno@ifi.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-12-02 00:13:55 +03:00
|
|
|
#ifdef CONFIG_EPOLL
|
|
|
|
extern struct ctl_table epoll_table[];
|
|
|
|
#endif
|
2005-04-17 02:20:36 +04:00
|
|
|
|
|
|
|
#ifdef HAVE_ARCH_PICK_MMAP_LAYOUT
|
|
|
|
int sysctl_legacy_va_layout;
|
|
|
|
#endif
|
|
|
|
|
|
|
|
/* The default sysctl tables: */
|
|
|
|
|
2007-10-18 14:05:22 +04:00
|
|
|
static struct ctl_table root_table[] = {
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
|
|
|
.procname = "kernel",
|
|
|
|
.mode = 0555,
|
|
|
|
.child = kern_table,
|
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "vm",
|
|
|
|
.mode = 0555,
|
|
|
|
.child = vm_table,
|
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "fs",
|
|
|
|
.mode = 0555,
|
|
|
|
.child = fs_table,
|
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "debug",
|
|
|
|
.mode = 0555,
|
|
|
|
.child = debug_table,
|
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "dev",
|
|
|
|
.mode = 0555,
|
|
|
|
.child = dev_table,
|
|
|
|
},
|
2009-04-03 13:30:53 +04:00
|
|
|
{ }
|
2005-04-17 02:20:36 +04:00
|
|
|
};
|
|
|
|
|
2007-07-09 20:52:00 +04:00
|
|
|
#ifdef CONFIG_SCHED_DEBUG
|
2007-12-18 17:21:13 +03:00
|
|
|
static int min_sched_granularity_ns = 100000; /* 100 usecs */
|
|
|
|
static int max_sched_granularity_ns = NSEC_PER_SEC; /* 1 second */
|
|
|
|
static int min_wakeup_granularity_ns; /* 0 usecs */
|
|
|
|
static int max_wakeup_granularity_ns = NSEC_PER_SEC; /* 1 second */
|
2009-11-30 14:16:47 +03:00
|
|
|
static int min_sched_tunable_scaling = SCHED_TUNABLESCALING_NONE;
|
|
|
|
static int max_sched_tunable_scaling = SCHED_TUNABLESCALING_END-1;
|
2007-07-09 20:52:00 +04:00
|
|
|
#endif
|
|
|
|
|
2010-05-25 01:32:31 +04:00
|
|
|
#ifdef CONFIG_COMPACTION
|
|
|
|
static int min_extfrag_threshold;
|
|
|
|
static int max_extfrag_threshold = 1000;
|
|
|
|
#endif
|
|
|
|
|
2007-10-18 14:05:22 +04:00
|
|
|
static struct ctl_table kern_table[] = {
|
2009-09-09 17:41:37 +04:00
|
|
|
{
|
|
|
|
.procname = "sched_child_runs_first",
|
|
|
|
.data = &sysctl_sched_child_runs_first,
|
|
|
|
.maxlen = sizeof(unsigned int),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec,
|
2009-09-09 17:41:37 +04:00
|
|
|
},
|
2007-07-09 20:52:00 +04:00
|
|
|
#ifdef CONFIG_SCHED_DEBUG
|
|
|
|
{
|
2007-11-10 00:39:37 +03:00
|
|
|
.procname = "sched_min_granularity_ns",
|
|
|
|
.data = &sysctl_sched_min_granularity,
|
2007-07-09 20:52:00 +04:00
|
|
|
.maxlen = sizeof(unsigned int),
|
|
|
|
.mode = 0644,
|
2009-12-12 22:34:10 +03:00
|
|
|
.proc_handler = sched_proc_update_handler,
|
2007-11-10 00:39:37 +03:00
|
|
|
.extra1 = &min_sched_granularity_ns,
|
|
|
|
.extra2 = &max_sched_granularity_ns,
|
2007-07-09 20:52:00 +04:00
|
|
|
},
|
2007-08-25 20:41:53 +04:00
|
|
|
{
|
|
|
|
.procname = "sched_latency_ns",
|
|
|
|
.data = &sysctl_sched_latency,
|
|
|
|
.maxlen = sizeof(unsigned int),
|
|
|
|
.mode = 0644,
|
2009-12-12 22:34:10 +03:00
|
|
|
.proc_handler = sched_proc_update_handler,
|
2007-08-25 20:41:53 +04:00
|
|
|
.extra1 = &min_sched_granularity_ns,
|
|
|
|
.extra2 = &max_sched_granularity_ns,
|
|
|
|
},
|
2007-07-09 20:52:00 +04:00
|
|
|
{
|
|
|
|
.procname = "sched_wakeup_granularity_ns",
|
|
|
|
.data = &sysctl_sched_wakeup_granularity,
|
|
|
|
.maxlen = sizeof(unsigned int),
|
|
|
|
.mode = 0644,
|
2009-12-12 22:34:10 +03:00
|
|
|
.proc_handler = sched_proc_update_handler,
|
2007-07-09 20:52:00 +04:00
|
|
|
.extra1 = &min_wakeup_granularity_ns,
|
|
|
|
.extra2 = &max_wakeup_granularity_ns,
|
|
|
|
},
|
2009-11-30 14:16:47 +03:00
|
|
|
{
|
|
|
|
.procname = "sched_tunable_scaling",
|
|
|
|
.data = &sysctl_sched_tunable_scaling,
|
|
|
|
.maxlen = sizeof(enum sched_tunable_scaling),
|
|
|
|
.mode = 0644,
|
2009-12-12 22:34:10 +03:00
|
|
|
.proc_handler = sched_proc_update_handler,
|
2009-11-30 14:16:47 +03:00
|
|
|
.extra1 = &min_sched_tunable_scaling,
|
|
|
|
.extra2 = &max_sched_tunable_scaling,
|
2008-06-27 15:41:35 +04:00
|
|
|
},
|
2007-10-15 19:00:18 +04:00
|
|
|
{
|
|
|
|
.procname = "sched_migration_cost",
|
|
|
|
.data = &sysctl_sched_migration_cost,
|
|
|
|
.maxlen = sizeof(unsigned int),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec,
|
2007-10-15 19:00:18 +04:00
|
|
|
},
|
2007-11-10 00:39:39 +03:00
|
|
|
{
|
|
|
|
.procname = "sched_nr_migrate",
|
|
|
|
.data = &sysctl_sched_nr_migrate,
|
|
|
|
.maxlen = sizeof(unsigned int),
|
2008-01-25 23:08:29 +03:00
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec,
|
2008-01-25 23:08:29 +03:00
|
|
|
},
|
2009-09-01 12:34:37 +04:00
|
|
|
{
|
|
|
|
.procname = "sched_time_avg",
|
|
|
|
.data = &sysctl_sched_time_avg,
|
|
|
|
.maxlen = sizeof(unsigned int),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec,
|
2009-09-01 12:34:37 +04:00
|
|
|
},
|
2010-11-16 02:47:06 +03:00
|
|
|
{
|
|
|
|
.procname = "sched_shares_window",
|
|
|
|
.data = &sysctl_sched_shares_window,
|
|
|
|
.maxlen = sizeof(unsigned int),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = proc_dointvec,
|
|
|
|
},
|
2009-04-16 10:45:34 +04:00
|
|
|
{
|
|
|
|
.procname = "timer_migration",
|
|
|
|
.data = &sysctl_timer_migration,
|
|
|
|
.maxlen = sizeof(unsigned int),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec_minmax,
|
2009-06-23 08:30:58 +04:00
|
|
|
.extra1 = &zero,
|
|
|
|
.extra2 = &one,
|
2008-01-25 23:08:29 +03:00
|
|
|
},
|
2007-08-25 20:41:52 +04:00
|
|
|
#endif
|
2008-02-13 17:45:39 +03:00
|
|
|
{
|
|
|
|
.procname = "sched_rt_period_us",
|
|
|
|
.data = &sysctl_sched_rt_period,
|
|
|
|
.maxlen = sizeof(unsigned int),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = sched_rt_handler,
|
2008-02-13 17:45:39 +03:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "sched_rt_runtime_us",
|
|
|
|
.data = &sysctl_sched_rt_runtime,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = sched_rt_handler,
|
2008-02-13 17:45:39 +03:00
|
|
|
},
|
2007-09-20 01:34:46 +04:00
|
|
|
{
|
|
|
|
.procname = "sched_compat_yield",
|
|
|
|
.data = &sysctl_sched_compat_yield,
|
|
|
|
.maxlen = sizeof(unsigned int),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec,
|
2007-09-20 01:34:46 +04:00
|
|
|
},
|
sched: Add 'autogroup' scheduling feature: automated per session task groups
A recurring complaint from CFS users is that parallel kbuild has
a negative impact on desktop interactivity. This patch
implements an idea from Linus, to automatically create task
groups. Currently, only per session autogroups are implemented,
but the patch leaves the way open for enhancement.
Implementation: each task's signal struct contains an inherited
pointer to a refcounted autogroup struct containing a task group
pointer, the default for all tasks pointing to the
init_task_group. When a task calls setsid(), a new task group
is created, the process is moved into the new task group, and a
reference to the preveious task group is dropped. Child
processes inherit this task group thereafter, and increase it's
refcount. When the last thread of a process exits, the
process's reference is dropped, such that when the last process
referencing an autogroup exits, the autogroup is destroyed.
At runqueue selection time, IFF a task has no cgroup assignment,
its current autogroup is used.
Autogroup bandwidth is controllable via setting it's nice level
through the proc filesystem:
cat /proc/<pid>/autogroup
Displays the task's group and the group's nice level.
echo <nice level> > /proc/<pid>/autogroup
Sets the task group's shares to the weight of nice <level> task.
Setting nice level is rate limited for !admin users due to the
abuse risk of task group locking.
The feature is enabled from boot by default if
CONFIG_SCHED_AUTOGROUP=y is selected, but can be disabled via
the boot option noautogroup, and can also be turned on/off on
the fly via:
echo [01] > /proc/sys/kernel/sched_autogroup_enabled
... which will automatically move tasks to/from the root task group.
Signed-off-by: Mike Galbraith <efault@gmx.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Markus Trippelsdorf <markus@trippelsdorf.de>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Paul Turner <pjt@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
[ Removed the task_group_path() debug code, and fixed !EVENTFD build failure. ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
LKML-Reference: <1290281700.28711.9.camel@maggy.simson.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-30 16:18:03 +03:00
|
|
|
#ifdef CONFIG_SCHED_AUTOGROUP
|
|
|
|
{
|
|
|
|
.procname = "sched_autogroup_enabled",
|
|
|
|
.data = &sysctl_sched_autogroup_enabled,
|
|
|
|
.maxlen = sizeof(unsigned int),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = proc_dointvec,
|
|
|
|
.extra1 = &zero,
|
|
|
|
.extra2 = &one,
|
|
|
|
},
|
|
|
|
#endif
|
2007-07-19 12:48:56 +04:00
|
|
|
#ifdef CONFIG_PROVE_LOCKING
|
|
|
|
{
|
|
|
|
.procname = "prove_locking",
|
|
|
|
.data = &prove_locking,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec,
|
2007-07-19 12:48:56 +04:00
|
|
|
},
|
|
|
|
#endif
|
|
|
|
#ifdef CONFIG_LOCK_STAT
|
|
|
|
{
|
|
|
|
.procname = "lock_stat",
|
|
|
|
.data = &lock_stat,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec,
|
2007-07-19 12:48:56 +04:00
|
|
|
},
|
2007-07-09 20:52:00 +04:00
|
|
|
#endif
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
|
|
|
.procname = "panic",
|
|
|
|
.data = &panic_timeout,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec,
|
2005-04-17 02:20:36 +04:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "core_uses_pid",
|
|
|
|
.data = &core_uses_pid,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec,
|
2005-04-17 02:20:36 +04:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "core_pattern",
|
|
|
|
.data = core_pattern,
|
2007-05-17 09:11:16 +04:00
|
|
|
.maxlen = CORENAME_MAX_SIZE,
|
2005-04-17 02:20:36 +04:00
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dostring,
|
2005-04-17 02:20:36 +04:00
|
|
|
},
|
2009-09-24 02:56:56 +04:00
|
|
|
{
|
|
|
|
.procname = "core_pipe_limit",
|
|
|
|
.data = &core_pipe_limit,
|
|
|
|
.maxlen = sizeof(unsigned int),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec,
|
2009-09-24 02:56:56 +04:00
|
|
|
},
|
2007-02-10 12:45:24 +03:00
|
|
|
#ifdef CONFIG_PROC_SYSCTL
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
|
|
|
.procname = "tainted",
|
2008-10-16 09:01:41 +04:00
|
|
|
.maxlen = sizeof(long),
|
2007-02-10 12:45:24 +03:00
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_taint,
|
2005-04-17 02:20:36 +04:00
|
|
|
},
|
2007-02-10 12:45:24 +03:00
|
|
|
#endif
|
2008-01-25 23:08:34 +03:00
|
|
|
#ifdef CONFIG_LATENCYTOP
|
|
|
|
{
|
|
|
|
.procname = "latencytop",
|
|
|
|
.data = &latencytop_enabled,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec,
|
2008-01-25 23:08:34 +03:00
|
|
|
},
|
|
|
|
#endif
|
2005-04-17 02:20:36 +04:00
|
|
|
#ifdef CONFIG_BLK_DEV_INITRD
|
|
|
|
{
|
|
|
|
.procname = "real-root-dev",
|
|
|
|
.data = &real_root_dev,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec,
|
2005-04-17 02:20:36 +04:00
|
|
|
},
|
|
|
|
#endif
|
2007-07-16 10:40:10 +04:00
|
|
|
{
|
|
|
|
.procname = "print-fatal-signals",
|
|
|
|
.data = &print_fatal_signals,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec,
|
2007-07-16 10:40:10 +04:00
|
|
|
},
|
2008-09-12 10:29:54 +04:00
|
|
|
#ifdef CONFIG_SPARC
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
|
|
|
.procname = "reboot-cmd",
|
|
|
|
.data = reboot_command,
|
|
|
|
.maxlen = 256,
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dostring,
|
2005-04-17 02:20:36 +04:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "stop-a",
|
|
|
|
.data = &stop_a_enabled,
|
|
|
|
.maxlen = sizeof (int),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec,
|
2005-04-17 02:20:36 +04:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "scons-poweroff",
|
|
|
|
.data = &scons_pwroff,
|
|
|
|
.maxlen = sizeof (int),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec,
|
2005-04-17 02:20:36 +04:00
|
|
|
},
|
|
|
|
#endif
|
2008-11-17 10:49:24 +03:00
|
|
|
#ifdef CONFIG_SPARC64
|
|
|
|
{
|
|
|
|
.procname = "tsb-ratio",
|
|
|
|
.data = &sysctl_tsb_ratio,
|
|
|
|
.maxlen = sizeof (int),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec,
|
2008-11-17 10:49:24 +03:00
|
|
|
},
|
|
|
|
#endif
|
2005-04-17 02:20:36 +04:00
|
|
|
#ifdef __hppa__
|
|
|
|
{
|
|
|
|
.procname = "soft-power",
|
|
|
|
.data = &pwrsw_enabled,
|
|
|
|
.maxlen = sizeof (int),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec,
|
2005-04-17 02:20:36 +04:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "unaligned-trap",
|
|
|
|
.data = &unaligned_enabled,
|
|
|
|
.maxlen = sizeof (int),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec,
|
2005-04-17 02:20:36 +04:00
|
|
|
},
|
|
|
|
#endif
|
|
|
|
{
|
|
|
|
.procname = "ctrl-alt-del",
|
|
|
|
.data = &C_A_D,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec,
|
2005-04-17 02:20:36 +04:00
|
|
|
},
|
2008-10-07 03:06:12 +04:00
|
|
|
#ifdef CONFIG_FUNCTION_TRACER
|
2008-05-12 23:20:43 +04:00
|
|
|
{
|
|
|
|
.procname = "ftrace_enabled",
|
|
|
|
.data = &ftrace_enabled,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = ftrace_enable_sysctl,
|
2008-05-12 23:20:43 +04:00
|
|
|
},
|
|
|
|
#endif
|
2008-12-17 07:06:40 +03:00
|
|
|
#ifdef CONFIG_STACK_TRACER
|
|
|
|
{
|
|
|
|
.procname = "stack_tracer_enabled",
|
|
|
|
.data = &stack_tracer_enabled,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = stack_trace_sysctl,
|
2008-12-17 07:06:40 +03:00
|
|
|
},
|
|
|
|
#endif
|
2008-10-24 03:26:08 +04:00
|
|
|
#ifdef CONFIG_TRACING
|
|
|
|
{
|
2008-11-04 13:58:21 +03:00
|
|
|
.procname = "ftrace_dump_on_oops",
|
2008-10-24 03:26:08 +04:00
|
|
|
.data = &ftrace_dump_on_oops,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec,
|
2008-10-24 03:26:08 +04:00
|
|
|
},
|
|
|
|
#endif
|
2008-07-08 21:00:17 +04:00
|
|
|
#ifdef CONFIG_MODULES
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
|
|
|
.procname = "modprobe",
|
|
|
|
.data = &modprobe_path,
|
|
|
|
.maxlen = KMOD_PATH_LEN,
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dostring,
|
2005-04-17 02:20:36 +04:00
|
|
|
},
|
2009-04-03 02:49:29 +04:00
|
|
|
{
|
|
|
|
.procname = "modules_disabled",
|
|
|
|
.data = &modules_disabled,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
|
|
|
/* only handle a transition from default "0" to "1" */
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec_minmax,
|
2009-04-03 02:49:29 +04:00
|
|
|
.extra1 = &one,
|
|
|
|
.extra2 = &one,
|
|
|
|
},
|
2005-04-17 02:20:36 +04:00
|
|
|
#endif
|
2010-06-07 15:57:12 +04:00
|
|
|
#ifdef CONFIG_HOTPLUG
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
|
|
|
.procname = "hotplug",
|
2005-11-16 11:00:00 +03:00
|
|
|
.data = &uevent_helper,
|
|
|
|
.maxlen = UEVENT_HELPER_PATH_LEN,
|
2005-04-17 02:20:36 +04:00
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dostring,
|
2005-04-17 02:20:36 +04:00
|
|
|
},
|
|
|
|
#endif
|
|
|
|
#ifdef CONFIG_CHR_DEV_SG
|
|
|
|
{
|
|
|
|
.procname = "sg-big-buff",
|
|
|
|
.data = &sg_big_buff,
|
|
|
|
.maxlen = sizeof (int),
|
|
|
|
.mode = 0444,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec,
|
2005-04-17 02:20:36 +04:00
|
|
|
},
|
|
|
|
#endif
|
|
|
|
#ifdef CONFIG_BSD_PROCESS_ACCT
|
|
|
|
{
|
|
|
|
.procname = "acct",
|
|
|
|
.data = &acct_parm,
|
|
|
|
.maxlen = 3*sizeof(int),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec,
|
2005-04-17 02:20:36 +04:00
|
|
|
},
|
|
|
|
#endif
|
|
|
|
#ifdef CONFIG_MAGIC_SYSRQ
|
|
|
|
{
|
|
|
|
.procname = "sysrq",
|
2006-12-13 11:34:36 +03:00
|
|
|
.data = &__sysrq_enabled,
|
2005-04-17 02:20:36 +04:00
|
|
|
.maxlen = sizeof (int),
|
|
|
|
.mode = 0644,
|
2010-03-22 08:31:26 +03:00
|
|
|
.proc_handler = sysrq_sysctl_handler,
|
2005-04-17 02:20:36 +04:00
|
|
|
},
|
|
|
|
#endif
|
2006-10-20 10:28:34 +04:00
|
|
|
#ifdef CONFIG_PROC_SYSCTL
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
|
|
|
.procname = "cad_pid",
|
2006-10-02 13:19:00 +04:00
|
|
|
.data = NULL,
|
2005-04-17 02:20:36 +04:00
|
|
|
.maxlen = sizeof (int),
|
|
|
|
.mode = 0600,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_do_cad_pid,
|
2005-04-17 02:20:36 +04:00
|
|
|
},
|
2006-10-20 10:28:34 +04:00
|
|
|
#endif
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
|
|
|
.procname = "threads-max",
|
|
|
|
.data = &max_threads,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec,
|
2005-04-17 02:20:36 +04:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "random",
|
|
|
|
.mode = 0555,
|
|
|
|
.child = random_table,
|
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "overflowuid",
|
|
|
|
.data = &overflowuid,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec_minmax,
|
2005-04-17 02:20:36 +04:00
|
|
|
.extra1 = &minolduid,
|
|
|
|
.extra2 = &maxolduid,
|
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "overflowgid",
|
|
|
|
.data = &overflowgid,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec_minmax,
|
2005-04-17 02:20:36 +04:00
|
|
|
.extra1 = &minolduid,
|
|
|
|
.extra2 = &maxolduid,
|
|
|
|
},
|
2006-01-06 11:19:28 +03:00
|
|
|
#ifdef CONFIG_S390
|
2005-04-17 02:20:36 +04:00
|
|
|
#ifdef CONFIG_MATHEMU
|
|
|
|
{
|
|
|
|
.procname = "ieee_emulation_warnings",
|
|
|
|
.data = &sysctl_ieee_emulation_warnings,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec,
|
2005-04-17 02:20:36 +04:00
|
|
|
},
|
|
|
|
#endif
|
|
|
|
{
|
|
|
|
.procname = "userprocess_debug",
|
2010-05-17 12:00:21 +04:00
|
|
|
.data = &show_unhandled_signals,
|
2005-04-17 02:20:36 +04:00
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec,
|
2005-04-17 02:20:36 +04:00
|
|
|
},
|
|
|
|
#endif
|
|
|
|
{
|
|
|
|
.procname = "pid_max",
|
|
|
|
.data = &pid_max,
|
|
|
|
.maxlen = sizeof (int),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec_minmax,
|
2005-04-17 02:20:36 +04:00
|
|
|
.extra1 = &pid_max_min,
|
|
|
|
.extra2 = &pid_max_max,
|
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "panic_on_oops",
|
|
|
|
.data = &panic_on_oops,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec,
|
2005-04-17 02:20:36 +04:00
|
|
|
},
|
2008-02-08 15:21:25 +03:00
|
|
|
#if defined CONFIG_PRINTK
|
|
|
|
{
|
|
|
|
.procname = "printk",
|
|
|
|
.data = &console_loglevel,
|
|
|
|
.maxlen = 4*sizeof(int),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec,
|
2008-02-08 15:21:25 +03:00
|
|
|
},
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
|
|
|
.procname = "printk_ratelimit",
|
2008-07-25 12:45:58 +04:00
|
|
|
.data = &printk_ratelimit_state.interval,
|
2005-04-17 02:20:36 +04:00
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec_jiffies,
|
2005-04-17 02:20:36 +04:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "printk_ratelimit_burst",
|
2008-07-25 12:45:58 +04:00
|
|
|
.data = &printk_ratelimit_state.burst,
|
2005-04-17 02:20:36 +04:00
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec,
|
2005-04-17 02:20:36 +04:00
|
|
|
},
|
2009-09-23 03:43:33 +04:00
|
|
|
{
|
|
|
|
.procname = "printk_delay",
|
|
|
|
.data = &printk_delay_msec,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec_minmax,
|
2009-09-23 03:43:33 +04:00
|
|
|
.extra1 = &zero,
|
|
|
|
.extra2 = &ten_thousand,
|
|
|
|
},
|
2010-11-12 01:05:18 +03:00
|
|
|
{
|
|
|
|
.procname = "dmesg_restrict",
|
|
|
|
.data = &dmesg_restrict,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = proc_dointvec_minmax,
|
|
|
|
.extra1 = &zero,
|
|
|
|
.extra2 = &one,
|
|
|
|
},
|
kptr_restrict for hiding kernel pointers from unprivileged users
Add the %pK printk format specifier and the /proc/sys/kernel/kptr_restrict
sysctl.
The %pK format specifier is designed to hide exposed kernel pointers,
specifically via /proc interfaces. Exposing these pointers provides an
easy target for kernel write vulnerabilities, since they reveal the
locations of writable structures containing easily triggerable function
pointers. The behavior of %pK depends on the kptr_restrict sysctl.
If kptr_restrict is set to 0, no deviation from the standard %p behavior
occurs. If kptr_restrict is set to 1, the default, if the current user
(intended to be a reader via seq_printf(), etc.) does not have CAP_SYSLOG
(currently in the LSM tree), kernel pointers using %pK are printed as 0's.
If kptr_restrict is set to 2, kernel pointers using %pK are printed as
0's regardless of privileges. Replacing with 0's was chosen over the
default "(null)", which cannot be parsed by userland %p, which expects
"(nil)".
[akpm@linux-foundation.org: check for IRQ context when !kptr_restrict, save an indent level, s/WARN/WARN_ONCE/]
[akpm@linux-foundation.org: coding-style fixup]
[randy.dunlap@oracle.com: fix kernel/sysctl.c warning]
Signed-off-by: Dan Rosenberg <drosenberg@vsecurity.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: James Morris <jmorris@namei.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Thomas Graf <tgraf@infradead.org>
Cc: Eugene Teo <eugeneteo@kernel.org>
Cc: Kees Cook <kees.cook@canonical.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David S. Miller <davem@davemloft.net>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Eric Paris <eparis@parisplace.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 03:59:41 +03:00
|
|
|
{
|
|
|
|
.procname = "kptr_restrict",
|
|
|
|
.data = &kptr_restrict,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = proc_dointvec_minmax,
|
|
|
|
.extra1 = &zero,
|
|
|
|
.extra2 = &two,
|
|
|
|
},
|
2010-11-16 08:17:27 +03:00
|
|
|
#endif
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
|
|
|
.procname = "ngroups_max",
|
|
|
|
.data = &ngroups_max,
|
|
|
|
.maxlen = sizeof (int),
|
|
|
|
.mode = 0444,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec,
|
2005-04-17 02:20:36 +04:00
|
|
|
},
|
2010-05-08 01:11:44 +04:00
|
|
|
#if defined(CONFIG_LOCKUP_DETECTOR)
|
2010-02-13 01:19:19 +03:00
|
|
|
{
|
2010-05-08 01:11:44 +04:00
|
|
|
.procname = "watchdog",
|
|
|
|
.data = &watchdog_enabled,
|
2010-02-13 01:19:19 +03:00
|
|
|
.maxlen = sizeof (int),
|
|
|
|
.mode = 0644,
|
2010-05-08 01:11:44 +04:00
|
|
|
.proc_handler = proc_dowatchdog_enabled,
|
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "watchdog_thresh",
|
|
|
|
.data = &softlockup_thresh,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = proc_dowatchdog_thresh,
|
|
|
|
.extra1 = &neg_one,
|
|
|
|
.extra2 = &sixty,
|
2010-02-13 01:19:19 +03:00
|
|
|
},
|
2010-05-08 01:11:46 +04:00
|
|
|
{
|
|
|
|
.procname = "softlockup_panic",
|
|
|
|
.data = &softlockup_panic,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = proc_dointvec_minmax,
|
|
|
|
.extra1 = &zero,
|
|
|
|
.extra2 = &one,
|
|
|
|
},
|
2010-11-30 01:07:17 +03:00
|
|
|
{
|
|
|
|
.procname = "nmi_watchdog",
|
|
|
|
.data = &watchdog_enabled,
|
|
|
|
.maxlen = sizeof (int),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = proc_dowatchdog_enabled,
|
|
|
|
},
|
|
|
|
#endif
|
|
|
|
#if defined(CONFIG_X86_LOCAL_APIC) && defined(CONFIG_X86)
|
|
|
|
{
|
|
|
|
.procname = "unknown_nmi_panic",
|
|
|
|
.data = &unknown_nmi_panic,
|
|
|
|
.maxlen = sizeof (int),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = proc_dointvec,
|
|
|
|
},
|
2010-02-13 01:19:19 +03:00
|
|
|
#endif
|
2005-04-17 02:20:36 +04:00
|
|
|
#if defined(CONFIG_X86)
|
2006-09-26 12:52:27 +04:00
|
|
|
{
|
|
|
|
.procname = "panic_on_unrecovered_nmi",
|
|
|
|
.data = &panic_on_unrecovered_nmi,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec,
|
2006-09-26 12:52:27 +04:00
|
|
|
},
|
2009-06-25 01:32:11 +04:00
|
|
|
{
|
|
|
|
.procname = "panic_on_io_nmi",
|
|
|
|
.data = &panic_on_io_nmi,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec,
|
2009-06-25 01:32:11 +04:00
|
|
|
},
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
|
|
|
.procname = "bootloader_type",
|
|
|
|
.data = &bootloader_type,
|
|
|
|
.maxlen = sizeof (int),
|
|
|
|
.mode = 0444,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec,
|
2005-04-17 02:20:36 +04:00
|
|
|
},
|
2009-05-08 03:54:11 +04:00
|
|
|
{
|
|
|
|
.procname = "bootloader_version",
|
|
|
|
.data = &bootloader_version,
|
|
|
|
.maxlen = sizeof (int),
|
|
|
|
.mode = 0444,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec,
|
2009-05-08 03:54:11 +04:00
|
|
|
},
|
2006-12-07 04:14:11 +03:00
|
|
|
{
|
|
|
|
.procname = "kstack_depth_to_print",
|
|
|
|
.data = &kstack_depth_to_print,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec,
|
2006-12-07 04:14:11 +03:00
|
|
|
},
|
2008-01-30 15:30:05 +03:00
|
|
|
{
|
|
|
|
.procname = "io_delay_type",
|
|
|
|
.data = &io_delay_type,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec,
|
2008-01-30 15:30:05 +03:00
|
|
|
},
|
2005-04-17 02:20:36 +04:00
|
|
|
#endif
|
2006-02-21 05:28:07 +03:00
|
|
|
#if defined(CONFIG_MMU)
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
|
|
|
.procname = "randomize_va_space",
|
|
|
|
.data = &randomize_va_space,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec,
|
2005-04-17 02:20:36 +04:00
|
|
|
},
|
2006-02-21 05:28:07 +03:00
|
|
|
#endif
|
2006-01-15 00:21:00 +03:00
|
|
|
#if defined(CONFIG_S390) && defined(CONFIG_SMP)
|
2005-07-27 22:44:57 +04:00
|
|
|
{
|
|
|
|
.procname = "spin_retry",
|
|
|
|
.data = &spin_retry,
|
|
|
|
.maxlen = sizeof (int),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec,
|
2005-07-27 22:44:57 +04:00
|
|
|
},
|
2006-02-21 05:27:58 +03:00
|
|
|
#endif
|
2007-07-28 11:33:16 +04:00
|
|
|
#if defined(CONFIG_ACPI_SLEEP) && defined(CONFIG_X86)
|
2006-02-21 05:27:58 +03:00
|
|
|
{
|
|
|
|
.procname = "acpi_video_flags",
|
2007-07-19 12:47:41 +04:00
|
|
|
.data = &acpi_realmode_flags,
|
2006-02-21 05:27:58 +03:00
|
|
|
.maxlen = sizeof (unsigned long),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_doulongvec_minmax,
|
2006-02-21 05:27:58 +03:00
|
|
|
},
|
2006-02-28 20:42:23 +03:00
|
|
|
#endif
|
|
|
|
#ifdef CONFIG_IA64
|
|
|
|
{
|
|
|
|
.procname = "ignore-unaligned-usertrap",
|
|
|
|
.data = &no_unaligned_warning,
|
|
|
|
.maxlen = sizeof (int),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec,
|
2006-02-28 20:42:23 +03:00
|
|
|
},
|
2009-01-15 21:38:56 +03:00
|
|
|
{
|
|
|
|
.procname = "unaligned-dump-stack",
|
|
|
|
.data = &unaligned_dump_stack,
|
|
|
|
.maxlen = sizeof (int),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec,
|
2009-01-15 21:38:56 +03:00
|
|
|
},
|
2006-06-26 15:56:52 +04:00
|
|
|
#endif
|
2009-01-15 22:08:40 +03:00
|
|
|
#ifdef CONFIG_DETECT_HUNG_TASK
|
|
|
|
{
|
|
|
|
.procname = "hung_task_panic",
|
|
|
|
.data = &sysctl_hung_task_panic,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec_minmax,
|
2009-01-15 22:08:40 +03:00
|
|
|
.extra1 = &zero,
|
|
|
|
.extra2 = &one,
|
|
|
|
},
|
2008-01-25 23:08:02 +03:00
|
|
|
{
|
|
|
|
.procname = "hung_task_check_count",
|
|
|
|
.data = &sysctl_hung_task_check_count,
|
2008-01-25 23:08:34 +03:00
|
|
|
.maxlen = sizeof(unsigned long),
|
2008-01-25 23:08:02 +03:00
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_doulongvec_minmax,
|
2008-01-25 23:08:02 +03:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "hung_task_timeout_secs",
|
|
|
|
.data = &sysctl_hung_task_timeout_secs,
|
2008-01-25 23:08:34 +03:00
|
|
|
.maxlen = sizeof(unsigned long),
|
2008-01-25 23:08:02 +03:00
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dohung_task_timeout_secs,
|
2008-01-25 23:08:02 +03:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "hung_task_warnings",
|
|
|
|
.data = &sysctl_hung_task_warnings,
|
2008-01-25 23:08:34 +03:00
|
|
|
.maxlen = sizeof(unsigned long),
|
2008-01-25 23:08:02 +03:00
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_doulongvec_minmax,
|
2008-01-25 23:08:02 +03:00
|
|
|
},
|
2007-10-17 10:26:09 +04:00
|
|
|
#endif
|
2006-06-26 15:56:52 +04:00
|
|
|
#ifdef CONFIG_COMPAT
|
|
|
|
{
|
|
|
|
.procname = "compat-log",
|
|
|
|
.data = &compat_log,
|
|
|
|
.maxlen = sizeof (int),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec,
|
2006-06-26 15:56:52 +04:00
|
|
|
},
|
2005-07-27 22:44:57 +04:00
|
|
|
#endif
|
2006-06-27 13:54:53 +04:00
|
|
|
#ifdef CONFIG_RT_MUTEXES
|
|
|
|
{
|
|
|
|
.procname = "max_lock_depth",
|
|
|
|
.data = &max_lock_depth,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec,
|
2006-06-27 13:54:53 +04:00
|
|
|
},
|
2007-05-08 11:26:04 +04:00
|
|
|
#endif
|
2007-07-18 05:37:02 +04:00
|
|
|
{
|
|
|
|
.procname = "poweroff_cmd",
|
|
|
|
.data = &poweroff_cmd,
|
|
|
|
.maxlen = POWEROFF_CMD_PATH_LEN,
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dostring,
|
2007-07-18 05:37:02 +04:00
|
|
|
},
|
2008-04-29 12:01:32 +04:00
|
|
|
#ifdef CONFIG_KEYS
|
|
|
|
{
|
|
|
|
.procname = "keys",
|
|
|
|
.mode = 0555,
|
|
|
|
.child = key_sysctls,
|
|
|
|
},
|
|
|
|
#endif
|
2008-06-18 20:26:49 +04:00
|
|
|
#ifdef CONFIG_RCU_TORTURE_TEST
|
|
|
|
{
|
|
|
|
.procname = "rcutorture_runnable",
|
|
|
|
.data = &rcutorture_runnable,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec,
|
2008-06-18 20:26:49 +04:00
|
|
|
},
|
|
|
|
#endif
|
perf: Do the big rename: Performance Counters -> Performance Events
Bye-bye Performance Counters, welcome Performance Events!
In the past few months the perfcounters subsystem has grown out its
initial role of counting hardware events, and has become (and is
becoming) a much broader generic event enumeration, reporting, logging,
monitoring, analysis facility.
Naming its core object 'perf_counter' and naming the subsystem
'perfcounters' has become more and more of a misnomer. With pending
code like hw-breakpoints support the 'counter' name is less and
less appropriate.
All in one, we've decided to rename the subsystem to 'performance
events' and to propagate this rename through all fields, variables
and API names. (in an ABI compatible fashion)
The word 'event' is also a bit shorter than 'counter' - which makes
it slightly more convenient to write/handle as well.
Thanks goes to Stephane Eranian who first observed this misnomer and
suggested a rename.
User-space tooling and ABI compatibility is not affected - this patch
should be function-invariant. (Also, defconfigs were not touched to
keep the size down.)
This patch has been generated via the following script:
FILES=$(find * -type f | grep -vE 'oprofile|[^K]config')
sed -i \
-e 's/PERF_EVENT_/PERF_RECORD_/g' \
-e 's/PERF_COUNTER/PERF_EVENT/g' \
-e 's/perf_counter/perf_event/g' \
-e 's/nb_counters/nb_events/g' \
-e 's/swcounter/swevent/g' \
-e 's/tpcounter_event/tp_event/g' \
$FILES
for N in $(find . -name perf_counter.[ch]); do
M=$(echo $N | sed 's/perf_counter/perf_event/g')
mv $N $M
done
FILES=$(find . -name perf_event.*)
sed -i \
-e 's/COUNTER_MASK/REG_MASK/g' \
-e 's/COUNTER/EVENT/g' \
-e 's/\<event\>/event_id/g' \
-e 's/counter/event/g' \
-e 's/Counter/Event/g' \
$FILES
... to keep it as correct as possible. This script can also be
used by anyone who has pending perfcounters patches - it converts
a Linux kernel tree over to the new naming. We tried to time this
change to the point in time where the amount of pending patches
is the smallest: the end of the merge window.
Namespace clashes were fixed up in a preparatory patch - and some
stylistic fallout will be fixed up in a subsequent patch.
( NOTE: 'counters' are still the proper terminology when we deal
with hardware registers - and these sed scripts are a bit
over-eager in renaming them. I've undone some of that, but
in case there's something left where 'counter' would be
better than 'event' we can undo that on an individual basis
instead of touching an otherwise nicely automated patch. )
Suggested-by: Stephane Eranian <eranian@google.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Paul Mackerras <paulus@samba.org>
Reviewed-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <linux-arch@vger.kernel.org>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-09-21 14:02:48 +04:00
|
|
|
#ifdef CONFIG_PERF_EVENTS
|
2009-04-09 12:53:45 +04:00
|
|
|
{
|
perf: Do the big rename: Performance Counters -> Performance Events
Bye-bye Performance Counters, welcome Performance Events!
In the past few months the perfcounters subsystem has grown out its
initial role of counting hardware events, and has become (and is
becoming) a much broader generic event enumeration, reporting, logging,
monitoring, analysis facility.
Naming its core object 'perf_counter' and naming the subsystem
'perfcounters' has become more and more of a misnomer. With pending
code like hw-breakpoints support the 'counter' name is less and
less appropriate.
All in one, we've decided to rename the subsystem to 'performance
events' and to propagate this rename through all fields, variables
and API names. (in an ABI compatible fashion)
The word 'event' is also a bit shorter than 'counter' - which makes
it slightly more convenient to write/handle as well.
Thanks goes to Stephane Eranian who first observed this misnomer and
suggested a rename.
User-space tooling and ABI compatibility is not affected - this patch
should be function-invariant. (Also, defconfigs were not touched to
keep the size down.)
This patch has been generated via the following script:
FILES=$(find * -type f | grep -vE 'oprofile|[^K]config')
sed -i \
-e 's/PERF_EVENT_/PERF_RECORD_/g' \
-e 's/PERF_COUNTER/PERF_EVENT/g' \
-e 's/perf_counter/perf_event/g' \
-e 's/nb_counters/nb_events/g' \
-e 's/swcounter/swevent/g' \
-e 's/tpcounter_event/tp_event/g' \
$FILES
for N in $(find . -name perf_counter.[ch]); do
M=$(echo $N | sed 's/perf_counter/perf_event/g')
mv $N $M
done
FILES=$(find . -name perf_event.*)
sed -i \
-e 's/COUNTER_MASK/REG_MASK/g' \
-e 's/COUNTER/EVENT/g' \
-e 's/\<event\>/event_id/g' \
-e 's/counter/event/g' \
-e 's/Counter/Event/g' \
$FILES
... to keep it as correct as possible. This script can also be
used by anyone who has pending perfcounters patches - it converts
a Linux kernel tree over to the new naming. We tried to time this
change to the point in time where the amount of pending patches
is the smallest: the end of the merge window.
Namespace clashes were fixed up in a preparatory patch - and some
stylistic fallout will be fixed up in a subsequent patch.
( NOTE: 'counters' are still the proper terminology when we deal
with hardware registers - and these sed scripts are a bit
over-eager in renaming them. I've undone some of that, but
in case there's something left where 'counter' would be
better than 'event' we can undo that on an individual basis
instead of touching an otherwise nicely automated patch. )
Suggested-by: Stephane Eranian <eranian@google.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Paul Mackerras <paulus@samba.org>
Reviewed-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <linux-arch@vger.kernel.org>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-09-21 14:02:48 +04:00
|
|
|
.procname = "perf_event_paranoid",
|
|
|
|
.data = &sysctl_perf_event_paranoid,
|
|
|
|
.maxlen = sizeof(sysctl_perf_event_paranoid),
|
2009-04-09 12:53:45 +04:00
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec,
|
2009-04-09 12:53:45 +04:00
|
|
|
},
|
2009-05-05 19:50:24 +04:00
|
|
|
{
|
perf: Do the big rename: Performance Counters -> Performance Events
Bye-bye Performance Counters, welcome Performance Events!
In the past few months the perfcounters subsystem has grown out its
initial role of counting hardware events, and has become (and is
becoming) a much broader generic event enumeration, reporting, logging,
monitoring, analysis facility.
Naming its core object 'perf_counter' and naming the subsystem
'perfcounters' has become more and more of a misnomer. With pending
code like hw-breakpoints support the 'counter' name is less and
less appropriate.
All in one, we've decided to rename the subsystem to 'performance
events' and to propagate this rename through all fields, variables
and API names. (in an ABI compatible fashion)
The word 'event' is also a bit shorter than 'counter' - which makes
it slightly more convenient to write/handle as well.
Thanks goes to Stephane Eranian who first observed this misnomer and
suggested a rename.
User-space tooling and ABI compatibility is not affected - this patch
should be function-invariant. (Also, defconfigs were not touched to
keep the size down.)
This patch has been generated via the following script:
FILES=$(find * -type f | grep -vE 'oprofile|[^K]config')
sed -i \
-e 's/PERF_EVENT_/PERF_RECORD_/g' \
-e 's/PERF_COUNTER/PERF_EVENT/g' \
-e 's/perf_counter/perf_event/g' \
-e 's/nb_counters/nb_events/g' \
-e 's/swcounter/swevent/g' \
-e 's/tpcounter_event/tp_event/g' \
$FILES
for N in $(find . -name perf_counter.[ch]); do
M=$(echo $N | sed 's/perf_counter/perf_event/g')
mv $N $M
done
FILES=$(find . -name perf_event.*)
sed -i \
-e 's/COUNTER_MASK/REG_MASK/g' \
-e 's/COUNTER/EVENT/g' \
-e 's/\<event\>/event_id/g' \
-e 's/counter/event/g' \
-e 's/Counter/Event/g' \
$FILES
... to keep it as correct as possible. This script can also be
used by anyone who has pending perfcounters patches - it converts
a Linux kernel tree over to the new naming. We tried to time this
change to the point in time where the amount of pending patches
is the smallest: the end of the merge window.
Namespace clashes were fixed up in a preparatory patch - and some
stylistic fallout will be fixed up in a subsequent patch.
( NOTE: 'counters' are still the proper terminology when we deal
with hardware registers - and these sed scripts are a bit
over-eager in renaming them. I've undone some of that, but
in case there's something left where 'counter' would be
better than 'event' we can undo that on an individual basis
instead of touching an otherwise nicely automated patch. )
Suggested-by: Stephane Eranian <eranian@google.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Paul Mackerras <paulus@samba.org>
Reviewed-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <linux-arch@vger.kernel.org>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-09-21 14:02:48 +04:00
|
|
|
.procname = "perf_event_mlock_kb",
|
|
|
|
.data = &sysctl_perf_event_mlock,
|
|
|
|
.maxlen = sizeof(sysctl_perf_event_mlock),
|
2009-05-05 19:50:24 +04:00
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec,
|
2009-05-05 19:50:24 +04:00
|
|
|
},
|
2009-05-25 19:39:05 +04:00
|
|
|
{
|
perf: Do the big rename: Performance Counters -> Performance Events
Bye-bye Performance Counters, welcome Performance Events!
In the past few months the perfcounters subsystem has grown out its
initial role of counting hardware events, and has become (and is
becoming) a much broader generic event enumeration, reporting, logging,
monitoring, analysis facility.
Naming its core object 'perf_counter' and naming the subsystem
'perfcounters' has become more and more of a misnomer. With pending
code like hw-breakpoints support the 'counter' name is less and
less appropriate.
All in one, we've decided to rename the subsystem to 'performance
events' and to propagate this rename through all fields, variables
and API names. (in an ABI compatible fashion)
The word 'event' is also a bit shorter than 'counter' - which makes
it slightly more convenient to write/handle as well.
Thanks goes to Stephane Eranian who first observed this misnomer and
suggested a rename.
User-space tooling and ABI compatibility is not affected - this patch
should be function-invariant. (Also, defconfigs were not touched to
keep the size down.)
This patch has been generated via the following script:
FILES=$(find * -type f | grep -vE 'oprofile|[^K]config')
sed -i \
-e 's/PERF_EVENT_/PERF_RECORD_/g' \
-e 's/PERF_COUNTER/PERF_EVENT/g' \
-e 's/perf_counter/perf_event/g' \
-e 's/nb_counters/nb_events/g' \
-e 's/swcounter/swevent/g' \
-e 's/tpcounter_event/tp_event/g' \
$FILES
for N in $(find . -name perf_counter.[ch]); do
M=$(echo $N | sed 's/perf_counter/perf_event/g')
mv $N $M
done
FILES=$(find . -name perf_event.*)
sed -i \
-e 's/COUNTER_MASK/REG_MASK/g' \
-e 's/COUNTER/EVENT/g' \
-e 's/\<event\>/event_id/g' \
-e 's/counter/event/g' \
-e 's/Counter/Event/g' \
$FILES
... to keep it as correct as possible. This script can also be
used by anyone who has pending perfcounters patches - it converts
a Linux kernel tree over to the new naming. We tried to time this
change to the point in time where the amount of pending patches
is the smallest: the end of the merge window.
Namespace clashes were fixed up in a preparatory patch - and some
stylistic fallout will be fixed up in a subsequent patch.
( NOTE: 'counters' are still the proper terminology when we deal
with hardware registers - and these sed scripts are a bit
over-eager in renaming them. I've undone some of that, but
in case there's something left where 'counter' would be
better than 'event' we can undo that on an individual basis
instead of touching an otherwise nicely automated patch. )
Suggested-by: Stephane Eranian <eranian@google.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Paul Mackerras <paulus@samba.org>
Reviewed-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <linux-arch@vger.kernel.org>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-09-21 14:02:48 +04:00
|
|
|
.procname = "perf_event_max_sample_rate",
|
|
|
|
.data = &sysctl_perf_event_sample_rate,
|
|
|
|
.maxlen = sizeof(sysctl_perf_event_sample_rate),
|
2009-05-25 19:39:05 +04:00
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec,
|
2009-05-25 19:39:05 +04:00
|
|
|
},
|
2009-04-09 12:53:45 +04:00
|
|
|
#endif
|
2008-04-04 02:51:41 +04:00
|
|
|
#ifdef CONFIG_KMEMCHECK
|
|
|
|
{
|
|
|
|
.procname = "kmemcheck",
|
|
|
|
.data = &kmemcheck_enabled,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec,
|
2008-04-04 02:51:41 +04:00
|
|
|
},
|
|
|
|
#endif
|
2009-09-15 23:53:11 +04:00
|
|
|
#ifdef CONFIG_BLOCK
|
2009-08-05 11:07:21 +04:00
|
|
|
{
|
|
|
|
.procname = "blk_iopoll",
|
|
|
|
.data = &blk_iopoll_enabled,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec,
|
2009-08-05 11:07:21 +04:00
|
|
|
},
|
2009-09-15 23:53:11 +04:00
|
|
|
#endif
|
2009-04-03 13:30:53 +04:00
|
|
|
{ }
|
2005-04-17 02:20:36 +04:00
|
|
|
};
|
|
|
|
|
2007-10-18 14:05:22 +04:00
|
|
|
static struct ctl_table vm_table[] = {
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
|
|
|
.procname = "overcommit_memory",
|
|
|
|
.data = &sysctl_overcommit_memory,
|
|
|
|
.maxlen = sizeof(sysctl_overcommit_memory),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec,
|
2005-04-17 02:20:36 +04:00
|
|
|
},
|
2006-06-23 13:03:13 +04:00
|
|
|
{
|
|
|
|
.procname = "panic_on_oom",
|
|
|
|
.data = &sysctl_panic_on_oom,
|
|
|
|
.maxlen = sizeof(sysctl_panic_on_oom),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec,
|
2006-06-23 13:03:13 +04:00
|
|
|
},
|
2007-10-17 10:25:56 +04:00
|
|
|
{
|
|
|
|
.procname = "oom_kill_allocating_task",
|
|
|
|
.data = &sysctl_oom_kill_allocating_task,
|
|
|
|
.maxlen = sizeof(sysctl_oom_kill_allocating_task),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec,
|
2007-10-17 10:25:56 +04:00
|
|
|
},
|
oom: add sysctl to enable task memory dump
Adds a new sysctl, 'oom_dump_tasks', that enables the kernel to produce a
dump of all system tasks (excluding kernel threads) when performing an
OOM-killing. Information includes pid, uid, tgid, vm size, rss, cpu,
oom_adj score, and name.
This is helpful for determining why there was an OOM condition and which
rogue task caused it.
It is configurable so that large systems, such as those with several
thousand tasks, do not incur a performance penalty associated with dumping
data they may not desire.
If an OOM was triggered as a result of a memory controller, the tasklist
shall be filtered to exclude tasks that are not a member of the same
cgroup.
Cc: Andrea Arcangeli <andrea@suse.de>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-07 11:14:07 +03:00
|
|
|
{
|
|
|
|
.procname = "oom_dump_tasks",
|
|
|
|
.data = &sysctl_oom_dump_tasks,
|
|
|
|
.maxlen = sizeof(sysctl_oom_dump_tasks),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec,
|
oom: add sysctl to enable task memory dump
Adds a new sysctl, 'oom_dump_tasks', that enables the kernel to produce a
dump of all system tasks (excluding kernel threads) when performing an
OOM-killing. Information includes pid, uid, tgid, vm size, rss, cpu,
oom_adj score, and name.
This is helpful for determining why there was an OOM condition and which
rogue task caused it.
It is configurable so that large systems, such as those with several
thousand tasks, do not incur a performance penalty associated with dumping
data they may not desire.
If an OOM was triggered as a result of a memory controller, the tasklist
shall be filtered to exclude tasks that are not a member of the same
cgroup.
Cc: Andrea Arcangeli <andrea@suse.de>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-07 11:14:07 +03:00
|
|
|
},
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
|
|
|
.procname = "overcommit_ratio",
|
|
|
|
.data = &sysctl_overcommit_ratio,
|
|
|
|
.maxlen = sizeof(sysctl_overcommit_ratio),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec,
|
2005-04-17 02:20:36 +04:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "page-cluster",
|
|
|
|
.data = &page_cluster,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec,
|
2005-04-17 02:20:36 +04:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "dirty_background_ratio",
|
|
|
|
.data = &dirty_background_ratio,
|
|
|
|
.maxlen = sizeof(dirty_background_ratio),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = dirty_background_ratio_handler,
|
2005-04-17 02:20:36 +04:00
|
|
|
.extra1 = &zero,
|
|
|
|
.extra2 = &one_hundred,
|
|
|
|
},
|
mm: add dirty_background_bytes and dirty_bytes sysctls
This change introduces two new sysctls to /proc/sys/vm:
dirty_background_bytes and dirty_bytes.
dirty_background_bytes is the counterpart to dirty_background_ratio and
dirty_bytes is the counterpart to dirty_ratio.
With growing memory capacities of individual machines, it's no longer
sufficient to specify dirty thresholds as a percentage of the amount of
dirtyable memory over the entire system.
dirty_background_bytes and dirty_bytes specify quantities of memory, in
bytes, that represent the dirty limits for the entire system. If either
of these values is set, its value represents the amount of dirty memory
that is needed to commence either background or direct writeback.
When a `bytes' or `ratio' file is written, its counterpart becomes a
function of the written value. For example, if dirty_bytes is written to
be 8096, 8K of memory is required to commence direct writeback.
dirty_ratio is then functionally equivalent to 8K / the amount of
dirtyable memory:
dirtyable_memory = free pages + mapped pages + file cache
dirty_background_bytes = dirty_background_ratio * dirtyable_memory
-or-
dirty_background_ratio = dirty_background_bytes / dirtyable_memory
AND
dirty_bytes = dirty_ratio * dirtyable_memory
-or-
dirty_ratio = dirty_bytes / dirtyable_memory
Only one of dirty_background_bytes and dirty_background_ratio may be
specified at a time, and only one of dirty_bytes and dirty_ratio may be
specified. When one sysctl is written, the other appears as 0 when read.
The `bytes' files operate on a page size granularity since dirty limits
are compared with ZVC values, which are in page units.
Prior to this change, the minimum dirty_ratio was 5 as implemented by
get_dirty_limits() although /proc/sys/vm/dirty_ratio would show any user
written value between 0 and 100. This restriction is maintained, but
dirty_bytes has a lower limit of only one page.
Also prior to this change, the dirty_background_ratio could not equal or
exceed dirty_ratio. This restriction is maintained in addition to
restricting dirty_background_bytes. If either background threshold equals
or exceeds that of the dirty threshold, it is implicitly set to half the
dirty threshold.
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Andrea Righi <righi.andrea@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-07 01:39:31 +03:00
|
|
|
{
|
|
|
|
.procname = "dirty_background_bytes",
|
|
|
|
.data = &dirty_background_bytes,
|
|
|
|
.maxlen = sizeof(dirty_background_bytes),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = dirty_background_bytes_handler,
|
2009-02-12 00:04:23 +03:00
|
|
|
.extra1 = &one_ul,
|
mm: add dirty_background_bytes and dirty_bytes sysctls
This change introduces two new sysctls to /proc/sys/vm:
dirty_background_bytes and dirty_bytes.
dirty_background_bytes is the counterpart to dirty_background_ratio and
dirty_bytes is the counterpart to dirty_ratio.
With growing memory capacities of individual machines, it's no longer
sufficient to specify dirty thresholds as a percentage of the amount of
dirtyable memory over the entire system.
dirty_background_bytes and dirty_bytes specify quantities of memory, in
bytes, that represent the dirty limits for the entire system. If either
of these values is set, its value represents the amount of dirty memory
that is needed to commence either background or direct writeback.
When a `bytes' or `ratio' file is written, its counterpart becomes a
function of the written value. For example, if dirty_bytes is written to
be 8096, 8K of memory is required to commence direct writeback.
dirty_ratio is then functionally equivalent to 8K / the amount of
dirtyable memory:
dirtyable_memory = free pages + mapped pages + file cache
dirty_background_bytes = dirty_background_ratio * dirtyable_memory
-or-
dirty_background_ratio = dirty_background_bytes / dirtyable_memory
AND
dirty_bytes = dirty_ratio * dirtyable_memory
-or-
dirty_ratio = dirty_bytes / dirtyable_memory
Only one of dirty_background_bytes and dirty_background_ratio may be
specified at a time, and only one of dirty_bytes and dirty_ratio may be
specified. When one sysctl is written, the other appears as 0 when read.
The `bytes' files operate on a page size granularity since dirty limits
are compared with ZVC values, which are in page units.
Prior to this change, the minimum dirty_ratio was 5 as implemented by
get_dirty_limits() although /proc/sys/vm/dirty_ratio would show any user
written value between 0 and 100. This restriction is maintained, but
dirty_bytes has a lower limit of only one page.
Also prior to this change, the dirty_background_ratio could not equal or
exceed dirty_ratio. This restriction is maintained in addition to
restricting dirty_background_bytes. If either background threshold equals
or exceeds that of the dirty threshold, it is implicitly set to half the
dirty threshold.
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Andrea Righi <righi.andrea@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-07 01:39:31 +03:00
|
|
|
},
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
|
|
|
.procname = "dirty_ratio",
|
|
|
|
.data = &vm_dirty_ratio,
|
|
|
|
.maxlen = sizeof(vm_dirty_ratio),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = dirty_ratio_handler,
|
2005-04-17 02:20:36 +04:00
|
|
|
.extra1 = &zero,
|
|
|
|
.extra2 = &one_hundred,
|
|
|
|
},
|
mm: add dirty_background_bytes and dirty_bytes sysctls
This change introduces two new sysctls to /proc/sys/vm:
dirty_background_bytes and dirty_bytes.
dirty_background_bytes is the counterpart to dirty_background_ratio and
dirty_bytes is the counterpart to dirty_ratio.
With growing memory capacities of individual machines, it's no longer
sufficient to specify dirty thresholds as a percentage of the amount of
dirtyable memory over the entire system.
dirty_background_bytes and dirty_bytes specify quantities of memory, in
bytes, that represent the dirty limits for the entire system. If either
of these values is set, its value represents the amount of dirty memory
that is needed to commence either background or direct writeback.
When a `bytes' or `ratio' file is written, its counterpart becomes a
function of the written value. For example, if dirty_bytes is written to
be 8096, 8K of memory is required to commence direct writeback.
dirty_ratio is then functionally equivalent to 8K / the amount of
dirtyable memory:
dirtyable_memory = free pages + mapped pages + file cache
dirty_background_bytes = dirty_background_ratio * dirtyable_memory
-or-
dirty_background_ratio = dirty_background_bytes / dirtyable_memory
AND
dirty_bytes = dirty_ratio * dirtyable_memory
-or-
dirty_ratio = dirty_bytes / dirtyable_memory
Only one of dirty_background_bytes and dirty_background_ratio may be
specified at a time, and only one of dirty_bytes and dirty_ratio may be
specified. When one sysctl is written, the other appears as 0 when read.
The `bytes' files operate on a page size granularity since dirty limits
are compared with ZVC values, which are in page units.
Prior to this change, the minimum dirty_ratio was 5 as implemented by
get_dirty_limits() although /proc/sys/vm/dirty_ratio would show any user
written value between 0 and 100. This restriction is maintained, but
dirty_bytes has a lower limit of only one page.
Also prior to this change, the dirty_background_ratio could not equal or
exceed dirty_ratio. This restriction is maintained in addition to
restricting dirty_background_bytes. If either background threshold equals
or exceeds that of the dirty threshold, it is implicitly set to half the
dirty threshold.
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Andrea Righi <righi.andrea@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-07 01:39:31 +03:00
|
|
|
{
|
|
|
|
.procname = "dirty_bytes",
|
|
|
|
.data = &vm_dirty_bytes,
|
|
|
|
.maxlen = sizeof(vm_dirty_bytes),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = dirty_bytes_handler,
|
2009-05-01 02:08:57 +04:00
|
|
|
.extra1 = &dirty_bytes_min,
|
mm: add dirty_background_bytes and dirty_bytes sysctls
This change introduces two new sysctls to /proc/sys/vm:
dirty_background_bytes and dirty_bytes.
dirty_background_bytes is the counterpart to dirty_background_ratio and
dirty_bytes is the counterpart to dirty_ratio.
With growing memory capacities of individual machines, it's no longer
sufficient to specify dirty thresholds as a percentage of the amount of
dirtyable memory over the entire system.
dirty_background_bytes and dirty_bytes specify quantities of memory, in
bytes, that represent the dirty limits for the entire system. If either
of these values is set, its value represents the amount of dirty memory
that is needed to commence either background or direct writeback.
When a `bytes' or `ratio' file is written, its counterpart becomes a
function of the written value. For example, if dirty_bytes is written to
be 8096, 8K of memory is required to commence direct writeback.
dirty_ratio is then functionally equivalent to 8K / the amount of
dirtyable memory:
dirtyable_memory = free pages + mapped pages + file cache
dirty_background_bytes = dirty_background_ratio * dirtyable_memory
-or-
dirty_background_ratio = dirty_background_bytes / dirtyable_memory
AND
dirty_bytes = dirty_ratio * dirtyable_memory
-or-
dirty_ratio = dirty_bytes / dirtyable_memory
Only one of dirty_background_bytes and dirty_background_ratio may be
specified at a time, and only one of dirty_bytes and dirty_ratio may be
specified. When one sysctl is written, the other appears as 0 when read.
The `bytes' files operate on a page size granularity since dirty limits
are compared with ZVC values, which are in page units.
Prior to this change, the minimum dirty_ratio was 5 as implemented by
get_dirty_limits() although /proc/sys/vm/dirty_ratio would show any user
written value between 0 and 100. This restriction is maintained, but
dirty_bytes has a lower limit of only one page.
Also prior to this change, the dirty_background_ratio could not equal or
exceed dirty_ratio. This restriction is maintained in addition to
restricting dirty_background_bytes. If either background threshold equals
or exceeds that of the dirty threshold, it is implicitly set to half the
dirty threshold.
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Andrea Righi <righi.andrea@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-07 01:39:31 +03:00
|
|
|
},
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
|
|
|
.procname = "dirty_writeback_centisecs",
|
2006-03-24 14:15:48 +03:00
|
|
|
.data = &dirty_writeback_interval,
|
|
|
|
.maxlen = sizeof(dirty_writeback_interval),
|
2005-04-17 02:20:36 +04:00
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = dirty_writeback_centisecs_handler,
|
2005-04-17 02:20:36 +04:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "dirty_expire_centisecs",
|
2006-03-24 14:15:48 +03:00
|
|
|
.data = &dirty_expire_interval,
|
|
|
|
.maxlen = sizeof(dirty_expire_interval),
|
2005-04-17 02:20:36 +04:00
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec,
|
2005-04-17 02:20:36 +04:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "nr_pdflush_threads",
|
|
|
|
.data = &nr_pdflush_threads,
|
|
|
|
.maxlen = sizeof nr_pdflush_threads,
|
|
|
|
.mode = 0444 /* read-only*/,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec,
|
2005-04-17 02:20:36 +04:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "swappiness",
|
|
|
|
.data = &vm_swappiness,
|
|
|
|
.maxlen = sizeof(vm_swappiness),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec_minmax,
|
2005-04-17 02:20:36 +04:00
|
|
|
.extra1 = &zero,
|
|
|
|
.extra2 = &one_hundred,
|
|
|
|
},
|
|
|
|
#ifdef CONFIG_HUGETLB_PAGE
|
hugetlb: derive huge pages nodes allowed from task mempolicy
This patch derives a "nodes_allowed" node mask from the numa mempolicy of
the task modifying the number of persistent huge pages to control the
allocation, freeing and adjusting of surplus huge pages when the pool page
count is modified via the new sysctl or sysfs attribute
"nr_hugepages_mempolicy". The nodes_allowed mask is derived as follows:
* For "default" [NULL] task mempolicy, a NULL nodemask_t pointer
is produced. This will cause the hugetlb subsystem to use
node_online_map as the "nodes_allowed". This preserves the
behavior before this patch.
* For "preferred" mempolicy, including explicit local allocation,
a nodemask with the single preferred node will be produced.
"local" policy will NOT track any internode migrations of the
task adjusting nr_hugepages.
* For "bind" and "interleave" policy, the mempolicy's nodemask
will be used.
* Other than to inform the construction of the nodes_allowed node
mask, the actual mempolicy mode is ignored. That is, all modes
behave like interleave over the resulting nodes_allowed mask
with no "fallback".
See the updated documentation [next patch] for more information
about the implications of this patch.
Examples:
Starting with:
Node 0 HugePages_Total: 0
Node 1 HugePages_Total: 0
Node 2 HugePages_Total: 0
Node 3 HugePages_Total: 0
Default behavior [with or without this patch] balances persistent
hugepage allocation across nodes [with sufficient contiguous memory]:
sysctl vm.nr_hugepages[_mempolicy]=32
yields:
Node 0 HugePages_Total: 8
Node 1 HugePages_Total: 8
Node 2 HugePages_Total: 8
Node 3 HugePages_Total: 8
Of course, we only have nr_hugepages_mempolicy with the patch,
but with default mempolicy, nr_hugepages_mempolicy behaves the
same as nr_hugepages.
Applying mempolicy--e.g., with numactl [using '-m' a.k.a.
'--membind' because it allows multiple nodes to be specified
and it's easy to type]--we can allocate huge pages on
individual nodes or sets of nodes. So, starting from the
condition above, with 8 huge pages per node, add 8 more to
node 2 using:
numactl -m 2 sysctl vm.nr_hugepages_mempolicy=40
This yields:
Node 0 HugePages_Total: 8
Node 1 HugePages_Total: 8
Node 2 HugePages_Total: 16
Node 3 HugePages_Total: 8
The incremental 8 huge pages were restricted to node 2 by the
specified mempolicy.
Similarly, we can use mempolicy to free persistent huge pages
from specified nodes:
numactl -m 0,1 sysctl vm.nr_hugepages_mempolicy=32
yields:
Node 0 HugePages_Total: 4
Node 1 HugePages_Total: 4
Node 2 HugePages_Total: 16
Node 3 HugePages_Total: 8
The 8 huge pages freed were balanced over nodes 0 and 1.
[rientjes@google.com: accomodate reworked NODEMASK_ALLOC]
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Andi Kleen <andi@firstfloor.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: Adam Litke <agl@us.ibm.com>
Cc: Andy Whitcroft <apw@canonical.com>
Cc: Eric Whitney <eric.whitney@hp.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15 04:58:21 +03:00
|
|
|
{
|
2005-04-17 02:20:36 +04:00
|
|
|
.procname = "nr_hugepages",
|
2008-07-24 08:27:42 +04:00
|
|
|
.data = NULL,
|
2005-04-17 02:20:36 +04:00
|
|
|
.maxlen = sizeof(unsigned long),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = hugetlb_sysctl_handler,
|
2005-04-17 02:20:36 +04:00
|
|
|
.extra1 = (void *)&hugetlb_zero,
|
|
|
|
.extra2 = (void *)&hugetlb_infinity,
|
hugetlb: derive huge pages nodes allowed from task mempolicy
This patch derives a "nodes_allowed" node mask from the numa mempolicy of
the task modifying the number of persistent huge pages to control the
allocation, freeing and adjusting of surplus huge pages when the pool page
count is modified via the new sysctl or sysfs attribute
"nr_hugepages_mempolicy". The nodes_allowed mask is derived as follows:
* For "default" [NULL] task mempolicy, a NULL nodemask_t pointer
is produced. This will cause the hugetlb subsystem to use
node_online_map as the "nodes_allowed". This preserves the
behavior before this patch.
* For "preferred" mempolicy, including explicit local allocation,
a nodemask with the single preferred node will be produced.
"local" policy will NOT track any internode migrations of the
task adjusting nr_hugepages.
* For "bind" and "interleave" policy, the mempolicy's nodemask
will be used.
* Other than to inform the construction of the nodes_allowed node
mask, the actual mempolicy mode is ignored. That is, all modes
behave like interleave over the resulting nodes_allowed mask
with no "fallback".
See the updated documentation [next patch] for more information
about the implications of this patch.
Examples:
Starting with:
Node 0 HugePages_Total: 0
Node 1 HugePages_Total: 0
Node 2 HugePages_Total: 0
Node 3 HugePages_Total: 0
Default behavior [with or without this patch] balances persistent
hugepage allocation across nodes [with sufficient contiguous memory]:
sysctl vm.nr_hugepages[_mempolicy]=32
yields:
Node 0 HugePages_Total: 8
Node 1 HugePages_Total: 8
Node 2 HugePages_Total: 8
Node 3 HugePages_Total: 8
Of course, we only have nr_hugepages_mempolicy with the patch,
but with default mempolicy, nr_hugepages_mempolicy behaves the
same as nr_hugepages.
Applying mempolicy--e.g., with numactl [using '-m' a.k.a.
'--membind' because it allows multiple nodes to be specified
and it's easy to type]--we can allocate huge pages on
individual nodes or sets of nodes. So, starting from the
condition above, with 8 huge pages per node, add 8 more to
node 2 using:
numactl -m 2 sysctl vm.nr_hugepages_mempolicy=40
This yields:
Node 0 HugePages_Total: 8
Node 1 HugePages_Total: 8
Node 2 HugePages_Total: 16
Node 3 HugePages_Total: 8
The incremental 8 huge pages were restricted to node 2 by the
specified mempolicy.
Similarly, we can use mempolicy to free persistent huge pages
from specified nodes:
numactl -m 0,1 sysctl vm.nr_hugepages_mempolicy=32
yields:
Node 0 HugePages_Total: 4
Node 1 HugePages_Total: 4
Node 2 HugePages_Total: 16
Node 3 HugePages_Total: 8
The 8 huge pages freed were balanced over nodes 0 and 1.
[rientjes@google.com: accomodate reworked NODEMASK_ALLOC]
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Andi Kleen <andi@firstfloor.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: Adam Litke <agl@us.ibm.com>
Cc: Andy Whitcroft <apw@canonical.com>
Cc: Eric Whitney <eric.whitney@hp.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15 04:58:21 +03:00
|
|
|
},
|
|
|
|
#ifdef CONFIG_NUMA
|
|
|
|
{
|
|
|
|
.procname = "nr_hugepages_mempolicy",
|
|
|
|
.data = NULL,
|
|
|
|
.maxlen = sizeof(unsigned long),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = &hugetlb_mempolicy_sysctl_handler,
|
|
|
|
.extra1 = (void *)&hugetlb_zero,
|
|
|
|
.extra2 = (void *)&hugetlb_infinity,
|
|
|
|
},
|
|
|
|
#endif
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
|
|
|
.procname = "hugetlb_shm_group",
|
|
|
|
.data = &sysctl_hugetlb_shm_group,
|
|
|
|
.maxlen = sizeof(gid_t),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec,
|
2005-04-17 02:20:36 +04:00
|
|
|
},
|
2007-07-17 15:03:13 +04:00
|
|
|
{
|
|
|
|
.procname = "hugepages_treat_as_movable",
|
|
|
|
.data = &hugepages_treat_as_movable,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = hugetlb_treat_movable_handler,
|
2007-07-17 15:03:13 +04:00
|
|
|
},
|
hugetlb: introduce nr_overcommit_hugepages sysctl
hugetlb: introduce nr_overcommit_hugepages sysctl
While examining the code to support /proc/sys/vm/hugetlb_dynamic_pool, I
became convinced that having a boolean sysctl was insufficient:
1) To support per-node control of hugepages, I have previously submitted
patches to add a sysfs attribute related to nr_hugepages. However, with
a boolean global value and per-mount quota enforcement constraining the
dynamic pool, adding corresponding control of the dynamic pool on a
per-node basis seems inconsistent to me.
2) Administration of the hugetlb dynamic pool with multiple hugetlbfs
mount points is, arguably, more arduous than it needs to be. Each quota
would need to be set separately, and the sum would need to be monitored.
To ease the administration, and to help make the way for per-node
control of the static & dynamic hugepage pool, I added a separate
sysctl, nr_overcommit_hugepages. This value serves as a high watermark
for the overall hugepage pool, while nr_hugepages serves as a low
watermark. The boolean sysctl can then be removed, as the condition
nr_overcommit_hugepages > 0
indicates the same administrative setting as
hugetlb_dynamic_pool == 1
Quotas still serve as local enforcement of the size of the pool on a
per-mount basis.
A few caveats:
1) There is a race whereby the global surplus huge page counter is
incremented before a hugepage has allocated. Another process could then
try grow the pool, and fail to convert a surplus huge page to a normal
huge page and instead allocate a fresh huge page. I believe this is
benign, as no memory is leaked (the actual pages are still tracked
correctly) and the counters won't go out of sync.
2) Shrinking the static pool while a surplus is in effect will allow the
number of surplus huge pages to exceed the overcommit value. As long as
this condition holds, however, no more surplus huge pages will be
allowed on the system until one of the two sysctls are increased
sufficiently, or the surplus huge pages go out of use and are freed.
Successfully tested on x86_64 with the current libhugetlbfs snapshot,
modified to use the new sysctl.
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Acked-by: Adam Litke <agl@us.ibm.com>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-12-18 03:20:12 +03:00
|
|
|
{
|
|
|
|
.procname = "nr_overcommit_hugepages",
|
2008-07-24 08:27:42 +04:00
|
|
|
.data = NULL,
|
|
|
|
.maxlen = sizeof(unsigned long),
|
hugetlb: introduce nr_overcommit_hugepages sysctl
hugetlb: introduce nr_overcommit_hugepages sysctl
While examining the code to support /proc/sys/vm/hugetlb_dynamic_pool, I
became convinced that having a boolean sysctl was insufficient:
1) To support per-node control of hugepages, I have previously submitted
patches to add a sysfs attribute related to nr_hugepages. However, with
a boolean global value and per-mount quota enforcement constraining the
dynamic pool, adding corresponding control of the dynamic pool on a
per-node basis seems inconsistent to me.
2) Administration of the hugetlb dynamic pool with multiple hugetlbfs
mount points is, arguably, more arduous than it needs to be. Each quota
would need to be set separately, and the sum would need to be monitored.
To ease the administration, and to help make the way for per-node
control of the static & dynamic hugepage pool, I added a separate
sysctl, nr_overcommit_hugepages. This value serves as a high watermark
for the overall hugepage pool, while nr_hugepages serves as a low
watermark. The boolean sysctl can then be removed, as the condition
nr_overcommit_hugepages > 0
indicates the same administrative setting as
hugetlb_dynamic_pool == 1
Quotas still serve as local enforcement of the size of the pool on a
per-mount basis.
A few caveats:
1) There is a race whereby the global surplus huge page counter is
incremented before a hugepage has allocated. Another process could then
try grow the pool, and fail to convert a surplus huge page to a normal
huge page and instead allocate a fresh huge page. I believe this is
benign, as no memory is leaked (the actual pages are still tracked
correctly) and the counters won't go out of sync.
2) Shrinking the static pool while a surplus is in effect will allow the
number of surplus huge pages to exceed the overcommit value. As long as
this condition holds, however, no more surplus huge pages will be
allowed on the system until one of the two sysctls are increased
sufficiently, or the surplus huge pages go out of use and are freed.
Successfully tested on x86_64 with the current libhugetlbfs snapshot,
modified to use the new sysctl.
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Acked-by: Adam Litke <agl@us.ibm.com>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-12-18 03:20:12 +03:00
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = hugetlb_overcommit_handler,
|
2008-07-24 08:27:42 +04:00
|
|
|
.extra1 = (void *)&hugetlb_zero,
|
|
|
|
.extra2 = (void *)&hugetlb_infinity,
|
hugetlb: introduce nr_overcommit_hugepages sysctl
hugetlb: introduce nr_overcommit_hugepages sysctl
While examining the code to support /proc/sys/vm/hugetlb_dynamic_pool, I
became convinced that having a boolean sysctl was insufficient:
1) To support per-node control of hugepages, I have previously submitted
patches to add a sysfs attribute related to nr_hugepages. However, with
a boolean global value and per-mount quota enforcement constraining the
dynamic pool, adding corresponding control of the dynamic pool on a
per-node basis seems inconsistent to me.
2) Administration of the hugetlb dynamic pool with multiple hugetlbfs
mount points is, arguably, more arduous than it needs to be. Each quota
would need to be set separately, and the sum would need to be monitored.
To ease the administration, and to help make the way for per-node
control of the static & dynamic hugepage pool, I added a separate
sysctl, nr_overcommit_hugepages. This value serves as a high watermark
for the overall hugepage pool, while nr_hugepages serves as a low
watermark. The boolean sysctl can then be removed, as the condition
nr_overcommit_hugepages > 0
indicates the same administrative setting as
hugetlb_dynamic_pool == 1
Quotas still serve as local enforcement of the size of the pool on a
per-mount basis.
A few caveats:
1) There is a race whereby the global surplus huge page counter is
incremented before a hugepage has allocated. Another process could then
try grow the pool, and fail to convert a surplus huge page to a normal
huge page and instead allocate a fresh huge page. I believe this is
benign, as no memory is leaked (the actual pages are still tracked
correctly) and the counters won't go out of sync.
2) Shrinking the static pool while a surplus is in effect will allow the
number of surplus huge pages to exceed the overcommit value. As long as
this condition holds, however, no more surplus huge pages will be
allowed on the system until one of the two sysctls are increased
sufficiently, or the surplus huge pages go out of use and are freed.
Successfully tested on x86_64 with the current libhugetlbfs snapshot,
modified to use the new sysctl.
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Acked-by: Adam Litke <agl@us.ibm.com>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-12-18 03:20:12 +03:00
|
|
|
},
|
2005-04-17 02:20:36 +04:00
|
|
|
#endif
|
|
|
|
{
|
|
|
|
.procname = "lowmem_reserve_ratio",
|
|
|
|
.data = &sysctl_lowmem_reserve_ratio,
|
|
|
|
.maxlen = sizeof(sysctl_lowmem_reserve_ratio),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = lowmem_reserve_ratio_sysctl_handler,
|
2005-04-17 02:20:36 +04:00
|
|
|
},
|
2006-01-08 12:00:39 +03:00
|
|
|
{
|
|
|
|
.procname = "drop_caches",
|
|
|
|
.data = &sysctl_drop_caches,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = drop_caches_sysctl_handler,
|
|
|
|
},
|
2010-05-25 01:32:28 +04:00
|
|
|
#ifdef CONFIG_COMPACTION
|
|
|
|
{
|
|
|
|
.procname = "compact_memory",
|
|
|
|
.data = &sysctl_compact_memory,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0200,
|
|
|
|
.proc_handler = sysctl_compaction_handler,
|
|
|
|
},
|
2010-05-25 01:32:31 +04:00
|
|
|
{
|
|
|
|
.procname = "extfrag_threshold",
|
|
|
|
.data = &sysctl_extfrag_threshold,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = sysctl_extfrag_handler,
|
|
|
|
.extra1 = &min_extfrag_threshold,
|
|
|
|
.extra2 = &max_extfrag_threshold,
|
|
|
|
},
|
|
|
|
|
2010-05-25 01:32:28 +04:00
|
|
|
#endif /* CONFIG_COMPACTION */
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
|
|
|
.procname = "min_free_kbytes",
|
|
|
|
.data = &min_free_kbytes,
|
|
|
|
.maxlen = sizeof(min_free_kbytes),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = min_free_kbytes_sysctl_handler,
|
2005-04-17 02:20:36 +04:00
|
|
|
.extra1 = &zero,
|
|
|
|
},
|
2006-01-08 12:00:40 +03:00
|
|
|
{
|
|
|
|
.procname = "percpu_pagelist_fraction",
|
|
|
|
.data = &percpu_pagelist_fraction,
|
|
|
|
.maxlen = sizeof(percpu_pagelist_fraction),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = percpu_pagelist_fraction_sysctl_handler,
|
2006-01-08 12:00:40 +03:00
|
|
|
.extra1 = &min_percpu_pagelist_fract,
|
|
|
|
},
|
2005-04-17 02:20:36 +04:00
|
|
|
#ifdef CONFIG_MMU
|
|
|
|
{
|
|
|
|
.procname = "max_map_count",
|
|
|
|
.data = &sysctl_max_map_count,
|
|
|
|
.maxlen = sizeof(sysctl_max_map_count),
|
|
|
|
.mode = 0644,
|
2009-12-18 02:27:05 +03:00
|
|
|
.proc_handler = proc_dointvec_minmax,
|
2009-12-15 04:59:52 +03:00
|
|
|
.extra1 = &zero,
|
2005-04-17 02:20:36 +04:00
|
|
|
},
|
2009-01-08 15:04:47 +03:00
|
|
|
#else
|
|
|
|
{
|
|
|
|
.procname = "nr_trim_pages",
|
|
|
|
.data = &sysctl_nr_trim_pages,
|
|
|
|
.maxlen = sizeof(sysctl_nr_trim_pages),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec_minmax,
|
2009-01-08 15:04:47 +03:00
|
|
|
.extra1 = &zero,
|
|
|
|
},
|
2005-04-17 02:20:36 +04:00
|
|
|
#endif
|
|
|
|
{
|
|
|
|
.procname = "laptop_mode",
|
|
|
|
.data = &laptop_mode,
|
|
|
|
.maxlen = sizeof(laptop_mode),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec_jiffies,
|
2005-04-17 02:20:36 +04:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "block_dump",
|
|
|
|
.data = &block_dump,
|
|
|
|
.maxlen = sizeof(block_dump),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec,
|
2005-04-17 02:20:36 +04:00
|
|
|
.extra1 = &zero,
|
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "vfs_cache_pressure",
|
|
|
|
.data = &sysctl_vfs_cache_pressure,
|
|
|
|
.maxlen = sizeof(sysctl_vfs_cache_pressure),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec,
|
2005-04-17 02:20:36 +04:00
|
|
|
.extra1 = &zero,
|
|
|
|
},
|
|
|
|
#ifdef HAVE_ARCH_PICK_MMAP_LAYOUT
|
|
|
|
{
|
|
|
|
.procname = "legacy_va_layout",
|
|
|
|
.data = &sysctl_legacy_va_layout,
|
|
|
|
.maxlen = sizeof(sysctl_legacy_va_layout),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec,
|
2005-04-17 02:20:36 +04:00
|
|
|
.extra1 = &zero,
|
|
|
|
},
|
|
|
|
#endif
|
2006-01-19 04:42:32 +03:00
|
|
|
#ifdef CONFIG_NUMA
|
|
|
|
{
|
|
|
|
.procname = "zone_reclaim_mode",
|
|
|
|
.data = &zone_reclaim_mode,
|
|
|
|
.maxlen = sizeof(zone_reclaim_mode),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec,
|
2006-02-01 14:05:29 +03:00
|
|
|
.extra1 = &zero,
|
2006-01-19 04:42:32 +03:00
|
|
|
},
|
2006-07-03 11:24:13 +04:00
|
|
|
{
|
|
|
|
.procname = "min_unmapped_ratio",
|
|
|
|
.data = &sysctl_min_unmapped_ratio,
|
|
|
|
.maxlen = sizeof(sysctl_min_unmapped_ratio),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = sysctl_min_unmapped_ratio_sysctl_handler,
|
2006-07-03 11:24:13 +04:00
|
|
|
.extra1 = &zero,
|
|
|
|
.extra2 = &one_hundred,
|
|
|
|
},
|
2006-09-26 10:31:52 +04:00
|
|
|
{
|
|
|
|
.procname = "min_slab_ratio",
|
|
|
|
.data = &sysctl_min_slab_ratio,
|
|
|
|
.maxlen = sizeof(sysctl_min_slab_ratio),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = sysctl_min_slab_ratio_sysctl_handler,
|
2006-09-26 10:31:52 +04:00
|
|
|
.extra1 = &zero,
|
|
|
|
.extra2 = &one_hundred,
|
|
|
|
},
|
[PATCH] vdso: randomize the i386 vDSO by moving it into a vma
Move the i386 VDSO down into a vma and thus randomize it.
Besides the security implications, this feature also helps debuggers, which
can COW a vma-backed VDSO just like a normal DSO and can thus do
single-stepping and other debugging features.
It's good for hypervisors (Xen, VMWare) too, which typically live in the same
high-mapped address space as the VDSO, hence whenever the VDSO is used, they
get lots of guest pagefaults and have to fix such guest accesses up - which
slows things down instead of speeding things up (the primary purpose of the
VDSO).
There's a new CONFIG_COMPAT_VDSO (default=y) option, which provides support
for older glibcs that still rely on a prelinked high-mapped VDSO. Newer
distributions (using glibc 2.3.3 or later) can turn this option off. Turning
it off is also recommended for security reasons: attackers cannot use the
predictable high-mapped VDSO page as syscall trampoline anymore.
There is a new vdso=[0|1] boot option as well, and a runtime
/proc/sys/vm/vdso_enabled sysctl switch, that allows the VDSO to be turned
on/off.
(This version of the VDSO-randomization patch also has working ELF
coredumping, the previous patch crashed in the coredumping code.)
This code is a combined work of the exec-shield VDSO randomization
code and Gerd Hoffmann's hypervisor-centric VDSO patch. Rusty Russell
started this patch and i completed it.
[akpm@osdl.org: cleanups]
[akpm@osdl.org: compile fix]
[akpm@osdl.org: compile fix 2]
[akpm@osdl.org: compile fix 3]
[akpm@osdl.org: revernt MAXMEM change]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@infradead.org>
Cc: Gerd Hoffmann <kraxel@suse.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Zachary Amsden <zach@vmware.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 13:53:50 +04:00
|
|
|
#endif
|
2007-05-09 13:35:13 +04:00
|
|
|
#ifdef CONFIG_SMP
|
|
|
|
{
|
|
|
|
.procname = "stat_interval",
|
|
|
|
.data = &sysctl_stat_interval,
|
|
|
|
.maxlen = sizeof(sysctl_stat_interval),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec_jiffies,
|
2007-05-09 13:35:13 +04:00
|
|
|
},
|
|
|
|
#endif
|
2009-12-15 22:27:45 +03:00
|
|
|
#ifdef CONFIG_MMU
|
2007-06-28 23:55:21 +04:00
|
|
|
{
|
|
|
|
.procname = "mmap_min_addr",
|
2009-07-31 20:54:11 +04:00
|
|
|
.data = &dac_mmap_min_addr,
|
|
|
|
.maxlen = sizeof(unsigned long),
|
2007-06-28 23:55:21 +04:00
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = mmap_min_addr_handler,
|
2007-06-28 23:55:21 +04:00
|
|
|
},
|
2009-12-15 22:27:45 +03:00
|
|
|
#endif
|
2007-07-16 10:38:01 +04:00
|
|
|
#ifdef CONFIG_NUMA
|
|
|
|
{
|
|
|
|
.procname = "numa_zonelist_order",
|
|
|
|
.data = &numa_zonelist_order,
|
|
|
|
.maxlen = NUMA_ZONELIST_ORDER_LEN,
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = numa_zonelist_order_handler,
|
2007-07-16 10:38:01 +04:00
|
|
|
},
|
|
|
|
#endif
|
2007-10-13 11:16:04 +04:00
|
|
|
#if (defined(CONFIG_X86_32) && !defined(CONFIG_UML))|| \
|
2007-03-01 04:07:42 +03:00
|
|
|
(defined(CONFIG_SUPERH) && defined(CONFIG_VSYSCALL))
|
[PATCH] vdso: randomize the i386 vDSO by moving it into a vma
Move the i386 VDSO down into a vma and thus randomize it.
Besides the security implications, this feature also helps debuggers, which
can COW a vma-backed VDSO just like a normal DSO and can thus do
single-stepping and other debugging features.
It's good for hypervisors (Xen, VMWare) too, which typically live in the same
high-mapped address space as the VDSO, hence whenever the VDSO is used, they
get lots of guest pagefaults and have to fix such guest accesses up - which
slows things down instead of speeding things up (the primary purpose of the
VDSO).
There's a new CONFIG_COMPAT_VDSO (default=y) option, which provides support
for older glibcs that still rely on a prelinked high-mapped VDSO. Newer
distributions (using glibc 2.3.3 or later) can turn this option off. Turning
it off is also recommended for security reasons: attackers cannot use the
predictable high-mapped VDSO page as syscall trampoline anymore.
There is a new vdso=[0|1] boot option as well, and a runtime
/proc/sys/vm/vdso_enabled sysctl switch, that allows the VDSO to be turned
on/off.
(This version of the VDSO-randomization patch also has working ELF
coredumping, the previous patch crashed in the coredumping code.)
This code is a combined work of the exec-shield VDSO randomization
code and Gerd Hoffmann's hypervisor-centric VDSO patch. Rusty Russell
started this patch and i completed it.
[akpm@osdl.org: cleanups]
[akpm@osdl.org: compile fix]
[akpm@osdl.org: compile fix 2]
[akpm@osdl.org: compile fix 3]
[akpm@osdl.org: revernt MAXMEM change]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@infradead.org>
Cc: Gerd Hoffmann <kraxel@suse.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Zachary Amsden <zach@vmware.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 13:53:50 +04:00
|
|
|
{
|
|
|
|
.procname = "vdso_enabled",
|
|
|
|
.data = &vdso_enabled,
|
|
|
|
.maxlen = sizeof(vdso_enabled),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec,
|
[PATCH] vdso: randomize the i386 vDSO by moving it into a vma
Move the i386 VDSO down into a vma and thus randomize it.
Besides the security implications, this feature also helps debuggers, which
can COW a vma-backed VDSO just like a normal DSO and can thus do
single-stepping and other debugging features.
It's good for hypervisors (Xen, VMWare) too, which typically live in the same
high-mapped address space as the VDSO, hence whenever the VDSO is used, they
get lots of guest pagefaults and have to fix such guest accesses up - which
slows things down instead of speeding things up (the primary purpose of the
VDSO).
There's a new CONFIG_COMPAT_VDSO (default=y) option, which provides support
for older glibcs that still rely on a prelinked high-mapped VDSO. Newer
distributions (using glibc 2.3.3 or later) can turn this option off. Turning
it off is also recommended for security reasons: attackers cannot use the
predictable high-mapped VDSO page as syscall trampoline anymore.
There is a new vdso=[0|1] boot option as well, and a runtime
/proc/sys/vm/vdso_enabled sysctl switch, that allows the VDSO to be turned
on/off.
(This version of the VDSO-randomization patch also has working ELF
coredumping, the previous patch crashed in the coredumping code.)
This code is a combined work of the exec-shield VDSO randomization
code and Gerd Hoffmann's hypervisor-centric VDSO patch. Rusty Russell
started this patch and i completed it.
[akpm@osdl.org: cleanups]
[akpm@osdl.org: compile fix]
[akpm@osdl.org: compile fix 2]
[akpm@osdl.org: compile fix 3]
[akpm@osdl.org: revernt MAXMEM change]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@infradead.org>
Cc: Gerd Hoffmann <kraxel@suse.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Zachary Amsden <zach@vmware.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 13:53:50 +04:00
|
|
|
.extra1 = &zero,
|
|
|
|
},
|
2005-04-17 02:20:36 +04:00
|
|
|
#endif
|
2008-02-05 09:29:20 +03:00
|
|
|
#ifdef CONFIG_HIGHMEM
|
|
|
|
{
|
|
|
|
.procname = "highmem_is_dirtyable",
|
|
|
|
.data = &vm_highmem_is_dirtyable,
|
|
|
|
.maxlen = sizeof(vm_highmem_is_dirtyable),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec_minmax,
|
2008-02-05 09:29:20 +03:00
|
|
|
.extra1 = &zero,
|
|
|
|
.extra2 = &one,
|
|
|
|
},
|
|
|
|
#endif
|
2009-04-14 01:39:33 +04:00
|
|
|
{
|
|
|
|
.procname = "scan_unevictable_pages",
|
|
|
|
.data = &scan_unevictable_pages,
|
|
|
|
.maxlen = sizeof(scan_unevictable_pages),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = scan_unevictable_handler,
|
2009-04-14 01:39:33 +04:00
|
|
|
},
|
2009-09-16 13:50:15 +04:00
|
|
|
#ifdef CONFIG_MEMORY_FAILURE
|
|
|
|
{
|
|
|
|
.procname = "memory_failure_early_kill",
|
|
|
|
.data = &sysctl_memory_failure_early_kill,
|
|
|
|
.maxlen = sizeof(sysctl_memory_failure_early_kill),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec_minmax,
|
2009-09-16 13:50:15 +04:00
|
|
|
.extra1 = &zero,
|
|
|
|
.extra2 = &one,
|
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "memory_failure_recovery",
|
|
|
|
.data = &sysctl_memory_failure_recovery,
|
|
|
|
.maxlen = sizeof(sysctl_memory_failure_recovery),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec_minmax,
|
2009-09-16 13:50:15 +04:00
|
|
|
.extra1 = &zero,
|
|
|
|
.extra2 = &one,
|
|
|
|
},
|
|
|
|
#endif
|
2009-04-03 13:30:53 +04:00
|
|
|
{ }
|
2005-04-17 02:20:36 +04:00
|
|
|
};
|
|
|
|
|
2007-02-14 11:34:07 +03:00
|
|
|
#if defined(CONFIG_BINFMT_MISC) || defined(CONFIG_BINFMT_MISC_MODULE)
|
2007-10-18 14:05:22 +04:00
|
|
|
static struct ctl_table binfmt_misc_table[] = {
|
2009-04-03 13:30:53 +04:00
|
|
|
{ }
|
2007-02-14 11:34:07 +03:00
|
|
|
};
|
|
|
|
#endif
|
|
|
|
|
2007-10-18 14:05:22 +04:00
|
|
|
static struct ctl_table fs_table[] = {
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
|
|
|
.procname = "inode-nr",
|
|
|
|
.data = &inodes_stat,
|
|
|
|
.maxlen = 2*sizeof(int),
|
|
|
|
.mode = 0444,
|
2010-10-23 13:03:02 +04:00
|
|
|
.proc_handler = proc_nr_inodes,
|
2005-04-17 02:20:36 +04:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "inode-state",
|
|
|
|
.data = &inodes_stat,
|
|
|
|
.maxlen = 7*sizeof(int),
|
|
|
|
.mode = 0444,
|
2010-10-23 13:03:02 +04:00
|
|
|
.proc_handler = proc_nr_inodes,
|
2005-04-17 02:20:36 +04:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "file-nr",
|
|
|
|
.data = &files_stat,
|
2010-10-27 01:22:44 +04:00
|
|
|
.maxlen = sizeof(files_stat),
|
2005-04-17 02:20:36 +04:00
|
|
|
.mode = 0444,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_nr_files,
|
2005-04-17 02:20:36 +04:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "file-max",
|
|
|
|
.data = &files_stat.max_files,
|
2010-10-27 01:22:44 +04:00
|
|
|
.maxlen = sizeof(files_stat.max_files),
|
2005-04-17 02:20:36 +04:00
|
|
|
.mode = 0644,
|
2010-10-27 01:22:44 +04:00
|
|
|
.proc_handler = proc_doulongvec_minmax,
|
2005-04-17 02:20:36 +04:00
|
|
|
},
|
2008-02-06 12:37:16 +03:00
|
|
|
{
|
|
|
|
.procname = "nr_open",
|
|
|
|
.data = &sysctl_nr_open,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec_minmax,
|
2008-05-10 18:08:32 +04:00
|
|
|
.extra1 = &sysctl_nr_open_min,
|
|
|
|
.extra2 = &sysctl_nr_open_max,
|
2008-02-06 12:37:16 +03:00
|
|
|
},
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
|
|
|
.procname = "dentry-state",
|
|
|
|
.data = &dentry_stat,
|
|
|
|
.maxlen = 6*sizeof(int),
|
|
|
|
.mode = 0444,
|
2010-10-10 13:36:23 +04:00
|
|
|
.proc_handler = proc_nr_dentry,
|
2005-04-17 02:20:36 +04:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "overflowuid",
|
|
|
|
.data = &fs_overflowuid,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec_minmax,
|
2005-04-17 02:20:36 +04:00
|
|
|
.extra1 = &minolduid,
|
|
|
|
.extra2 = &maxolduid,
|
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "overflowgid",
|
|
|
|
.data = &fs_overflowgid,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec_minmax,
|
2005-04-17 02:20:36 +04:00
|
|
|
.extra1 = &minolduid,
|
|
|
|
.extra2 = &maxolduid,
|
|
|
|
},
|
2008-08-06 17:12:22 +04:00
|
|
|
#ifdef CONFIG_FILE_LOCKING
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
|
|
|
.procname = "leases-enable",
|
|
|
|
.data = &leases_enable,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec,
|
2005-04-17 02:20:36 +04:00
|
|
|
},
|
2008-08-06 17:12:22 +04:00
|
|
|
#endif
|
2005-04-17 02:20:36 +04:00
|
|
|
#ifdef CONFIG_DNOTIFY
|
|
|
|
{
|
|
|
|
.procname = "dir-notify-enable",
|
|
|
|
.data = &dir_notify_enable,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec,
|
2005-04-17 02:20:36 +04:00
|
|
|
},
|
|
|
|
#endif
|
|
|
|
#ifdef CONFIG_MMU
|
2008-08-06 17:12:22 +04:00
|
|
|
#ifdef CONFIG_FILE_LOCKING
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
|
|
|
.procname = "lease-break-time",
|
|
|
|
.data = &lease_break_time,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec,
|
2005-04-17 02:20:36 +04:00
|
|
|
},
|
2008-08-06 17:12:22 +04:00
|
|
|
#endif
|
2008-10-16 09:05:12 +04:00
|
|
|
#ifdef CONFIG_AIO
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
|
|
|
.procname = "aio-nr",
|
|
|
|
.data = &aio_nr,
|
|
|
|
.maxlen = sizeof(aio_nr),
|
|
|
|
.mode = 0444,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_doulongvec_minmax,
|
2005-04-17 02:20:36 +04:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "aio-max-nr",
|
|
|
|
.data = &aio_max_nr,
|
|
|
|
.maxlen = sizeof(aio_max_nr),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_doulongvec_minmax,
|
2005-04-17 02:20:36 +04:00
|
|
|
},
|
2008-10-16 09:05:12 +04:00
|
|
|
#endif /* CONFIG_AIO */
|
2006-06-02 00:10:59 +04:00
|
|
|
#ifdef CONFIG_INOTIFY_USER
|
2005-07-13 20:38:18 +04:00
|
|
|
{
|
|
|
|
.procname = "inotify",
|
|
|
|
.mode = 0555,
|
|
|
|
.child = inotify_table,
|
|
|
|
},
|
|
|
|
#endif
|
epoll: introduce resource usage limits
It has been thought that the per-user file descriptors limit would also
limit the resources that a normal user can request via the epoll
interface. Vegard Nossum reported a very simple program (a modified
version attached) that can make a normal user to request a pretty large
amount of kernel memory, well within the its maximum number of fds. To
solve such problem, default limits are now imposed, and /proc based
configuration has been introduced. A new directory has been created,
named /proc/sys/fs/epoll/ and inside there, there are two configuration
points:
max_user_instances = Maximum number of devices - per user
max_user_watches = Maximum number of "watched" fds - per user
The current default for "max_user_watches" limits the memory used by epoll
to store "watches", to 1/32 of the amount of the low RAM. As example, a
256MB 32bit machine, will have "max_user_watches" set to roughly 90000.
That should be enough to not break existing heavy epoll users. The
default value for "max_user_instances" is set to 128, that should be
enough too.
This also changes the userspace, because a new error code can now come out
from EPOLL_CTL_ADD (-ENOSPC). The EMFILE from epoll_create() was already
listed, so that should be ok.
[akpm@linux-foundation.org: use get_current_user()]
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: <stable@kernel.org>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Reported-by: Vegard Nossum <vegardno@ifi.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-12-02 00:13:55 +03:00
|
|
|
#ifdef CONFIG_EPOLL
|
|
|
|
{
|
|
|
|
.procname = "epoll",
|
|
|
|
.mode = 0555,
|
|
|
|
.child = epoll_table,
|
|
|
|
},
|
|
|
|
#endif
|
2005-04-17 02:20:36 +04:00
|
|
|
#endif
|
2005-06-23 11:09:43 +04:00
|
|
|
{
|
|
|
|
.procname = "suid_dumpable",
|
|
|
|
.data = &suid_dumpable,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2009-11-16 14:11:48 +03:00
|
|
|
.proc_handler = proc_dointvec_minmax,
|
2009-04-03 03:58:33 +04:00
|
|
|
.extra1 = &zero,
|
|
|
|
.extra2 = &two,
|
2005-06-23 11:09:43 +04:00
|
|
|
},
|
2007-02-14 11:34:07 +03:00
|
|
|
#if defined(CONFIG_BINFMT_MISC) || defined(CONFIG_BINFMT_MISC_MODULE)
|
|
|
|
{
|
|
|
|
.procname = "binfmt_misc",
|
|
|
|
.mode = 0555,
|
|
|
|
.child = binfmt_misc_table,
|
|
|
|
},
|
|
|
|
#endif
|
2010-05-19 23:03:16 +04:00
|
|
|
{
|
2010-06-03 16:54:39 +04:00
|
|
|
.procname = "pipe-max-size",
|
|
|
|
.data = &pipe_max_size,
|
2010-05-19 23:03:16 +04:00
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2010-06-03 16:54:39 +04:00
|
|
|
.proc_handler = &pipe_proc_fn,
|
|
|
|
.extra1 = &pipe_min_size,
|
2010-05-19 23:03:16 +04:00
|
|
|
},
|
2009-04-03 13:30:53 +04:00
|
|
|
{ }
|
2005-04-17 02:20:36 +04:00
|
|
|
};
|
|
|
|
|
2007-10-18 14:05:22 +04:00
|
|
|
static struct ctl_table debug_table[] = {
|
2010-05-17 12:00:21 +04:00
|
|
|
#if defined(CONFIG_X86) || defined(CONFIG_PPC) || defined(CONFIG_SPARC) || \
|
|
|
|
defined(CONFIG_S390)
|
2007-07-22 13:12:28 +04:00
|
|
|
{
|
|
|
|
.procname = "exception-trace",
|
|
|
|
.data = &show_unhandled_signals,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = proc_dointvec
|
|
|
|
},
|
2010-02-25 16:34:15 +03:00
|
|
|
#endif
|
|
|
|
#if defined(CONFIG_OPTPROBES)
|
|
|
|
{
|
|
|
|
.procname = "kprobes-optimization",
|
|
|
|
.data = &sysctl_kprobes_optimization,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = proc_kprobes_optimization_handler,
|
|
|
|
.extra1 = &zero,
|
|
|
|
.extra2 = &one,
|
|
|
|
},
|
2007-07-22 13:12:28 +04:00
|
|
|
#endif
|
2009-04-03 13:30:53 +04:00
|
|
|
{ }
|
2005-04-17 02:20:36 +04:00
|
|
|
};
|
|
|
|
|
2007-10-18 14:05:22 +04:00
|
|
|
static struct ctl_table dev_table[] = {
|
2009-04-03 13:30:53 +04:00
|
|
|
{ }
|
[PATCH] inotify
inotify is intended to correct the deficiencies of dnotify, particularly
its inability to scale and its terrible user interface:
* dnotify requires the opening of one fd per each directory
that you intend to watch. This quickly results in too many
open files and pins removable media, preventing unmount.
* dnotify is directory-based. You only learn about changes to
directories. Sure, a change to a file in a directory affects
the directory, but you are then forced to keep a cache of
stat structures.
* dnotify's interface to user-space is awful. Signals?
inotify provides a more usable, simple, powerful solution to file change
notification:
* inotify's interface is a system call that returns a fd, not SIGIO.
You get a single fd, which is select()-able.
* inotify has an event that says "the filesystem that the item
you were watching is on was unmounted."
* inotify can watch directories or files.
Inotify is currently used by Beagle (a desktop search infrastructure),
Gamin (a FAM replacement), and other projects.
See Documentation/filesystems/inotify.txt.
Signed-off-by: Robert Love <rml@novell.com>
Cc: John McCutchan <ttb@tentacle.dhs.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-13 01:06:03 +04:00
|
|
|
};
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2005-11-04 13:18:40 +03:00
|
|
|
static DEFINE_SPINLOCK(sysctl_lock);
|
|
|
|
|
|
|
|
/* called under sysctl_lock */
|
|
|
|
static int use_table(struct ctl_table_header *p)
|
|
|
|
{
|
|
|
|
if (unlikely(p->unregistering))
|
|
|
|
return 0;
|
|
|
|
p->used++;
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* called under sysctl_lock */
|
|
|
|
static void unuse_table(struct ctl_table_header *p)
|
|
|
|
{
|
|
|
|
if (!--p->used)
|
|
|
|
if (unlikely(p->unregistering))
|
|
|
|
complete(p->unregistering);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* called under sysctl_lock, will reacquire if has to wait */
|
|
|
|
static void start_unregistering(struct ctl_table_header *p)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* if p->used is 0, nobody will ever touch that entry again;
|
|
|
|
* we'll eliminate all paths to it before dropping sysctl_lock
|
|
|
|
*/
|
|
|
|
if (unlikely(p->used)) {
|
|
|
|
struct completion wait;
|
|
|
|
init_completion(&wait);
|
|
|
|
p->unregistering = &wait;
|
|
|
|
spin_unlock(&sysctl_lock);
|
|
|
|
wait_for_completion(&wait);
|
|
|
|
spin_lock(&sysctl_lock);
|
2008-07-15 09:44:23 +04:00
|
|
|
} else {
|
|
|
|
/* anything non-NULL; we'll never dereference it */
|
|
|
|
p->unregistering = ERR_PTR(-EINVAL);
|
2005-11-04 13:18:40 +03:00
|
|
|
}
|
|
|
|
/*
|
|
|
|
* do not remove from the list until nobody holds it; walking the
|
|
|
|
* list in do_sysctl() relies on that.
|
|
|
|
*/
|
|
|
|
list_del_init(&p->ctl_entry);
|
|
|
|
}
|
|
|
|
|
2008-07-15 09:44:23 +04:00
|
|
|
void sysctl_head_get(struct ctl_table_header *head)
|
|
|
|
{
|
|
|
|
spin_lock(&sysctl_lock);
|
|
|
|
head->count++;
|
|
|
|
spin_unlock(&sysctl_lock);
|
|
|
|
}
|
|
|
|
|
|
|
|
void sysctl_head_put(struct ctl_table_header *head)
|
|
|
|
{
|
|
|
|
spin_lock(&sysctl_lock);
|
|
|
|
if (!--head->count)
|
|
|
|
kfree(head);
|
|
|
|
spin_unlock(&sysctl_lock);
|
|
|
|
}
|
|
|
|
|
|
|
|
struct ctl_table_header *sysctl_head_grab(struct ctl_table_header *head)
|
|
|
|
{
|
|
|
|
if (!head)
|
|
|
|
BUG();
|
|
|
|
spin_lock(&sysctl_lock);
|
|
|
|
if (!use_table(head))
|
|
|
|
head = ERR_PTR(-ENOENT);
|
|
|
|
spin_unlock(&sysctl_lock);
|
|
|
|
return head;
|
|
|
|
}
|
|
|
|
|
2007-02-14 11:34:11 +03:00
|
|
|
void sysctl_head_finish(struct ctl_table_header *head)
|
|
|
|
{
|
|
|
|
if (!head)
|
|
|
|
return;
|
|
|
|
spin_lock(&sysctl_lock);
|
|
|
|
unuse_table(head);
|
|
|
|
spin_unlock(&sysctl_lock);
|
|
|
|
}
|
|
|
|
|
2008-07-15 05:22:20 +04:00
|
|
|
static struct ctl_table_set *
|
|
|
|
lookup_header_set(struct ctl_table_root *root, struct nsproxy *namespaces)
|
|
|
|
{
|
|
|
|
struct ctl_table_set *set = &root->default_set;
|
|
|
|
if (root->lookup)
|
|
|
|
set = root->lookup(root, namespaces);
|
|
|
|
return set;
|
|
|
|
}
|
|
|
|
|
2007-11-30 15:54:00 +03:00
|
|
|
static struct list_head *
|
|
|
|
lookup_header_list(struct ctl_table_root *root, struct nsproxy *namespaces)
|
2007-02-14 11:34:11 +03:00
|
|
|
{
|
2008-07-15 05:22:20 +04:00
|
|
|
struct ctl_table_set *set = lookup_header_set(root, namespaces);
|
|
|
|
return &set->list;
|
2007-11-30 15:54:00 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
struct ctl_table_header *__sysctl_head_next(struct nsproxy *namespaces,
|
|
|
|
struct ctl_table_header *prev)
|
|
|
|
{
|
|
|
|
struct ctl_table_root *root;
|
|
|
|
struct list_head *header_list;
|
2007-02-14 11:34:11 +03:00
|
|
|
struct ctl_table_header *head;
|
|
|
|
struct list_head *tmp;
|
2007-11-30 15:54:00 +03:00
|
|
|
|
2007-02-14 11:34:11 +03:00
|
|
|
spin_lock(&sysctl_lock);
|
|
|
|
if (prev) {
|
2007-11-30 15:54:00 +03:00
|
|
|
head = prev;
|
2007-02-14 11:34:11 +03:00
|
|
|
tmp = &prev->ctl_entry;
|
|
|
|
unuse_table(prev);
|
|
|
|
goto next;
|
|
|
|
}
|
|
|
|
tmp = &root_table_header.ctl_entry;
|
|
|
|
for (;;) {
|
|
|
|
head = list_entry(tmp, struct ctl_table_header, ctl_entry);
|
|
|
|
|
|
|
|
if (!use_table(head))
|
|
|
|
goto next;
|
|
|
|
spin_unlock(&sysctl_lock);
|
|
|
|
return head;
|
|
|
|
next:
|
2007-11-30 15:54:00 +03:00
|
|
|
root = head->root;
|
2007-02-14 11:34:11 +03:00
|
|
|
tmp = tmp->next;
|
2007-11-30 15:54:00 +03:00
|
|
|
header_list = lookup_header_list(root, namespaces);
|
|
|
|
if (tmp != header_list)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
do {
|
|
|
|
root = list_entry(root->root_list.next,
|
|
|
|
struct ctl_table_root, root_list);
|
|
|
|
if (root == &sysctl_table_root)
|
|
|
|
goto out;
|
|
|
|
header_list = lookup_header_list(root, namespaces);
|
|
|
|
} while (list_empty(header_list));
|
|
|
|
tmp = header_list->next;
|
2007-02-14 11:34:11 +03:00
|
|
|
}
|
2007-11-30 15:54:00 +03:00
|
|
|
out:
|
2007-02-14 11:34:11 +03:00
|
|
|
spin_unlock(&sysctl_lock);
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2007-11-30 15:54:00 +03:00
|
|
|
struct ctl_table_header *sysctl_head_next(struct ctl_table_header *prev)
|
|
|
|
{
|
|
|
|
return __sysctl_head_next(current->nsproxy, prev);
|
|
|
|
}
|
|
|
|
|
|
|
|
void register_sysctl_root(struct ctl_table_root *root)
|
|
|
|
{
|
|
|
|
spin_lock(&sysctl_lock);
|
|
|
|
list_add_tail(&root->root_list, &sysctl_table_root.root_list);
|
|
|
|
spin_unlock(&sysctl_lock);
|
|
|
|
}
|
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
/*
|
2007-02-14 11:34:11 +03:00
|
|
|
* sysctl_perm does NOT grant the superuser all rights automatically, because
|
2005-04-17 02:20:36 +04:00
|
|
|
* some sysctl variables are readonly even to root.
|
|
|
|
*/
|
|
|
|
|
|
|
|
static int test_perm(int mode, int op)
|
|
|
|
{
|
2008-11-14 02:39:12 +03:00
|
|
|
if (!current_euid())
|
2005-04-17 02:20:36 +04:00
|
|
|
mode >>= 6;
|
|
|
|
else if (in_egroup_p(0))
|
|
|
|
mode >>= 3;
|
2008-07-16 05:03:57 +04:00
|
|
|
if ((op & ~mode & (MAY_READ|MAY_WRITE|MAY_EXEC)) == 0)
|
2005-04-17 02:20:36 +04:00
|
|
|
return 0;
|
|
|
|
return -EACCES;
|
|
|
|
}
|
|
|
|
|
2008-04-29 12:02:44 +04:00
|
|
|
int sysctl_perm(struct ctl_table_root *root, struct ctl_table *table, int op)
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
|
|
|
int error;
|
2008-04-29 12:02:44 +04:00
|
|
|
int mode;
|
|
|
|
|
2008-07-16 05:03:57 +04:00
|
|
|
error = security_sysctl(table, op & (MAY_READ | MAY_WRITE | MAY_EXEC));
|
2005-04-17 02:20:36 +04:00
|
|
|
if (error)
|
|
|
|
return error;
|
2008-04-29 12:02:44 +04:00
|
|
|
|
|
|
|
if (root->permissions)
|
|
|
|
mode = root->permissions(root, current->nsproxy, table);
|
|
|
|
else
|
|
|
|
mode = table->mode;
|
|
|
|
|
|
|
|
return test_perm(mode, op);
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
|
2007-02-14 11:34:13 +03:00
|
|
|
static void sysctl_set_parent(struct ctl_table *parent, struct ctl_table *table)
|
|
|
|
{
|
2009-04-03 14:18:02 +04:00
|
|
|
for (; table->procname; table++) {
|
2007-02-14 11:34:13 +03:00
|
|
|
table->parent = parent;
|
|
|
|
if (table->child)
|
|
|
|
sysctl_set_parent(table, table->child);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static __init int sysctl_init(void)
|
|
|
|
{
|
|
|
|
sysctl_set_parent(NULL, root_table);
|
2008-04-29 12:02:36 +04:00
|
|
|
#ifdef CONFIG_SYSCTL_SYSCALL_CHECK
|
2010-08-11 01:17:51 +04:00
|
|
|
sysctl_check_table(current->nsproxy, root_table);
|
2008-04-29 12:02:36 +04:00
|
|
|
#endif
|
2007-02-14 11:34:13 +03:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
core_initcall(sysctl_init);
|
|
|
|
|
2008-07-27 09:31:22 +04:00
|
|
|
static struct ctl_table *is_branch_in(struct ctl_table *branch,
|
|
|
|
struct ctl_table *table)
|
2008-07-15 14:33:31 +04:00
|
|
|
{
|
|
|
|
struct ctl_table *p;
|
|
|
|
const char *s = branch->procname;
|
|
|
|
|
|
|
|
/* branch should have named subdirectory as its first element */
|
|
|
|
if (!s || !branch->child)
|
2008-07-27 09:31:22 +04:00
|
|
|
return NULL;
|
2008-07-15 14:33:31 +04:00
|
|
|
|
|
|
|
/* ... and nothing else */
|
2009-04-03 14:18:02 +04:00
|
|
|
if (branch[1].procname)
|
2008-07-27 09:31:22 +04:00
|
|
|
return NULL;
|
2008-07-15 14:33:31 +04:00
|
|
|
|
|
|
|
/* table should contain subdirectory with the same name */
|
2009-04-03 14:18:02 +04:00
|
|
|
for (p = table; p->procname; p++) {
|
2008-07-15 14:33:31 +04:00
|
|
|
if (!p->child)
|
|
|
|
continue;
|
|
|
|
if (p->procname && strcmp(p->procname, s) == 0)
|
2008-07-27 09:31:22 +04:00
|
|
|
return p;
|
2008-07-15 14:33:31 +04:00
|
|
|
}
|
2008-07-27 09:31:22 +04:00
|
|
|
return NULL;
|
2008-07-15 14:33:31 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
/* see if attaching q to p would be an improvement */
|
|
|
|
static void try_attach(struct ctl_table_header *p, struct ctl_table_header *q)
|
|
|
|
{
|
|
|
|
struct ctl_table *to = p->ctl_table, *by = q->ctl_table;
|
2008-07-27 09:31:22 +04:00
|
|
|
struct ctl_table *next;
|
2008-07-15 14:33:31 +04:00
|
|
|
int is_better = 0;
|
|
|
|
int not_in_parent = !p->attached_by;
|
|
|
|
|
2008-07-27 09:31:22 +04:00
|
|
|
while ((next = is_branch_in(by, to)) != NULL) {
|
2008-07-15 14:33:31 +04:00
|
|
|
if (by == q->attached_by)
|
|
|
|
is_better = 1;
|
|
|
|
if (to == p->attached_by)
|
|
|
|
not_in_parent = 1;
|
|
|
|
by = by->child;
|
2008-07-27 09:31:22 +04:00
|
|
|
to = next->child;
|
2008-07-15 14:33:31 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
if (is_better && not_in_parent) {
|
|
|
|
q->attached_by = by;
|
|
|
|
q->attached_to = to;
|
|
|
|
q->parent = p;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
/**
|
2007-11-30 15:54:00 +03:00
|
|
|
* __register_sysctl_paths - register a sysctl hierarchy
|
|
|
|
* @root: List of sysctl headers to register on
|
|
|
|
* @namespaces: Data to compute which lists of sysctl entries are visible
|
2007-11-30 15:50:18 +03:00
|
|
|
* @path: The path to the directory the sysctl table is in.
|
2005-04-17 02:20:36 +04:00
|
|
|
* @table: the top-level table structure
|
|
|
|
*
|
|
|
|
* Register a sysctl table hierarchy. @table should be a filled in ctl_table
|
2007-11-30 15:50:18 +03:00
|
|
|
* array. A completely 0 filled entry terminates the table.
|
2005-04-17 02:20:36 +04:00
|
|
|
*
|
2007-10-18 14:05:22 +04:00
|
|
|
* The members of the &struct ctl_table structure are used as follows:
|
2005-04-17 02:20:36 +04:00
|
|
|
*
|
|
|
|
* procname - the name of the sysctl file under /proc/sys. Set to %NULL to not
|
|
|
|
* enter a sysctl file
|
|
|
|
*
|
|
|
|
* data - a pointer to data for use by proc_handler
|
|
|
|
*
|
|
|
|
* maxlen - the maximum size in bytes of the data
|
|
|
|
*
|
|
|
|
* mode - the file permissions for the /proc/sys file, and for sysctl(2)
|
|
|
|
*
|
|
|
|
* child - a pointer to the child sysctl table if this entry is a directory, or
|
|
|
|
* %NULL.
|
|
|
|
*
|
|
|
|
* proc_handler - the text handler routine (described below)
|
|
|
|
*
|
|
|
|
* de - for internal use by the sysctl routines
|
|
|
|
*
|
|
|
|
* extra1, extra2 - extra pointers usable by the proc handler routines
|
|
|
|
*
|
|
|
|
* Leaf nodes in the sysctl tree will be represented by a single file
|
|
|
|
* under /proc; non-leaf nodes will be represented by directories.
|
|
|
|
*
|
|
|
|
* sysctl(2) can automatically manage read and write requests through
|
|
|
|
* the sysctl table. The data and maxlen fields of the ctl_table
|
|
|
|
* struct enable minimal validation of the values being written to be
|
|
|
|
* performed, and the mode field allows minimal authentication.
|
|
|
|
*
|
|
|
|
* There must be a proc_handler routine for any terminal nodes
|
|
|
|
* mirrored under /proc/sys (non-terminals are handled by a built-in
|
|
|
|
* directory handler). Several default handlers are available to
|
|
|
|
* cover common cases -
|
|
|
|
*
|
|
|
|
* proc_dostring(), proc_dointvec(), proc_dointvec_jiffies(),
|
|
|
|
* proc_dointvec_userhz_jiffies(), proc_dointvec_minmax(),
|
|
|
|
* proc_doulongvec_ms_jiffies_minmax(), proc_doulongvec_minmax()
|
|
|
|
*
|
|
|
|
* It is the handler's job to read the input buffer from user memory
|
|
|
|
* and process it. The handler should return 0 on success.
|
|
|
|
*
|
|
|
|
* This routine returns %NULL on a failure to register, and a pointer
|
|
|
|
* to the table header on success.
|
|
|
|
*/
|
2007-11-30 15:54:00 +03:00
|
|
|
struct ctl_table_header *__register_sysctl_paths(
|
|
|
|
struct ctl_table_root *root,
|
|
|
|
struct nsproxy *namespaces,
|
|
|
|
const struct ctl_path *path, struct ctl_table *table)
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
2007-11-30 15:50:18 +03:00
|
|
|
struct ctl_table_header *header;
|
|
|
|
struct ctl_table *new, **prevp;
|
|
|
|
unsigned int n, npath;
|
2008-07-15 14:33:31 +04:00
|
|
|
struct ctl_table_set *set;
|
2007-11-30 15:50:18 +03:00
|
|
|
|
|
|
|
/* Count the path components */
|
2009-04-03 14:18:02 +04:00
|
|
|
for (npath = 0; path[npath].procname; ++npath)
|
2007-11-30 15:50:18 +03:00
|
|
|
;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* For each path component, allocate a 2-element ctl_table array.
|
|
|
|
* The first array element will be filled with the sysctl entry
|
2009-04-03 14:18:02 +04:00
|
|
|
* for this, the second will be the sentinel (procname == 0).
|
2007-11-30 15:50:18 +03:00
|
|
|
*
|
|
|
|
* We allocate everything in one go so that we don't have to
|
|
|
|
* worry about freeing additional memory in unregister_sysctl_table.
|
|
|
|
*/
|
|
|
|
header = kzalloc(sizeof(struct ctl_table_header) +
|
|
|
|
(2 * npath * sizeof(struct ctl_table)), GFP_KERNEL);
|
|
|
|
if (!header)
|
2005-04-17 02:20:36 +04:00
|
|
|
return NULL;
|
2007-11-30 15:50:18 +03:00
|
|
|
|
|
|
|
new = (struct ctl_table *) (header + 1);
|
|
|
|
|
|
|
|
/* Now connect the dots */
|
|
|
|
prevp = &header->ctl_table;
|
|
|
|
for (n = 0; n < npath; ++n, ++path) {
|
|
|
|
/* Copy the procname */
|
|
|
|
new->procname = path->procname;
|
|
|
|
new->mode = 0555;
|
|
|
|
|
|
|
|
*prevp = new;
|
|
|
|
prevp = &new->child;
|
|
|
|
|
|
|
|
new += 2;
|
|
|
|
}
|
|
|
|
*prevp = table;
|
2007-11-30 15:52:10 +03:00
|
|
|
header->ctl_table_arg = table;
|
2007-11-30 15:50:18 +03:00
|
|
|
|
|
|
|
INIT_LIST_HEAD(&header->ctl_entry);
|
|
|
|
header->used = 0;
|
|
|
|
header->unregistering = NULL;
|
2007-11-30 15:54:00 +03:00
|
|
|
header->root = root;
|
2007-11-30 15:50:18 +03:00
|
|
|
sysctl_set_parent(NULL, header->ctl_table);
|
2008-07-15 09:44:23 +04:00
|
|
|
header->count = 1;
|
2008-04-29 12:02:36 +04:00
|
|
|
#ifdef CONFIG_SYSCTL_SYSCALL_CHECK
|
2007-11-30 15:54:00 +03:00
|
|
|
if (sysctl_check_table(namespaces, header->ctl_table)) {
|
2007-11-30 15:50:18 +03:00
|
|
|
kfree(header);
|
2007-10-18 14:05:54 +04:00
|
|
|
return NULL;
|
|
|
|
}
|
2008-04-29 12:02:36 +04:00
|
|
|
#endif
|
2005-11-04 13:18:40 +03:00
|
|
|
spin_lock(&sysctl_lock);
|
2008-07-15 05:22:20 +04:00
|
|
|
header->set = lookup_header_set(root, namespaces);
|
2008-07-15 14:33:31 +04:00
|
|
|
header->attached_by = header->ctl_table;
|
|
|
|
header->attached_to = root_table;
|
|
|
|
header->parent = &root_table_header;
|
|
|
|
for (set = header->set; set; set = set->parent) {
|
|
|
|
struct ctl_table_header *p;
|
|
|
|
list_for_each_entry(p, &set->list, ctl_entry) {
|
|
|
|
if (p->unregistering)
|
|
|
|
continue;
|
|
|
|
try_attach(p, header);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
header->parent->count++;
|
2008-07-15 05:22:20 +04:00
|
|
|
list_add_tail(&header->ctl_entry, &header->set->list);
|
2005-11-04 13:18:40 +03:00
|
|
|
spin_unlock(&sysctl_lock);
|
2007-11-30 15:50:18 +03:00
|
|
|
|
|
|
|
return header;
|
|
|
|
}
|
|
|
|
|
2007-11-30 15:54:00 +03:00
|
|
|
/**
|
|
|
|
* register_sysctl_table_path - register a sysctl table hierarchy
|
|
|
|
* @path: The path to the directory the sysctl table is in.
|
|
|
|
* @table: the top-level table structure
|
|
|
|
*
|
|
|
|
* Register a sysctl table hierarchy. @table should be a filled in ctl_table
|
|
|
|
* array. A completely 0 filled entry terminates the table.
|
|
|
|
*
|
|
|
|
* See __register_sysctl_paths for more details.
|
|
|
|
*/
|
|
|
|
struct ctl_table_header *register_sysctl_paths(const struct ctl_path *path,
|
|
|
|
struct ctl_table *table)
|
|
|
|
{
|
|
|
|
return __register_sysctl_paths(&sysctl_table_root, current->nsproxy,
|
|
|
|
path, table);
|
|
|
|
}
|
|
|
|
|
2007-11-30 15:50:18 +03:00
|
|
|
/**
|
|
|
|
* register_sysctl_table - register a sysctl table hierarchy
|
|
|
|
* @table: the top-level table structure
|
|
|
|
*
|
|
|
|
* Register a sysctl table hierarchy. @table should be a filled in ctl_table
|
|
|
|
* array. A completely 0 filled entry terminates the table.
|
|
|
|
*
|
|
|
|
* See register_sysctl_paths for more details.
|
|
|
|
*/
|
|
|
|
struct ctl_table_header *register_sysctl_table(struct ctl_table *table)
|
|
|
|
{
|
|
|
|
static const struct ctl_path null_path[] = { {} };
|
|
|
|
|
|
|
|
return register_sysctl_paths(null_path, table);
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* unregister_sysctl_table - unregister a sysctl table hierarchy
|
|
|
|
* @header: the header returned from register_sysctl_table
|
|
|
|
*
|
|
|
|
* Unregisters the sysctl table and all children. proc entries may not
|
|
|
|
* actually be removed until they are no longer used by anyone.
|
|
|
|
*/
|
|
|
|
void unregister_sysctl_table(struct ctl_table_header * header)
|
|
|
|
{
|
2005-11-04 13:18:40 +03:00
|
|
|
might_sleep();
|
2007-12-05 10:45:24 +03:00
|
|
|
|
|
|
|
if (header == NULL)
|
|
|
|
return;
|
|
|
|
|
2005-11-04 13:18:40 +03:00
|
|
|
spin_lock(&sysctl_lock);
|
|
|
|
start_unregistering(header);
|
2008-07-15 14:33:31 +04:00
|
|
|
if (!--header->parent->count) {
|
|
|
|
WARN_ON(1);
|
|
|
|
kfree(header->parent);
|
|
|
|
}
|
2008-07-15 09:44:23 +04:00
|
|
|
if (!--header->count)
|
|
|
|
kfree(header);
|
2005-11-04 13:18:40 +03:00
|
|
|
spin_unlock(&sysctl_lock);
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
|
2008-07-15 16:54:06 +04:00
|
|
|
int sysctl_is_seen(struct ctl_table_header *p)
|
|
|
|
{
|
|
|
|
struct ctl_table_set *set = p->set;
|
|
|
|
int res;
|
|
|
|
spin_lock(&sysctl_lock);
|
|
|
|
if (p->unregistering)
|
|
|
|
res = 0;
|
|
|
|
else if (!set->is_seen)
|
|
|
|
res = 1;
|
|
|
|
else
|
|
|
|
res = set->is_seen(set);
|
|
|
|
spin_unlock(&sysctl_lock);
|
|
|
|
return res;
|
|
|
|
}
|
|
|
|
|
2008-07-15 05:22:20 +04:00
|
|
|
void setup_sysctl_set(struct ctl_table_set *p,
|
|
|
|
struct ctl_table_set *parent,
|
|
|
|
int (*is_seen)(struct ctl_table_set *))
|
|
|
|
{
|
|
|
|
INIT_LIST_HEAD(&p->list);
|
|
|
|
p->parent = parent ? parent : &sysctl_table_root.default_set;
|
|
|
|
p->is_seen = is_seen;
|
|
|
|
}
|
|
|
|
|
2006-09-27 12:51:04 +04:00
|
|
|
#else /* !CONFIG_SYSCTL */
|
2007-10-18 14:05:22 +04:00
|
|
|
struct ctl_table_header *register_sysctl_table(struct ctl_table * table)
|
2006-09-27 12:51:04 +04:00
|
|
|
{
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2007-11-30 15:50:18 +03:00
|
|
|
struct ctl_table_header *register_sysctl_paths(const struct ctl_path *path,
|
|
|
|
struct ctl_table *table)
|
|
|
|
{
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2006-09-27 12:51:04 +04:00
|
|
|
void unregister_sysctl_table(struct ctl_table_header * table)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
2008-07-15 05:22:20 +04:00
|
|
|
void setup_sysctl_set(struct ctl_table_set *p,
|
|
|
|
struct ctl_table_set *parent,
|
|
|
|
int (*is_seen)(struct ctl_table_set *))
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
2008-07-15 09:44:23 +04:00
|
|
|
void sysctl_head_put(struct ctl_table_header *head)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
2006-09-27 12:51:04 +04:00
|
|
|
#endif /* CONFIG_SYSCTL */
|
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
/*
|
|
|
|
* /proc/sys support
|
|
|
|
*/
|
|
|
|
|
2006-09-27 12:51:04 +04:00
|
|
|
#ifdef CONFIG_PROC_SYSCTL
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2006-10-02 13:18:05 +04:00
|
|
|
static int _proc_do_string(void* data, int maxlen, int write,
|
2009-09-24 02:57:19 +04:00
|
|
|
void __user *buffer,
|
2006-10-02 13:18:05 +04:00
|
|
|
size_t *lenp, loff_t *ppos)
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
|
|
|
size_t len;
|
|
|
|
char __user *p;
|
|
|
|
char c;
|
2007-02-10 12:46:38 +03:00
|
|
|
|
|
|
|
if (!data || !maxlen || !*lenp) {
|
2005-04-17 02:20:36 +04:00
|
|
|
*lenp = 0;
|
|
|
|
return 0;
|
|
|
|
}
|
2007-02-10 12:46:38 +03:00
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
if (write) {
|
|
|
|
len = 0;
|
|
|
|
p = buffer;
|
|
|
|
while (len < *lenp) {
|
|
|
|
if (get_user(c, p++))
|
|
|
|
return -EFAULT;
|
|
|
|
if (c == 0 || c == '\n')
|
|
|
|
break;
|
|
|
|
len++;
|
|
|
|
}
|
2006-10-02 13:18:04 +04:00
|
|
|
if (len >= maxlen)
|
|
|
|
len = maxlen-1;
|
|
|
|
if(copy_from_user(data, buffer, len))
|
2005-04-17 02:20:36 +04:00
|
|
|
return -EFAULT;
|
2006-10-02 13:18:04 +04:00
|
|
|
((char *) data)[len] = 0;
|
2005-04-17 02:20:36 +04:00
|
|
|
*ppos += *lenp;
|
|
|
|
} else {
|
2006-10-02 13:18:04 +04:00
|
|
|
len = strlen(data);
|
|
|
|
if (len > maxlen)
|
|
|
|
len = maxlen;
|
2007-02-10 12:46:38 +03:00
|
|
|
|
|
|
|
if (*ppos > len) {
|
|
|
|
*lenp = 0;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
data += *ppos;
|
|
|
|
len -= *ppos;
|
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
if (len > *lenp)
|
|
|
|
len = *lenp;
|
|
|
|
if (len)
|
2006-10-02 13:18:04 +04:00
|
|
|
if(copy_to_user(buffer, data, len))
|
2005-04-17 02:20:36 +04:00
|
|
|
return -EFAULT;
|
|
|
|
if (len < *lenp) {
|
|
|
|
if(put_user('\n', ((char __user *) buffer) + len))
|
|
|
|
return -EFAULT;
|
|
|
|
len++;
|
|
|
|
}
|
|
|
|
*lenp = len;
|
|
|
|
*ppos += len;
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2006-10-02 13:18:04 +04:00
|
|
|
/**
|
|
|
|
* proc_dostring - read a string sysctl
|
|
|
|
* @table: the sysctl table
|
|
|
|
* @write: %TRUE if this is a write to the sysctl file
|
|
|
|
* @buffer: the user buffer
|
|
|
|
* @lenp: the size of the user buffer
|
|
|
|
* @ppos: file position
|
|
|
|
*
|
|
|
|
* Reads/writes a string from/to the user buffer. If the kernel
|
|
|
|
* buffer provided is not large enough to hold the string, the
|
|
|
|
* string is truncated. The copied string is %NULL-terminated.
|
|
|
|
* If the string is being read by the user process, it is copied
|
|
|
|
* and a newline '\n' is added. It is truncated if the buffer is
|
|
|
|
* not large enough.
|
|
|
|
*
|
|
|
|
* Returns 0 on success.
|
|
|
|
*/
|
2009-09-24 02:57:19 +04:00
|
|
|
int proc_dostring(struct ctl_table *table, int write,
|
2006-10-02 13:18:04 +04:00
|
|
|
void __user *buffer, size_t *lenp, loff_t *ppos)
|
|
|
|
{
|
2009-09-24 02:57:19 +04:00
|
|
|
return _proc_do_string(table->data, table->maxlen, write,
|
2006-10-02 13:18:04 +04:00
|
|
|
buffer, lenp, ppos);
|
|
|
|
}
|
|
|
|
|
2010-05-05 04:26:45 +04:00
|
|
|
static size_t proc_skip_spaces(char **buf)
|
|
|
|
{
|
|
|
|
size_t ret;
|
|
|
|
char *tmp = skip_spaces(*buf);
|
|
|
|
ret = tmp - *buf;
|
|
|
|
*buf = tmp;
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2010-05-05 04:26:55 +04:00
|
|
|
static void proc_skip_char(char **buf, size_t *size, const char v)
|
|
|
|
{
|
|
|
|
while (*size) {
|
|
|
|
if (**buf != v)
|
|
|
|
break;
|
|
|
|
(*size)--;
|
|
|
|
(*buf)++;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2010-05-05 04:26:45 +04:00
|
|
|
#define TMPBUFLEN 22
|
|
|
|
/**
|
2010-05-21 22:29:53 +04:00
|
|
|
* proc_get_long - reads an ASCII formatted integer from a user buffer
|
2010-05-05 04:26:45 +04:00
|
|
|
*
|
2010-05-21 22:29:53 +04:00
|
|
|
* @buf: a kernel buffer
|
|
|
|
* @size: size of the kernel buffer
|
|
|
|
* @val: this is where the number will be stored
|
|
|
|
* @neg: set to %TRUE if number is negative
|
|
|
|
* @perm_tr: a vector which contains the allowed trailers
|
|
|
|
* @perm_tr_len: size of the perm_tr vector
|
|
|
|
* @tr: pointer to store the trailer character
|
2010-05-05 04:26:45 +04:00
|
|
|
*
|
2010-05-21 22:29:53 +04:00
|
|
|
* In case of success %0 is returned and @buf and @size are updated with
|
|
|
|
* the amount of bytes read. If @tr is non-NULL and a trailing
|
|
|
|
* character exists (size is non-zero after returning from this
|
|
|
|
* function), @tr is updated with the trailing character.
|
2010-05-05 04:26:45 +04:00
|
|
|
*/
|
|
|
|
static int proc_get_long(char **buf, size_t *size,
|
|
|
|
unsigned long *val, bool *neg,
|
|
|
|
const char *perm_tr, unsigned perm_tr_len, char *tr)
|
|
|
|
{
|
|
|
|
int len;
|
|
|
|
char *p, tmp[TMPBUFLEN];
|
|
|
|
|
|
|
|
if (!*size)
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
len = *size;
|
|
|
|
if (len > TMPBUFLEN - 1)
|
|
|
|
len = TMPBUFLEN - 1;
|
|
|
|
|
|
|
|
memcpy(tmp, *buf, len);
|
|
|
|
|
|
|
|
tmp[len] = 0;
|
|
|
|
p = tmp;
|
|
|
|
if (*p == '-' && *size > 1) {
|
|
|
|
*neg = true;
|
|
|
|
p++;
|
|
|
|
} else
|
|
|
|
*neg = false;
|
|
|
|
if (!isdigit(*p))
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
*val = simple_strtoul(p, &p, 0);
|
|
|
|
|
|
|
|
len = p - tmp;
|
|
|
|
|
|
|
|
/* We don't know if the next char is whitespace thus we may accept
|
|
|
|
* invalid integers (e.g. 1234...a) or two integers instead of one
|
|
|
|
* (e.g. 123...1). So lets not allow such large numbers. */
|
|
|
|
if (len == TMPBUFLEN - 1)
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
if (len < *size && perm_tr_len && !memchr(perm_tr, *p, perm_tr_len))
|
|
|
|
return -EINVAL;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2010-05-05 04:26:45 +04:00
|
|
|
if (tr && (len < *size))
|
|
|
|
*tr = *p;
|
|
|
|
|
|
|
|
*buf += len;
|
|
|
|
*size -= len;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
2010-05-21 22:29:53 +04:00
|
|
|
* proc_put_long - converts an integer to a decimal ASCII formatted string
|
2010-05-05 04:26:45 +04:00
|
|
|
*
|
2010-05-21 22:29:53 +04:00
|
|
|
* @buf: the user buffer
|
|
|
|
* @size: the size of the user buffer
|
|
|
|
* @val: the integer to be converted
|
|
|
|
* @neg: sign of the number, %TRUE for negative
|
2010-05-05 04:26:45 +04:00
|
|
|
*
|
2010-05-21 22:29:53 +04:00
|
|
|
* In case of success %0 is returned and @buf and @size are updated with
|
|
|
|
* the amount of bytes written.
|
2010-05-05 04:26:45 +04:00
|
|
|
*/
|
|
|
|
static int proc_put_long(void __user **buf, size_t *size, unsigned long val,
|
|
|
|
bool neg)
|
|
|
|
{
|
|
|
|
int len;
|
|
|
|
char tmp[TMPBUFLEN], *p = tmp;
|
|
|
|
|
|
|
|
sprintf(p, "%s%lu", neg ? "-" : "", val);
|
|
|
|
len = strlen(tmp);
|
|
|
|
if (len > *size)
|
|
|
|
len = *size;
|
|
|
|
if (copy_to_user(*buf, tmp, len))
|
|
|
|
return -EFAULT;
|
|
|
|
*size -= len;
|
|
|
|
*buf += len;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
#undef TMPBUFLEN
|
|
|
|
|
|
|
|
static int proc_put_char(void __user **buf, size_t *size, char c)
|
|
|
|
{
|
|
|
|
if (*size) {
|
|
|
|
char __user **buffer = (char __user **)buf;
|
|
|
|
if (put_user(c, *buffer))
|
|
|
|
return -EFAULT;
|
|
|
|
(*size)--, (*buffer)++;
|
|
|
|
*buf = *buffer;
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2010-05-05 04:26:45 +04:00
|
|
|
static int do_proc_dointvec_conv(bool *negp, unsigned long *lvalp,
|
2005-04-17 02:20:36 +04:00
|
|
|
int *valp,
|
|
|
|
int write, void *data)
|
|
|
|
{
|
|
|
|
if (write) {
|
|
|
|
*valp = *negp ? -*lvalp : *lvalp;
|
|
|
|
} else {
|
|
|
|
int val = *valp;
|
|
|
|
if (val < 0) {
|
2010-05-05 04:26:45 +04:00
|
|
|
*negp = true;
|
2005-04-17 02:20:36 +04:00
|
|
|
*lvalp = (unsigned long)-val;
|
|
|
|
} else {
|
2010-05-05 04:26:45 +04:00
|
|
|
*negp = false;
|
2005-04-17 02:20:36 +04:00
|
|
|
*lvalp = (unsigned long)val;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2010-05-05 04:26:45 +04:00
|
|
|
static const char proc_wspace_sep[] = { ' ', '\t', '\n' };
|
|
|
|
|
2007-10-18 14:05:22 +04:00
|
|
|
static int __do_proc_dointvec(void *tbl_data, struct ctl_table *table,
|
2009-09-24 02:57:19 +04:00
|
|
|
int write, void __user *buffer,
|
2006-10-02 13:18:23 +04:00
|
|
|
size_t *lenp, loff_t *ppos,
|
2010-05-05 04:26:45 +04:00
|
|
|
int (*conv)(bool *negp, unsigned long *lvalp, int *valp,
|
2005-04-17 02:20:36 +04:00
|
|
|
int write, void *data),
|
|
|
|
void *data)
|
|
|
|
{
|
2010-05-05 04:26:45 +04:00
|
|
|
int *i, vleft, first = 1, err = 0;
|
|
|
|
unsigned long page = 0;
|
|
|
|
size_t left;
|
|
|
|
char *kbuf;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2010-05-05 04:26:45 +04:00
|
|
|
if (!tbl_data || !table->maxlen || !*lenp || (*ppos && !write)) {
|
2005-04-17 02:20:36 +04:00
|
|
|
*lenp = 0;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2006-10-02 13:18:23 +04:00
|
|
|
i = (int *) tbl_data;
|
2005-04-17 02:20:36 +04:00
|
|
|
vleft = table->maxlen / sizeof(*i);
|
|
|
|
left = *lenp;
|
|
|
|
|
|
|
|
if (!conv)
|
|
|
|
conv = do_proc_dointvec_conv;
|
|
|
|
|
2010-05-05 04:26:45 +04:00
|
|
|
if (write) {
|
|
|
|
if (left > PAGE_SIZE - 1)
|
|
|
|
left = PAGE_SIZE - 1;
|
|
|
|
page = __get_free_page(GFP_TEMPORARY);
|
|
|
|
kbuf = (char *) page;
|
|
|
|
if (!kbuf)
|
|
|
|
return -ENOMEM;
|
|
|
|
if (copy_from_user(kbuf, buffer, left)) {
|
|
|
|
err = -EFAULT;
|
|
|
|
goto free;
|
|
|
|
}
|
|
|
|
kbuf[left] = 0;
|
|
|
|
}
|
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
for (; left && vleft--; i++, first=0) {
|
2010-05-05 04:26:45 +04:00
|
|
|
unsigned long lval;
|
|
|
|
bool neg;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2010-05-05 04:26:45 +04:00
|
|
|
if (write) {
|
|
|
|
left -= proc_skip_spaces(&kbuf);
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2010-05-26 03:10:14 +04:00
|
|
|
if (!left)
|
|
|
|
break;
|
2010-05-05 04:26:45 +04:00
|
|
|
err = proc_get_long(&kbuf, &left, &lval, &neg,
|
|
|
|
proc_wspace_sep,
|
|
|
|
sizeof(proc_wspace_sep), NULL);
|
|
|
|
if (err)
|
2005-04-17 02:20:36 +04:00
|
|
|
break;
|
2010-05-05 04:26:45 +04:00
|
|
|
if (conv(&neg, &lval, i, 1, data)) {
|
|
|
|
err = -EINVAL;
|
2005-04-17 02:20:36 +04:00
|
|
|
break;
|
2010-05-05 04:26:45 +04:00
|
|
|
}
|
2005-04-17 02:20:36 +04:00
|
|
|
} else {
|
2010-05-05 04:26:45 +04:00
|
|
|
if (conv(&neg, &lval, i, 0, data)) {
|
|
|
|
err = -EINVAL;
|
|
|
|
break;
|
|
|
|
}
|
2005-04-17 02:20:36 +04:00
|
|
|
if (!first)
|
2010-05-05 04:26:45 +04:00
|
|
|
err = proc_put_char(&buffer, &left, '\t');
|
|
|
|
if (err)
|
|
|
|
break;
|
|
|
|
err = proc_put_long(&buffer, &left, lval, neg);
|
|
|
|
if (err)
|
2005-04-17 02:20:36 +04:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2010-05-05 04:26:45 +04:00
|
|
|
if (!write && !first && left && !err)
|
|
|
|
err = proc_put_char(&buffer, &left, '\n');
|
2010-05-26 03:10:14 +04:00
|
|
|
if (write && !err && left)
|
2010-05-05 04:26:45 +04:00
|
|
|
left -= proc_skip_spaces(&kbuf);
|
|
|
|
free:
|
2005-04-17 02:20:36 +04:00
|
|
|
if (write) {
|
2010-05-05 04:26:45 +04:00
|
|
|
free_page(page);
|
|
|
|
if (first)
|
|
|
|
return err ? : -EINVAL;
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
*lenp -= left;
|
|
|
|
*ppos += *lenp;
|
2010-05-05 04:26:45 +04:00
|
|
|
return err;
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
|
2009-09-24 02:57:19 +04:00
|
|
|
static int do_proc_dointvec(struct ctl_table *table, int write,
|
2006-10-02 13:18:23 +04:00
|
|
|
void __user *buffer, size_t *lenp, loff_t *ppos,
|
2010-05-05 04:26:45 +04:00
|
|
|
int (*conv)(bool *negp, unsigned long *lvalp, int *valp,
|
2006-10-02 13:18:23 +04:00
|
|
|
int write, void *data),
|
|
|
|
void *data)
|
|
|
|
{
|
2009-09-24 02:57:19 +04:00
|
|
|
return __do_proc_dointvec(table->data, table, write,
|
2006-10-02 13:18:23 +04:00
|
|
|
buffer, lenp, ppos, conv, data);
|
|
|
|
}
|
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
/**
|
|
|
|
* proc_dointvec - read a vector of integers
|
|
|
|
* @table: the sysctl table
|
|
|
|
* @write: %TRUE if this is a write to the sysctl file
|
|
|
|
* @buffer: the user buffer
|
|
|
|
* @lenp: the size of the user buffer
|
|
|
|
* @ppos: file position
|
|
|
|
*
|
|
|
|
* Reads/writes up to table->maxlen/sizeof(unsigned int) integer
|
|
|
|
* values from/to the user buffer, treated as an ASCII string.
|
|
|
|
*
|
|
|
|
* Returns 0 on success.
|
|
|
|
*/
|
2009-09-24 02:57:19 +04:00
|
|
|
int proc_dointvec(struct ctl_table *table, int write,
|
2005-04-17 02:20:36 +04:00
|
|
|
void __user *buffer, size_t *lenp, loff_t *ppos)
|
|
|
|
{
|
2009-09-24 02:57:19 +04:00
|
|
|
return do_proc_dointvec(table,write,buffer,lenp,ppos,
|
2005-04-17 02:20:36 +04:00
|
|
|
NULL,NULL);
|
|
|
|
}
|
|
|
|
|
2007-02-10 12:45:24 +03:00
|
|
|
/*
|
2008-10-16 09:01:41 +04:00
|
|
|
* Taint values can only be increased
|
|
|
|
* This means we can safely use a temporary.
|
2007-02-10 12:45:24 +03:00
|
|
|
*/
|
2009-09-24 02:57:19 +04:00
|
|
|
static int proc_taint(struct ctl_table *table, int write,
|
2007-02-10 12:45:24 +03:00
|
|
|
void __user *buffer, size_t *lenp, loff_t *ppos)
|
|
|
|
{
|
2008-10-16 09:01:41 +04:00
|
|
|
struct ctl_table t;
|
|
|
|
unsigned long tmptaint = get_taint();
|
|
|
|
int err;
|
2007-02-10 12:45:24 +03:00
|
|
|
|
2007-04-24 01:41:14 +04:00
|
|
|
if (write && !capable(CAP_SYS_ADMIN))
|
2007-02-10 12:45:24 +03:00
|
|
|
return -EPERM;
|
|
|
|
|
2008-10-16 09:01:41 +04:00
|
|
|
t = *table;
|
|
|
|
t.data = &tmptaint;
|
2009-09-24 02:57:19 +04:00
|
|
|
err = proc_doulongvec_minmax(&t, write, buffer, lenp, ppos);
|
2008-10-16 09:01:41 +04:00
|
|
|
if (err < 0)
|
|
|
|
return err;
|
|
|
|
|
|
|
|
if (write) {
|
|
|
|
/*
|
|
|
|
* Poor man's atomic or. Not worth adding a primitive
|
|
|
|
* to everyone's atomic.h for this
|
|
|
|
*/
|
|
|
|
int i;
|
|
|
|
for (i = 0; i < BITS_PER_LONG && tmptaint >> i; i++) {
|
|
|
|
if ((tmptaint >> i) & 1)
|
|
|
|
add_taint(i);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return err;
|
2007-02-10 12:45:24 +03:00
|
|
|
}
|
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
struct do_proc_dointvec_minmax_conv_param {
|
|
|
|
int *min;
|
|
|
|
int *max;
|
|
|
|
};
|
|
|
|
|
2010-05-05 04:26:45 +04:00
|
|
|
static int do_proc_dointvec_minmax_conv(bool *negp, unsigned long *lvalp,
|
|
|
|
int *valp,
|
2005-04-17 02:20:36 +04:00
|
|
|
int write, void *data)
|
|
|
|
{
|
|
|
|
struct do_proc_dointvec_minmax_conv_param *param = data;
|
|
|
|
if (write) {
|
|
|
|
int val = *negp ? -*lvalp : *lvalp;
|
|
|
|
if ((param->min && *param->min > val) ||
|
|
|
|
(param->max && *param->max < val))
|
|
|
|
return -EINVAL;
|
|
|
|
*valp = val;
|
|
|
|
} else {
|
|
|
|
int val = *valp;
|
|
|
|
if (val < 0) {
|
2010-05-05 04:26:45 +04:00
|
|
|
*negp = true;
|
2005-04-17 02:20:36 +04:00
|
|
|
*lvalp = (unsigned long)-val;
|
|
|
|
} else {
|
2010-05-05 04:26:45 +04:00
|
|
|
*negp = false;
|
2005-04-17 02:20:36 +04:00
|
|
|
*lvalp = (unsigned long)val;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* proc_dointvec_minmax - read a vector of integers with min/max values
|
|
|
|
* @table: the sysctl table
|
|
|
|
* @write: %TRUE if this is a write to the sysctl file
|
|
|
|
* @buffer: the user buffer
|
|
|
|
* @lenp: the size of the user buffer
|
|
|
|
* @ppos: file position
|
|
|
|
*
|
|
|
|
* Reads/writes up to table->maxlen/sizeof(unsigned int) integer
|
|
|
|
* values from/to the user buffer, treated as an ASCII string.
|
|
|
|
*
|
|
|
|
* This routine will ensure the values are within the range specified by
|
|
|
|
* table->extra1 (min) and table->extra2 (max).
|
|
|
|
*
|
|
|
|
* Returns 0 on success.
|
|
|
|
*/
|
2009-09-24 02:57:19 +04:00
|
|
|
int proc_dointvec_minmax(struct ctl_table *table, int write,
|
2005-04-17 02:20:36 +04:00
|
|
|
void __user *buffer, size_t *lenp, loff_t *ppos)
|
|
|
|
{
|
|
|
|
struct do_proc_dointvec_minmax_conv_param param = {
|
|
|
|
.min = (int *) table->extra1,
|
|
|
|
.max = (int *) table->extra2,
|
|
|
|
};
|
2009-09-24 02:57:19 +04:00
|
|
|
return do_proc_dointvec(table, write, buffer, lenp, ppos,
|
2005-04-17 02:20:36 +04:00
|
|
|
do_proc_dointvec_minmax_conv, ¶m);
|
|
|
|
}
|
|
|
|
|
2007-10-18 14:05:22 +04:00
|
|
|
static int __do_proc_doulongvec_minmax(void *data, struct ctl_table *table, int write,
|
2005-04-17 02:20:36 +04:00
|
|
|
void __user *buffer,
|
|
|
|
size_t *lenp, loff_t *ppos,
|
|
|
|
unsigned long convmul,
|
|
|
|
unsigned long convdiv)
|
|
|
|
{
|
2010-05-05 04:26:45 +04:00
|
|
|
unsigned long *i, *min, *max;
|
|
|
|
int vleft, first = 1, err = 0;
|
|
|
|
unsigned long page = 0;
|
|
|
|
size_t left;
|
|
|
|
char *kbuf;
|
|
|
|
|
|
|
|
if (!data || !table->maxlen || !*lenp || (*ppos && !write)) {
|
2005-04-17 02:20:36 +04:00
|
|
|
*lenp = 0;
|
|
|
|
return 0;
|
|
|
|
}
|
2010-05-05 04:26:45 +04:00
|
|
|
|
2006-10-02 13:18:23 +04:00
|
|
|
i = (unsigned long *) data;
|
2005-04-17 02:20:36 +04:00
|
|
|
min = (unsigned long *) table->extra1;
|
|
|
|
max = (unsigned long *) table->extra2;
|
|
|
|
vleft = table->maxlen / sizeof(unsigned long);
|
|
|
|
left = *lenp;
|
2010-05-05 04:26:45 +04:00
|
|
|
|
|
|
|
if (write) {
|
|
|
|
if (left > PAGE_SIZE - 1)
|
|
|
|
left = PAGE_SIZE - 1;
|
|
|
|
page = __get_free_page(GFP_TEMPORARY);
|
|
|
|
kbuf = (char *) page;
|
|
|
|
if (!kbuf)
|
|
|
|
return -ENOMEM;
|
|
|
|
if (copy_from_user(kbuf, buffer, left)) {
|
|
|
|
err = -EFAULT;
|
|
|
|
goto free;
|
|
|
|
}
|
|
|
|
kbuf[left] = 0;
|
|
|
|
}
|
|
|
|
|
2010-10-07 23:59:29 +04:00
|
|
|
for (; left && vleft--; i++, first = 0) {
|
2010-05-05 04:26:45 +04:00
|
|
|
unsigned long val;
|
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
if (write) {
|
2010-05-05 04:26:45 +04:00
|
|
|
bool neg;
|
|
|
|
|
|
|
|
left -= proc_skip_spaces(&kbuf);
|
|
|
|
|
|
|
|
err = proc_get_long(&kbuf, &left, &val, &neg,
|
|
|
|
proc_wspace_sep,
|
|
|
|
sizeof(proc_wspace_sep), NULL);
|
|
|
|
if (err)
|
2005-04-17 02:20:36 +04:00
|
|
|
break;
|
|
|
|
if (neg)
|
|
|
|
continue;
|
|
|
|
if ((min && val < *min) || (max && val > *max))
|
|
|
|
continue;
|
|
|
|
*i = val;
|
|
|
|
} else {
|
2010-05-05 04:26:45 +04:00
|
|
|
val = convdiv * (*i) / convmul;
|
2005-04-17 02:20:36 +04:00
|
|
|
if (!first)
|
2010-05-05 04:26:45 +04:00
|
|
|
err = proc_put_char(&buffer, &left, '\t');
|
|
|
|
err = proc_put_long(&buffer, &left, val, false);
|
|
|
|
if (err)
|
|
|
|
break;
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2010-05-05 04:26:45 +04:00
|
|
|
if (!write && !first && left && !err)
|
|
|
|
err = proc_put_char(&buffer, &left, '\n');
|
|
|
|
if (write && !err)
|
|
|
|
left -= proc_skip_spaces(&kbuf);
|
|
|
|
free:
|
2005-04-17 02:20:36 +04:00
|
|
|
if (write) {
|
2010-05-05 04:26:45 +04:00
|
|
|
free_page(page);
|
|
|
|
if (first)
|
|
|
|
return err ? : -EINVAL;
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
*lenp -= left;
|
|
|
|
*ppos += *lenp;
|
2010-05-05 04:26:45 +04:00
|
|
|
return err;
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
|
2007-10-18 14:05:22 +04:00
|
|
|
static int do_proc_doulongvec_minmax(struct ctl_table *table, int write,
|
2006-10-02 13:18:23 +04:00
|
|
|
void __user *buffer,
|
|
|
|
size_t *lenp, loff_t *ppos,
|
|
|
|
unsigned long convmul,
|
|
|
|
unsigned long convdiv)
|
|
|
|
{
|
|
|
|
return __do_proc_doulongvec_minmax(table->data, table, write,
|
2009-09-24 02:57:19 +04:00
|
|
|
buffer, lenp, ppos, convmul, convdiv);
|
2006-10-02 13:18:23 +04:00
|
|
|
}
|
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
/**
|
|
|
|
* proc_doulongvec_minmax - read a vector of long integers with min/max values
|
|
|
|
* @table: the sysctl table
|
|
|
|
* @write: %TRUE if this is a write to the sysctl file
|
|
|
|
* @buffer: the user buffer
|
|
|
|
* @lenp: the size of the user buffer
|
|
|
|
* @ppos: file position
|
|
|
|
*
|
|
|
|
* Reads/writes up to table->maxlen/sizeof(unsigned long) unsigned long
|
|
|
|
* values from/to the user buffer, treated as an ASCII string.
|
|
|
|
*
|
|
|
|
* This routine will ensure the values are within the range specified by
|
|
|
|
* table->extra1 (min) and table->extra2 (max).
|
|
|
|
*
|
|
|
|
* Returns 0 on success.
|
|
|
|
*/
|
2009-09-24 02:57:19 +04:00
|
|
|
int proc_doulongvec_minmax(struct ctl_table *table, int write,
|
2005-04-17 02:20:36 +04:00
|
|
|
void __user *buffer, size_t *lenp, loff_t *ppos)
|
|
|
|
{
|
2009-09-24 02:57:19 +04:00
|
|
|
return do_proc_doulongvec_minmax(table, write, buffer, lenp, ppos, 1l, 1l);
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* proc_doulongvec_ms_jiffies_minmax - read a vector of millisecond values with min/max values
|
|
|
|
* @table: the sysctl table
|
|
|
|
* @write: %TRUE if this is a write to the sysctl file
|
|
|
|
* @buffer: the user buffer
|
|
|
|
* @lenp: the size of the user buffer
|
|
|
|
* @ppos: file position
|
|
|
|
*
|
|
|
|
* Reads/writes up to table->maxlen/sizeof(unsigned long) unsigned long
|
|
|
|
* values from/to the user buffer, treated as an ASCII string. The values
|
|
|
|
* are treated as milliseconds, and converted to jiffies when they are stored.
|
|
|
|
*
|
|
|
|
* This routine will ensure the values are within the range specified by
|
|
|
|
* table->extra1 (min) and table->extra2 (max).
|
|
|
|
*
|
|
|
|
* Returns 0 on success.
|
|
|
|
*/
|
2007-10-18 14:05:22 +04:00
|
|
|
int proc_doulongvec_ms_jiffies_minmax(struct ctl_table *table, int write,
|
2005-04-17 02:20:36 +04:00
|
|
|
void __user *buffer,
|
|
|
|
size_t *lenp, loff_t *ppos)
|
|
|
|
{
|
2009-09-24 02:57:19 +04:00
|
|
|
return do_proc_doulongvec_minmax(table, write, buffer,
|
2005-04-17 02:20:36 +04:00
|
|
|
lenp, ppos, HZ, 1000l);
|
|
|
|
}
|
|
|
|
|
|
|
|
|
2010-05-05 04:26:45 +04:00
|
|
|
static int do_proc_dointvec_jiffies_conv(bool *negp, unsigned long *lvalp,
|
2005-04-17 02:20:36 +04:00
|
|
|
int *valp,
|
|
|
|
int write, void *data)
|
|
|
|
{
|
|
|
|
if (write) {
|
2006-03-24 14:15:50 +03:00
|
|
|
if (*lvalp > LONG_MAX / HZ)
|
|
|
|
return 1;
|
2005-04-17 02:20:36 +04:00
|
|
|
*valp = *negp ? -(*lvalp*HZ) : (*lvalp*HZ);
|
|
|
|
} else {
|
|
|
|
int val = *valp;
|
|
|
|
unsigned long lval;
|
|
|
|
if (val < 0) {
|
2010-05-05 04:26:45 +04:00
|
|
|
*negp = true;
|
2005-04-17 02:20:36 +04:00
|
|
|
lval = (unsigned long)-val;
|
|
|
|
} else {
|
2010-05-05 04:26:45 +04:00
|
|
|
*negp = false;
|
2005-04-17 02:20:36 +04:00
|
|
|
lval = (unsigned long)val;
|
|
|
|
}
|
|
|
|
*lvalp = lval / HZ;
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2010-05-05 04:26:45 +04:00
|
|
|
static int do_proc_dointvec_userhz_jiffies_conv(bool *negp, unsigned long *lvalp,
|
2005-04-17 02:20:36 +04:00
|
|
|
int *valp,
|
|
|
|
int write, void *data)
|
|
|
|
{
|
|
|
|
if (write) {
|
2006-03-24 14:15:50 +03:00
|
|
|
if (USER_HZ < HZ && *lvalp > (LONG_MAX / HZ) * USER_HZ)
|
|
|
|
return 1;
|
2005-04-17 02:20:36 +04:00
|
|
|
*valp = clock_t_to_jiffies(*negp ? -*lvalp : *lvalp);
|
|
|
|
} else {
|
|
|
|
int val = *valp;
|
|
|
|
unsigned long lval;
|
|
|
|
if (val < 0) {
|
2010-05-05 04:26:45 +04:00
|
|
|
*negp = true;
|
2005-04-17 02:20:36 +04:00
|
|
|
lval = (unsigned long)-val;
|
|
|
|
} else {
|
2010-05-05 04:26:45 +04:00
|
|
|
*negp = false;
|
2005-04-17 02:20:36 +04:00
|
|
|
lval = (unsigned long)val;
|
|
|
|
}
|
|
|
|
*lvalp = jiffies_to_clock_t(lval);
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2010-05-05 04:26:45 +04:00
|
|
|
static int do_proc_dointvec_ms_jiffies_conv(bool *negp, unsigned long *lvalp,
|
2005-04-17 02:20:36 +04:00
|
|
|
int *valp,
|
|
|
|
int write, void *data)
|
|
|
|
{
|
|
|
|
if (write) {
|
|
|
|
*valp = msecs_to_jiffies(*negp ? -*lvalp : *lvalp);
|
|
|
|
} else {
|
|
|
|
int val = *valp;
|
|
|
|
unsigned long lval;
|
|
|
|
if (val < 0) {
|
2010-05-05 04:26:45 +04:00
|
|
|
*negp = true;
|
2005-04-17 02:20:36 +04:00
|
|
|
lval = (unsigned long)-val;
|
|
|
|
} else {
|
2010-05-05 04:26:45 +04:00
|
|
|
*negp = false;
|
2005-04-17 02:20:36 +04:00
|
|
|
lval = (unsigned long)val;
|
|
|
|
}
|
|
|
|
*lvalp = jiffies_to_msecs(lval);
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* proc_dointvec_jiffies - read a vector of integers as seconds
|
|
|
|
* @table: the sysctl table
|
|
|
|
* @write: %TRUE if this is a write to the sysctl file
|
|
|
|
* @buffer: the user buffer
|
|
|
|
* @lenp: the size of the user buffer
|
|
|
|
* @ppos: file position
|
|
|
|
*
|
|
|
|
* Reads/writes up to table->maxlen/sizeof(unsigned int) integer
|
|
|
|
* values from/to the user buffer, treated as an ASCII string.
|
|
|
|
* The values read are assumed to be in seconds, and are converted into
|
|
|
|
* jiffies.
|
|
|
|
*
|
|
|
|
* Returns 0 on success.
|
|
|
|
*/
|
2009-09-24 02:57:19 +04:00
|
|
|
int proc_dointvec_jiffies(struct ctl_table *table, int write,
|
2005-04-17 02:20:36 +04:00
|
|
|
void __user *buffer, size_t *lenp, loff_t *ppos)
|
|
|
|
{
|
2009-09-24 02:57:19 +04:00
|
|
|
return do_proc_dointvec(table,write,buffer,lenp,ppos,
|
2005-04-17 02:20:36 +04:00
|
|
|
do_proc_dointvec_jiffies_conv,NULL);
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* proc_dointvec_userhz_jiffies - read a vector of integers as 1/USER_HZ seconds
|
|
|
|
* @table: the sysctl table
|
|
|
|
* @write: %TRUE if this is a write to the sysctl file
|
|
|
|
* @buffer: the user buffer
|
|
|
|
* @lenp: the size of the user buffer
|
2005-11-07 12:01:06 +03:00
|
|
|
* @ppos: pointer to the file position
|
2005-04-17 02:20:36 +04:00
|
|
|
*
|
|
|
|
* Reads/writes up to table->maxlen/sizeof(unsigned int) integer
|
|
|
|
* values from/to the user buffer, treated as an ASCII string.
|
|
|
|
* The values read are assumed to be in 1/USER_HZ seconds, and
|
|
|
|
* are converted into jiffies.
|
|
|
|
*
|
|
|
|
* Returns 0 on success.
|
|
|
|
*/
|
2009-09-24 02:57:19 +04:00
|
|
|
int proc_dointvec_userhz_jiffies(struct ctl_table *table, int write,
|
2005-04-17 02:20:36 +04:00
|
|
|
void __user *buffer, size_t *lenp, loff_t *ppos)
|
|
|
|
{
|
2009-09-24 02:57:19 +04:00
|
|
|
return do_proc_dointvec(table,write,buffer,lenp,ppos,
|
2005-04-17 02:20:36 +04:00
|
|
|
do_proc_dointvec_userhz_jiffies_conv,NULL);
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* proc_dointvec_ms_jiffies - read a vector of integers as 1 milliseconds
|
|
|
|
* @table: the sysctl table
|
|
|
|
* @write: %TRUE if this is a write to the sysctl file
|
|
|
|
* @buffer: the user buffer
|
|
|
|
* @lenp: the size of the user buffer
|
2005-05-01 19:59:26 +04:00
|
|
|
* @ppos: file position
|
|
|
|
* @ppos: the current position in the file
|
2005-04-17 02:20:36 +04:00
|
|
|
*
|
|
|
|
* Reads/writes up to table->maxlen/sizeof(unsigned int) integer
|
|
|
|
* values from/to the user buffer, treated as an ASCII string.
|
|
|
|
* The values read are assumed to be in 1/1000 seconds, and
|
|
|
|
* are converted into jiffies.
|
|
|
|
*
|
|
|
|
* Returns 0 on success.
|
|
|
|
*/
|
2009-09-24 02:57:19 +04:00
|
|
|
int proc_dointvec_ms_jiffies(struct ctl_table *table, int write,
|
2005-04-17 02:20:36 +04:00
|
|
|
void __user *buffer, size_t *lenp, loff_t *ppos)
|
|
|
|
{
|
2009-09-24 02:57:19 +04:00
|
|
|
return do_proc_dointvec(table, write, buffer, lenp, ppos,
|
2005-04-17 02:20:36 +04:00
|
|
|
do_proc_dointvec_ms_jiffies_conv, NULL);
|
|
|
|
}
|
|
|
|
|
2009-09-24 02:57:19 +04:00
|
|
|
static int proc_do_cad_pid(struct ctl_table *table, int write,
|
2006-10-02 13:19:00 +04:00
|
|
|
void __user *buffer, size_t *lenp, loff_t *ppos)
|
|
|
|
{
|
|
|
|
struct pid *new_pid;
|
|
|
|
pid_t tmp;
|
|
|
|
int r;
|
|
|
|
|
2008-02-08 15:19:20 +03:00
|
|
|
tmp = pid_vnr(cad_pid);
|
2006-10-02 13:19:00 +04:00
|
|
|
|
2009-09-24 02:57:19 +04:00
|
|
|
r = __do_proc_dointvec(&tmp, table, write, buffer,
|
2006-10-02 13:19:00 +04:00
|
|
|
lenp, ppos, NULL, NULL);
|
|
|
|
if (r || !write)
|
|
|
|
return r;
|
|
|
|
|
|
|
|
new_pid = find_get_pid(tmp);
|
|
|
|
if (!new_pid)
|
|
|
|
return -ESRCH;
|
|
|
|
|
|
|
|
put_pid(xchg(&cad_pid, new_pid));
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2010-05-05 04:26:55 +04:00
|
|
|
/**
|
|
|
|
* proc_do_large_bitmap - read/write from/to a large bitmap
|
|
|
|
* @table: the sysctl table
|
|
|
|
* @write: %TRUE if this is a write to the sysctl file
|
|
|
|
* @buffer: the user buffer
|
|
|
|
* @lenp: the size of the user buffer
|
|
|
|
* @ppos: file position
|
|
|
|
*
|
|
|
|
* The bitmap is stored at table->data and the bitmap length (in bits)
|
|
|
|
* in table->maxlen.
|
|
|
|
*
|
|
|
|
* We use a range comma separated format (e.g. 1,3-4,10-10) so that
|
|
|
|
* large bitmaps may be represented in a compact manner. Writing into
|
|
|
|
* the file will clear the bitmap then update it with the given input.
|
|
|
|
*
|
|
|
|
* Returns 0 on success.
|
|
|
|
*/
|
|
|
|
int proc_do_large_bitmap(struct ctl_table *table, int write,
|
|
|
|
void __user *buffer, size_t *lenp, loff_t *ppos)
|
|
|
|
{
|
|
|
|
int err = 0;
|
|
|
|
bool first = 1;
|
|
|
|
size_t left = *lenp;
|
|
|
|
unsigned long bitmap_len = table->maxlen;
|
|
|
|
unsigned long *bitmap = (unsigned long *) table->data;
|
|
|
|
unsigned long *tmp_bitmap = NULL;
|
|
|
|
char tr_a[] = { '-', ',', '\n' }, tr_b[] = { ',', '\n', 0 }, c;
|
|
|
|
|
|
|
|
if (!bitmap_len || !left || (*ppos && !write)) {
|
|
|
|
*lenp = 0;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (write) {
|
|
|
|
unsigned long page = 0;
|
|
|
|
char *kbuf;
|
|
|
|
|
|
|
|
if (left > PAGE_SIZE - 1)
|
|
|
|
left = PAGE_SIZE - 1;
|
|
|
|
|
|
|
|
page = __get_free_page(GFP_TEMPORARY);
|
|
|
|
kbuf = (char *) page;
|
|
|
|
if (!kbuf)
|
|
|
|
return -ENOMEM;
|
|
|
|
if (copy_from_user(kbuf, buffer, left)) {
|
|
|
|
free_page(page);
|
|
|
|
return -EFAULT;
|
|
|
|
}
|
|
|
|
kbuf[left] = 0;
|
|
|
|
|
|
|
|
tmp_bitmap = kzalloc(BITS_TO_LONGS(bitmap_len) * sizeof(unsigned long),
|
|
|
|
GFP_KERNEL);
|
|
|
|
if (!tmp_bitmap) {
|
|
|
|
free_page(page);
|
|
|
|
return -ENOMEM;
|
|
|
|
}
|
|
|
|
proc_skip_char(&kbuf, &left, '\n');
|
|
|
|
while (!err && left) {
|
|
|
|
unsigned long val_a, val_b;
|
|
|
|
bool neg;
|
|
|
|
|
|
|
|
err = proc_get_long(&kbuf, &left, &val_a, &neg, tr_a,
|
|
|
|
sizeof(tr_a), &c);
|
|
|
|
if (err)
|
|
|
|
break;
|
|
|
|
if (val_a >= bitmap_len || neg) {
|
|
|
|
err = -EINVAL;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
val_b = val_a;
|
|
|
|
if (left) {
|
|
|
|
kbuf++;
|
|
|
|
left--;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (c == '-') {
|
|
|
|
err = proc_get_long(&kbuf, &left, &val_b,
|
|
|
|
&neg, tr_b, sizeof(tr_b),
|
|
|
|
&c);
|
|
|
|
if (err)
|
|
|
|
break;
|
|
|
|
if (val_b >= bitmap_len || neg ||
|
|
|
|
val_a > val_b) {
|
|
|
|
err = -EINVAL;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
if (left) {
|
|
|
|
kbuf++;
|
|
|
|
left--;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
while (val_a <= val_b)
|
|
|
|
set_bit(val_a++, tmp_bitmap);
|
|
|
|
|
|
|
|
first = 0;
|
|
|
|
proc_skip_char(&kbuf, &left, '\n');
|
|
|
|
}
|
|
|
|
free_page(page);
|
|
|
|
} else {
|
|
|
|
unsigned long bit_a, bit_b = 0;
|
|
|
|
|
|
|
|
while (left) {
|
|
|
|
bit_a = find_next_bit(bitmap, bitmap_len, bit_b);
|
|
|
|
if (bit_a >= bitmap_len)
|
|
|
|
break;
|
|
|
|
bit_b = find_next_zero_bit(bitmap, bitmap_len,
|
|
|
|
bit_a + 1) - 1;
|
|
|
|
|
|
|
|
if (!first) {
|
|
|
|
err = proc_put_char(&buffer, &left, ',');
|
|
|
|
if (err)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
err = proc_put_long(&buffer, &left, bit_a, false);
|
|
|
|
if (err)
|
|
|
|
break;
|
|
|
|
if (bit_a != bit_b) {
|
|
|
|
err = proc_put_char(&buffer, &left, '-');
|
|
|
|
if (err)
|
|
|
|
break;
|
|
|
|
err = proc_put_long(&buffer, &left, bit_b, false);
|
|
|
|
if (err)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
first = 0; bit_b++;
|
|
|
|
}
|
|
|
|
if (!err)
|
|
|
|
err = proc_put_char(&buffer, &left, '\n');
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!err) {
|
|
|
|
if (write) {
|
|
|
|
if (*ppos)
|
|
|
|
bitmap_or(bitmap, bitmap, tmp_bitmap, bitmap_len);
|
|
|
|
else
|
|
|
|
memcpy(bitmap, tmp_bitmap,
|
|
|
|
BITS_TO_LONGS(bitmap_len) * sizeof(unsigned long));
|
|
|
|
}
|
|
|
|
kfree(tmp_bitmap);
|
|
|
|
*lenp -= left;
|
|
|
|
*ppos += *lenp;
|
|
|
|
return 0;
|
|
|
|
} else {
|
|
|
|
kfree(tmp_bitmap);
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2011-01-13 04:00:45 +03:00
|
|
|
#else /* CONFIG_PROC_SYSCTL */
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2009-09-24 02:57:19 +04:00
|
|
|
int proc_dostring(struct ctl_table *table, int write,
|
2005-04-17 02:20:36 +04:00
|
|
|
void __user *buffer, size_t *lenp, loff_t *ppos)
|
|
|
|
{
|
|
|
|
return -ENOSYS;
|
|
|
|
}
|
|
|
|
|
2009-09-24 02:57:19 +04:00
|
|
|
int proc_dointvec(struct ctl_table *table, int write,
|
2005-04-17 02:20:36 +04:00
|
|
|
void __user *buffer, size_t *lenp, loff_t *ppos)
|
|
|
|
{
|
|
|
|
return -ENOSYS;
|
|
|
|
}
|
|
|
|
|
2009-09-24 02:57:19 +04:00
|
|
|
int proc_dointvec_minmax(struct ctl_table *table, int write,
|
2005-04-17 02:20:36 +04:00
|
|
|
void __user *buffer, size_t *lenp, loff_t *ppos)
|
|
|
|
{
|
|
|
|
return -ENOSYS;
|
|
|
|
}
|
|
|
|
|
2009-09-24 02:57:19 +04:00
|
|
|
int proc_dointvec_jiffies(struct ctl_table *table, int write,
|
2005-04-17 02:20:36 +04:00
|
|
|
void __user *buffer, size_t *lenp, loff_t *ppos)
|
|
|
|
{
|
|
|
|
return -ENOSYS;
|
|
|
|
}
|
|
|
|
|
2009-09-24 02:57:19 +04:00
|
|
|
int proc_dointvec_userhz_jiffies(struct ctl_table *table, int write,
|
2005-04-17 02:20:36 +04:00
|
|
|
void __user *buffer, size_t *lenp, loff_t *ppos)
|
|
|
|
{
|
|
|
|
return -ENOSYS;
|
|
|
|
}
|
|
|
|
|
2009-09-24 02:57:19 +04:00
|
|
|
int proc_dointvec_ms_jiffies(struct ctl_table *table, int write,
|
2005-04-17 02:20:36 +04:00
|
|
|
void __user *buffer, size_t *lenp, loff_t *ppos)
|
|
|
|
{
|
|
|
|
return -ENOSYS;
|
|
|
|
}
|
|
|
|
|
2009-09-24 02:57:19 +04:00
|
|
|
int proc_doulongvec_minmax(struct ctl_table *table, int write,
|
2005-04-17 02:20:36 +04:00
|
|
|
void __user *buffer, size_t *lenp, loff_t *ppos)
|
|
|
|
{
|
|
|
|
return -ENOSYS;
|
|
|
|
}
|
|
|
|
|
2007-10-18 14:05:22 +04:00
|
|
|
int proc_doulongvec_ms_jiffies_minmax(struct ctl_table *table, int write,
|
2005-04-17 02:20:36 +04:00
|
|
|
void __user *buffer,
|
|
|
|
size_t *lenp, loff_t *ppos)
|
|
|
|
{
|
|
|
|
return -ENOSYS;
|
|
|
|
}
|
|
|
|
|
|
|
|
|
2011-01-13 04:00:45 +03:00
|
|
|
#endif /* CONFIG_PROC_SYSCTL */
|
2005-04-17 02:20:36 +04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* No sense putting this after each symbol definition, twice,
|
|
|
|
* exception granted :-)
|
|
|
|
*/
|
|
|
|
EXPORT_SYMBOL(proc_dointvec);
|
|
|
|
EXPORT_SYMBOL(proc_dointvec_jiffies);
|
|
|
|
EXPORT_SYMBOL(proc_dointvec_minmax);
|
|
|
|
EXPORT_SYMBOL(proc_dointvec_userhz_jiffies);
|
|
|
|
EXPORT_SYMBOL(proc_dointvec_ms_jiffies);
|
|
|
|
EXPORT_SYMBOL(proc_dostring);
|
|
|
|
EXPORT_SYMBOL(proc_doulongvec_minmax);
|
|
|
|
EXPORT_SYMBOL(proc_doulongvec_ms_jiffies_minmax);
|
|
|
|
EXPORT_SYMBOL(register_sysctl_table);
|
2007-11-30 15:50:18 +03:00
|
|
|
EXPORT_SYMBOL(register_sysctl_paths);
|
2005-04-17 02:20:36 +04:00
|
|
|
EXPORT_SYMBOL(unregister_sysctl_table);
|