2005-04-17 02:20:36 +04:00
|
|
|
/*
|
|
|
|
* POSIX message queues filesystem for Linux.
|
|
|
|
*
|
|
|
|
* Copyright (C) 2003,2004 Krzysztof Benedyczak (golbi@mat.uni.torun.pl)
|
2006-10-04 01:23:27 +04:00
|
|
|
* Michal Wronski (michal.wronski@gmail.com)
|
2005-04-17 02:20:36 +04:00
|
|
|
*
|
|
|
|
* Spinlocks: Mohamed Abbas (abbas.mohamed@intel.com)
|
|
|
|
* Lockless receive & send, fd based notify:
|
2014-01-28 05:07:04 +04:00
|
|
|
* Manfred Spraul (manfred@colorfullife.com)
|
2005-04-17 02:20:36 +04:00
|
|
|
*
|
2006-05-25 01:09:55 +04:00
|
|
|
* Audit: George Wilson (ltcgcw@us.ibm.com)
|
|
|
|
*
|
2005-04-17 02:20:36 +04:00
|
|
|
* This file is released under the GPL.
|
|
|
|
*/
|
|
|
|
|
2006-01-11 23:17:46 +03:00
|
|
|
#include <linux/capability.h>
|
2005-04-17 02:20:36 +04:00
|
|
|
#include <linux/init.h>
|
|
|
|
#include <linux/pagemap.h>
|
|
|
|
#include <linux/file.h>
|
|
|
|
#include <linux/mount.h>
|
2018-11-02 02:07:25 +03:00
|
|
|
#include <linux/fs_context.h>
|
2005-04-17 02:20:36 +04:00
|
|
|
#include <linux/namei.h>
|
|
|
|
#include <linux/sysctl.h>
|
|
|
|
#include <linux/poll.h>
|
|
|
|
#include <linux/mqueue.h>
|
|
|
|
#include <linux/msg.h>
|
|
|
|
#include <linux/skbuff.h>
|
2012-06-01 03:26:30 +04:00
|
|
|
#include <linux/vmalloc.h>
|
2005-04-17 02:20:36 +04:00
|
|
|
#include <linux/netlink.h>
|
|
|
|
#include <linux/syscalls.h>
|
2006-05-25 01:09:55 +04:00
|
|
|
#include <linux/audit.h>
|
2005-05-01 19:59:14 +04:00
|
|
|
#include <linux/signal.h>
|
2006-03-26 13:37:17 +04:00
|
|
|
#include <linux/mutex.h>
|
2007-10-19 10:40:14 +04:00
|
|
|
#include <linux/nsproxy.h>
|
|
|
|
#include <linux/pid.h>
|
2009-04-07 06:01:08 +04:00
|
|
|
#include <linux/ipc_namespace.h>
|
user namespace: make signal.c respect user namespaces
ipc/mqueue.c: for __SI_MESQ, convert the uid being sent to recipient's
user namespace. (new, thanks Oleg)
__send_signal: convert current's uid to the recipient's user namespace
for any siginfo which is not SI_FROMKERNEL (patch from Oleg, thanks
again :)
do_notify_parent and do_notify_parent_cldstop: map task's uid to parent's
user namespace
ptrace_signal maps parent's uid into current's user namespace before
including in signal to current. IIUC Oleg has argued that this shouldn't
matter as the debugger will play with it, but it seems like not converting
the value currently being set is misleading.
Changelog:
Sep 20: Inspired by Oleg's suggestion, define map_cred_ns() helper to
simplify callers and help make clear what we are translating
(which uid into which namespace). Passing the target task would
make callers even easier to read, but we pass in user_ns because
current_user_ns() != task_cred_xxx(current, user_ns).
Sep 20: As recommended by Oleg, also put task_pid_vnr() under rcu_read_lock
in ptrace_signal().
Sep 23: In send_signal(), detect when (user) signal is coming from an
ancestor or unrelated user namespace. Pass that on to __send_signal,
which sets si_uid to 0 or overflowuid if needed.
Oct 12: Base on Oleg's fixup_uid() patch. On top of that, handle all
SI_FROMKERNEL cases at callers, because we can't assume sender is
current in those cases.
Nov 10: (mhelsley) rename fixup_uid to more meaningful usern_fixup_signal_uid
Nov 10: (akpm) make the !CONFIG_USER_NS case clearer
Signed-off-by: Serge Hallyn <serge.hallyn@canonical.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
From: Serge Hallyn <serge.hallyn@canonical.com>
Subject: __send_signal: pass q->info, not info, to userns_fixup_signal_uid (v2)
Eric Biederman pointed out that passing info is a bug and could lead to a
NULL pointer deref to boot.
A collection of signal, securebits, filecaps, cap_bounds, and a few other
ltp tests passed with this kernel.
Changelog:
Nov 18: previous patch missed a leading '&'
Signed-off-by: Serge Hallyn <serge.hallyn@canonical.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
From: Dan Carpenter <dan.carpenter@oracle.com>
Subject: ipc/mqueue: lock() => unlock() typo
There was a double lock typo introduced in b085f4bd6b21 "user namespace:
make signal.c respect user namespaces"
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-11 03:11:37 +04:00
|
|
|
#include <linux/user_namespace.h>
|
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 11:04:11 +03:00
|
|
|
#include <linux/slab.h>
|
2017-02-01 18:36:40 +03:00
|
|
|
#include <linux/sched/wake_q.h>
|
2017-02-08 20:51:30 +03:00
|
|
|
#include <linux/sched/signal.h>
|
2017-02-08 20:51:30 +03:00
|
|
|
#include <linux/sched/user.h>
|
2006-03-26 13:37:17 +04:00
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
#include <net/sock.h>
|
|
|
|
#include "util.h"
|
|
|
|
|
2018-11-02 02:07:25 +03:00
|
|
|
struct mqueue_fs_context {
|
|
|
|
struct ipc_namespace *ipc_ns;
|
ipc/mqueue: use get_tree_nodev() in mqueue_get_tree()
When running the stress-ng clone benchmark with multiple testing threads,
it was found that there were significant spinlock contention in sget_fc().
The contended spinlock was the sb_lock. It is under heavy contention
because the following code in the critcal section of sget_fc():
hlist_for_each_entry(old, &fc->fs_type->fs_supers, s_instances) {
if (test(old, fc))
goto share_extant_sb;
}
After testing with added instrumentation code, it was found that the
benchmark could generate thousands of ipc namespaces with the
corresponding number of entries in the mqueue's fs_supers list where the
namespaces are the key for the search. This leads to excessive time in
scanning the list for a match.
Looking back at the mqueue calling sequence leading to sget_fc():
mq_init_ns()
=> mq_create_mount()
=> fc_mount()
=> vfs_get_tree()
=> mqueue_get_tree()
=> get_tree_keyed()
=> vfs_get_super()
=> sget_fc()
Currently, mq_init_ns() is the only mqueue function that will indirectly
call mqueue_get_tree() with a newly allocated ipc namespace as the key for
searching. As a result, there will never be a match with the exising ipc
namespaces stored in the mqueue's fs_supers list.
So using get_tree_keyed() to do an existing ipc namespace search is just a
waste of time. Instead, we could use get_tree_nodev() to eliminate the
useless search. By doing so, we can greatly reduce the sb_lock hold time
and avoid the spinlock contention problem in case a large number of ipc
namespaces are present.
Of course, if the code is modified in the future to allow
mqueue_get_tree() to be called with an existing ipc namespace instead of a
new one, we will have to use get_tree_keyed() in this case.
The following stress-ng clone benchmark command was run on a 2-socket
48-core Intel system:
./stress-ng --clone 32 --verbose --oomable --metrics-brief -t 20
The "bogo ops/s" increased from 5948.45 before patch to 9137.06 after
patch. This is an increase of 54% in performance.
Link: https://lkml.kernel.org/r/20220121172315.19652-1-longman@redhat.com
Fixes: 935c6912b198 ("ipc: Convert mqueue fs to fs_context")
Signed-off-by: Waiman Long <longman@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-10 04:29:21 +03:00
|
|
|
bool newns; /* Set if newly created ipc namespace */
|
2018-11-02 02:07:25 +03:00
|
|
|
};
|
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
#define MQUEUE_MAGIC 0x19800202
|
|
|
|
#define DIRENT_SIZE 20
|
|
|
|
#define FILENT_SIZE 80
|
|
|
|
|
|
|
|
#define SEND 0
|
|
|
|
#define RECV 1
|
|
|
|
|
|
|
|
#define STATE_NONE 0
|
2015-05-04 17:02:46 +03:00
|
|
|
#define STATE_READY 1
|
2005-04-17 02:20:36 +04:00
|
|
|
|
ipc/mqueue: improve performance of send/recv
The existing implementation of the POSIX message queue send and recv
functions is, well, abysmal. Even worse than abysmal. I submitted a
patch to increase the maximum POSIX message queue limit to 65536 due to
customer needs, however, upon looking over the send/recv implementation, I
realized that my customer needs help with that too even if they don't know
it. The basic problem is that, given the fairly typical use case scenario
for a large queue of queueing lots of messages all at the same priority (I
verified with my customer that this is indeed what their app does), the
msg_insert routine is basically a frikkin' bubble sort. I mean, whoa,
that's *so* middle school.
OK, OK, to not slam the original author too much, I'm sure they didn't
envision a queue depth of 50,000+ messages. No one would think that
moving elements in an array, one at a time, and dereferencing each pointer
in that array to check priority of the message being pointed too, again
one at a time, for 50,000+ times would be good. So let's assume that, as
is typical, the users have found a way to break our code simply by using
it in a way we didn't envision. Fair enough.
"So, just how broken is it?", you ask. I wondered the same thing, so I
wrote an app to let me know. It's my next patch. It gave me some
interesting results. Here's what it tested:
Interference with other apps - In continuous mode, the app just sits there
and hits a message queue forever, while you go do something productive on
another terminal using other CPUs. You then measure how long it takes you
to do that something productive. Then you restart the app in fake
continuous mode, and it sits in a tight loop on a CPU while you repeat
your tests. The whole point of this is to keep one CPU tied up (so it
can't be used in your other work) but in one case tied up hitting the
mqueue code so we can see the effect of walking that 65,528 element array
one pointer at a time on the global CPU cache. If it's bad, then it will
slow down your app on the other CPUs just by polluting cache mercilessly.
In the fake case, it will be in a tight loop, but not polluting cache.
Testing the mqueue subsystem directly - Here we just run a number of tests
to see how the mqueue subsystem performs under different conditions. A
couple conditions are known to be worst case for the old system, and some
routines, so this tests all of them.
So, on to the results already:
Subsystem/Test Old New
Time to compile linux
kernel (make -j12 on a
6 core CPU)
Running mqueue test user 49m10.744s user 45m26.294s
sys 5m51.924s sys 4m59.894s
total 55m02.668s total 50m26.188s
Running fake test user 45m32.686s user 45m18.552s
sys 5m12.465s sys 4m56.468s
total 50m45.151s total 50m15.020s
% slowdown from mqueue
cache thrashing ~8% ~.5%
Avg time to send/recv (in nanoseconds per message)
when queue empty 305/288 349/318
when queue full (65528 messages)
constant priority 526589/823 362/314
increasing priority 403105/916 495/445
decreasing priority 73420/594 482/409
random priority 280147/920 546/436
Time to fill/drain queue (65528 messages, in seconds)
constant priority 17.37/.12 .13/.12
increasing priority 4.14/.14 .21/.18
decreasing priority 12.93/.13 .21/.18
random priority 8.88/.16 .22/.17
So, I think the results speak for themselves. It's possible this
implementation could be improved by cacheing at least one priority level
in the node tree (that would bring the queue empty performance more in
line with the old implementation), but this works and is *so* much better
than what we had, especially for the common case of a single priority in
use, that further refinements can be in follow on patches.
[akpm@linux-foundation.org: fix typo in comment, remove stray semicolon]
[levinsasha928@gmail.com: use correct gfp flags in msg_insert]
Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Manfred Spraul <manfred@colorfullife.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-01 03:26:35 +04:00
|
|
|
struct posix_msg_tree_node {
|
|
|
|
struct rb_node rb_node;
|
|
|
|
struct list_head msg_list;
|
|
|
|
int priority;
|
|
|
|
};
|
|
|
|
|
2020-02-04 04:34:36 +03:00
|
|
|
/*
|
|
|
|
* Locking:
|
|
|
|
*
|
|
|
|
* Accesses to a message queue are synchronized by acquiring info->lock.
|
|
|
|
*
|
|
|
|
* There are two notable exceptions:
|
|
|
|
* - The actual wakeup of a sleeping task is performed using the wake_q
|
|
|
|
* framework. info->lock is already released when wake_up_q is called.
|
|
|
|
* - The exit codepaths after sleeping check ext_wait_queue->state without
|
|
|
|
* any locks. If it is STATE_READY, then the syscall is completed without
|
|
|
|
* acquiring info->lock.
|
|
|
|
*
|
|
|
|
* MQ_BARRIER:
|
|
|
|
* To achieve proper release/acquire memory barrier pairing, the state is set to
|
|
|
|
* STATE_READY with smp_store_release(), and it is read with READ_ONCE followed
|
|
|
|
* by smp_acquire__after_ctrl_dep(). In addition, wake_q_add_safe() is used.
|
|
|
|
*
|
|
|
|
* This prevents the following races:
|
|
|
|
*
|
|
|
|
* 1) With the simple wake_q_add(), the task could be gone already before
|
|
|
|
* the increase of the reference happens
|
|
|
|
* Thread A
|
|
|
|
* Thread B
|
|
|
|
* WRITE_ONCE(wait.state, STATE_NONE);
|
|
|
|
* schedule_hrtimeout()
|
|
|
|
* wake_q_add(A)
|
|
|
|
* if (cmpxchg()) // success
|
|
|
|
* ->state = STATE_READY (reordered)
|
|
|
|
* <timeout returns>
|
|
|
|
* if (wait.state == STATE_READY) return;
|
|
|
|
* sysret to user space
|
|
|
|
* sys_exit()
|
|
|
|
* get_task_struct() // UaF
|
|
|
|
*
|
|
|
|
* Solution: Use wake_q_add_safe() and perform the get_task_struct() before
|
|
|
|
* the smp_store_release() that does ->state = STATE_READY.
|
|
|
|
*
|
|
|
|
* 2) Without proper _release/_acquire barriers, the woken up task
|
|
|
|
* could read stale data
|
|
|
|
*
|
|
|
|
* Thread A
|
|
|
|
* Thread B
|
|
|
|
* do_mq_timedreceive
|
|
|
|
* WRITE_ONCE(wait.state, STATE_NONE);
|
|
|
|
* schedule_hrtimeout()
|
|
|
|
* state = STATE_READY;
|
|
|
|
* <timeout returns>
|
|
|
|
* if (wait.state == STATE_READY) return;
|
|
|
|
* msg_ptr = wait.msg; // Access to stale data!
|
|
|
|
* receiver->msg = message; (reordered)
|
|
|
|
*
|
|
|
|
* Solution: use _release and _acquire barriers.
|
|
|
|
*
|
|
|
|
* 3) There is intentionally no barrier when setting current->state
|
|
|
|
* to TASK_INTERRUPTIBLE: spin_unlock(&info->lock) provides the
|
|
|
|
* release memory barrier, and the wakeup is triggered when holding
|
|
|
|
* info->lock, i.e. spin_lock(&info->lock) provided a pairing
|
|
|
|
* acquire memory barrier.
|
|
|
|
*/
|
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
struct ext_wait_queue { /* queue of sleeping tasks */
|
|
|
|
struct task_struct *task;
|
|
|
|
struct list_head list;
|
|
|
|
struct msg_msg *msg; /* ptr of loaded message */
|
|
|
|
int state; /* one of STATE_* values */
|
|
|
|
};
|
|
|
|
|
|
|
|
struct mqueue_inode_info {
|
|
|
|
spinlock_t lock;
|
|
|
|
struct inode vfs_inode;
|
|
|
|
wait_queue_head_t wait_q;
|
|
|
|
|
ipc/mqueue: improve performance of send/recv
The existing implementation of the POSIX message queue send and recv
functions is, well, abysmal. Even worse than abysmal. I submitted a
patch to increase the maximum POSIX message queue limit to 65536 due to
customer needs, however, upon looking over the send/recv implementation, I
realized that my customer needs help with that too even if they don't know
it. The basic problem is that, given the fairly typical use case scenario
for a large queue of queueing lots of messages all at the same priority (I
verified with my customer that this is indeed what their app does), the
msg_insert routine is basically a frikkin' bubble sort. I mean, whoa,
that's *so* middle school.
OK, OK, to not slam the original author too much, I'm sure they didn't
envision a queue depth of 50,000+ messages. No one would think that
moving elements in an array, one at a time, and dereferencing each pointer
in that array to check priority of the message being pointed too, again
one at a time, for 50,000+ times would be good. So let's assume that, as
is typical, the users have found a way to break our code simply by using
it in a way we didn't envision. Fair enough.
"So, just how broken is it?", you ask. I wondered the same thing, so I
wrote an app to let me know. It's my next patch. It gave me some
interesting results. Here's what it tested:
Interference with other apps - In continuous mode, the app just sits there
and hits a message queue forever, while you go do something productive on
another terminal using other CPUs. You then measure how long it takes you
to do that something productive. Then you restart the app in fake
continuous mode, and it sits in a tight loop on a CPU while you repeat
your tests. The whole point of this is to keep one CPU tied up (so it
can't be used in your other work) but in one case tied up hitting the
mqueue code so we can see the effect of walking that 65,528 element array
one pointer at a time on the global CPU cache. If it's bad, then it will
slow down your app on the other CPUs just by polluting cache mercilessly.
In the fake case, it will be in a tight loop, but not polluting cache.
Testing the mqueue subsystem directly - Here we just run a number of tests
to see how the mqueue subsystem performs under different conditions. A
couple conditions are known to be worst case for the old system, and some
routines, so this tests all of them.
So, on to the results already:
Subsystem/Test Old New
Time to compile linux
kernel (make -j12 on a
6 core CPU)
Running mqueue test user 49m10.744s user 45m26.294s
sys 5m51.924s sys 4m59.894s
total 55m02.668s total 50m26.188s
Running fake test user 45m32.686s user 45m18.552s
sys 5m12.465s sys 4m56.468s
total 50m45.151s total 50m15.020s
% slowdown from mqueue
cache thrashing ~8% ~.5%
Avg time to send/recv (in nanoseconds per message)
when queue empty 305/288 349/318
when queue full (65528 messages)
constant priority 526589/823 362/314
increasing priority 403105/916 495/445
decreasing priority 73420/594 482/409
random priority 280147/920 546/436
Time to fill/drain queue (65528 messages, in seconds)
constant priority 17.37/.12 .13/.12
increasing priority 4.14/.14 .21/.18
decreasing priority 12.93/.13 .21/.18
random priority 8.88/.16 .22/.17
So, I think the results speak for themselves. It's possible this
implementation could be improved by cacheing at least one priority level
in the node tree (that would bring the queue empty performance more in
line with the old implementation), but this works and is *so* much better
than what we had, especially for the common case of a single priority in
use, that further refinements can be in follow on patches.
[akpm@linux-foundation.org: fix typo in comment, remove stray semicolon]
[levinsasha928@gmail.com: use correct gfp flags in msg_insert]
Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Manfred Spraul <manfred@colorfullife.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-01 03:26:35 +04:00
|
|
|
struct rb_root msg_tree;
|
2019-05-15 01:46:26 +03:00
|
|
|
struct rb_node *msg_tree_rightmost;
|
ipc/mqueue: add rbtree node caching support
When I wrote the first patch that added the rbtree support for message
queue insertion, it sped up the case where the queue was very full
drastically from the original code. It, however, slowed down the case
where the queue was empty (not drastically though).
This patch caches the last freed rbtree node struct so we can quickly
reuse it when we get a new message. This is the common path for any queue
that very frequently goes from 0 to 1 then back to 0 messages in queue.
Andrew Morton didn't like that we were doing a GFP_ATOMIC allocation in
msg_insert, so this patch attempts to speculatively allocate a new node
struct outside of the spin lock when we know we need it, but will still
fall back to a GFP_ATOMIC allocation if it has to.
Once I added the caching, the necessary various ret = ; spin_unlock
gyrations in mq_timedsend were getting pretty ugly, so this also slightly
refactors that function to streamline the flow of the code and the
function exit.
Finally, while working on getting performance back I made sure that all of
the node structs were always fully initialized when they were first used,
rendering the use of kzalloc unnecessary and a waste of CPU cycles.
The net result of all of this is:
1) We will avoid a GFP_ATOMIC allocation when possible, but fall back
on it when necessary.
2) We will speculatively allocate a node struct using GFP_KERNEL if our
cache is empty (and save the struct to our cache if it's still empty
after we have obtained the spin lock).
3) The performance of the common queue empty case has significantly
improved and is now much more in line with the older performance for
this case.
The performance changes are:
Old mqueue new mqueue new mqueue + caching
queue empty
send/recv 305/288ns 349/318ns 310/322ns
I don't think we'll ever be able to get the recv performance back, but
that's because the old recv performance was a direct result and
consequence of the old methods abysmal send performance. The recv path
simply must do more so that the send path does not incur such a penalty
under higher queue depths.
As it turns out, the new caching code also sped up the various queue full
cases relative to my last patch. That could be because of the difference
between the syscall path in 3.3.4-rc5 and 3.3.4-rc6, or because of the
change in code flow in the mq_timedsend routine. Regardless, I'll take
it. It wasn't huge, and I *would* say it was within the margin for error,
but after many repeated runs what I'm seeing is that the old numbers trend
slightly higher (about 10 to 20ns depending on which test is the one
running).
[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-01 03:26:38 +04:00
|
|
|
struct posix_msg_tree_node *node_cache;
|
2005-04-17 02:20:36 +04:00
|
|
|
struct mq_attr attr;
|
|
|
|
|
|
|
|
struct sigevent notify;
|
2014-01-28 05:07:04 +04:00
|
|
|
struct pid *notify_owner;
|
ipc/mqueue.c: change __do_notify() to bypass check_kill_permission()
Commit cc731525f26a ("signal: Remove kernel interal si_code magic")
changed the value of SI_FROMUSER(SI_MESGQ), this means that mq_notify() no
longer works if the sender doesn't have rights to send a signal.
Change __do_notify() to use do_send_sig_info() instead of kill_pid_info()
to avoid check_kill_permission().
This needs the additional notify.sigev_signo != 0 check, shouldn't we
change do_mq_notify() to deny sigev_signo == 0 ?
Test-case:
#include <signal.h>
#include <mqueue.h>
#include <unistd.h>
#include <sys/wait.h>
#include <assert.h>
static int notified;
static void sigh(int sig)
{
notified = 1;
}
int main(void)
{
signal(SIGIO, sigh);
int fd = mq_open("/mq", O_RDWR|O_CREAT, 0666, NULL);
assert(fd >= 0);
struct sigevent se = {
.sigev_notify = SIGEV_SIGNAL,
.sigev_signo = SIGIO,
};
assert(mq_notify(fd, &se) == 0);
if (!fork()) {
assert(setuid(1) == 0);
mq_send(fd, "",1,0);
return 0;
}
wait(NULL);
mq_unlink("/mq");
assert(notified);
return 0;
}
[manfred@colorfullife.com: 1) Add self_exec_id evaluation so that the implementation matches do_notify_parent 2) use PIDTYPE_TGID everywhere]
Fixes: cc731525f26a ("signal: Remove kernel interal si_code magic")
Reported-by: Yoji <yoji.fujihar.min@gmail.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Markus Elfring <elfring@users.sourceforge.net>
Cc: <1vier1@web.de>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/e2a782e4-eab9-4f5c-c749-c07a8f7a4e66@colorfullife.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-05-08 04:35:39 +03:00
|
|
|
u32 notify_self_exec_id;
|
2011-11-17 10:57:55 +04:00
|
|
|
struct user_namespace *notify_user_ns;
|
2021-04-22 15:27:12 +03:00
|
|
|
struct ucounts *ucounts; /* user who created, for accounting */
|
2005-04-17 02:20:36 +04:00
|
|
|
struct sock *notify_sock;
|
|
|
|
struct sk_buff *notify_cookie;
|
|
|
|
|
|
|
|
/* for tasks waiting for free space and messages, respectively */
|
|
|
|
struct ext_wait_queue e_wait_q[2];
|
|
|
|
|
|
|
|
unsigned long qsize; /* size of queue in memory (sum of all msgs) */
|
|
|
|
};
|
|
|
|
|
2018-11-02 02:07:25 +03:00
|
|
|
static struct file_system_type mqueue_fs_type;
|
2007-02-12 11:55:39 +03:00
|
|
|
static const struct inode_operations mqueue_dir_inode_operations;
|
2007-02-12 11:55:35 +03:00
|
|
|
static const struct file_operations mqueue_file_operations;
|
2009-09-22 04:01:09 +04:00
|
|
|
static const struct super_operations mqueue_super_ops;
|
2018-11-02 02:07:25 +03:00
|
|
|
static const struct fs_context_operations mqueue_fs_context_ops;
|
2005-04-17 02:20:36 +04:00
|
|
|
static void remove_notification(struct mqueue_inode_info *info);
|
|
|
|
|
2006-12-07 07:33:20 +03:00
|
|
|
static struct kmem_cache *mqueue_inode_cachep;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
|
|
|
static inline struct mqueue_inode_info *MQUEUE_I(struct inode *inode)
|
|
|
|
{
|
|
|
|
return container_of(inode, struct mqueue_inode_info, vfs_inode);
|
|
|
|
}
|
|
|
|
|
namespaces: ipc namespaces: implement support for posix msqueues
Implement multiple mounts of the mqueue file system, and link it to usage
of CLONE_NEWIPC.
Each ipc ns has a corresponding mqueuefs superblock. When a user does
clone(CLONE_NEWIPC) or unshare(CLONE_NEWIPC), the unshare will cause an
internal mount of a new mqueuefs sb linked to the new ipc ns.
When a user does 'mount -t mqueue mqueue /dev/mqueue', he mounts the
mqueuefs superblock.
Posix message queues can be worked with both through the mq_* system calls
(see mq_overview(7)), and through the VFS through the mqueue mount. Any
usage of mq_open() and friends will work with the acting task's ipc
namespace. Any actions through the VFS will work with the mqueuefs in
which the file was created. So if a user doesn't remount mqueuefs after
unshare(CLONE_NEWIPC), mq_open("/ab") will not be reflected in "ls
/dev/mqueue".
If task a mounts mqueue for ipc_ns:1, then clones task b with a new ipcns,
ipcns:2, and then task a is the last task in ipc_ns:1 to exit, then (1)
ipc_ns:1 will be freed, (2) it's superblock will live on until task b
umounts the corresponding mqueuefs, and vfs actions will continue to
succeed, but (3) sb->s_fs_info will be NULL for the sb corresponding to
the deceased ipc_ns:1.
To make this happen, we must protect the ipc reference count when
a) a task exits and drops its ipcns->count, since it might be dropping
it to 0 and freeing the ipcns
b) a task accesses the ipcns through its mqueuefs interface, since it
bumps the ipcns refcount and might race with the last task in the ipcns
exiting.
So the kref is changed to an atomic_t so we can use
atomic_dec_and_lock(&ns->count,mq_lock), and every access to the ipcns
through ns = mqueuefs_sb->s_fs_info is protected by the same lock.
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 06:01:10 +04:00
|
|
|
/*
|
|
|
|
* This routine should be called with the mq_lock held.
|
|
|
|
*/
|
|
|
|
static inline struct ipc_namespace *__get_ns_from_inode(struct inode *inode)
|
2009-04-07 06:01:08 +04:00
|
|
|
{
|
namespaces: ipc namespaces: implement support for posix msqueues
Implement multiple mounts of the mqueue file system, and link it to usage
of CLONE_NEWIPC.
Each ipc ns has a corresponding mqueuefs superblock. When a user does
clone(CLONE_NEWIPC) or unshare(CLONE_NEWIPC), the unshare will cause an
internal mount of a new mqueuefs sb linked to the new ipc ns.
When a user does 'mount -t mqueue mqueue /dev/mqueue', he mounts the
mqueuefs superblock.
Posix message queues can be worked with both through the mq_* system calls
(see mq_overview(7)), and through the VFS through the mqueue mount. Any
usage of mq_open() and friends will work with the acting task's ipc
namespace. Any actions through the VFS will work with the mqueuefs in
which the file was created. So if a user doesn't remount mqueuefs after
unshare(CLONE_NEWIPC), mq_open("/ab") will not be reflected in "ls
/dev/mqueue".
If task a mounts mqueue for ipc_ns:1, then clones task b with a new ipcns,
ipcns:2, and then task a is the last task in ipc_ns:1 to exit, then (1)
ipc_ns:1 will be freed, (2) it's superblock will live on until task b
umounts the corresponding mqueuefs, and vfs actions will continue to
succeed, but (3) sb->s_fs_info will be NULL for the sb corresponding to
the deceased ipc_ns:1.
To make this happen, we must protect the ipc reference count when
a) a task exits and drops its ipcns->count, since it might be dropping
it to 0 and freeing the ipcns
b) a task accesses the ipcns through its mqueuefs interface, since it
bumps the ipcns refcount and might race with the last task in the ipcns
exiting.
So the kref is changed to an atomic_t so we can use
atomic_dec_and_lock(&ns->count,mq_lock), and every access to the ipcns
through ns = mqueuefs_sb->s_fs_info is protected by the same lock.
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 06:01:10 +04:00
|
|
|
return get_ipc_ns(inode->i_sb->s_fs_info);
|
2009-04-07 06:01:08 +04:00
|
|
|
}
|
|
|
|
|
namespaces: ipc namespaces: implement support for posix msqueues
Implement multiple mounts of the mqueue file system, and link it to usage
of CLONE_NEWIPC.
Each ipc ns has a corresponding mqueuefs superblock. When a user does
clone(CLONE_NEWIPC) or unshare(CLONE_NEWIPC), the unshare will cause an
internal mount of a new mqueuefs sb linked to the new ipc ns.
When a user does 'mount -t mqueue mqueue /dev/mqueue', he mounts the
mqueuefs superblock.
Posix message queues can be worked with both through the mq_* system calls
(see mq_overview(7)), and through the VFS through the mqueue mount. Any
usage of mq_open() and friends will work with the acting task's ipc
namespace. Any actions through the VFS will work with the mqueuefs in
which the file was created. So if a user doesn't remount mqueuefs after
unshare(CLONE_NEWIPC), mq_open("/ab") will not be reflected in "ls
/dev/mqueue".
If task a mounts mqueue for ipc_ns:1, then clones task b with a new ipcns,
ipcns:2, and then task a is the last task in ipc_ns:1 to exit, then (1)
ipc_ns:1 will be freed, (2) it's superblock will live on until task b
umounts the corresponding mqueuefs, and vfs actions will continue to
succeed, but (3) sb->s_fs_info will be NULL for the sb corresponding to
the deceased ipc_ns:1.
To make this happen, we must protect the ipc reference count when
a) a task exits and drops its ipcns->count, since it might be dropping
it to 0 and freeing the ipcns
b) a task accesses the ipcns through its mqueuefs interface, since it
bumps the ipcns refcount and might race with the last task in the ipcns
exiting.
So the kref is changed to an atomic_t so we can use
atomic_dec_and_lock(&ns->count,mq_lock), and every access to the ipcns
through ns = mqueuefs_sb->s_fs_info is protected by the same lock.
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 06:01:10 +04:00
|
|
|
static struct ipc_namespace *get_ns_from_inode(struct inode *inode)
|
2009-04-07 06:01:08 +04:00
|
|
|
{
|
namespaces: ipc namespaces: implement support for posix msqueues
Implement multiple mounts of the mqueue file system, and link it to usage
of CLONE_NEWIPC.
Each ipc ns has a corresponding mqueuefs superblock. When a user does
clone(CLONE_NEWIPC) or unshare(CLONE_NEWIPC), the unshare will cause an
internal mount of a new mqueuefs sb linked to the new ipc ns.
When a user does 'mount -t mqueue mqueue /dev/mqueue', he mounts the
mqueuefs superblock.
Posix message queues can be worked with both through the mq_* system calls
(see mq_overview(7)), and through the VFS through the mqueue mount. Any
usage of mq_open() and friends will work with the acting task's ipc
namespace. Any actions through the VFS will work with the mqueuefs in
which the file was created. So if a user doesn't remount mqueuefs after
unshare(CLONE_NEWIPC), mq_open("/ab") will not be reflected in "ls
/dev/mqueue".
If task a mounts mqueue for ipc_ns:1, then clones task b with a new ipcns,
ipcns:2, and then task a is the last task in ipc_ns:1 to exit, then (1)
ipc_ns:1 will be freed, (2) it's superblock will live on until task b
umounts the corresponding mqueuefs, and vfs actions will continue to
succeed, but (3) sb->s_fs_info will be NULL for the sb corresponding to
the deceased ipc_ns:1.
To make this happen, we must protect the ipc reference count when
a) a task exits and drops its ipcns->count, since it might be dropping
it to 0 and freeing the ipcns
b) a task accesses the ipcns through its mqueuefs interface, since it
bumps the ipcns refcount and might race with the last task in the ipcns
exiting.
So the kref is changed to an atomic_t so we can use
atomic_dec_and_lock(&ns->count,mq_lock), and every access to the ipcns
through ns = mqueuefs_sb->s_fs_info is protected by the same lock.
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 06:01:10 +04:00
|
|
|
struct ipc_namespace *ns;
|
|
|
|
|
|
|
|
spin_lock(&mq_lock);
|
|
|
|
ns = __get_ns_from_inode(inode);
|
|
|
|
spin_unlock(&mq_lock);
|
|
|
|
return ns;
|
2009-04-07 06:01:08 +04:00
|
|
|
}
|
|
|
|
|
ipc/mqueue: improve performance of send/recv
The existing implementation of the POSIX message queue send and recv
functions is, well, abysmal. Even worse than abysmal. I submitted a
patch to increase the maximum POSIX message queue limit to 65536 due to
customer needs, however, upon looking over the send/recv implementation, I
realized that my customer needs help with that too even if they don't know
it. The basic problem is that, given the fairly typical use case scenario
for a large queue of queueing lots of messages all at the same priority (I
verified with my customer that this is indeed what their app does), the
msg_insert routine is basically a frikkin' bubble sort. I mean, whoa,
that's *so* middle school.
OK, OK, to not slam the original author too much, I'm sure they didn't
envision a queue depth of 50,000+ messages. No one would think that
moving elements in an array, one at a time, and dereferencing each pointer
in that array to check priority of the message being pointed too, again
one at a time, for 50,000+ times would be good. So let's assume that, as
is typical, the users have found a way to break our code simply by using
it in a way we didn't envision. Fair enough.
"So, just how broken is it?", you ask. I wondered the same thing, so I
wrote an app to let me know. It's my next patch. It gave me some
interesting results. Here's what it tested:
Interference with other apps - In continuous mode, the app just sits there
and hits a message queue forever, while you go do something productive on
another terminal using other CPUs. You then measure how long it takes you
to do that something productive. Then you restart the app in fake
continuous mode, and it sits in a tight loop on a CPU while you repeat
your tests. The whole point of this is to keep one CPU tied up (so it
can't be used in your other work) but in one case tied up hitting the
mqueue code so we can see the effect of walking that 65,528 element array
one pointer at a time on the global CPU cache. If it's bad, then it will
slow down your app on the other CPUs just by polluting cache mercilessly.
In the fake case, it will be in a tight loop, but not polluting cache.
Testing the mqueue subsystem directly - Here we just run a number of tests
to see how the mqueue subsystem performs under different conditions. A
couple conditions are known to be worst case for the old system, and some
routines, so this tests all of them.
So, on to the results already:
Subsystem/Test Old New
Time to compile linux
kernel (make -j12 on a
6 core CPU)
Running mqueue test user 49m10.744s user 45m26.294s
sys 5m51.924s sys 4m59.894s
total 55m02.668s total 50m26.188s
Running fake test user 45m32.686s user 45m18.552s
sys 5m12.465s sys 4m56.468s
total 50m45.151s total 50m15.020s
% slowdown from mqueue
cache thrashing ~8% ~.5%
Avg time to send/recv (in nanoseconds per message)
when queue empty 305/288 349/318
when queue full (65528 messages)
constant priority 526589/823 362/314
increasing priority 403105/916 495/445
decreasing priority 73420/594 482/409
random priority 280147/920 546/436
Time to fill/drain queue (65528 messages, in seconds)
constant priority 17.37/.12 .13/.12
increasing priority 4.14/.14 .21/.18
decreasing priority 12.93/.13 .21/.18
random priority 8.88/.16 .22/.17
So, I think the results speak for themselves. It's possible this
implementation could be improved by cacheing at least one priority level
in the node tree (that would bring the queue empty performance more in
line with the old implementation), but this works and is *so* much better
than what we had, especially for the common case of a single priority in
use, that further refinements can be in follow on patches.
[akpm@linux-foundation.org: fix typo in comment, remove stray semicolon]
[levinsasha928@gmail.com: use correct gfp flags in msg_insert]
Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Manfred Spraul <manfred@colorfullife.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-01 03:26:35 +04:00
|
|
|
/* Auxiliary functions to manipulate messages' list */
|
|
|
|
static int msg_insert(struct msg_msg *msg, struct mqueue_inode_info *info)
|
|
|
|
{
|
|
|
|
struct rb_node **p, *parent = NULL;
|
|
|
|
struct posix_msg_tree_node *leaf;
|
2019-05-15 01:46:26 +03:00
|
|
|
bool rightmost = true;
|
ipc/mqueue: improve performance of send/recv
The existing implementation of the POSIX message queue send and recv
functions is, well, abysmal. Even worse than abysmal. I submitted a
patch to increase the maximum POSIX message queue limit to 65536 due to
customer needs, however, upon looking over the send/recv implementation, I
realized that my customer needs help with that too even if they don't know
it. The basic problem is that, given the fairly typical use case scenario
for a large queue of queueing lots of messages all at the same priority (I
verified with my customer that this is indeed what their app does), the
msg_insert routine is basically a frikkin' bubble sort. I mean, whoa,
that's *so* middle school.
OK, OK, to not slam the original author too much, I'm sure they didn't
envision a queue depth of 50,000+ messages. No one would think that
moving elements in an array, one at a time, and dereferencing each pointer
in that array to check priority of the message being pointed too, again
one at a time, for 50,000+ times would be good. So let's assume that, as
is typical, the users have found a way to break our code simply by using
it in a way we didn't envision. Fair enough.
"So, just how broken is it?", you ask. I wondered the same thing, so I
wrote an app to let me know. It's my next patch. It gave me some
interesting results. Here's what it tested:
Interference with other apps - In continuous mode, the app just sits there
and hits a message queue forever, while you go do something productive on
another terminal using other CPUs. You then measure how long it takes you
to do that something productive. Then you restart the app in fake
continuous mode, and it sits in a tight loop on a CPU while you repeat
your tests. The whole point of this is to keep one CPU tied up (so it
can't be used in your other work) but in one case tied up hitting the
mqueue code so we can see the effect of walking that 65,528 element array
one pointer at a time on the global CPU cache. If it's bad, then it will
slow down your app on the other CPUs just by polluting cache mercilessly.
In the fake case, it will be in a tight loop, but not polluting cache.
Testing the mqueue subsystem directly - Here we just run a number of tests
to see how the mqueue subsystem performs under different conditions. A
couple conditions are known to be worst case for the old system, and some
routines, so this tests all of them.
So, on to the results already:
Subsystem/Test Old New
Time to compile linux
kernel (make -j12 on a
6 core CPU)
Running mqueue test user 49m10.744s user 45m26.294s
sys 5m51.924s sys 4m59.894s
total 55m02.668s total 50m26.188s
Running fake test user 45m32.686s user 45m18.552s
sys 5m12.465s sys 4m56.468s
total 50m45.151s total 50m15.020s
% slowdown from mqueue
cache thrashing ~8% ~.5%
Avg time to send/recv (in nanoseconds per message)
when queue empty 305/288 349/318
when queue full (65528 messages)
constant priority 526589/823 362/314
increasing priority 403105/916 495/445
decreasing priority 73420/594 482/409
random priority 280147/920 546/436
Time to fill/drain queue (65528 messages, in seconds)
constant priority 17.37/.12 .13/.12
increasing priority 4.14/.14 .21/.18
decreasing priority 12.93/.13 .21/.18
random priority 8.88/.16 .22/.17
So, I think the results speak for themselves. It's possible this
implementation could be improved by cacheing at least one priority level
in the node tree (that would bring the queue empty performance more in
line with the old implementation), but this works and is *so* much better
than what we had, especially for the common case of a single priority in
use, that further refinements can be in follow on patches.
[akpm@linux-foundation.org: fix typo in comment, remove stray semicolon]
[levinsasha928@gmail.com: use correct gfp flags in msg_insert]
Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Manfred Spraul <manfred@colorfullife.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-01 03:26:35 +04:00
|
|
|
|
|
|
|
p = &info->msg_tree.rb_node;
|
|
|
|
while (*p) {
|
|
|
|
parent = *p;
|
|
|
|
leaf = rb_entry(parent, struct posix_msg_tree_node, rb_node);
|
|
|
|
|
|
|
|
if (likely(leaf->priority == msg->m_type))
|
|
|
|
goto insert_msg;
|
2019-05-15 01:46:26 +03:00
|
|
|
else if (msg->m_type < leaf->priority) {
|
ipc/mqueue: improve performance of send/recv
The existing implementation of the POSIX message queue send and recv
functions is, well, abysmal. Even worse than abysmal. I submitted a
patch to increase the maximum POSIX message queue limit to 65536 due to
customer needs, however, upon looking over the send/recv implementation, I
realized that my customer needs help with that too even if they don't know
it. The basic problem is that, given the fairly typical use case scenario
for a large queue of queueing lots of messages all at the same priority (I
verified with my customer that this is indeed what their app does), the
msg_insert routine is basically a frikkin' bubble sort. I mean, whoa,
that's *so* middle school.
OK, OK, to not slam the original author too much, I'm sure they didn't
envision a queue depth of 50,000+ messages. No one would think that
moving elements in an array, one at a time, and dereferencing each pointer
in that array to check priority of the message being pointed too, again
one at a time, for 50,000+ times would be good. So let's assume that, as
is typical, the users have found a way to break our code simply by using
it in a way we didn't envision. Fair enough.
"So, just how broken is it?", you ask. I wondered the same thing, so I
wrote an app to let me know. It's my next patch. It gave me some
interesting results. Here's what it tested:
Interference with other apps - In continuous mode, the app just sits there
and hits a message queue forever, while you go do something productive on
another terminal using other CPUs. You then measure how long it takes you
to do that something productive. Then you restart the app in fake
continuous mode, and it sits in a tight loop on a CPU while you repeat
your tests. The whole point of this is to keep one CPU tied up (so it
can't be used in your other work) but in one case tied up hitting the
mqueue code so we can see the effect of walking that 65,528 element array
one pointer at a time on the global CPU cache. If it's bad, then it will
slow down your app on the other CPUs just by polluting cache mercilessly.
In the fake case, it will be in a tight loop, but not polluting cache.
Testing the mqueue subsystem directly - Here we just run a number of tests
to see how the mqueue subsystem performs under different conditions. A
couple conditions are known to be worst case for the old system, and some
routines, so this tests all of them.
So, on to the results already:
Subsystem/Test Old New
Time to compile linux
kernel (make -j12 on a
6 core CPU)
Running mqueue test user 49m10.744s user 45m26.294s
sys 5m51.924s sys 4m59.894s
total 55m02.668s total 50m26.188s
Running fake test user 45m32.686s user 45m18.552s
sys 5m12.465s sys 4m56.468s
total 50m45.151s total 50m15.020s
% slowdown from mqueue
cache thrashing ~8% ~.5%
Avg time to send/recv (in nanoseconds per message)
when queue empty 305/288 349/318
when queue full (65528 messages)
constant priority 526589/823 362/314
increasing priority 403105/916 495/445
decreasing priority 73420/594 482/409
random priority 280147/920 546/436
Time to fill/drain queue (65528 messages, in seconds)
constant priority 17.37/.12 .13/.12
increasing priority 4.14/.14 .21/.18
decreasing priority 12.93/.13 .21/.18
random priority 8.88/.16 .22/.17
So, I think the results speak for themselves. It's possible this
implementation could be improved by cacheing at least one priority level
in the node tree (that would bring the queue empty performance more in
line with the old implementation), but this works and is *so* much better
than what we had, especially for the common case of a single priority in
use, that further refinements can be in follow on patches.
[akpm@linux-foundation.org: fix typo in comment, remove stray semicolon]
[levinsasha928@gmail.com: use correct gfp flags in msg_insert]
Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Manfred Spraul <manfred@colorfullife.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-01 03:26:35 +04:00
|
|
|
p = &(*p)->rb_left;
|
2019-05-15 01:46:26 +03:00
|
|
|
rightmost = false;
|
|
|
|
} else
|
ipc/mqueue: improve performance of send/recv
The existing implementation of the POSIX message queue send and recv
functions is, well, abysmal. Even worse than abysmal. I submitted a
patch to increase the maximum POSIX message queue limit to 65536 due to
customer needs, however, upon looking over the send/recv implementation, I
realized that my customer needs help with that too even if they don't know
it. The basic problem is that, given the fairly typical use case scenario
for a large queue of queueing lots of messages all at the same priority (I
verified with my customer that this is indeed what their app does), the
msg_insert routine is basically a frikkin' bubble sort. I mean, whoa,
that's *so* middle school.
OK, OK, to not slam the original author too much, I'm sure they didn't
envision a queue depth of 50,000+ messages. No one would think that
moving elements in an array, one at a time, and dereferencing each pointer
in that array to check priority of the message being pointed too, again
one at a time, for 50,000+ times would be good. So let's assume that, as
is typical, the users have found a way to break our code simply by using
it in a way we didn't envision. Fair enough.
"So, just how broken is it?", you ask. I wondered the same thing, so I
wrote an app to let me know. It's my next patch. It gave me some
interesting results. Here's what it tested:
Interference with other apps - In continuous mode, the app just sits there
and hits a message queue forever, while you go do something productive on
another terminal using other CPUs. You then measure how long it takes you
to do that something productive. Then you restart the app in fake
continuous mode, and it sits in a tight loop on a CPU while you repeat
your tests. The whole point of this is to keep one CPU tied up (so it
can't be used in your other work) but in one case tied up hitting the
mqueue code so we can see the effect of walking that 65,528 element array
one pointer at a time on the global CPU cache. If it's bad, then it will
slow down your app on the other CPUs just by polluting cache mercilessly.
In the fake case, it will be in a tight loop, but not polluting cache.
Testing the mqueue subsystem directly - Here we just run a number of tests
to see how the mqueue subsystem performs under different conditions. A
couple conditions are known to be worst case for the old system, and some
routines, so this tests all of them.
So, on to the results already:
Subsystem/Test Old New
Time to compile linux
kernel (make -j12 on a
6 core CPU)
Running mqueue test user 49m10.744s user 45m26.294s
sys 5m51.924s sys 4m59.894s
total 55m02.668s total 50m26.188s
Running fake test user 45m32.686s user 45m18.552s
sys 5m12.465s sys 4m56.468s
total 50m45.151s total 50m15.020s
% slowdown from mqueue
cache thrashing ~8% ~.5%
Avg time to send/recv (in nanoseconds per message)
when queue empty 305/288 349/318
when queue full (65528 messages)
constant priority 526589/823 362/314
increasing priority 403105/916 495/445
decreasing priority 73420/594 482/409
random priority 280147/920 546/436
Time to fill/drain queue (65528 messages, in seconds)
constant priority 17.37/.12 .13/.12
increasing priority 4.14/.14 .21/.18
decreasing priority 12.93/.13 .21/.18
random priority 8.88/.16 .22/.17
So, I think the results speak for themselves. It's possible this
implementation could be improved by cacheing at least one priority level
in the node tree (that would bring the queue empty performance more in
line with the old implementation), but this works and is *so* much better
than what we had, especially for the common case of a single priority in
use, that further refinements can be in follow on patches.
[akpm@linux-foundation.org: fix typo in comment, remove stray semicolon]
[levinsasha928@gmail.com: use correct gfp flags in msg_insert]
Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Manfred Spraul <manfred@colorfullife.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-01 03:26:35 +04:00
|
|
|
p = &(*p)->rb_right;
|
|
|
|
}
|
ipc/mqueue: add rbtree node caching support
When I wrote the first patch that added the rbtree support for message
queue insertion, it sped up the case where the queue was very full
drastically from the original code. It, however, slowed down the case
where the queue was empty (not drastically though).
This patch caches the last freed rbtree node struct so we can quickly
reuse it when we get a new message. This is the common path for any queue
that very frequently goes from 0 to 1 then back to 0 messages in queue.
Andrew Morton didn't like that we were doing a GFP_ATOMIC allocation in
msg_insert, so this patch attempts to speculatively allocate a new node
struct outside of the spin lock when we know we need it, but will still
fall back to a GFP_ATOMIC allocation if it has to.
Once I added the caching, the necessary various ret = ; spin_unlock
gyrations in mq_timedsend were getting pretty ugly, so this also slightly
refactors that function to streamline the flow of the code and the
function exit.
Finally, while working on getting performance back I made sure that all of
the node structs were always fully initialized when they were first used,
rendering the use of kzalloc unnecessary and a waste of CPU cycles.
The net result of all of this is:
1) We will avoid a GFP_ATOMIC allocation when possible, but fall back
on it when necessary.
2) We will speculatively allocate a node struct using GFP_KERNEL if our
cache is empty (and save the struct to our cache if it's still empty
after we have obtained the spin lock).
3) The performance of the common queue empty case has significantly
improved and is now much more in line with the older performance for
this case.
The performance changes are:
Old mqueue new mqueue new mqueue + caching
queue empty
send/recv 305/288ns 349/318ns 310/322ns
I don't think we'll ever be able to get the recv performance back, but
that's because the old recv performance was a direct result and
consequence of the old methods abysmal send performance. The recv path
simply must do more so that the send path does not incur such a penalty
under higher queue depths.
As it turns out, the new caching code also sped up the various queue full
cases relative to my last patch. That could be because of the difference
between the syscall path in 3.3.4-rc5 and 3.3.4-rc6, or because of the
change in code flow in the mq_timedsend routine. Regardless, I'll take
it. It wasn't huge, and I *would* say it was within the margin for error,
but after many repeated runs what I'm seeing is that the old numbers trend
slightly higher (about 10 to 20ns depending on which test is the one
running).
[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-01 03:26:38 +04:00
|
|
|
if (info->node_cache) {
|
|
|
|
leaf = info->node_cache;
|
|
|
|
info->node_cache = NULL;
|
|
|
|
} else {
|
|
|
|
leaf = kmalloc(sizeof(*leaf), GFP_ATOMIC);
|
|
|
|
if (!leaf)
|
|
|
|
return -ENOMEM;
|
|
|
|
INIT_LIST_HEAD(&leaf->msg_list);
|
|
|
|
}
|
ipc/mqueue: improve performance of send/recv
The existing implementation of the POSIX message queue send and recv
functions is, well, abysmal. Even worse than abysmal. I submitted a
patch to increase the maximum POSIX message queue limit to 65536 due to
customer needs, however, upon looking over the send/recv implementation, I
realized that my customer needs help with that too even if they don't know
it. The basic problem is that, given the fairly typical use case scenario
for a large queue of queueing lots of messages all at the same priority (I
verified with my customer that this is indeed what their app does), the
msg_insert routine is basically a frikkin' bubble sort. I mean, whoa,
that's *so* middle school.
OK, OK, to not slam the original author too much, I'm sure they didn't
envision a queue depth of 50,000+ messages. No one would think that
moving elements in an array, one at a time, and dereferencing each pointer
in that array to check priority of the message being pointed too, again
one at a time, for 50,000+ times would be good. So let's assume that, as
is typical, the users have found a way to break our code simply by using
it in a way we didn't envision. Fair enough.
"So, just how broken is it?", you ask. I wondered the same thing, so I
wrote an app to let me know. It's my next patch. It gave me some
interesting results. Here's what it tested:
Interference with other apps - In continuous mode, the app just sits there
and hits a message queue forever, while you go do something productive on
another terminal using other CPUs. You then measure how long it takes you
to do that something productive. Then you restart the app in fake
continuous mode, and it sits in a tight loop on a CPU while you repeat
your tests. The whole point of this is to keep one CPU tied up (so it
can't be used in your other work) but in one case tied up hitting the
mqueue code so we can see the effect of walking that 65,528 element array
one pointer at a time on the global CPU cache. If it's bad, then it will
slow down your app on the other CPUs just by polluting cache mercilessly.
In the fake case, it will be in a tight loop, but not polluting cache.
Testing the mqueue subsystem directly - Here we just run a number of tests
to see how the mqueue subsystem performs under different conditions. A
couple conditions are known to be worst case for the old system, and some
routines, so this tests all of them.
So, on to the results already:
Subsystem/Test Old New
Time to compile linux
kernel (make -j12 on a
6 core CPU)
Running mqueue test user 49m10.744s user 45m26.294s
sys 5m51.924s sys 4m59.894s
total 55m02.668s total 50m26.188s
Running fake test user 45m32.686s user 45m18.552s
sys 5m12.465s sys 4m56.468s
total 50m45.151s total 50m15.020s
% slowdown from mqueue
cache thrashing ~8% ~.5%
Avg time to send/recv (in nanoseconds per message)
when queue empty 305/288 349/318
when queue full (65528 messages)
constant priority 526589/823 362/314
increasing priority 403105/916 495/445
decreasing priority 73420/594 482/409
random priority 280147/920 546/436
Time to fill/drain queue (65528 messages, in seconds)
constant priority 17.37/.12 .13/.12
increasing priority 4.14/.14 .21/.18
decreasing priority 12.93/.13 .21/.18
random priority 8.88/.16 .22/.17
So, I think the results speak for themselves. It's possible this
implementation could be improved by cacheing at least one priority level
in the node tree (that would bring the queue empty performance more in
line with the old implementation), but this works and is *so* much better
than what we had, especially for the common case of a single priority in
use, that further refinements can be in follow on patches.
[akpm@linux-foundation.org: fix typo in comment, remove stray semicolon]
[levinsasha928@gmail.com: use correct gfp flags in msg_insert]
Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Manfred Spraul <manfred@colorfullife.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-01 03:26:35 +04:00
|
|
|
leaf->priority = msg->m_type;
|
2019-05-15 01:46:26 +03:00
|
|
|
|
|
|
|
if (rightmost)
|
|
|
|
info->msg_tree_rightmost = &leaf->rb_node;
|
|
|
|
|
ipc/mqueue: improve performance of send/recv
The existing implementation of the POSIX message queue send and recv
functions is, well, abysmal. Even worse than abysmal. I submitted a
patch to increase the maximum POSIX message queue limit to 65536 due to
customer needs, however, upon looking over the send/recv implementation, I
realized that my customer needs help with that too even if they don't know
it. The basic problem is that, given the fairly typical use case scenario
for a large queue of queueing lots of messages all at the same priority (I
verified with my customer that this is indeed what their app does), the
msg_insert routine is basically a frikkin' bubble sort. I mean, whoa,
that's *so* middle school.
OK, OK, to not slam the original author too much, I'm sure they didn't
envision a queue depth of 50,000+ messages. No one would think that
moving elements in an array, one at a time, and dereferencing each pointer
in that array to check priority of the message being pointed too, again
one at a time, for 50,000+ times would be good. So let's assume that, as
is typical, the users have found a way to break our code simply by using
it in a way we didn't envision. Fair enough.
"So, just how broken is it?", you ask. I wondered the same thing, so I
wrote an app to let me know. It's my next patch. It gave me some
interesting results. Here's what it tested:
Interference with other apps - In continuous mode, the app just sits there
and hits a message queue forever, while you go do something productive on
another terminal using other CPUs. You then measure how long it takes you
to do that something productive. Then you restart the app in fake
continuous mode, and it sits in a tight loop on a CPU while you repeat
your tests. The whole point of this is to keep one CPU tied up (so it
can't be used in your other work) but in one case tied up hitting the
mqueue code so we can see the effect of walking that 65,528 element array
one pointer at a time on the global CPU cache. If it's bad, then it will
slow down your app on the other CPUs just by polluting cache mercilessly.
In the fake case, it will be in a tight loop, but not polluting cache.
Testing the mqueue subsystem directly - Here we just run a number of tests
to see how the mqueue subsystem performs under different conditions. A
couple conditions are known to be worst case for the old system, and some
routines, so this tests all of them.
So, on to the results already:
Subsystem/Test Old New
Time to compile linux
kernel (make -j12 on a
6 core CPU)
Running mqueue test user 49m10.744s user 45m26.294s
sys 5m51.924s sys 4m59.894s
total 55m02.668s total 50m26.188s
Running fake test user 45m32.686s user 45m18.552s
sys 5m12.465s sys 4m56.468s
total 50m45.151s total 50m15.020s
% slowdown from mqueue
cache thrashing ~8% ~.5%
Avg time to send/recv (in nanoseconds per message)
when queue empty 305/288 349/318
when queue full (65528 messages)
constant priority 526589/823 362/314
increasing priority 403105/916 495/445
decreasing priority 73420/594 482/409
random priority 280147/920 546/436
Time to fill/drain queue (65528 messages, in seconds)
constant priority 17.37/.12 .13/.12
increasing priority 4.14/.14 .21/.18
decreasing priority 12.93/.13 .21/.18
random priority 8.88/.16 .22/.17
So, I think the results speak for themselves. It's possible this
implementation could be improved by cacheing at least one priority level
in the node tree (that would bring the queue empty performance more in
line with the old implementation), but this works and is *so* much better
than what we had, especially for the common case of a single priority in
use, that further refinements can be in follow on patches.
[akpm@linux-foundation.org: fix typo in comment, remove stray semicolon]
[levinsasha928@gmail.com: use correct gfp flags in msg_insert]
Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Manfred Spraul <manfred@colorfullife.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-01 03:26:35 +04:00
|
|
|
rb_link_node(&leaf->rb_node, parent, p);
|
|
|
|
rb_insert_color(&leaf->rb_node, &info->msg_tree);
|
|
|
|
insert_msg:
|
|
|
|
info->attr.mq_curmsgs++;
|
|
|
|
info->qsize += msg->m_ts;
|
|
|
|
list_add_tail(&msg->m_list, &leaf->msg_list);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2019-05-15 01:46:26 +03:00
|
|
|
static inline void msg_tree_erase(struct posix_msg_tree_node *leaf,
|
|
|
|
struct mqueue_inode_info *info)
|
|
|
|
{
|
|
|
|
struct rb_node *node = &leaf->rb_node;
|
|
|
|
|
|
|
|
if (info->msg_tree_rightmost == node)
|
|
|
|
info->msg_tree_rightmost = rb_prev(node);
|
|
|
|
|
|
|
|
rb_erase(node, &info->msg_tree);
|
2020-04-07 06:12:53 +03:00
|
|
|
if (info->node_cache)
|
2019-05-15 01:46:26 +03:00
|
|
|
kfree(leaf);
|
2020-04-07 06:12:53 +03:00
|
|
|
else
|
2019-05-15 01:46:26 +03:00
|
|
|
info->node_cache = leaf;
|
|
|
|
}
|
|
|
|
|
ipc/mqueue: improve performance of send/recv
The existing implementation of the POSIX message queue send and recv
functions is, well, abysmal. Even worse than abysmal. I submitted a
patch to increase the maximum POSIX message queue limit to 65536 due to
customer needs, however, upon looking over the send/recv implementation, I
realized that my customer needs help with that too even if they don't know
it. The basic problem is that, given the fairly typical use case scenario
for a large queue of queueing lots of messages all at the same priority (I
verified with my customer that this is indeed what their app does), the
msg_insert routine is basically a frikkin' bubble sort. I mean, whoa,
that's *so* middle school.
OK, OK, to not slam the original author too much, I'm sure they didn't
envision a queue depth of 50,000+ messages. No one would think that
moving elements in an array, one at a time, and dereferencing each pointer
in that array to check priority of the message being pointed too, again
one at a time, for 50,000+ times would be good. So let's assume that, as
is typical, the users have found a way to break our code simply by using
it in a way we didn't envision. Fair enough.
"So, just how broken is it?", you ask. I wondered the same thing, so I
wrote an app to let me know. It's my next patch. It gave me some
interesting results. Here's what it tested:
Interference with other apps - In continuous mode, the app just sits there
and hits a message queue forever, while you go do something productive on
another terminal using other CPUs. You then measure how long it takes you
to do that something productive. Then you restart the app in fake
continuous mode, and it sits in a tight loop on a CPU while you repeat
your tests. The whole point of this is to keep one CPU tied up (so it
can't be used in your other work) but in one case tied up hitting the
mqueue code so we can see the effect of walking that 65,528 element array
one pointer at a time on the global CPU cache. If it's bad, then it will
slow down your app on the other CPUs just by polluting cache mercilessly.
In the fake case, it will be in a tight loop, but not polluting cache.
Testing the mqueue subsystem directly - Here we just run a number of tests
to see how the mqueue subsystem performs under different conditions. A
couple conditions are known to be worst case for the old system, and some
routines, so this tests all of them.
So, on to the results already:
Subsystem/Test Old New
Time to compile linux
kernel (make -j12 on a
6 core CPU)
Running mqueue test user 49m10.744s user 45m26.294s
sys 5m51.924s sys 4m59.894s
total 55m02.668s total 50m26.188s
Running fake test user 45m32.686s user 45m18.552s
sys 5m12.465s sys 4m56.468s
total 50m45.151s total 50m15.020s
% slowdown from mqueue
cache thrashing ~8% ~.5%
Avg time to send/recv (in nanoseconds per message)
when queue empty 305/288 349/318
when queue full (65528 messages)
constant priority 526589/823 362/314
increasing priority 403105/916 495/445
decreasing priority 73420/594 482/409
random priority 280147/920 546/436
Time to fill/drain queue (65528 messages, in seconds)
constant priority 17.37/.12 .13/.12
increasing priority 4.14/.14 .21/.18
decreasing priority 12.93/.13 .21/.18
random priority 8.88/.16 .22/.17
So, I think the results speak for themselves. It's possible this
implementation could be improved by cacheing at least one priority level
in the node tree (that would bring the queue empty performance more in
line with the old implementation), but this works and is *so* much better
than what we had, especially for the common case of a single priority in
use, that further refinements can be in follow on patches.
[akpm@linux-foundation.org: fix typo in comment, remove stray semicolon]
[levinsasha928@gmail.com: use correct gfp flags in msg_insert]
Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Manfred Spraul <manfred@colorfullife.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-01 03:26:35 +04:00
|
|
|
static inline struct msg_msg *msg_get(struct mqueue_inode_info *info)
|
|
|
|
{
|
2019-05-15 01:46:26 +03:00
|
|
|
struct rb_node *parent = NULL;
|
ipc/mqueue: improve performance of send/recv
The existing implementation of the POSIX message queue send and recv
functions is, well, abysmal. Even worse than abysmal. I submitted a
patch to increase the maximum POSIX message queue limit to 65536 due to
customer needs, however, upon looking over the send/recv implementation, I
realized that my customer needs help with that too even if they don't know
it. The basic problem is that, given the fairly typical use case scenario
for a large queue of queueing lots of messages all at the same priority (I
verified with my customer that this is indeed what their app does), the
msg_insert routine is basically a frikkin' bubble sort. I mean, whoa,
that's *so* middle school.
OK, OK, to not slam the original author too much, I'm sure they didn't
envision a queue depth of 50,000+ messages. No one would think that
moving elements in an array, one at a time, and dereferencing each pointer
in that array to check priority of the message being pointed too, again
one at a time, for 50,000+ times would be good. So let's assume that, as
is typical, the users have found a way to break our code simply by using
it in a way we didn't envision. Fair enough.
"So, just how broken is it?", you ask. I wondered the same thing, so I
wrote an app to let me know. It's my next patch. It gave me some
interesting results. Here's what it tested:
Interference with other apps - In continuous mode, the app just sits there
and hits a message queue forever, while you go do something productive on
another terminal using other CPUs. You then measure how long it takes you
to do that something productive. Then you restart the app in fake
continuous mode, and it sits in a tight loop on a CPU while you repeat
your tests. The whole point of this is to keep one CPU tied up (so it
can't be used in your other work) but in one case tied up hitting the
mqueue code so we can see the effect of walking that 65,528 element array
one pointer at a time on the global CPU cache. If it's bad, then it will
slow down your app on the other CPUs just by polluting cache mercilessly.
In the fake case, it will be in a tight loop, but not polluting cache.
Testing the mqueue subsystem directly - Here we just run a number of tests
to see how the mqueue subsystem performs under different conditions. A
couple conditions are known to be worst case for the old system, and some
routines, so this tests all of them.
So, on to the results already:
Subsystem/Test Old New
Time to compile linux
kernel (make -j12 on a
6 core CPU)
Running mqueue test user 49m10.744s user 45m26.294s
sys 5m51.924s sys 4m59.894s
total 55m02.668s total 50m26.188s
Running fake test user 45m32.686s user 45m18.552s
sys 5m12.465s sys 4m56.468s
total 50m45.151s total 50m15.020s
% slowdown from mqueue
cache thrashing ~8% ~.5%
Avg time to send/recv (in nanoseconds per message)
when queue empty 305/288 349/318
when queue full (65528 messages)
constant priority 526589/823 362/314
increasing priority 403105/916 495/445
decreasing priority 73420/594 482/409
random priority 280147/920 546/436
Time to fill/drain queue (65528 messages, in seconds)
constant priority 17.37/.12 .13/.12
increasing priority 4.14/.14 .21/.18
decreasing priority 12.93/.13 .21/.18
random priority 8.88/.16 .22/.17
So, I think the results speak for themselves. It's possible this
implementation could be improved by cacheing at least one priority level
in the node tree (that would bring the queue empty performance more in
line with the old implementation), but this works and is *so* much better
than what we had, especially for the common case of a single priority in
use, that further refinements can be in follow on patches.
[akpm@linux-foundation.org: fix typo in comment, remove stray semicolon]
[levinsasha928@gmail.com: use correct gfp flags in msg_insert]
Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Manfred Spraul <manfred@colorfullife.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-01 03:26:35 +04:00
|
|
|
struct posix_msg_tree_node *leaf;
|
|
|
|
struct msg_msg *msg;
|
|
|
|
|
|
|
|
try_again:
|
2019-05-15 01:46:26 +03:00
|
|
|
/*
|
|
|
|
* During insert, low priorities go to the left and high to the
|
|
|
|
* right. On receive, we want the highest priorities first, so
|
|
|
|
* walk all the way to the right.
|
|
|
|
*/
|
|
|
|
parent = info->msg_tree_rightmost;
|
ipc/mqueue: improve performance of send/recv
The existing implementation of the POSIX message queue send and recv
functions is, well, abysmal. Even worse than abysmal. I submitted a
patch to increase the maximum POSIX message queue limit to 65536 due to
customer needs, however, upon looking over the send/recv implementation, I
realized that my customer needs help with that too even if they don't know
it. The basic problem is that, given the fairly typical use case scenario
for a large queue of queueing lots of messages all at the same priority (I
verified with my customer that this is indeed what their app does), the
msg_insert routine is basically a frikkin' bubble sort. I mean, whoa,
that's *so* middle school.
OK, OK, to not slam the original author too much, I'm sure they didn't
envision a queue depth of 50,000+ messages. No one would think that
moving elements in an array, one at a time, and dereferencing each pointer
in that array to check priority of the message being pointed too, again
one at a time, for 50,000+ times would be good. So let's assume that, as
is typical, the users have found a way to break our code simply by using
it in a way we didn't envision. Fair enough.
"So, just how broken is it?", you ask. I wondered the same thing, so I
wrote an app to let me know. It's my next patch. It gave me some
interesting results. Here's what it tested:
Interference with other apps - In continuous mode, the app just sits there
and hits a message queue forever, while you go do something productive on
another terminal using other CPUs. You then measure how long it takes you
to do that something productive. Then you restart the app in fake
continuous mode, and it sits in a tight loop on a CPU while you repeat
your tests. The whole point of this is to keep one CPU tied up (so it
can't be used in your other work) but in one case tied up hitting the
mqueue code so we can see the effect of walking that 65,528 element array
one pointer at a time on the global CPU cache. If it's bad, then it will
slow down your app on the other CPUs just by polluting cache mercilessly.
In the fake case, it will be in a tight loop, but not polluting cache.
Testing the mqueue subsystem directly - Here we just run a number of tests
to see how the mqueue subsystem performs under different conditions. A
couple conditions are known to be worst case for the old system, and some
routines, so this tests all of them.
So, on to the results already:
Subsystem/Test Old New
Time to compile linux
kernel (make -j12 on a
6 core CPU)
Running mqueue test user 49m10.744s user 45m26.294s
sys 5m51.924s sys 4m59.894s
total 55m02.668s total 50m26.188s
Running fake test user 45m32.686s user 45m18.552s
sys 5m12.465s sys 4m56.468s
total 50m45.151s total 50m15.020s
% slowdown from mqueue
cache thrashing ~8% ~.5%
Avg time to send/recv (in nanoseconds per message)
when queue empty 305/288 349/318
when queue full (65528 messages)
constant priority 526589/823 362/314
increasing priority 403105/916 495/445
decreasing priority 73420/594 482/409
random priority 280147/920 546/436
Time to fill/drain queue (65528 messages, in seconds)
constant priority 17.37/.12 .13/.12
increasing priority 4.14/.14 .21/.18
decreasing priority 12.93/.13 .21/.18
random priority 8.88/.16 .22/.17
So, I think the results speak for themselves. It's possible this
implementation could be improved by cacheing at least one priority level
in the node tree (that would bring the queue empty performance more in
line with the old implementation), but this works and is *so* much better
than what we had, especially for the common case of a single priority in
use, that further refinements can be in follow on patches.
[akpm@linux-foundation.org: fix typo in comment, remove stray semicolon]
[levinsasha928@gmail.com: use correct gfp flags in msg_insert]
Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Manfred Spraul <manfred@colorfullife.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-01 03:26:35 +04:00
|
|
|
if (!parent) {
|
|
|
|
if (info->attr.mq_curmsgs) {
|
|
|
|
pr_warn_once("Inconsistency in POSIX message queue, "
|
|
|
|
"no tree element, but supposedly messages "
|
|
|
|
"should exist!\n");
|
|
|
|
info->attr.mq_curmsgs = 0;
|
|
|
|
}
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
leaf = rb_entry(parent, struct posix_msg_tree_node, rb_node);
|
ipc/mqueue: add rbtree node caching support
When I wrote the first patch that added the rbtree support for message
queue insertion, it sped up the case where the queue was very full
drastically from the original code. It, however, slowed down the case
where the queue was empty (not drastically though).
This patch caches the last freed rbtree node struct so we can quickly
reuse it when we get a new message. This is the common path for any queue
that very frequently goes from 0 to 1 then back to 0 messages in queue.
Andrew Morton didn't like that we were doing a GFP_ATOMIC allocation in
msg_insert, so this patch attempts to speculatively allocate a new node
struct outside of the spin lock when we know we need it, but will still
fall back to a GFP_ATOMIC allocation if it has to.
Once I added the caching, the necessary various ret = ; spin_unlock
gyrations in mq_timedsend were getting pretty ugly, so this also slightly
refactors that function to streamline the flow of the code and the
function exit.
Finally, while working on getting performance back I made sure that all of
the node structs were always fully initialized when they were first used,
rendering the use of kzalloc unnecessary and a waste of CPU cycles.
The net result of all of this is:
1) We will avoid a GFP_ATOMIC allocation when possible, but fall back
on it when necessary.
2) We will speculatively allocate a node struct using GFP_KERNEL if our
cache is empty (and save the struct to our cache if it's still empty
after we have obtained the spin lock).
3) The performance of the common queue empty case has significantly
improved and is now much more in line with the older performance for
this case.
The performance changes are:
Old mqueue new mqueue new mqueue + caching
queue empty
send/recv 305/288ns 349/318ns 310/322ns
I don't think we'll ever be able to get the recv performance back, but
that's because the old recv performance was a direct result and
consequence of the old methods abysmal send performance. The recv path
simply must do more so that the send path does not incur such a penalty
under higher queue depths.
As it turns out, the new caching code also sped up the various queue full
cases relative to my last patch. That could be because of the difference
between the syscall path in 3.3.4-rc5 and 3.3.4-rc6, or because of the
change in code flow in the mq_timedsend routine. Regardless, I'll take
it. It wasn't huge, and I *would* say it was within the margin for error,
but after many repeated runs what I'm seeing is that the old numbers trend
slightly higher (about 10 to 20ns depending on which test is the one
running).
[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-01 03:26:38 +04:00
|
|
|
if (unlikely(list_empty(&leaf->msg_list))) {
|
ipc/mqueue: improve performance of send/recv
The existing implementation of the POSIX message queue send and recv
functions is, well, abysmal. Even worse than abysmal. I submitted a
patch to increase the maximum POSIX message queue limit to 65536 due to
customer needs, however, upon looking over the send/recv implementation, I
realized that my customer needs help with that too even if they don't know
it. The basic problem is that, given the fairly typical use case scenario
for a large queue of queueing lots of messages all at the same priority (I
verified with my customer that this is indeed what their app does), the
msg_insert routine is basically a frikkin' bubble sort. I mean, whoa,
that's *so* middle school.
OK, OK, to not slam the original author too much, I'm sure they didn't
envision a queue depth of 50,000+ messages. No one would think that
moving elements in an array, one at a time, and dereferencing each pointer
in that array to check priority of the message being pointed too, again
one at a time, for 50,000+ times would be good. So let's assume that, as
is typical, the users have found a way to break our code simply by using
it in a way we didn't envision. Fair enough.
"So, just how broken is it?", you ask. I wondered the same thing, so I
wrote an app to let me know. It's my next patch. It gave me some
interesting results. Here's what it tested:
Interference with other apps - In continuous mode, the app just sits there
and hits a message queue forever, while you go do something productive on
another terminal using other CPUs. You then measure how long it takes you
to do that something productive. Then you restart the app in fake
continuous mode, and it sits in a tight loop on a CPU while you repeat
your tests. The whole point of this is to keep one CPU tied up (so it
can't be used in your other work) but in one case tied up hitting the
mqueue code so we can see the effect of walking that 65,528 element array
one pointer at a time on the global CPU cache. If it's bad, then it will
slow down your app on the other CPUs just by polluting cache mercilessly.
In the fake case, it will be in a tight loop, but not polluting cache.
Testing the mqueue subsystem directly - Here we just run a number of tests
to see how the mqueue subsystem performs under different conditions. A
couple conditions are known to be worst case for the old system, and some
routines, so this tests all of them.
So, on to the results already:
Subsystem/Test Old New
Time to compile linux
kernel (make -j12 on a
6 core CPU)
Running mqueue test user 49m10.744s user 45m26.294s
sys 5m51.924s sys 4m59.894s
total 55m02.668s total 50m26.188s
Running fake test user 45m32.686s user 45m18.552s
sys 5m12.465s sys 4m56.468s
total 50m45.151s total 50m15.020s
% slowdown from mqueue
cache thrashing ~8% ~.5%
Avg time to send/recv (in nanoseconds per message)
when queue empty 305/288 349/318
when queue full (65528 messages)
constant priority 526589/823 362/314
increasing priority 403105/916 495/445
decreasing priority 73420/594 482/409
random priority 280147/920 546/436
Time to fill/drain queue (65528 messages, in seconds)
constant priority 17.37/.12 .13/.12
increasing priority 4.14/.14 .21/.18
decreasing priority 12.93/.13 .21/.18
random priority 8.88/.16 .22/.17
So, I think the results speak for themselves. It's possible this
implementation could be improved by cacheing at least one priority level
in the node tree (that would bring the queue empty performance more in
line with the old implementation), but this works and is *so* much better
than what we had, especially for the common case of a single priority in
use, that further refinements can be in follow on patches.
[akpm@linux-foundation.org: fix typo in comment, remove stray semicolon]
[levinsasha928@gmail.com: use correct gfp flags in msg_insert]
Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Manfred Spraul <manfred@colorfullife.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-01 03:26:35 +04:00
|
|
|
pr_warn_once("Inconsistency in POSIX message queue, "
|
|
|
|
"empty leaf node but we haven't implemented "
|
|
|
|
"lazy leaf delete!\n");
|
2019-05-15 01:46:26 +03:00
|
|
|
msg_tree_erase(leaf, info);
|
ipc/mqueue: improve performance of send/recv
The existing implementation of the POSIX message queue send and recv
functions is, well, abysmal. Even worse than abysmal. I submitted a
patch to increase the maximum POSIX message queue limit to 65536 due to
customer needs, however, upon looking over the send/recv implementation, I
realized that my customer needs help with that too even if they don't know
it. The basic problem is that, given the fairly typical use case scenario
for a large queue of queueing lots of messages all at the same priority (I
verified with my customer that this is indeed what their app does), the
msg_insert routine is basically a frikkin' bubble sort. I mean, whoa,
that's *so* middle school.
OK, OK, to not slam the original author too much, I'm sure they didn't
envision a queue depth of 50,000+ messages. No one would think that
moving elements in an array, one at a time, and dereferencing each pointer
in that array to check priority of the message being pointed too, again
one at a time, for 50,000+ times would be good. So let's assume that, as
is typical, the users have found a way to break our code simply by using
it in a way we didn't envision. Fair enough.
"So, just how broken is it?", you ask. I wondered the same thing, so I
wrote an app to let me know. It's my next patch. It gave me some
interesting results. Here's what it tested:
Interference with other apps - In continuous mode, the app just sits there
and hits a message queue forever, while you go do something productive on
another terminal using other CPUs. You then measure how long it takes you
to do that something productive. Then you restart the app in fake
continuous mode, and it sits in a tight loop on a CPU while you repeat
your tests. The whole point of this is to keep one CPU tied up (so it
can't be used in your other work) but in one case tied up hitting the
mqueue code so we can see the effect of walking that 65,528 element array
one pointer at a time on the global CPU cache. If it's bad, then it will
slow down your app on the other CPUs just by polluting cache mercilessly.
In the fake case, it will be in a tight loop, but not polluting cache.
Testing the mqueue subsystem directly - Here we just run a number of tests
to see how the mqueue subsystem performs under different conditions. A
couple conditions are known to be worst case for the old system, and some
routines, so this tests all of them.
So, on to the results already:
Subsystem/Test Old New
Time to compile linux
kernel (make -j12 on a
6 core CPU)
Running mqueue test user 49m10.744s user 45m26.294s
sys 5m51.924s sys 4m59.894s
total 55m02.668s total 50m26.188s
Running fake test user 45m32.686s user 45m18.552s
sys 5m12.465s sys 4m56.468s
total 50m45.151s total 50m15.020s
% slowdown from mqueue
cache thrashing ~8% ~.5%
Avg time to send/recv (in nanoseconds per message)
when queue empty 305/288 349/318
when queue full (65528 messages)
constant priority 526589/823 362/314
increasing priority 403105/916 495/445
decreasing priority 73420/594 482/409
random priority 280147/920 546/436
Time to fill/drain queue (65528 messages, in seconds)
constant priority 17.37/.12 .13/.12
increasing priority 4.14/.14 .21/.18
decreasing priority 12.93/.13 .21/.18
random priority 8.88/.16 .22/.17
So, I think the results speak for themselves. It's possible this
implementation could be improved by cacheing at least one priority level
in the node tree (that would bring the queue empty performance more in
line with the old implementation), but this works and is *so* much better
than what we had, especially for the common case of a single priority in
use, that further refinements can be in follow on patches.
[akpm@linux-foundation.org: fix typo in comment, remove stray semicolon]
[levinsasha928@gmail.com: use correct gfp flags in msg_insert]
Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Manfred Spraul <manfred@colorfullife.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-01 03:26:35 +04:00
|
|
|
goto try_again;
|
|
|
|
} else {
|
|
|
|
msg = list_first_entry(&leaf->msg_list,
|
|
|
|
struct msg_msg, m_list);
|
|
|
|
list_del(&msg->m_list);
|
|
|
|
if (list_empty(&leaf->msg_list)) {
|
2019-05-15 01:46:26 +03:00
|
|
|
msg_tree_erase(leaf, info);
|
ipc/mqueue: improve performance of send/recv
The existing implementation of the POSIX message queue send and recv
functions is, well, abysmal. Even worse than abysmal. I submitted a
patch to increase the maximum POSIX message queue limit to 65536 due to
customer needs, however, upon looking over the send/recv implementation, I
realized that my customer needs help with that too even if they don't know
it. The basic problem is that, given the fairly typical use case scenario
for a large queue of queueing lots of messages all at the same priority (I
verified with my customer that this is indeed what their app does), the
msg_insert routine is basically a frikkin' bubble sort. I mean, whoa,
that's *so* middle school.
OK, OK, to not slam the original author too much, I'm sure they didn't
envision a queue depth of 50,000+ messages. No one would think that
moving elements in an array, one at a time, and dereferencing each pointer
in that array to check priority of the message being pointed too, again
one at a time, for 50,000+ times would be good. So let's assume that, as
is typical, the users have found a way to break our code simply by using
it in a way we didn't envision. Fair enough.
"So, just how broken is it?", you ask. I wondered the same thing, so I
wrote an app to let me know. It's my next patch. It gave me some
interesting results. Here's what it tested:
Interference with other apps - In continuous mode, the app just sits there
and hits a message queue forever, while you go do something productive on
another terminal using other CPUs. You then measure how long it takes you
to do that something productive. Then you restart the app in fake
continuous mode, and it sits in a tight loop on a CPU while you repeat
your tests. The whole point of this is to keep one CPU tied up (so it
can't be used in your other work) but in one case tied up hitting the
mqueue code so we can see the effect of walking that 65,528 element array
one pointer at a time on the global CPU cache. If it's bad, then it will
slow down your app on the other CPUs just by polluting cache mercilessly.
In the fake case, it will be in a tight loop, but not polluting cache.
Testing the mqueue subsystem directly - Here we just run a number of tests
to see how the mqueue subsystem performs under different conditions. A
couple conditions are known to be worst case for the old system, and some
routines, so this tests all of them.
So, on to the results already:
Subsystem/Test Old New
Time to compile linux
kernel (make -j12 on a
6 core CPU)
Running mqueue test user 49m10.744s user 45m26.294s
sys 5m51.924s sys 4m59.894s
total 55m02.668s total 50m26.188s
Running fake test user 45m32.686s user 45m18.552s
sys 5m12.465s sys 4m56.468s
total 50m45.151s total 50m15.020s
% slowdown from mqueue
cache thrashing ~8% ~.5%
Avg time to send/recv (in nanoseconds per message)
when queue empty 305/288 349/318
when queue full (65528 messages)
constant priority 526589/823 362/314
increasing priority 403105/916 495/445
decreasing priority 73420/594 482/409
random priority 280147/920 546/436
Time to fill/drain queue (65528 messages, in seconds)
constant priority 17.37/.12 .13/.12
increasing priority 4.14/.14 .21/.18
decreasing priority 12.93/.13 .21/.18
random priority 8.88/.16 .22/.17
So, I think the results speak for themselves. It's possible this
implementation could be improved by cacheing at least one priority level
in the node tree (that would bring the queue empty performance more in
line with the old implementation), but this works and is *so* much better
than what we had, especially for the common case of a single priority in
use, that further refinements can be in follow on patches.
[akpm@linux-foundation.org: fix typo in comment, remove stray semicolon]
[levinsasha928@gmail.com: use correct gfp flags in msg_insert]
Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Manfred Spraul <manfred@colorfullife.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-01 03:26:35 +04:00
|
|
|
}
|
|
|
|
}
|
|
|
|
info->attr.mq_curmsgs--;
|
|
|
|
info->qsize -= msg->m_ts;
|
|
|
|
return msg;
|
|
|
|
}
|
|
|
|
|
namespaces: ipc namespaces: implement support for posix msqueues
Implement multiple mounts of the mqueue file system, and link it to usage
of CLONE_NEWIPC.
Each ipc ns has a corresponding mqueuefs superblock. When a user does
clone(CLONE_NEWIPC) or unshare(CLONE_NEWIPC), the unshare will cause an
internal mount of a new mqueuefs sb linked to the new ipc ns.
When a user does 'mount -t mqueue mqueue /dev/mqueue', he mounts the
mqueuefs superblock.
Posix message queues can be worked with both through the mq_* system calls
(see mq_overview(7)), and through the VFS through the mqueue mount. Any
usage of mq_open() and friends will work with the acting task's ipc
namespace. Any actions through the VFS will work with the mqueuefs in
which the file was created. So if a user doesn't remount mqueuefs after
unshare(CLONE_NEWIPC), mq_open("/ab") will not be reflected in "ls
/dev/mqueue".
If task a mounts mqueue for ipc_ns:1, then clones task b with a new ipcns,
ipcns:2, and then task a is the last task in ipc_ns:1 to exit, then (1)
ipc_ns:1 will be freed, (2) it's superblock will live on until task b
umounts the corresponding mqueuefs, and vfs actions will continue to
succeed, but (3) sb->s_fs_info will be NULL for the sb corresponding to
the deceased ipc_ns:1.
To make this happen, we must protect the ipc reference count when
a) a task exits and drops its ipcns->count, since it might be dropping
it to 0 and freeing the ipcns
b) a task accesses the ipcns through its mqueuefs interface, since it
bumps the ipcns refcount and might race with the last task in the ipcns
exiting.
So the kref is changed to an atomic_t so we can use
atomic_dec_and_lock(&ns->count,mq_lock), and every access to the ipcns
through ns = mqueuefs_sb->s_fs_info is protected by the same lock.
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 06:01:10 +04:00
|
|
|
static struct inode *mqueue_get_inode(struct super_block *sb,
|
2011-07-24 22:18:20 +04:00
|
|
|
struct ipc_namespace *ipc_ns, umode_t mode,
|
namespaces: ipc namespaces: implement support for posix msqueues
Implement multiple mounts of the mqueue file system, and link it to usage
of CLONE_NEWIPC.
Each ipc ns has a corresponding mqueuefs superblock. When a user does
clone(CLONE_NEWIPC) or unshare(CLONE_NEWIPC), the unshare will cause an
internal mount of a new mqueuefs sb linked to the new ipc ns.
When a user does 'mount -t mqueue mqueue /dev/mqueue', he mounts the
mqueuefs superblock.
Posix message queues can be worked with both through the mq_* system calls
(see mq_overview(7)), and through the VFS through the mqueue mount. Any
usage of mq_open() and friends will work with the acting task's ipc
namespace. Any actions through the VFS will work with the mqueuefs in
which the file was created. So if a user doesn't remount mqueuefs after
unshare(CLONE_NEWIPC), mq_open("/ab") will not be reflected in "ls
/dev/mqueue".
If task a mounts mqueue for ipc_ns:1, then clones task b with a new ipcns,
ipcns:2, and then task a is the last task in ipc_ns:1 to exit, then (1)
ipc_ns:1 will be freed, (2) it's superblock will live on until task b
umounts the corresponding mqueuefs, and vfs actions will continue to
succeed, but (3) sb->s_fs_info will be NULL for the sb corresponding to
the deceased ipc_ns:1.
To make this happen, we must protect the ipc reference count when
a) a task exits and drops its ipcns->count, since it might be dropping
it to 0 and freeing the ipcns
b) a task accesses the ipcns through its mqueuefs interface, since it
bumps the ipcns refcount and might race with the last task in the ipcns
exiting.
So the kref is changed to an atomic_t so we can use
atomic_dec_and_lock(&ns->count,mq_lock), and every access to the ipcns
through ns = mqueuefs_sb->s_fs_info is protected by the same lock.
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 06:01:10 +04:00
|
|
|
struct mq_attr *attr)
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
|
|
|
struct inode *inode;
|
2011-07-27 03:08:47 +04:00
|
|
|
int ret = -ENOMEM;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
|
|
|
inode = new_inode(sb);
|
2011-07-27 03:08:46 +04:00
|
|
|
if (!inode)
|
|
|
|
goto err;
|
|
|
|
|
|
|
|
inode->i_ino = get_next_ino();
|
|
|
|
inode->i_mode = mode;
|
|
|
|
inode->i_uid = current_fsuid();
|
|
|
|
inode->i_gid = current_fsgid();
|
2016-09-14 17:48:04 +03:00
|
|
|
inode->i_mtime = inode->i_ctime = inode->i_atime = current_time(inode);
|
2011-07-27 03:08:46 +04:00
|
|
|
|
|
|
|
if (S_ISREG(mode)) {
|
|
|
|
struct mqueue_inode_info *info;
|
ipc/mqueue: improve performance of send/recv
The existing implementation of the POSIX message queue send and recv
functions is, well, abysmal. Even worse than abysmal. I submitted a
patch to increase the maximum POSIX message queue limit to 65536 due to
customer needs, however, upon looking over the send/recv implementation, I
realized that my customer needs help with that too even if they don't know
it. The basic problem is that, given the fairly typical use case scenario
for a large queue of queueing lots of messages all at the same priority (I
verified with my customer that this is indeed what their app does), the
msg_insert routine is basically a frikkin' bubble sort. I mean, whoa,
that's *so* middle school.
OK, OK, to not slam the original author too much, I'm sure they didn't
envision a queue depth of 50,000+ messages. No one would think that
moving elements in an array, one at a time, and dereferencing each pointer
in that array to check priority of the message being pointed too, again
one at a time, for 50,000+ times would be good. So let's assume that, as
is typical, the users have found a way to break our code simply by using
it in a way we didn't envision. Fair enough.
"So, just how broken is it?", you ask. I wondered the same thing, so I
wrote an app to let me know. It's my next patch. It gave me some
interesting results. Here's what it tested:
Interference with other apps - In continuous mode, the app just sits there
and hits a message queue forever, while you go do something productive on
another terminal using other CPUs. You then measure how long it takes you
to do that something productive. Then you restart the app in fake
continuous mode, and it sits in a tight loop on a CPU while you repeat
your tests. The whole point of this is to keep one CPU tied up (so it
can't be used in your other work) but in one case tied up hitting the
mqueue code so we can see the effect of walking that 65,528 element array
one pointer at a time on the global CPU cache. If it's bad, then it will
slow down your app on the other CPUs just by polluting cache mercilessly.
In the fake case, it will be in a tight loop, but not polluting cache.
Testing the mqueue subsystem directly - Here we just run a number of tests
to see how the mqueue subsystem performs under different conditions. A
couple conditions are known to be worst case for the old system, and some
routines, so this tests all of them.
So, on to the results already:
Subsystem/Test Old New
Time to compile linux
kernel (make -j12 on a
6 core CPU)
Running mqueue test user 49m10.744s user 45m26.294s
sys 5m51.924s sys 4m59.894s
total 55m02.668s total 50m26.188s
Running fake test user 45m32.686s user 45m18.552s
sys 5m12.465s sys 4m56.468s
total 50m45.151s total 50m15.020s
% slowdown from mqueue
cache thrashing ~8% ~.5%
Avg time to send/recv (in nanoseconds per message)
when queue empty 305/288 349/318
when queue full (65528 messages)
constant priority 526589/823 362/314
increasing priority 403105/916 495/445
decreasing priority 73420/594 482/409
random priority 280147/920 546/436
Time to fill/drain queue (65528 messages, in seconds)
constant priority 17.37/.12 .13/.12
increasing priority 4.14/.14 .21/.18
decreasing priority 12.93/.13 .21/.18
random priority 8.88/.16 .22/.17
So, I think the results speak for themselves. It's possible this
implementation could be improved by cacheing at least one priority level
in the node tree (that would bring the queue empty performance more in
line with the old implementation), but this works and is *so* much better
than what we had, especially for the common case of a single priority in
use, that further refinements can be in follow on patches.
[akpm@linux-foundation.org: fix typo in comment, remove stray semicolon]
[levinsasha928@gmail.com: use correct gfp flags in msg_insert]
Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Manfred Spraul <manfred@colorfullife.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-01 03:26:35 +04:00
|
|
|
unsigned long mq_bytes, mq_treesize;
|
2011-07-27 03:08:46 +04:00
|
|
|
|
|
|
|
inode->i_fop = &mqueue_file_operations;
|
|
|
|
inode->i_size = FILENT_SIZE;
|
|
|
|
/* mqueue specific info */
|
|
|
|
info = MQUEUE_I(inode);
|
|
|
|
spin_lock_init(&info->lock);
|
|
|
|
init_waitqueue_head(&info->wait_q);
|
|
|
|
INIT_LIST_HEAD(&info->e_wait_q[0].list);
|
|
|
|
INIT_LIST_HEAD(&info->e_wait_q[1].list);
|
|
|
|
info->notify_owner = NULL;
|
2011-11-17 10:57:55 +04:00
|
|
|
info->notify_user_ns = NULL;
|
2011-07-27 03:08:46 +04:00
|
|
|
info->qsize = 0;
|
2021-04-22 15:27:12 +03:00
|
|
|
info->ucounts = NULL; /* set when all is ok */
|
ipc/mqueue: improve performance of send/recv
The existing implementation of the POSIX message queue send and recv
functions is, well, abysmal. Even worse than abysmal. I submitted a
patch to increase the maximum POSIX message queue limit to 65536 due to
customer needs, however, upon looking over the send/recv implementation, I
realized that my customer needs help with that too even if they don't know
it. The basic problem is that, given the fairly typical use case scenario
for a large queue of queueing lots of messages all at the same priority (I
verified with my customer that this is indeed what their app does), the
msg_insert routine is basically a frikkin' bubble sort. I mean, whoa,
that's *so* middle school.
OK, OK, to not slam the original author too much, I'm sure they didn't
envision a queue depth of 50,000+ messages. No one would think that
moving elements in an array, one at a time, and dereferencing each pointer
in that array to check priority of the message being pointed too, again
one at a time, for 50,000+ times would be good. So let's assume that, as
is typical, the users have found a way to break our code simply by using
it in a way we didn't envision. Fair enough.
"So, just how broken is it?", you ask. I wondered the same thing, so I
wrote an app to let me know. It's my next patch. It gave me some
interesting results. Here's what it tested:
Interference with other apps - In continuous mode, the app just sits there
and hits a message queue forever, while you go do something productive on
another terminal using other CPUs. You then measure how long it takes you
to do that something productive. Then you restart the app in fake
continuous mode, and it sits in a tight loop on a CPU while you repeat
your tests. The whole point of this is to keep one CPU tied up (so it
can't be used in your other work) but in one case tied up hitting the
mqueue code so we can see the effect of walking that 65,528 element array
one pointer at a time on the global CPU cache. If it's bad, then it will
slow down your app on the other CPUs just by polluting cache mercilessly.
In the fake case, it will be in a tight loop, but not polluting cache.
Testing the mqueue subsystem directly - Here we just run a number of tests
to see how the mqueue subsystem performs under different conditions. A
couple conditions are known to be worst case for the old system, and some
routines, so this tests all of them.
So, on to the results already:
Subsystem/Test Old New
Time to compile linux
kernel (make -j12 on a
6 core CPU)
Running mqueue test user 49m10.744s user 45m26.294s
sys 5m51.924s sys 4m59.894s
total 55m02.668s total 50m26.188s
Running fake test user 45m32.686s user 45m18.552s
sys 5m12.465s sys 4m56.468s
total 50m45.151s total 50m15.020s
% slowdown from mqueue
cache thrashing ~8% ~.5%
Avg time to send/recv (in nanoseconds per message)
when queue empty 305/288 349/318
when queue full (65528 messages)
constant priority 526589/823 362/314
increasing priority 403105/916 495/445
decreasing priority 73420/594 482/409
random priority 280147/920 546/436
Time to fill/drain queue (65528 messages, in seconds)
constant priority 17.37/.12 .13/.12
increasing priority 4.14/.14 .21/.18
decreasing priority 12.93/.13 .21/.18
random priority 8.88/.16 .22/.17
So, I think the results speak for themselves. It's possible this
implementation could be improved by cacheing at least one priority level
in the node tree (that would bring the queue empty performance more in
line with the old implementation), but this works and is *so* much better
than what we had, especially for the common case of a single priority in
use, that further refinements can be in follow on patches.
[akpm@linux-foundation.org: fix typo in comment, remove stray semicolon]
[levinsasha928@gmail.com: use correct gfp flags in msg_insert]
Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Manfred Spraul <manfred@colorfullife.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-01 03:26:35 +04:00
|
|
|
info->msg_tree = RB_ROOT;
|
2019-05-15 01:46:26 +03:00
|
|
|
info->msg_tree_rightmost = NULL;
|
ipc/mqueue: add rbtree node caching support
When I wrote the first patch that added the rbtree support for message
queue insertion, it sped up the case where the queue was very full
drastically from the original code. It, however, slowed down the case
where the queue was empty (not drastically though).
This patch caches the last freed rbtree node struct so we can quickly
reuse it when we get a new message. This is the common path for any queue
that very frequently goes from 0 to 1 then back to 0 messages in queue.
Andrew Morton didn't like that we were doing a GFP_ATOMIC allocation in
msg_insert, so this patch attempts to speculatively allocate a new node
struct outside of the spin lock when we know we need it, but will still
fall back to a GFP_ATOMIC allocation if it has to.
Once I added the caching, the necessary various ret = ; spin_unlock
gyrations in mq_timedsend were getting pretty ugly, so this also slightly
refactors that function to streamline the flow of the code and the
function exit.
Finally, while working on getting performance back I made sure that all of
the node structs were always fully initialized when they were first used,
rendering the use of kzalloc unnecessary and a waste of CPU cycles.
The net result of all of this is:
1) We will avoid a GFP_ATOMIC allocation when possible, but fall back
on it when necessary.
2) We will speculatively allocate a node struct using GFP_KERNEL if our
cache is empty (and save the struct to our cache if it's still empty
after we have obtained the spin lock).
3) The performance of the common queue empty case has significantly
improved and is now much more in line with the older performance for
this case.
The performance changes are:
Old mqueue new mqueue new mqueue + caching
queue empty
send/recv 305/288ns 349/318ns 310/322ns
I don't think we'll ever be able to get the recv performance back, but
that's because the old recv performance was a direct result and
consequence of the old methods abysmal send performance. The recv path
simply must do more so that the send path does not incur such a penalty
under higher queue depths.
As it turns out, the new caching code also sped up the various queue full
cases relative to my last patch. That could be because of the difference
between the syscall path in 3.3.4-rc5 and 3.3.4-rc6, or because of the
change in code flow in the mq_timedsend routine. Regardless, I'll take
it. It wasn't huge, and I *would* say it was within the margin for error,
but after many repeated runs what I'm seeing is that the old numbers trend
slightly higher (about 10 to 20ns depending on which test is the one
running).
[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-01 03:26:38 +04:00
|
|
|
info->node_cache = NULL;
|
2011-07-27 03:08:46 +04:00
|
|
|
memset(&info->attr, 0, sizeof(info->attr));
|
2012-06-01 03:26:33 +04:00
|
|
|
info->attr.mq_maxmsg = min(ipc_ns->mq_msg_max,
|
|
|
|
ipc_ns->mq_msg_default);
|
|
|
|
info->attr.mq_msgsize = min(ipc_ns->mq_msgsize_max,
|
|
|
|
ipc_ns->mq_msgsize_default);
|
2011-07-27 03:08:46 +04:00
|
|
|
if (attr) {
|
|
|
|
info->attr.mq_maxmsg = attr->mq_maxmsg;
|
|
|
|
info->attr.mq_msgsize = attr->mq_msgsize;
|
|
|
|
}
|
ipc/mqueue: improve performance of send/recv
The existing implementation of the POSIX message queue send and recv
functions is, well, abysmal. Even worse than abysmal. I submitted a
patch to increase the maximum POSIX message queue limit to 65536 due to
customer needs, however, upon looking over the send/recv implementation, I
realized that my customer needs help with that too even if they don't know
it. The basic problem is that, given the fairly typical use case scenario
for a large queue of queueing lots of messages all at the same priority (I
verified with my customer that this is indeed what their app does), the
msg_insert routine is basically a frikkin' bubble sort. I mean, whoa,
that's *so* middle school.
OK, OK, to not slam the original author too much, I'm sure they didn't
envision a queue depth of 50,000+ messages. No one would think that
moving elements in an array, one at a time, and dereferencing each pointer
in that array to check priority of the message being pointed too, again
one at a time, for 50,000+ times would be good. So let's assume that, as
is typical, the users have found a way to break our code simply by using
it in a way we didn't envision. Fair enough.
"So, just how broken is it?", you ask. I wondered the same thing, so I
wrote an app to let me know. It's my next patch. It gave me some
interesting results. Here's what it tested:
Interference with other apps - In continuous mode, the app just sits there
and hits a message queue forever, while you go do something productive on
another terminal using other CPUs. You then measure how long it takes you
to do that something productive. Then you restart the app in fake
continuous mode, and it sits in a tight loop on a CPU while you repeat
your tests. The whole point of this is to keep one CPU tied up (so it
can't be used in your other work) but in one case tied up hitting the
mqueue code so we can see the effect of walking that 65,528 element array
one pointer at a time on the global CPU cache. If it's bad, then it will
slow down your app on the other CPUs just by polluting cache mercilessly.
In the fake case, it will be in a tight loop, but not polluting cache.
Testing the mqueue subsystem directly - Here we just run a number of tests
to see how the mqueue subsystem performs under different conditions. A
couple conditions are known to be worst case for the old system, and some
routines, so this tests all of them.
So, on to the results already:
Subsystem/Test Old New
Time to compile linux
kernel (make -j12 on a
6 core CPU)
Running mqueue test user 49m10.744s user 45m26.294s
sys 5m51.924s sys 4m59.894s
total 55m02.668s total 50m26.188s
Running fake test user 45m32.686s user 45m18.552s
sys 5m12.465s sys 4m56.468s
total 50m45.151s total 50m15.020s
% slowdown from mqueue
cache thrashing ~8% ~.5%
Avg time to send/recv (in nanoseconds per message)
when queue empty 305/288 349/318
when queue full (65528 messages)
constant priority 526589/823 362/314
increasing priority 403105/916 495/445
decreasing priority 73420/594 482/409
random priority 280147/920 546/436
Time to fill/drain queue (65528 messages, in seconds)
constant priority 17.37/.12 .13/.12
increasing priority 4.14/.14 .21/.18
decreasing priority 12.93/.13 .21/.18
random priority 8.88/.16 .22/.17
So, I think the results speak for themselves. It's possible this
implementation could be improved by cacheing at least one priority level
in the node tree (that would bring the queue empty performance more in
line with the old implementation), but this works and is *so* much better
than what we had, especially for the common case of a single priority in
use, that further refinements can be in follow on patches.
[akpm@linux-foundation.org: fix typo in comment, remove stray semicolon]
[levinsasha928@gmail.com: use correct gfp flags in msg_insert]
Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Manfred Spraul <manfred@colorfullife.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-01 03:26:35 +04:00
|
|
|
/*
|
|
|
|
* We used to allocate a static array of pointers and account
|
|
|
|
* the size of that array as well as one msg_msg struct per
|
|
|
|
* possible message into the queue size. That's no longer
|
|
|
|
* accurate as the queue is now an rbtree and will grow and
|
|
|
|
* shrink depending on usage patterns. We can, however, still
|
|
|
|
* account one msg_msg struct per message, but the nodes are
|
|
|
|
* allocated depending on priority usage, and most programs
|
|
|
|
* only use one, or a handful, of priorities. However, since
|
|
|
|
* this is pinned memory, we need to assume worst case, so
|
|
|
|
* that means the min(mq_maxmsg, max_priorities) * struct
|
|
|
|
* posix_msg_tree_node.
|
|
|
|
*/
|
2017-12-02 01:43:43 +03:00
|
|
|
|
|
|
|
ret = -EINVAL;
|
|
|
|
if (info->attr.mq_maxmsg <= 0 || info->attr.mq_msgsize <= 0)
|
|
|
|
goto out_inode;
|
|
|
|
if (capable(CAP_SYS_RESOURCE)) {
|
|
|
|
if (info->attr.mq_maxmsg > HARD_MSGMAX ||
|
|
|
|
info->attr.mq_msgsize > HARD_MSGSIZEMAX)
|
|
|
|
goto out_inode;
|
|
|
|
} else {
|
|
|
|
if (info->attr.mq_maxmsg > ipc_ns->mq_msg_max ||
|
|
|
|
info->attr.mq_msgsize > ipc_ns->mq_msgsize_max)
|
|
|
|
goto out_inode;
|
|
|
|
}
|
|
|
|
ret = -EOVERFLOW;
|
|
|
|
/* check for overflow */
|
|
|
|
if (info->attr.mq_msgsize > ULONG_MAX/info->attr.mq_maxmsg)
|
|
|
|
goto out_inode;
|
ipc/mqueue: improve performance of send/recv
The existing implementation of the POSIX message queue send and recv
functions is, well, abysmal. Even worse than abysmal. I submitted a
patch to increase the maximum POSIX message queue limit to 65536 due to
customer needs, however, upon looking over the send/recv implementation, I
realized that my customer needs help with that too even if they don't know
it. The basic problem is that, given the fairly typical use case scenario
for a large queue of queueing lots of messages all at the same priority (I
verified with my customer that this is indeed what their app does), the
msg_insert routine is basically a frikkin' bubble sort. I mean, whoa,
that's *so* middle school.
OK, OK, to not slam the original author too much, I'm sure they didn't
envision a queue depth of 50,000+ messages. No one would think that
moving elements in an array, one at a time, and dereferencing each pointer
in that array to check priority of the message being pointed too, again
one at a time, for 50,000+ times would be good. So let's assume that, as
is typical, the users have found a way to break our code simply by using
it in a way we didn't envision. Fair enough.
"So, just how broken is it?", you ask. I wondered the same thing, so I
wrote an app to let me know. It's my next patch. It gave me some
interesting results. Here's what it tested:
Interference with other apps - In continuous mode, the app just sits there
and hits a message queue forever, while you go do something productive on
another terminal using other CPUs. You then measure how long it takes you
to do that something productive. Then you restart the app in fake
continuous mode, and it sits in a tight loop on a CPU while you repeat
your tests. The whole point of this is to keep one CPU tied up (so it
can't be used in your other work) but in one case tied up hitting the
mqueue code so we can see the effect of walking that 65,528 element array
one pointer at a time on the global CPU cache. If it's bad, then it will
slow down your app on the other CPUs just by polluting cache mercilessly.
In the fake case, it will be in a tight loop, but not polluting cache.
Testing the mqueue subsystem directly - Here we just run a number of tests
to see how the mqueue subsystem performs under different conditions. A
couple conditions are known to be worst case for the old system, and some
routines, so this tests all of them.
So, on to the results already:
Subsystem/Test Old New
Time to compile linux
kernel (make -j12 on a
6 core CPU)
Running mqueue test user 49m10.744s user 45m26.294s
sys 5m51.924s sys 4m59.894s
total 55m02.668s total 50m26.188s
Running fake test user 45m32.686s user 45m18.552s
sys 5m12.465s sys 4m56.468s
total 50m45.151s total 50m15.020s
% slowdown from mqueue
cache thrashing ~8% ~.5%
Avg time to send/recv (in nanoseconds per message)
when queue empty 305/288 349/318
when queue full (65528 messages)
constant priority 526589/823 362/314
increasing priority 403105/916 495/445
decreasing priority 73420/594 482/409
random priority 280147/920 546/436
Time to fill/drain queue (65528 messages, in seconds)
constant priority 17.37/.12 .13/.12
increasing priority 4.14/.14 .21/.18
decreasing priority 12.93/.13 .21/.18
random priority 8.88/.16 .22/.17
So, I think the results speak for themselves. It's possible this
implementation could be improved by cacheing at least one priority level
in the node tree (that would bring the queue empty performance more in
line with the old implementation), but this works and is *so* much better
than what we had, especially for the common case of a single priority in
use, that further refinements can be in follow on patches.
[akpm@linux-foundation.org: fix typo in comment, remove stray semicolon]
[levinsasha928@gmail.com: use correct gfp flags in msg_insert]
Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Manfred Spraul <manfred@colorfullife.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-01 03:26:35 +04:00
|
|
|
mq_treesize = info->attr.mq_maxmsg * sizeof(struct msg_msg) +
|
|
|
|
min_t(unsigned int, info->attr.mq_maxmsg, MQ_PRIO_MAX) *
|
|
|
|
sizeof(struct posix_msg_tree_node);
|
2017-12-02 01:43:43 +03:00
|
|
|
mq_bytes = info->attr.mq_maxmsg * info->attr.mq_msgsize;
|
|
|
|
if (mq_bytes + mq_treesize < mq_bytes)
|
|
|
|
goto out_inode;
|
|
|
|
mq_bytes += mq_treesize;
|
2021-04-22 15:27:12 +03:00
|
|
|
info->ucounts = get_ucounts(current_ucounts());
|
|
|
|
if (info->ucounts) {
|
|
|
|
long msgqueue;
|
|
|
|
|
|
|
|
spin_lock(&mq_lock);
|
|
|
|
msgqueue = inc_rlimit_ucounts(info->ucounts, UCOUNT_RLIMIT_MSGQUEUE, mq_bytes);
|
|
|
|
if (msgqueue == LONG_MAX || msgqueue > rlimit(RLIMIT_MSGQUEUE)) {
|
|
|
|
dec_rlimit_ucounts(info->ucounts, UCOUNT_RLIMIT_MSGQUEUE, mq_bytes);
|
|
|
|
spin_unlock(&mq_lock);
|
|
|
|
put_ucounts(info->ucounts);
|
|
|
|
info->ucounts = NULL;
|
|
|
|
/* mqueue_evict_inode() releases info->messages */
|
|
|
|
ret = -EMFILE;
|
|
|
|
goto out_inode;
|
|
|
|
}
|
2011-07-27 03:08:46 +04:00
|
|
|
spin_unlock(&mq_lock);
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
2011-07-27 03:08:46 +04:00
|
|
|
} else if (S_ISDIR(mode)) {
|
|
|
|
inc_nlink(inode);
|
|
|
|
/* Some things misbehave if size == 0 on a directory */
|
|
|
|
inode->i_size = 2 * DIRENT_SIZE;
|
|
|
|
inode->i_op = &mqueue_dir_inode_operations;
|
|
|
|
inode->i_fop = &simple_dir_operations;
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
2011-07-27 03:08:46 +04:00
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
return inode;
|
|
|
|
out_inode:
|
|
|
|
iput(inode);
|
2011-07-27 03:08:46 +04:00
|
|
|
err:
|
2011-07-27 03:08:47 +04:00
|
|
|
return ERR_PTR(ret);
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
|
2018-11-02 02:07:25 +03:00
|
|
|
static int mqueue_fill_super(struct super_block *sb, struct fs_context *fc)
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
|
|
|
struct inode *inode;
|
2018-03-24 19:28:14 +03:00
|
|
|
struct ipc_namespace *ns = sb->s_fs_info;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2016-06-09 23:34:02 +03:00
|
|
|
sb->s_iflags |= SB_I_NOEXEC | SB_I_NODEV;
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 15:29:47 +03:00
|
|
|
sb->s_blocksize = PAGE_SIZE;
|
|
|
|
sb->s_blocksize_bits = PAGE_SHIFT;
|
2005-04-17 02:20:36 +04:00
|
|
|
sb->s_magic = MQUEUE_MAGIC;
|
|
|
|
sb->s_op = &mqueue_super_ops;
|
|
|
|
|
2012-01-09 07:15:13 +04:00
|
|
|
inode = mqueue_get_inode(sb, ns, S_IFDIR | S_ISVTX | S_IRWXUGO, NULL);
|
|
|
|
if (IS_ERR(inode))
|
|
|
|
return PTR_ERR(inode);
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2012-01-09 07:15:13 +04:00
|
|
|
sb->s_root = d_make_root(inode);
|
|
|
|
if (!sb->s_root)
|
|
|
|
return -ENOMEM;
|
|
|
|
return 0;
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
|
2018-11-02 02:07:25 +03:00
|
|
|
static int mqueue_get_tree(struct fs_context *fc)
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
2018-11-02 02:07:25 +03:00
|
|
|
struct mqueue_fs_context *ctx = fc->fs_private;
|
|
|
|
|
ipc/mqueue: use get_tree_nodev() in mqueue_get_tree()
When running the stress-ng clone benchmark with multiple testing threads,
it was found that there were significant spinlock contention in sget_fc().
The contended spinlock was the sb_lock. It is under heavy contention
because the following code in the critcal section of sget_fc():
hlist_for_each_entry(old, &fc->fs_type->fs_supers, s_instances) {
if (test(old, fc))
goto share_extant_sb;
}
After testing with added instrumentation code, it was found that the
benchmark could generate thousands of ipc namespaces with the
corresponding number of entries in the mqueue's fs_supers list where the
namespaces are the key for the search. This leads to excessive time in
scanning the list for a match.
Looking back at the mqueue calling sequence leading to sget_fc():
mq_init_ns()
=> mq_create_mount()
=> fc_mount()
=> vfs_get_tree()
=> mqueue_get_tree()
=> get_tree_keyed()
=> vfs_get_super()
=> sget_fc()
Currently, mq_init_ns() is the only mqueue function that will indirectly
call mqueue_get_tree() with a newly allocated ipc namespace as the key for
searching. As a result, there will never be a match with the exising ipc
namespaces stored in the mqueue's fs_supers list.
So using get_tree_keyed() to do an existing ipc namespace search is just a
waste of time. Instead, we could use get_tree_nodev() to eliminate the
useless search. By doing so, we can greatly reduce the sb_lock hold time
and avoid the spinlock contention problem in case a large number of ipc
namespaces are present.
Of course, if the code is modified in the future to allow
mqueue_get_tree() to be called with an existing ipc namespace instead of a
new one, we will have to use get_tree_keyed() in this case.
The following stress-ng clone benchmark command was run on a 2-socket
48-core Intel system:
./stress-ng --clone 32 --verbose --oomable --metrics-brief -t 20
The "bogo ops/s" increased from 5948.45 before patch to 9137.06 after
patch. This is an increase of 54% in performance.
Link: https://lkml.kernel.org/r/20220121172315.19652-1-longman@redhat.com
Fixes: 935c6912b198 ("ipc: Convert mqueue fs to fs_context")
Signed-off-by: Waiman Long <longman@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-10 04:29:21 +03:00
|
|
|
/*
|
|
|
|
* With a newly created ipc namespace, we don't need to do a search
|
|
|
|
* for an ipc namespace match, but we still need to set s_fs_info.
|
|
|
|
*/
|
|
|
|
if (ctx->newns) {
|
|
|
|
fc->s_fs_info = ctx->ipc_ns;
|
|
|
|
return get_tree_nodev(fc, mqueue_fill_super);
|
|
|
|
}
|
2019-09-04 02:05:48 +03:00
|
|
|
return get_tree_keyed(fc, mqueue_fill_super, ctx->ipc_ns);
|
2018-11-02 02:07:25 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
static void mqueue_fs_context_free(struct fs_context *fc)
|
|
|
|
{
|
|
|
|
struct mqueue_fs_context *ctx = fc->fs_private;
|
|
|
|
|
2019-05-13 00:46:05 +03:00
|
|
|
put_ipc_ns(ctx->ipc_ns);
|
2018-11-02 02:07:25 +03:00
|
|
|
kfree(ctx);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int mqueue_init_fs_context(struct fs_context *fc)
|
|
|
|
{
|
|
|
|
struct mqueue_fs_context *ctx;
|
|
|
|
|
|
|
|
ctx = kzalloc(sizeof(struct mqueue_fs_context), GFP_KERNEL);
|
|
|
|
if (!ctx)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
ctx->ipc_ns = get_ipc_ns(current->nsproxy->ipc_ns);
|
2019-05-13 00:46:05 +03:00
|
|
|
put_user_ns(fc->user_ns);
|
|
|
|
fc->user_ns = get_user_ns(ctx->ipc_ns->user_ns);
|
2018-11-02 02:07:25 +03:00
|
|
|
fc->fs_private = ctx;
|
|
|
|
fc->ops = &mqueue_fs_context_ops;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
ipc/mqueue: use get_tree_nodev() in mqueue_get_tree()
When running the stress-ng clone benchmark with multiple testing threads,
it was found that there were significant spinlock contention in sget_fc().
The contended spinlock was the sb_lock. It is under heavy contention
because the following code in the critcal section of sget_fc():
hlist_for_each_entry(old, &fc->fs_type->fs_supers, s_instances) {
if (test(old, fc))
goto share_extant_sb;
}
After testing with added instrumentation code, it was found that the
benchmark could generate thousands of ipc namespaces with the
corresponding number of entries in the mqueue's fs_supers list where the
namespaces are the key for the search. This leads to excessive time in
scanning the list for a match.
Looking back at the mqueue calling sequence leading to sget_fc():
mq_init_ns()
=> mq_create_mount()
=> fc_mount()
=> vfs_get_tree()
=> mqueue_get_tree()
=> get_tree_keyed()
=> vfs_get_super()
=> sget_fc()
Currently, mq_init_ns() is the only mqueue function that will indirectly
call mqueue_get_tree() with a newly allocated ipc namespace as the key for
searching. As a result, there will never be a match with the exising ipc
namespaces stored in the mqueue's fs_supers list.
So using get_tree_keyed() to do an existing ipc namespace search is just a
waste of time. Instead, we could use get_tree_nodev() to eliminate the
useless search. By doing so, we can greatly reduce the sb_lock hold time
and avoid the spinlock contention problem in case a large number of ipc
namespaces are present.
Of course, if the code is modified in the future to allow
mqueue_get_tree() to be called with an existing ipc namespace instead of a
new one, we will have to use get_tree_keyed() in this case.
The following stress-ng clone benchmark command was run on a 2-socket
48-core Intel system:
./stress-ng --clone 32 --verbose --oomable --metrics-brief -t 20
The "bogo ops/s" increased from 5948.45 before patch to 9137.06 after
patch. This is an increase of 54% in performance.
Link: https://lkml.kernel.org/r/20220121172315.19652-1-longman@redhat.com
Fixes: 935c6912b198 ("ipc: Convert mqueue fs to fs_context")
Signed-off-by: Waiman Long <longman@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-10 04:29:21 +03:00
|
|
|
/*
|
|
|
|
* mq_init_ns() is currently the only caller of mq_create_mount().
|
|
|
|
* So the ns parameter is always a newly created ipc namespace.
|
|
|
|
*/
|
2018-11-02 02:07:25 +03:00
|
|
|
static struct vfsmount *mq_create_mount(struct ipc_namespace *ns)
|
|
|
|
{
|
|
|
|
struct mqueue_fs_context *ctx;
|
|
|
|
struct fs_context *fc;
|
|
|
|
struct vfsmount *mnt;
|
|
|
|
|
|
|
|
fc = fs_context_for_mount(&mqueue_fs_type, SB_KERNMOUNT);
|
|
|
|
if (IS_ERR(fc))
|
|
|
|
return ERR_CAST(fc);
|
|
|
|
|
|
|
|
ctx = fc->fs_private;
|
ipc/mqueue: use get_tree_nodev() in mqueue_get_tree()
When running the stress-ng clone benchmark with multiple testing threads,
it was found that there were significant spinlock contention in sget_fc().
The contended spinlock was the sb_lock. It is under heavy contention
because the following code in the critcal section of sget_fc():
hlist_for_each_entry(old, &fc->fs_type->fs_supers, s_instances) {
if (test(old, fc))
goto share_extant_sb;
}
After testing with added instrumentation code, it was found that the
benchmark could generate thousands of ipc namespaces with the
corresponding number of entries in the mqueue's fs_supers list where the
namespaces are the key for the search. This leads to excessive time in
scanning the list for a match.
Looking back at the mqueue calling sequence leading to sget_fc():
mq_init_ns()
=> mq_create_mount()
=> fc_mount()
=> vfs_get_tree()
=> mqueue_get_tree()
=> get_tree_keyed()
=> vfs_get_super()
=> sget_fc()
Currently, mq_init_ns() is the only mqueue function that will indirectly
call mqueue_get_tree() with a newly allocated ipc namespace as the key for
searching. As a result, there will never be a match with the exising ipc
namespaces stored in the mqueue's fs_supers list.
So using get_tree_keyed() to do an existing ipc namespace search is just a
waste of time. Instead, we could use get_tree_nodev() to eliminate the
useless search. By doing so, we can greatly reduce the sb_lock hold time
and avoid the spinlock contention problem in case a large number of ipc
namespaces are present.
Of course, if the code is modified in the future to allow
mqueue_get_tree() to be called with an existing ipc namespace instead of a
new one, we will have to use get_tree_keyed() in this case.
The following stress-ng clone benchmark command was run on a 2-socket
48-core Intel system:
./stress-ng --clone 32 --verbose --oomable --metrics-brief -t 20
The "bogo ops/s" increased from 5948.45 before patch to 9137.06 after
patch. This is an increase of 54% in performance.
Link: https://lkml.kernel.org/r/20220121172315.19652-1-longman@redhat.com
Fixes: 935c6912b198 ("ipc: Convert mqueue fs to fs_context")
Signed-off-by: Waiman Long <longman@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-10 04:29:21 +03:00
|
|
|
ctx->newns = true;
|
2018-11-02 02:07:25 +03:00
|
|
|
put_ipc_ns(ctx->ipc_ns);
|
|
|
|
ctx->ipc_ns = get_ipc_ns(ns);
|
2019-05-13 00:46:05 +03:00
|
|
|
put_user_ns(fc->user_ns);
|
|
|
|
fc->user_ns = get_user_ns(ctx->ipc_ns->user_ns);
|
2018-11-02 02:07:25 +03:00
|
|
|
|
|
|
|
mnt = fc_mount(fc);
|
|
|
|
put_fs_context(fc);
|
|
|
|
return mnt;
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
|
2008-07-26 06:45:34 +04:00
|
|
|
static void init_once(void *foo)
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
|
|
|
struct mqueue_inode_info *p = (struct mqueue_inode_info *) foo;
|
|
|
|
|
2007-05-17 09:10:57 +04:00
|
|
|
inode_init_once(&p->vfs_inode);
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
static struct inode *mqueue_alloc_inode(struct super_block *sb)
|
|
|
|
{
|
|
|
|
struct mqueue_inode_info *ei;
|
|
|
|
|
2022-03-23 00:41:03 +03:00
|
|
|
ei = alloc_inode_sb(sb, mqueue_inode_cachep, GFP_KERNEL);
|
2005-04-17 02:20:36 +04:00
|
|
|
if (!ei)
|
|
|
|
return NULL;
|
|
|
|
return &ei->vfs_inode;
|
|
|
|
}
|
|
|
|
|
2019-04-16 05:30:30 +03:00
|
|
|
static void mqueue_free_inode(struct inode *inode)
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
|
|
|
kmem_cache_free(mqueue_inode_cachep, MQUEUE_I(inode));
|
|
|
|
}
|
|
|
|
|
2010-06-06 00:29:45 +04:00
|
|
|
static void mqueue_evict_inode(struct inode *inode)
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
|
|
|
struct mqueue_inode_info *info;
|
namespaces: ipc namespaces: implement support for posix msqueues
Implement multiple mounts of the mqueue file system, and link it to usage
of CLONE_NEWIPC.
Each ipc ns has a corresponding mqueuefs superblock. When a user does
clone(CLONE_NEWIPC) or unshare(CLONE_NEWIPC), the unshare will cause an
internal mount of a new mqueuefs sb linked to the new ipc ns.
When a user does 'mount -t mqueue mqueue /dev/mqueue', he mounts the
mqueuefs superblock.
Posix message queues can be worked with both through the mq_* system calls
(see mq_overview(7)), and through the VFS through the mqueue mount. Any
usage of mq_open() and friends will work with the acting task's ipc
namespace. Any actions through the VFS will work with the mqueuefs in
which the file was created. So if a user doesn't remount mqueuefs after
unshare(CLONE_NEWIPC), mq_open("/ab") will not be reflected in "ls
/dev/mqueue".
If task a mounts mqueue for ipc_ns:1, then clones task b with a new ipcns,
ipcns:2, and then task a is the last task in ipc_ns:1 to exit, then (1)
ipc_ns:1 will be freed, (2) it's superblock will live on until task b
umounts the corresponding mqueuefs, and vfs actions will continue to
succeed, but (3) sb->s_fs_info will be NULL for the sb corresponding to
the deceased ipc_ns:1.
To make this happen, we must protect the ipc reference count when
a) a task exits and drops its ipcns->count, since it might be dropping
it to 0 and freeing the ipcns
b) a task accesses the ipcns through its mqueuefs interface, since it
bumps the ipcns refcount and might race with the last task in the ipcns
exiting.
So the kref is changed to an atomic_t so we can use
atomic_dec_and_lock(&ns->count,mq_lock), and every access to the ipcns
through ns = mqueuefs_sb->s_fs_info is protected by the same lock.
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 06:01:10 +04:00
|
|
|
struct ipc_namespace *ipc_ns;
|
ipc: prevent lockup on alloc_msg and free_msg
msgctl10 of ltp triggers the following lockup When CONFIG_KASAN is
enabled on large memory SMP systems, the pages initialization can take a
long time, if msgctl10 requests a huge block memory, and it will block
rcu scheduler, so release cpu actively.
After adding schedule() in free_msg, free_msg can not be called when
holding spinlock, so adding msg to a tmp list, and free it out of
spinlock
rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
rcu: Tasks blocked on level-1 rcu_node (CPUs 16-31): P32505
rcu: Tasks blocked on level-1 rcu_node (CPUs 48-63): P34978
rcu: (detected by 11, t=35024 jiffies, g=44237529, q=16542267)
msgctl10 R running task 21608 32505 2794 0x00000082
Call Trace:
preempt_schedule_irq+0x4c/0xb0
retint_kernel+0x1b/0x2d
RIP: 0010:__is_insn_slot_addr+0xfb/0x250
Code: 82 1d 00 48 8b 9b 90 00 00 00 4c 89 f7 49 c1 ee 03 e8 59 83 1d 00 48 b8 00 00 00 00 00 fc ff df 4c 39 eb 48 89 9d 58 ff ff ff <41> c6 04 06 f8 74 66 4c 8d 75 98 4c 89 f1 48 c1 e9 03 48 01 c8 48
RSP: 0018:ffff88bce041f758 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
RAX: dffffc0000000000 RBX: ffffffff8471bc50 RCX: ffffffff828a2a57
RDX: dffffc0000000000 RSI: dffffc0000000000 RDI: ffff88bce041f780
RBP: ffff88bce041f828 R08: ffffed15f3f4c5b3 R09: ffffed15f3f4c5b3
R10: 0000000000000001 R11: ffffed15f3f4c5b2 R12: 000000318aee9b73
R13: ffffffff8471bc50 R14: 1ffff1179c083ef0 R15: 1ffff1179c083eec
kernel_text_address+0xc1/0x100
__kernel_text_address+0xe/0x30
unwind_get_return_address+0x2f/0x50
__save_stack_trace+0x92/0x100
create_object+0x380/0x650
__kmalloc+0x14c/0x2b0
load_msg+0x38/0x1a0
do_msgsnd+0x19e/0xcf0
do_syscall_64+0x117/0x400
entry_SYSCALL_64_after_hwframe+0x49/0xbe
rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
rcu: Tasks blocked on level-1 rcu_node (CPUs 0-15): P32170
rcu: (detected by 14, t=35016 jiffies, g=44237525, q=12423063)
msgctl10 R running task 21608 32170 32155 0x00000082
Call Trace:
preempt_schedule_irq+0x4c/0xb0
retint_kernel+0x1b/0x2d
RIP: 0010:lock_acquire+0x4d/0x340
Code: 48 81 ec c0 00 00 00 45 89 c6 4d 89 cf 48 8d 6c 24 20 48 89 3c 24 48 8d bb e4 0c 00 00 89 74 24 0c 48 c7 44 24 20 b3 8a b5 41 <48> c1 ed 03 48 c7 44 24 28 b4 25 18 84 48 c7 44 24 30 d0 54 7a 82
RSP: 0018:ffff88af83417738 EFLAGS: 00000282 ORIG_RAX: ffffffffffffff13
RAX: dffffc0000000000 RBX: ffff88bd335f3080 RCX: 0000000000000002
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88bd335f3d64
RBP: ffff88af83417758 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000001 R11: ffffed13f3f745b2 R12: 0000000000000000
R13: 0000000000000002 R14: 0000000000000000 R15: 0000000000000000
is_bpf_text_address+0x32/0xe0
kernel_text_address+0xec/0x100
__kernel_text_address+0xe/0x30
unwind_get_return_address+0x2f/0x50
__save_stack_trace+0x92/0x100
save_stack+0x32/0xb0
__kasan_slab_free+0x130/0x180
kfree+0xfa/0x2d0
free_msg+0x24/0x50
do_msgrcv+0x508/0xe60
do_syscall_64+0x117/0x400
entry_SYSCALL_64_after_hwframe+0x49/0xbe
Davidlohr said:
"So after releasing the lock, the msg rbtree/list is empty and new
calls will not see those in the newly populated tmp_msg list, and
therefore they cannot access the delayed msg freeing pointers, which
is good. Also the fact that the node_cache is now freed before the
actual messages seems to be harmless as this is wanted for
msg_insert() avoiding GFP_ATOMIC allocations, and after releasing the
info->lock the thing is freed anyway so it should not change things"
Link: http://lkml.kernel.org/r/1552029161-4957-1-git-send-email-lirongqing@baidu.com
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
Reviewed-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-15 01:46:20 +03:00
|
|
|
struct msg_msg *msg, *nmsg;
|
|
|
|
LIST_HEAD(tmp_msg);
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2012-05-03 16:48:02 +04:00
|
|
|
clear_inode(inode);
|
2010-06-06 00:29:45 +04:00
|
|
|
|
|
|
|
if (S_ISDIR(inode->i_mode))
|
2005-04-17 02:20:36 +04:00
|
|
|
return;
|
2010-06-06 00:29:45 +04:00
|
|
|
|
namespaces: ipc namespaces: implement support for posix msqueues
Implement multiple mounts of the mqueue file system, and link it to usage
of CLONE_NEWIPC.
Each ipc ns has a corresponding mqueuefs superblock. When a user does
clone(CLONE_NEWIPC) or unshare(CLONE_NEWIPC), the unshare will cause an
internal mount of a new mqueuefs sb linked to the new ipc ns.
When a user does 'mount -t mqueue mqueue /dev/mqueue', he mounts the
mqueuefs superblock.
Posix message queues can be worked with both through the mq_* system calls
(see mq_overview(7)), and through the VFS through the mqueue mount. Any
usage of mq_open() and friends will work with the acting task's ipc
namespace. Any actions through the VFS will work with the mqueuefs in
which the file was created. So if a user doesn't remount mqueuefs after
unshare(CLONE_NEWIPC), mq_open("/ab") will not be reflected in "ls
/dev/mqueue".
If task a mounts mqueue for ipc_ns:1, then clones task b with a new ipcns,
ipcns:2, and then task a is the last task in ipc_ns:1 to exit, then (1)
ipc_ns:1 will be freed, (2) it's superblock will live on until task b
umounts the corresponding mqueuefs, and vfs actions will continue to
succeed, but (3) sb->s_fs_info will be NULL for the sb corresponding to
the deceased ipc_ns:1.
To make this happen, we must protect the ipc reference count when
a) a task exits and drops its ipcns->count, since it might be dropping
it to 0 and freeing the ipcns
b) a task accesses the ipcns through its mqueuefs interface, since it
bumps the ipcns refcount and might race with the last task in the ipcns
exiting.
So the kref is changed to an atomic_t so we can use
atomic_dec_and_lock(&ns->count,mq_lock), and every access to the ipcns
through ns = mqueuefs_sb->s_fs_info is protected by the same lock.
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 06:01:10 +04:00
|
|
|
ipc_ns = get_ns_from_inode(inode);
|
2005-04-17 02:20:36 +04:00
|
|
|
info = MQUEUE_I(inode);
|
|
|
|
spin_lock(&info->lock);
|
ipc/mqueue: improve performance of send/recv
The existing implementation of the POSIX message queue send and recv
functions is, well, abysmal. Even worse than abysmal. I submitted a
patch to increase the maximum POSIX message queue limit to 65536 due to
customer needs, however, upon looking over the send/recv implementation, I
realized that my customer needs help with that too even if they don't know
it. The basic problem is that, given the fairly typical use case scenario
for a large queue of queueing lots of messages all at the same priority (I
verified with my customer that this is indeed what their app does), the
msg_insert routine is basically a frikkin' bubble sort. I mean, whoa,
that's *so* middle school.
OK, OK, to not slam the original author too much, I'm sure they didn't
envision a queue depth of 50,000+ messages. No one would think that
moving elements in an array, one at a time, and dereferencing each pointer
in that array to check priority of the message being pointed too, again
one at a time, for 50,000+ times would be good. So let's assume that, as
is typical, the users have found a way to break our code simply by using
it in a way we didn't envision. Fair enough.
"So, just how broken is it?", you ask. I wondered the same thing, so I
wrote an app to let me know. It's my next patch. It gave me some
interesting results. Here's what it tested:
Interference with other apps - In continuous mode, the app just sits there
and hits a message queue forever, while you go do something productive on
another terminal using other CPUs. You then measure how long it takes you
to do that something productive. Then you restart the app in fake
continuous mode, and it sits in a tight loop on a CPU while you repeat
your tests. The whole point of this is to keep one CPU tied up (so it
can't be used in your other work) but in one case tied up hitting the
mqueue code so we can see the effect of walking that 65,528 element array
one pointer at a time on the global CPU cache. If it's bad, then it will
slow down your app on the other CPUs just by polluting cache mercilessly.
In the fake case, it will be in a tight loop, but not polluting cache.
Testing the mqueue subsystem directly - Here we just run a number of tests
to see how the mqueue subsystem performs under different conditions. A
couple conditions are known to be worst case for the old system, and some
routines, so this tests all of them.
So, on to the results already:
Subsystem/Test Old New
Time to compile linux
kernel (make -j12 on a
6 core CPU)
Running mqueue test user 49m10.744s user 45m26.294s
sys 5m51.924s sys 4m59.894s
total 55m02.668s total 50m26.188s
Running fake test user 45m32.686s user 45m18.552s
sys 5m12.465s sys 4m56.468s
total 50m45.151s total 50m15.020s
% slowdown from mqueue
cache thrashing ~8% ~.5%
Avg time to send/recv (in nanoseconds per message)
when queue empty 305/288 349/318
when queue full (65528 messages)
constant priority 526589/823 362/314
increasing priority 403105/916 495/445
decreasing priority 73420/594 482/409
random priority 280147/920 546/436
Time to fill/drain queue (65528 messages, in seconds)
constant priority 17.37/.12 .13/.12
increasing priority 4.14/.14 .21/.18
decreasing priority 12.93/.13 .21/.18
random priority 8.88/.16 .22/.17
So, I think the results speak for themselves. It's possible this
implementation could be improved by cacheing at least one priority level
in the node tree (that would bring the queue empty performance more in
line with the old implementation), but this works and is *so* much better
than what we had, especially for the common case of a single priority in
use, that further refinements can be in follow on patches.
[akpm@linux-foundation.org: fix typo in comment, remove stray semicolon]
[levinsasha928@gmail.com: use correct gfp flags in msg_insert]
Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Manfred Spraul <manfred@colorfullife.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-01 03:26:35 +04:00
|
|
|
while ((msg = msg_get(info)) != NULL)
|
ipc: prevent lockup on alloc_msg and free_msg
msgctl10 of ltp triggers the following lockup When CONFIG_KASAN is
enabled on large memory SMP systems, the pages initialization can take a
long time, if msgctl10 requests a huge block memory, and it will block
rcu scheduler, so release cpu actively.
After adding schedule() in free_msg, free_msg can not be called when
holding spinlock, so adding msg to a tmp list, and free it out of
spinlock
rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
rcu: Tasks blocked on level-1 rcu_node (CPUs 16-31): P32505
rcu: Tasks blocked on level-1 rcu_node (CPUs 48-63): P34978
rcu: (detected by 11, t=35024 jiffies, g=44237529, q=16542267)
msgctl10 R running task 21608 32505 2794 0x00000082
Call Trace:
preempt_schedule_irq+0x4c/0xb0
retint_kernel+0x1b/0x2d
RIP: 0010:__is_insn_slot_addr+0xfb/0x250
Code: 82 1d 00 48 8b 9b 90 00 00 00 4c 89 f7 49 c1 ee 03 e8 59 83 1d 00 48 b8 00 00 00 00 00 fc ff df 4c 39 eb 48 89 9d 58 ff ff ff <41> c6 04 06 f8 74 66 4c 8d 75 98 4c 89 f1 48 c1 e9 03 48 01 c8 48
RSP: 0018:ffff88bce041f758 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
RAX: dffffc0000000000 RBX: ffffffff8471bc50 RCX: ffffffff828a2a57
RDX: dffffc0000000000 RSI: dffffc0000000000 RDI: ffff88bce041f780
RBP: ffff88bce041f828 R08: ffffed15f3f4c5b3 R09: ffffed15f3f4c5b3
R10: 0000000000000001 R11: ffffed15f3f4c5b2 R12: 000000318aee9b73
R13: ffffffff8471bc50 R14: 1ffff1179c083ef0 R15: 1ffff1179c083eec
kernel_text_address+0xc1/0x100
__kernel_text_address+0xe/0x30
unwind_get_return_address+0x2f/0x50
__save_stack_trace+0x92/0x100
create_object+0x380/0x650
__kmalloc+0x14c/0x2b0
load_msg+0x38/0x1a0
do_msgsnd+0x19e/0xcf0
do_syscall_64+0x117/0x400
entry_SYSCALL_64_after_hwframe+0x49/0xbe
rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
rcu: Tasks blocked on level-1 rcu_node (CPUs 0-15): P32170
rcu: (detected by 14, t=35016 jiffies, g=44237525, q=12423063)
msgctl10 R running task 21608 32170 32155 0x00000082
Call Trace:
preempt_schedule_irq+0x4c/0xb0
retint_kernel+0x1b/0x2d
RIP: 0010:lock_acquire+0x4d/0x340
Code: 48 81 ec c0 00 00 00 45 89 c6 4d 89 cf 48 8d 6c 24 20 48 89 3c 24 48 8d bb e4 0c 00 00 89 74 24 0c 48 c7 44 24 20 b3 8a b5 41 <48> c1 ed 03 48 c7 44 24 28 b4 25 18 84 48 c7 44 24 30 d0 54 7a 82
RSP: 0018:ffff88af83417738 EFLAGS: 00000282 ORIG_RAX: ffffffffffffff13
RAX: dffffc0000000000 RBX: ffff88bd335f3080 RCX: 0000000000000002
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88bd335f3d64
RBP: ffff88af83417758 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000001 R11: ffffed13f3f745b2 R12: 0000000000000000
R13: 0000000000000002 R14: 0000000000000000 R15: 0000000000000000
is_bpf_text_address+0x32/0xe0
kernel_text_address+0xec/0x100
__kernel_text_address+0xe/0x30
unwind_get_return_address+0x2f/0x50
__save_stack_trace+0x92/0x100
save_stack+0x32/0xb0
__kasan_slab_free+0x130/0x180
kfree+0xfa/0x2d0
free_msg+0x24/0x50
do_msgrcv+0x508/0xe60
do_syscall_64+0x117/0x400
entry_SYSCALL_64_after_hwframe+0x49/0xbe
Davidlohr said:
"So after releasing the lock, the msg rbtree/list is empty and new
calls will not see those in the newly populated tmp_msg list, and
therefore they cannot access the delayed msg freeing pointers, which
is good. Also the fact that the node_cache is now freed before the
actual messages seems to be harmless as this is wanted for
msg_insert() avoiding GFP_ATOMIC allocations, and after releasing the
info->lock the thing is freed anyway so it should not change things"
Link: http://lkml.kernel.org/r/1552029161-4957-1-git-send-email-lirongqing@baidu.com
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
Reviewed-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-15 01:46:20 +03:00
|
|
|
list_add_tail(&msg->m_list, &tmp_msg);
|
ipc/mqueue: add rbtree node caching support
When I wrote the first patch that added the rbtree support for message
queue insertion, it sped up the case where the queue was very full
drastically from the original code. It, however, slowed down the case
where the queue was empty (not drastically though).
This patch caches the last freed rbtree node struct so we can quickly
reuse it when we get a new message. This is the common path for any queue
that very frequently goes from 0 to 1 then back to 0 messages in queue.
Andrew Morton didn't like that we were doing a GFP_ATOMIC allocation in
msg_insert, so this patch attempts to speculatively allocate a new node
struct outside of the spin lock when we know we need it, but will still
fall back to a GFP_ATOMIC allocation if it has to.
Once I added the caching, the necessary various ret = ; spin_unlock
gyrations in mq_timedsend were getting pretty ugly, so this also slightly
refactors that function to streamline the flow of the code and the
function exit.
Finally, while working on getting performance back I made sure that all of
the node structs were always fully initialized when they were first used,
rendering the use of kzalloc unnecessary and a waste of CPU cycles.
The net result of all of this is:
1) We will avoid a GFP_ATOMIC allocation when possible, but fall back
on it when necessary.
2) We will speculatively allocate a node struct using GFP_KERNEL if our
cache is empty (and save the struct to our cache if it's still empty
after we have obtained the spin lock).
3) The performance of the common queue empty case has significantly
improved and is now much more in line with the older performance for
this case.
The performance changes are:
Old mqueue new mqueue new mqueue + caching
queue empty
send/recv 305/288ns 349/318ns 310/322ns
I don't think we'll ever be able to get the recv performance back, but
that's because the old recv performance was a direct result and
consequence of the old methods abysmal send performance. The recv path
simply must do more so that the send path does not incur such a penalty
under higher queue depths.
As it turns out, the new caching code also sped up the various queue full
cases relative to my last patch. That could be because of the difference
between the syscall path in 3.3.4-rc5 and 3.3.4-rc6, or because of the
change in code flow in the mq_timedsend routine. Regardless, I'll take
it. It wasn't huge, and I *would* say it was within the margin for error,
but after many repeated runs what I'm seeing is that the old numbers trend
slightly higher (about 10 to 20ns depending on which test is the one
running).
[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-01 03:26:38 +04:00
|
|
|
kfree(info->node_cache);
|
2005-04-17 02:20:36 +04:00
|
|
|
spin_unlock(&info->lock);
|
|
|
|
|
ipc: prevent lockup on alloc_msg and free_msg
msgctl10 of ltp triggers the following lockup When CONFIG_KASAN is
enabled on large memory SMP systems, the pages initialization can take a
long time, if msgctl10 requests a huge block memory, and it will block
rcu scheduler, so release cpu actively.
After adding schedule() in free_msg, free_msg can not be called when
holding spinlock, so adding msg to a tmp list, and free it out of
spinlock
rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
rcu: Tasks blocked on level-1 rcu_node (CPUs 16-31): P32505
rcu: Tasks blocked on level-1 rcu_node (CPUs 48-63): P34978
rcu: (detected by 11, t=35024 jiffies, g=44237529, q=16542267)
msgctl10 R running task 21608 32505 2794 0x00000082
Call Trace:
preempt_schedule_irq+0x4c/0xb0
retint_kernel+0x1b/0x2d
RIP: 0010:__is_insn_slot_addr+0xfb/0x250
Code: 82 1d 00 48 8b 9b 90 00 00 00 4c 89 f7 49 c1 ee 03 e8 59 83 1d 00 48 b8 00 00 00 00 00 fc ff df 4c 39 eb 48 89 9d 58 ff ff ff <41> c6 04 06 f8 74 66 4c 8d 75 98 4c 89 f1 48 c1 e9 03 48 01 c8 48
RSP: 0018:ffff88bce041f758 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
RAX: dffffc0000000000 RBX: ffffffff8471bc50 RCX: ffffffff828a2a57
RDX: dffffc0000000000 RSI: dffffc0000000000 RDI: ffff88bce041f780
RBP: ffff88bce041f828 R08: ffffed15f3f4c5b3 R09: ffffed15f3f4c5b3
R10: 0000000000000001 R11: ffffed15f3f4c5b2 R12: 000000318aee9b73
R13: ffffffff8471bc50 R14: 1ffff1179c083ef0 R15: 1ffff1179c083eec
kernel_text_address+0xc1/0x100
__kernel_text_address+0xe/0x30
unwind_get_return_address+0x2f/0x50
__save_stack_trace+0x92/0x100
create_object+0x380/0x650
__kmalloc+0x14c/0x2b0
load_msg+0x38/0x1a0
do_msgsnd+0x19e/0xcf0
do_syscall_64+0x117/0x400
entry_SYSCALL_64_after_hwframe+0x49/0xbe
rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
rcu: Tasks blocked on level-1 rcu_node (CPUs 0-15): P32170
rcu: (detected by 14, t=35016 jiffies, g=44237525, q=12423063)
msgctl10 R running task 21608 32170 32155 0x00000082
Call Trace:
preempt_schedule_irq+0x4c/0xb0
retint_kernel+0x1b/0x2d
RIP: 0010:lock_acquire+0x4d/0x340
Code: 48 81 ec c0 00 00 00 45 89 c6 4d 89 cf 48 8d 6c 24 20 48 89 3c 24 48 8d bb e4 0c 00 00 89 74 24 0c 48 c7 44 24 20 b3 8a b5 41 <48> c1 ed 03 48 c7 44 24 28 b4 25 18 84 48 c7 44 24 30 d0 54 7a 82
RSP: 0018:ffff88af83417738 EFLAGS: 00000282 ORIG_RAX: ffffffffffffff13
RAX: dffffc0000000000 RBX: ffff88bd335f3080 RCX: 0000000000000002
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88bd335f3d64
RBP: ffff88af83417758 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000001 R11: ffffed13f3f745b2 R12: 0000000000000000
R13: 0000000000000002 R14: 0000000000000000 R15: 0000000000000000
is_bpf_text_address+0x32/0xe0
kernel_text_address+0xec/0x100
__kernel_text_address+0xe/0x30
unwind_get_return_address+0x2f/0x50
__save_stack_trace+0x92/0x100
save_stack+0x32/0xb0
__kasan_slab_free+0x130/0x180
kfree+0xfa/0x2d0
free_msg+0x24/0x50
do_msgrcv+0x508/0xe60
do_syscall_64+0x117/0x400
entry_SYSCALL_64_after_hwframe+0x49/0xbe
Davidlohr said:
"So after releasing the lock, the msg rbtree/list is empty and new
calls will not see those in the newly populated tmp_msg list, and
therefore they cannot access the delayed msg freeing pointers, which
is good. Also the fact that the node_cache is now freed before the
actual messages seems to be harmless as this is wanted for
msg_insert() avoiding GFP_ATOMIC allocations, and after releasing the
info->lock the thing is freed anyway so it should not change things"
Link: http://lkml.kernel.org/r/1552029161-4957-1-git-send-email-lirongqing@baidu.com
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
Reviewed-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-15 01:46:20 +03:00
|
|
|
list_for_each_entry_safe(msg, nmsg, &tmp_msg, m_list) {
|
|
|
|
list_del(&msg->m_list);
|
|
|
|
free_msg(msg);
|
|
|
|
}
|
|
|
|
|
2021-04-22 15:27:12 +03:00
|
|
|
if (info->ucounts) {
|
ipc/mqueue.c: only perform resource calculation if user valid
Andreas Christoforou reported:
UBSAN: Undefined behaviour in ipc/mqueue.c:414:49 signed integer overflow:
9 * 2305843009213693951 cannot be represented in type 'long int'
...
Call Trace:
mqueue_evict_inode+0x8e7/0xa10 ipc/mqueue.c:414
evict+0x472/0x8c0 fs/inode.c:558
iput_final fs/inode.c:1547 [inline]
iput+0x51d/0x8c0 fs/inode.c:1573
mqueue_get_inode+0x8eb/0x1070 ipc/mqueue.c:320
mqueue_create_attr+0x198/0x440 ipc/mqueue.c:459
vfs_mkobj+0x39e/0x580 fs/namei.c:2892
prepare_open ipc/mqueue.c:731 [inline]
do_mq_open+0x6da/0x8e0 ipc/mqueue.c:771
Which could be triggered by:
struct mq_attr attr = {
.mq_flags = 0,
.mq_maxmsg = 9,
.mq_msgsize = 0x1fffffffffffffff,
.mq_curmsgs = 0,
};
if (mq_open("/testing", 0x40, 3, &attr) == (mqd_t) -1)
perror("mq_open");
mqueue_get_inode() was correctly rejecting the giant mq_msgsize, and
preparing to return -EINVAL. During the cleanup, it calls
mqueue_evict_inode() which performed resource usage tracking math for
updating "user", before checking if there was a valid "user" at all
(which would indicate that the calculations would be sane). Instead,
delay this check to after seeing a valid "user".
The overflow was real, but the results went unused, so while the flaw is
harmless, it's noisy for kernel fuzzers, so just fix it by moving the
calculation under the non-NULL "user" where it actually gets used.
Link: http://lkml.kernel.org/r/201906072207.ECB65450@keescook
Signed-off-by: Kees Cook <keescook@chromium.org>
Reported-by: Andreas Christoforou <andreaschristofo@gmail.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-17 02:30:21 +03:00
|
|
|
unsigned long mq_bytes, mq_treesize;
|
|
|
|
|
|
|
|
/* Total amount of bytes accounted for the mqueue */
|
|
|
|
mq_treesize = info->attr.mq_maxmsg * sizeof(struct msg_msg) +
|
|
|
|
min_t(unsigned int, info->attr.mq_maxmsg, MQ_PRIO_MAX) *
|
|
|
|
sizeof(struct posix_msg_tree_node);
|
|
|
|
|
|
|
|
mq_bytes = mq_treesize + (info->attr.mq_maxmsg *
|
|
|
|
info->attr.mq_msgsize);
|
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
spin_lock(&mq_lock);
|
2021-04-22 15:27:12 +03:00
|
|
|
dec_rlimit_ucounts(info->ucounts, UCOUNT_RLIMIT_MSGQUEUE, mq_bytes);
|
namespaces: ipc namespaces: implement support for posix msqueues
Implement multiple mounts of the mqueue file system, and link it to usage
of CLONE_NEWIPC.
Each ipc ns has a corresponding mqueuefs superblock. When a user does
clone(CLONE_NEWIPC) or unshare(CLONE_NEWIPC), the unshare will cause an
internal mount of a new mqueuefs sb linked to the new ipc ns.
When a user does 'mount -t mqueue mqueue /dev/mqueue', he mounts the
mqueuefs superblock.
Posix message queues can be worked with both through the mq_* system calls
(see mq_overview(7)), and through the VFS through the mqueue mount. Any
usage of mq_open() and friends will work with the acting task's ipc
namespace. Any actions through the VFS will work with the mqueuefs in
which the file was created. So if a user doesn't remount mqueuefs after
unshare(CLONE_NEWIPC), mq_open("/ab") will not be reflected in "ls
/dev/mqueue".
If task a mounts mqueue for ipc_ns:1, then clones task b with a new ipcns,
ipcns:2, and then task a is the last task in ipc_ns:1 to exit, then (1)
ipc_ns:1 will be freed, (2) it's superblock will live on until task b
umounts the corresponding mqueuefs, and vfs actions will continue to
succeed, but (3) sb->s_fs_info will be NULL for the sb corresponding to
the deceased ipc_ns:1.
To make this happen, we must protect the ipc reference count when
a) a task exits and drops its ipcns->count, since it might be dropping
it to 0 and freeing the ipcns
b) a task accesses the ipcns through its mqueuefs interface, since it
bumps the ipcns refcount and might race with the last task in the ipcns
exiting.
So the kref is changed to an atomic_t so we can use
atomic_dec_and_lock(&ns->count,mq_lock), and every access to the ipcns
through ns = mqueuefs_sb->s_fs_info is protected by the same lock.
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 06:01:10 +04:00
|
|
|
/*
|
|
|
|
* get_ns_from_inode() ensures that the
|
|
|
|
* (ipc_ns = sb->s_fs_info) is either a valid ipc_ns
|
|
|
|
* to which we now hold a reference, or it is NULL.
|
|
|
|
* We can't put it here under mq_lock, though.
|
|
|
|
*/
|
|
|
|
if (ipc_ns)
|
|
|
|
ipc_ns->mq_queues_count--;
|
2005-04-17 02:20:36 +04:00
|
|
|
spin_unlock(&mq_lock);
|
2021-04-22 15:27:12 +03:00
|
|
|
put_ucounts(info->ucounts);
|
|
|
|
info->ucounts = NULL;
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
namespaces: ipc namespaces: implement support for posix msqueues
Implement multiple mounts of the mqueue file system, and link it to usage
of CLONE_NEWIPC.
Each ipc ns has a corresponding mqueuefs superblock. When a user does
clone(CLONE_NEWIPC) or unshare(CLONE_NEWIPC), the unshare will cause an
internal mount of a new mqueuefs sb linked to the new ipc ns.
When a user does 'mount -t mqueue mqueue /dev/mqueue', he mounts the
mqueuefs superblock.
Posix message queues can be worked with both through the mq_* system calls
(see mq_overview(7)), and through the VFS through the mqueue mount. Any
usage of mq_open() and friends will work with the acting task's ipc
namespace. Any actions through the VFS will work with the mqueuefs in
which the file was created. So if a user doesn't remount mqueuefs after
unshare(CLONE_NEWIPC), mq_open("/ab") will not be reflected in "ls
/dev/mqueue".
If task a mounts mqueue for ipc_ns:1, then clones task b with a new ipcns,
ipcns:2, and then task a is the last task in ipc_ns:1 to exit, then (1)
ipc_ns:1 will be freed, (2) it's superblock will live on until task b
umounts the corresponding mqueuefs, and vfs actions will continue to
succeed, but (3) sb->s_fs_info will be NULL for the sb corresponding to
the deceased ipc_ns:1.
To make this happen, we must protect the ipc reference count when
a) a task exits and drops its ipcns->count, since it might be dropping
it to 0 and freeing the ipcns
b) a task accesses the ipcns through its mqueuefs interface, since it
bumps the ipcns refcount and might race with the last task in the ipcns
exiting.
So the kref is changed to an atomic_t so we can use
atomic_dec_and_lock(&ns->count,mq_lock), and every access to the ipcns
through ns = mqueuefs_sb->s_fs_info is protected by the same lock.
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 06:01:10 +04:00
|
|
|
if (ipc_ns)
|
|
|
|
put_ipc_ns(ipc_ns);
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
|
2017-12-02 01:26:05 +03:00
|
|
|
static int mqueue_create_attr(struct dentry *dentry, umode_t mode, void *arg)
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
2017-12-02 01:26:05 +03:00
|
|
|
struct inode *dir = dentry->d_parent->d_inode;
|
2005-04-17 02:20:36 +04:00
|
|
|
struct inode *inode;
|
2017-12-02 01:26:05 +03:00
|
|
|
struct mq_attr *attr = arg;
|
2005-04-17 02:20:36 +04:00
|
|
|
int error;
|
namespaces: ipc namespaces: implement support for posix msqueues
Implement multiple mounts of the mqueue file system, and link it to usage
of CLONE_NEWIPC.
Each ipc ns has a corresponding mqueuefs superblock. When a user does
clone(CLONE_NEWIPC) or unshare(CLONE_NEWIPC), the unshare will cause an
internal mount of a new mqueuefs sb linked to the new ipc ns.
When a user does 'mount -t mqueue mqueue /dev/mqueue', he mounts the
mqueuefs superblock.
Posix message queues can be worked with both through the mq_* system calls
(see mq_overview(7)), and through the VFS through the mqueue mount. Any
usage of mq_open() and friends will work with the acting task's ipc
namespace. Any actions through the VFS will work with the mqueuefs in
which the file was created. So if a user doesn't remount mqueuefs after
unshare(CLONE_NEWIPC), mq_open("/ab") will not be reflected in "ls
/dev/mqueue".
If task a mounts mqueue for ipc_ns:1, then clones task b with a new ipcns,
ipcns:2, and then task a is the last task in ipc_ns:1 to exit, then (1)
ipc_ns:1 will be freed, (2) it's superblock will live on until task b
umounts the corresponding mqueuefs, and vfs actions will continue to
succeed, but (3) sb->s_fs_info will be NULL for the sb corresponding to
the deceased ipc_ns:1.
To make this happen, we must protect the ipc reference count when
a) a task exits and drops its ipcns->count, since it might be dropping
it to 0 and freeing the ipcns
b) a task accesses the ipcns through its mqueuefs interface, since it
bumps the ipcns refcount and might race with the last task in the ipcns
exiting.
So the kref is changed to an atomic_t so we can use
atomic_dec_and_lock(&ns->count,mq_lock), and every access to the ipcns
through ns = mqueuefs_sb->s_fs_info is protected by the same lock.
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 06:01:10 +04:00
|
|
|
struct ipc_namespace *ipc_ns;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
|
|
|
spin_lock(&mq_lock);
|
namespaces: ipc namespaces: implement support for posix msqueues
Implement multiple mounts of the mqueue file system, and link it to usage
of CLONE_NEWIPC.
Each ipc ns has a corresponding mqueuefs superblock. When a user does
clone(CLONE_NEWIPC) or unshare(CLONE_NEWIPC), the unshare will cause an
internal mount of a new mqueuefs sb linked to the new ipc ns.
When a user does 'mount -t mqueue mqueue /dev/mqueue', he mounts the
mqueuefs superblock.
Posix message queues can be worked with both through the mq_* system calls
(see mq_overview(7)), and through the VFS through the mqueue mount. Any
usage of mq_open() and friends will work with the acting task's ipc
namespace. Any actions through the VFS will work with the mqueuefs in
which the file was created. So if a user doesn't remount mqueuefs after
unshare(CLONE_NEWIPC), mq_open("/ab") will not be reflected in "ls
/dev/mqueue".
If task a mounts mqueue for ipc_ns:1, then clones task b with a new ipcns,
ipcns:2, and then task a is the last task in ipc_ns:1 to exit, then (1)
ipc_ns:1 will be freed, (2) it's superblock will live on until task b
umounts the corresponding mqueuefs, and vfs actions will continue to
succeed, but (3) sb->s_fs_info will be NULL for the sb corresponding to
the deceased ipc_ns:1.
To make this happen, we must protect the ipc reference count when
a) a task exits and drops its ipcns->count, since it might be dropping
it to 0 and freeing the ipcns
b) a task accesses the ipcns through its mqueuefs interface, since it
bumps the ipcns refcount and might race with the last task in the ipcns
exiting.
So the kref is changed to an atomic_t so we can use
atomic_dec_and_lock(&ns->count,mq_lock), and every access to the ipcns
through ns = mqueuefs_sb->s_fs_info is protected by the same lock.
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 06:01:10 +04:00
|
|
|
ipc_ns = __get_ns_from_inode(dir);
|
|
|
|
if (!ipc_ns) {
|
|
|
|
error = -EACCES;
|
|
|
|
goto out_unlock;
|
|
|
|
}
|
2014-02-26 03:01:45 +04:00
|
|
|
|
|
|
|
if (ipc_ns->mq_queues_count >= ipc_ns->mq_queues_max &&
|
|
|
|
!capable(CAP_SYS_RESOURCE)) {
|
2005-04-17 02:20:36 +04:00
|
|
|
error = -ENOSPC;
|
2009-04-07 06:01:08 +04:00
|
|
|
goto out_unlock;
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
2009-04-07 06:01:08 +04:00
|
|
|
ipc_ns->mq_queues_count++;
|
2005-04-17 02:20:36 +04:00
|
|
|
spin_unlock(&mq_lock);
|
|
|
|
|
namespaces: ipc namespaces: implement support for posix msqueues
Implement multiple mounts of the mqueue file system, and link it to usage
of CLONE_NEWIPC.
Each ipc ns has a corresponding mqueuefs superblock. When a user does
clone(CLONE_NEWIPC) or unshare(CLONE_NEWIPC), the unshare will cause an
internal mount of a new mqueuefs sb linked to the new ipc ns.
When a user does 'mount -t mqueue mqueue /dev/mqueue', he mounts the
mqueuefs superblock.
Posix message queues can be worked with both through the mq_* system calls
(see mq_overview(7)), and through the VFS through the mqueue mount. Any
usage of mq_open() and friends will work with the acting task's ipc
namespace. Any actions through the VFS will work with the mqueuefs in
which the file was created. So if a user doesn't remount mqueuefs after
unshare(CLONE_NEWIPC), mq_open("/ab") will not be reflected in "ls
/dev/mqueue".
If task a mounts mqueue for ipc_ns:1, then clones task b with a new ipcns,
ipcns:2, and then task a is the last task in ipc_ns:1 to exit, then (1)
ipc_ns:1 will be freed, (2) it's superblock will live on until task b
umounts the corresponding mqueuefs, and vfs actions will continue to
succeed, but (3) sb->s_fs_info will be NULL for the sb corresponding to
the deceased ipc_ns:1.
To make this happen, we must protect the ipc reference count when
a) a task exits and drops its ipcns->count, since it might be dropping
it to 0 and freeing the ipcns
b) a task accesses the ipcns through its mqueuefs interface, since it
bumps the ipcns refcount and might race with the last task in the ipcns
exiting.
So the kref is changed to an atomic_t so we can use
atomic_dec_and_lock(&ns->count,mq_lock), and every access to the ipcns
through ns = mqueuefs_sb->s_fs_info is protected by the same lock.
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 06:01:10 +04:00
|
|
|
inode = mqueue_get_inode(dir->i_sb, ipc_ns, mode, attr);
|
2011-07-27 03:08:47 +04:00
|
|
|
if (IS_ERR(inode)) {
|
|
|
|
error = PTR_ERR(inode);
|
2005-04-17 02:20:36 +04:00
|
|
|
spin_lock(&mq_lock);
|
2009-04-07 06:01:08 +04:00
|
|
|
ipc_ns->mq_queues_count--;
|
|
|
|
goto out_unlock;
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
|
namespaces: ipc namespaces: implement support for posix msqueues
Implement multiple mounts of the mqueue file system, and link it to usage
of CLONE_NEWIPC.
Each ipc ns has a corresponding mqueuefs superblock. When a user does
clone(CLONE_NEWIPC) or unshare(CLONE_NEWIPC), the unshare will cause an
internal mount of a new mqueuefs sb linked to the new ipc ns.
When a user does 'mount -t mqueue mqueue /dev/mqueue', he mounts the
mqueuefs superblock.
Posix message queues can be worked with both through the mq_* system calls
(see mq_overview(7)), and through the VFS through the mqueue mount. Any
usage of mq_open() and friends will work with the acting task's ipc
namespace. Any actions through the VFS will work with the mqueuefs in
which the file was created. So if a user doesn't remount mqueuefs after
unshare(CLONE_NEWIPC), mq_open("/ab") will not be reflected in "ls
/dev/mqueue".
If task a mounts mqueue for ipc_ns:1, then clones task b with a new ipcns,
ipcns:2, and then task a is the last task in ipc_ns:1 to exit, then (1)
ipc_ns:1 will be freed, (2) it's superblock will live on until task b
umounts the corresponding mqueuefs, and vfs actions will continue to
succeed, but (3) sb->s_fs_info will be NULL for the sb corresponding to
the deceased ipc_ns:1.
To make this happen, we must protect the ipc reference count when
a) a task exits and drops its ipcns->count, since it might be dropping
it to 0 and freeing the ipcns
b) a task accesses the ipcns through its mqueuefs interface, since it
bumps the ipcns refcount and might race with the last task in the ipcns
exiting.
So the kref is changed to an atomic_t so we can use
atomic_dec_and_lock(&ns->count,mq_lock), and every access to the ipcns
through ns = mqueuefs_sb->s_fs_info is protected by the same lock.
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 06:01:10 +04:00
|
|
|
put_ipc_ns(ipc_ns);
|
2005-04-17 02:20:36 +04:00
|
|
|
dir->i_size += DIRENT_SIZE;
|
2016-09-14 17:48:04 +03:00
|
|
|
dir->i_ctime = dir->i_mtime = dir->i_atime = current_time(dir);
|
2005-04-17 02:20:36 +04:00
|
|
|
|
|
|
|
d_instantiate(dentry, inode);
|
|
|
|
dget(dentry);
|
|
|
|
return 0;
|
2009-04-07 06:01:08 +04:00
|
|
|
out_unlock:
|
2005-04-17 02:20:36 +04:00
|
|
|
spin_unlock(&mq_lock);
|
namespaces: ipc namespaces: implement support for posix msqueues
Implement multiple mounts of the mqueue file system, and link it to usage
of CLONE_NEWIPC.
Each ipc ns has a corresponding mqueuefs superblock. When a user does
clone(CLONE_NEWIPC) or unshare(CLONE_NEWIPC), the unshare will cause an
internal mount of a new mqueuefs sb linked to the new ipc ns.
When a user does 'mount -t mqueue mqueue /dev/mqueue', he mounts the
mqueuefs superblock.
Posix message queues can be worked with both through the mq_* system calls
(see mq_overview(7)), and through the VFS through the mqueue mount. Any
usage of mq_open() and friends will work with the acting task's ipc
namespace. Any actions through the VFS will work with the mqueuefs in
which the file was created. So if a user doesn't remount mqueuefs after
unshare(CLONE_NEWIPC), mq_open("/ab") will not be reflected in "ls
/dev/mqueue".
If task a mounts mqueue for ipc_ns:1, then clones task b with a new ipcns,
ipcns:2, and then task a is the last task in ipc_ns:1 to exit, then (1)
ipc_ns:1 will be freed, (2) it's superblock will live on until task b
umounts the corresponding mqueuefs, and vfs actions will continue to
succeed, but (3) sb->s_fs_info will be NULL for the sb corresponding to
the deceased ipc_ns:1.
To make this happen, we must protect the ipc reference count when
a) a task exits and drops its ipcns->count, since it might be dropping
it to 0 and freeing the ipcns
b) a task accesses the ipcns through its mqueuefs interface, since it
bumps the ipcns refcount and might race with the last task in the ipcns
exiting.
So the kref is changed to an atomic_t so we can use
atomic_dec_and_lock(&ns->count,mq_lock), and every access to the ipcns
through ns = mqueuefs_sb->s_fs_info is protected by the same lock.
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 06:01:10 +04:00
|
|
|
if (ipc_ns)
|
|
|
|
put_ipc_ns(ipc_ns);
|
2005-04-17 02:20:36 +04:00
|
|
|
return error;
|
|
|
|
}
|
|
|
|
|
2021-01-21 16:19:43 +03:00
|
|
|
static int mqueue_create(struct user_namespace *mnt_userns, struct inode *dir,
|
|
|
|
struct dentry *dentry, umode_t mode, bool excl)
|
2017-12-02 01:26:05 +03:00
|
|
|
{
|
|
|
|
return mqueue_create_attr(dentry, mode, NULL);
|
|
|
|
}
|
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
static int mqueue_unlink(struct inode *dir, struct dentry *dentry)
|
|
|
|
{
|
2015-03-18 01:26:12 +03:00
|
|
|
struct inode *inode = d_inode(dentry);
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2016-09-14 17:48:04 +03:00
|
|
|
dir->i_ctime = dir->i_mtime = dir->i_atime = current_time(dir);
|
2005-04-17 02:20:36 +04:00
|
|
|
dir->i_size -= DIRENT_SIZE;
|
2014-01-28 05:07:04 +04:00
|
|
|
drop_nlink(inode);
|
|
|
|
dput(dentry);
|
|
|
|
return 0;
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This is routine for system read from queue file.
|
|
|
|
* To avoid mess with doing here some sort of mq_receive we allow
|
|
|
|
* to read only queue size & notification info (the only values
|
|
|
|
* that are interesting from user point of view and aren't accessible
|
|
|
|
* through std routines)
|
|
|
|
*/
|
|
|
|
static ssize_t mqueue_read_file(struct file *filp, char __user *u_data,
|
2008-07-25 12:48:07 +04:00
|
|
|
size_t count, loff_t *off)
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
2013-01-24 02:07:38 +04:00
|
|
|
struct mqueue_inode_info *info = MQUEUE_I(file_inode(filp));
|
2005-04-17 02:20:36 +04:00
|
|
|
char buffer[FILENT_SIZE];
|
2008-07-25 12:48:07 +04:00
|
|
|
ssize_t ret;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
|
|
|
spin_lock(&info->lock);
|
|
|
|
snprintf(buffer, sizeof(buffer),
|
|
|
|
"QSIZE:%-10lu NOTIFY:%-5d SIGNO:%-5d NOTIFY_PID:%-6d\n",
|
|
|
|
info->qsize,
|
|
|
|
info->notify_owner ? info->notify.sigev_notify : 0,
|
|
|
|
(info->notify_owner &&
|
|
|
|
info->notify.sigev_notify == SIGEV_SIGNAL) ?
|
|
|
|
info->notify.sigev_signo : 0,
|
2008-02-08 15:19:20 +03:00
|
|
|
pid_vnr(info->notify_owner));
|
2005-04-17 02:20:36 +04:00
|
|
|
spin_unlock(&info->lock);
|
|
|
|
buffer[sizeof(buffer)-1] = '\0';
|
|
|
|
|
2008-07-25 12:48:07 +04:00
|
|
|
ret = simple_read_from_buffer(u_data, count, off, buffer,
|
|
|
|
strlen(buffer));
|
|
|
|
if (ret <= 0)
|
|
|
|
return ret;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2016-09-14 17:48:04 +03:00
|
|
|
file_inode(filp)->i_atime = file_inode(filp)->i_ctime = current_time(file_inode(filp));
|
2008-07-25 12:48:07 +04:00
|
|
|
return ret;
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
|
2006-06-23 13:05:12 +04:00
|
|
|
static int mqueue_flush_file(struct file *filp, fl_owner_t id)
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
2013-01-24 02:07:38 +04:00
|
|
|
struct mqueue_inode_info *info = MQUEUE_I(file_inode(filp));
|
2005-04-17 02:20:36 +04:00
|
|
|
|
|
|
|
spin_lock(&info->lock);
|
2006-10-02 13:17:26 +04:00
|
|
|
if (task_tgid(current) == info->notify_owner)
|
2005-04-17 02:20:36 +04:00
|
|
|
remove_notification(info);
|
|
|
|
|
|
|
|
spin_unlock(&info->lock);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2017-07-03 07:42:43 +03:00
|
|
|
static __poll_t mqueue_poll_file(struct file *filp, struct poll_table_struct *poll_tab)
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
2013-01-24 02:07:38 +04:00
|
|
|
struct mqueue_inode_info *info = MQUEUE_I(file_inode(filp));
|
2017-07-03 07:42:43 +03:00
|
|
|
__poll_t retval = 0;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
|
|
|
poll_wait(filp, &info->wait_q, poll_tab);
|
|
|
|
|
|
|
|
spin_lock(&info->lock);
|
|
|
|
if (info->attr.mq_curmsgs)
|
2018-02-12 01:34:03 +03:00
|
|
|
retval = EPOLLIN | EPOLLRDNORM;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
|
|
|
if (info->attr.mq_curmsgs < info->attr.mq_maxmsg)
|
2018-02-12 01:34:03 +03:00
|
|
|
retval |= EPOLLOUT | EPOLLWRNORM;
|
2005-04-17 02:20:36 +04:00
|
|
|
spin_unlock(&info->lock);
|
|
|
|
|
|
|
|
return retval;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Adds current to info->e_wait_q[sr] before element with smaller prio */
|
|
|
|
static void wq_add(struct mqueue_inode_info *info, int sr,
|
|
|
|
struct ext_wait_queue *ewp)
|
|
|
|
{
|
|
|
|
struct ext_wait_queue *walk;
|
|
|
|
|
|
|
|
list_for_each_entry(walk, &info->e_wait_q[sr].list, list) {
|
2018-02-07 02:40:52 +03:00
|
|
|
if (walk->task->prio <= current->prio) {
|
2005-04-17 02:20:36 +04:00
|
|
|
list_add_tail(&ewp->list, &walk->list);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
list_add_tail(&ewp->list, &info->e_wait_q[sr].list);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Puts current task to sleep. Caller must hold queue lock. After return
|
|
|
|
* lock isn't held.
|
|
|
|
* sr: SEND or RECV
|
|
|
|
*/
|
|
|
|
static int wq_sleep(struct mqueue_inode_info *info, int sr,
|
2010-04-03 00:40:20 +04:00
|
|
|
ktime_t *timeout, struct ext_wait_queue *ewp)
|
2017-02-28 01:28:21 +03:00
|
|
|
__releases(&info->lock)
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
|
|
|
int retval;
|
|
|
|
signed long time;
|
|
|
|
|
|
|
|
wq_add(info, sr, ewp);
|
|
|
|
|
|
|
|
for (;;) {
|
2020-02-04 04:34:36 +03:00
|
|
|
/* memory barrier not required, we hold info->lock */
|
2015-05-04 17:02:46 +03:00
|
|
|
__set_current_state(TASK_INTERRUPTIBLE);
|
2005-04-17 02:20:36 +04:00
|
|
|
|
|
|
|
spin_unlock(&info->lock);
|
2011-11-01 04:06:35 +04:00
|
|
|
time = schedule_hrtimeout_range_clock(timeout, 0,
|
|
|
|
HRTIMER_MODE_ABS, CLOCK_REALTIME);
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2020-02-04 04:34:36 +03:00
|
|
|
if (READ_ONCE(ewp->state) == STATE_READY) {
|
|
|
|
/* see MQ_BARRIER for purpose/pairing */
|
|
|
|
smp_acquire__after_ctrl_dep();
|
2005-04-17 02:20:36 +04:00
|
|
|
retval = 0;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
spin_lock(&info->lock);
|
2020-02-04 04:34:36 +03:00
|
|
|
|
|
|
|
/* we hold info->lock, so no memory barrier required */
|
|
|
|
if (READ_ONCE(ewp->state) == STATE_READY) {
|
2005-04-17 02:20:36 +04:00
|
|
|
retval = 0;
|
|
|
|
goto out_unlock;
|
|
|
|
}
|
|
|
|
if (signal_pending(current)) {
|
|
|
|
retval = -ERESTARTSYS;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
if (time == 0) {
|
|
|
|
retval = -ETIMEDOUT;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
list_del(&ewp->list);
|
|
|
|
out_unlock:
|
|
|
|
spin_unlock(&info->lock);
|
|
|
|
out:
|
|
|
|
return retval;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Returns waiting task that should be serviced first or NULL if none exists
|
|
|
|
*/
|
|
|
|
static struct ext_wait_queue *wq_get_first_waiter(
|
|
|
|
struct mqueue_inode_info *info, int sr)
|
|
|
|
{
|
|
|
|
struct list_head *ptr;
|
|
|
|
|
|
|
|
ptr = info->e_wait_q[sr].list.prev;
|
|
|
|
if (ptr == &info->e_wait_q[sr].list)
|
|
|
|
return NULL;
|
|
|
|
return list_entry(ptr, struct ext_wait_queue, list);
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
static inline void set_cookie(struct sk_buff *skb, char code)
|
|
|
|
{
|
2014-01-28 05:07:04 +04:00
|
|
|
((char *)skb->data)[NOTIFY_COOKIE_LEN-1] = code;
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The next function is only to split too long sys_mq_timedsend
|
|
|
|
*/
|
|
|
|
static void __do_notify(struct mqueue_inode_info *info)
|
|
|
|
{
|
|
|
|
/* notification
|
|
|
|
* invoked when there is registered process and there isn't process
|
|
|
|
* waiting synchronously for message AND state of queue changed from
|
|
|
|
* empty to not empty. Here we are sure that no one is waiting
|
|
|
|
* synchronously. */
|
|
|
|
if (info->notify_owner &&
|
|
|
|
info->attr.mq_curmsgs == 1) {
|
|
|
|
switch (info->notify.sigev_notify) {
|
|
|
|
case SIGEV_NONE:
|
|
|
|
break;
|
ipc/mqueue.c: change __do_notify() to bypass check_kill_permission()
Commit cc731525f26a ("signal: Remove kernel interal si_code magic")
changed the value of SI_FROMUSER(SI_MESGQ), this means that mq_notify() no
longer works if the sender doesn't have rights to send a signal.
Change __do_notify() to use do_send_sig_info() instead of kill_pid_info()
to avoid check_kill_permission().
This needs the additional notify.sigev_signo != 0 check, shouldn't we
change do_mq_notify() to deny sigev_signo == 0 ?
Test-case:
#include <signal.h>
#include <mqueue.h>
#include <unistd.h>
#include <sys/wait.h>
#include <assert.h>
static int notified;
static void sigh(int sig)
{
notified = 1;
}
int main(void)
{
signal(SIGIO, sigh);
int fd = mq_open("/mq", O_RDWR|O_CREAT, 0666, NULL);
assert(fd >= 0);
struct sigevent se = {
.sigev_notify = SIGEV_SIGNAL,
.sigev_signo = SIGIO,
};
assert(mq_notify(fd, &se) == 0);
if (!fork()) {
assert(setuid(1) == 0);
mq_send(fd, "",1,0);
return 0;
}
wait(NULL);
mq_unlink("/mq");
assert(notified);
return 0;
}
[manfred@colorfullife.com: 1) Add self_exec_id evaluation so that the implementation matches do_notify_parent 2) use PIDTYPE_TGID everywhere]
Fixes: cc731525f26a ("signal: Remove kernel interal si_code magic")
Reported-by: Yoji <yoji.fujihar.min@gmail.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Markus Elfring <elfring@users.sourceforge.net>
Cc: <1vier1@web.de>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/e2a782e4-eab9-4f5c-c749-c07a8f7a4e66@colorfullife.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-05-08 04:35:39 +03:00
|
|
|
case SIGEV_SIGNAL: {
|
|
|
|
struct kernel_siginfo sig_i;
|
|
|
|
struct task_struct *task;
|
|
|
|
|
|
|
|
/* do_mq_notify() accepts sigev_signo == 0, why?? */
|
|
|
|
if (!info->notify.sigev_signo)
|
|
|
|
break;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2018-01-06 02:27:42 +03:00
|
|
|
clear_siginfo(&sig_i);
|
2005-04-17 02:20:36 +04:00
|
|
|
sig_i.si_signo = info->notify.sigev_signo;
|
|
|
|
sig_i.si_errno = 0;
|
|
|
|
sig_i.si_code = SI_MESGQ;
|
|
|
|
sig_i.si_value = info->notify.sigev_value;
|
user namespace: make signal.c respect user namespaces
ipc/mqueue.c: for __SI_MESQ, convert the uid being sent to recipient's
user namespace. (new, thanks Oleg)
__send_signal: convert current's uid to the recipient's user namespace
for any siginfo which is not SI_FROMKERNEL (patch from Oleg, thanks
again :)
do_notify_parent and do_notify_parent_cldstop: map task's uid to parent's
user namespace
ptrace_signal maps parent's uid into current's user namespace before
including in signal to current. IIUC Oleg has argued that this shouldn't
matter as the debugger will play with it, but it seems like not converting
the value currently being set is misleading.
Changelog:
Sep 20: Inspired by Oleg's suggestion, define map_cred_ns() helper to
simplify callers and help make clear what we are translating
(which uid into which namespace). Passing the target task would
make callers even easier to read, but we pass in user_ns because
current_user_ns() != task_cred_xxx(current, user_ns).
Sep 20: As recommended by Oleg, also put task_pid_vnr() under rcu_read_lock
in ptrace_signal().
Sep 23: In send_signal(), detect when (user) signal is coming from an
ancestor or unrelated user namespace. Pass that on to __send_signal,
which sets si_uid to 0 or overflowuid if needed.
Oct 12: Base on Oleg's fixup_uid() patch. On top of that, handle all
SI_FROMKERNEL cases at callers, because we can't assume sender is
current in those cases.
Nov 10: (mhelsley) rename fixup_uid to more meaningful usern_fixup_signal_uid
Nov 10: (akpm) make the !CONFIG_USER_NS case clearer
Signed-off-by: Serge Hallyn <serge.hallyn@canonical.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
From: Serge Hallyn <serge.hallyn@canonical.com>
Subject: __send_signal: pass q->info, not info, to userns_fixup_signal_uid (v2)
Eric Biederman pointed out that passing info is a bug and could lead to a
NULL pointer deref to boot.
A collection of signal, securebits, filecaps, cap_bounds, and a few other
ltp tests passed with this kernel.
Changelog:
Nov 18: previous patch missed a leading '&'
Signed-off-by: Serge Hallyn <serge.hallyn@canonical.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
From: Dan Carpenter <dan.carpenter@oracle.com>
Subject: ipc/mqueue: lock() => unlock() typo
There was a double lock typo introduced in b085f4bd6b21 "user namespace:
make signal.c respect user namespaces"
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-11 03:11:37 +04:00
|
|
|
rcu_read_lock();
|
ipc/mqueue.c: change __do_notify() to bypass check_kill_permission()
Commit cc731525f26a ("signal: Remove kernel interal si_code magic")
changed the value of SI_FROMUSER(SI_MESGQ), this means that mq_notify() no
longer works if the sender doesn't have rights to send a signal.
Change __do_notify() to use do_send_sig_info() instead of kill_pid_info()
to avoid check_kill_permission().
This needs the additional notify.sigev_signo != 0 check, shouldn't we
change do_mq_notify() to deny sigev_signo == 0 ?
Test-case:
#include <signal.h>
#include <mqueue.h>
#include <unistd.h>
#include <sys/wait.h>
#include <assert.h>
static int notified;
static void sigh(int sig)
{
notified = 1;
}
int main(void)
{
signal(SIGIO, sigh);
int fd = mq_open("/mq", O_RDWR|O_CREAT, 0666, NULL);
assert(fd >= 0);
struct sigevent se = {
.sigev_notify = SIGEV_SIGNAL,
.sigev_signo = SIGIO,
};
assert(mq_notify(fd, &se) == 0);
if (!fork()) {
assert(setuid(1) == 0);
mq_send(fd, "",1,0);
return 0;
}
wait(NULL);
mq_unlink("/mq");
assert(notified);
return 0;
}
[manfred@colorfullife.com: 1) Add self_exec_id evaluation so that the implementation matches do_notify_parent 2) use PIDTYPE_TGID everywhere]
Fixes: cc731525f26a ("signal: Remove kernel interal si_code magic")
Reported-by: Yoji <yoji.fujihar.min@gmail.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Markus Elfring <elfring@users.sourceforge.net>
Cc: <1vier1@web.de>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/e2a782e4-eab9-4f5c-c749-c07a8f7a4e66@colorfullife.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-05-08 04:35:39 +03:00
|
|
|
/* map current pid/uid into info->owner's namespaces */
|
2009-01-08 05:08:50 +03:00
|
|
|
sig_i.si_pid = task_tgid_nr_ns(current,
|
|
|
|
ns_of_pid(info->notify_owner));
|
ipc/mqueue.c: change __do_notify() to bypass check_kill_permission()
Commit cc731525f26a ("signal: Remove kernel interal si_code magic")
changed the value of SI_FROMUSER(SI_MESGQ), this means that mq_notify() no
longer works if the sender doesn't have rights to send a signal.
Change __do_notify() to use do_send_sig_info() instead of kill_pid_info()
to avoid check_kill_permission().
This needs the additional notify.sigev_signo != 0 check, shouldn't we
change do_mq_notify() to deny sigev_signo == 0 ?
Test-case:
#include <signal.h>
#include <mqueue.h>
#include <unistd.h>
#include <sys/wait.h>
#include <assert.h>
static int notified;
static void sigh(int sig)
{
notified = 1;
}
int main(void)
{
signal(SIGIO, sigh);
int fd = mq_open("/mq", O_RDWR|O_CREAT, 0666, NULL);
assert(fd >= 0);
struct sigevent se = {
.sigev_notify = SIGEV_SIGNAL,
.sigev_signo = SIGIO,
};
assert(mq_notify(fd, &se) == 0);
if (!fork()) {
assert(setuid(1) == 0);
mq_send(fd, "",1,0);
return 0;
}
wait(NULL);
mq_unlink("/mq");
assert(notified);
return 0;
}
[manfred@colorfullife.com: 1) Add self_exec_id evaluation so that the implementation matches do_notify_parent 2) use PIDTYPE_TGID everywhere]
Fixes: cc731525f26a ("signal: Remove kernel interal si_code magic")
Reported-by: Yoji <yoji.fujihar.min@gmail.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Markus Elfring <elfring@users.sourceforge.net>
Cc: <1vier1@web.de>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/e2a782e4-eab9-4f5c-c749-c07a8f7a4e66@colorfullife.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-05-08 04:35:39 +03:00
|
|
|
sig_i.si_uid = from_kuid_munged(info->notify_user_ns,
|
|
|
|
current_uid());
|
|
|
|
/*
|
|
|
|
* We can't use kill_pid_info(), this signal should
|
|
|
|
* bypass check_kill_permission(). It is from kernel
|
|
|
|
* but si_fromuser() can't know this.
|
|
|
|
* We do check the self_exec_id, to avoid sending
|
|
|
|
* signals to programs that don't expect them.
|
|
|
|
*/
|
|
|
|
task = pid_task(info->notify_owner, PIDTYPE_TGID);
|
|
|
|
if (task && task->self_exec_id ==
|
|
|
|
info->notify_self_exec_id) {
|
|
|
|
do_send_sig_info(info->notify.sigev_signo,
|
|
|
|
&sig_i, task, PIDTYPE_TGID);
|
|
|
|
}
|
user namespace: make signal.c respect user namespaces
ipc/mqueue.c: for __SI_MESQ, convert the uid being sent to recipient's
user namespace. (new, thanks Oleg)
__send_signal: convert current's uid to the recipient's user namespace
for any siginfo which is not SI_FROMKERNEL (patch from Oleg, thanks
again :)
do_notify_parent and do_notify_parent_cldstop: map task's uid to parent's
user namespace
ptrace_signal maps parent's uid into current's user namespace before
including in signal to current. IIUC Oleg has argued that this shouldn't
matter as the debugger will play with it, but it seems like not converting
the value currently being set is misleading.
Changelog:
Sep 20: Inspired by Oleg's suggestion, define map_cred_ns() helper to
simplify callers and help make clear what we are translating
(which uid into which namespace). Passing the target task would
make callers even easier to read, but we pass in user_ns because
current_user_ns() != task_cred_xxx(current, user_ns).
Sep 20: As recommended by Oleg, also put task_pid_vnr() under rcu_read_lock
in ptrace_signal().
Sep 23: In send_signal(), detect when (user) signal is coming from an
ancestor or unrelated user namespace. Pass that on to __send_signal,
which sets si_uid to 0 or overflowuid if needed.
Oct 12: Base on Oleg's fixup_uid() patch. On top of that, handle all
SI_FROMKERNEL cases at callers, because we can't assume sender is
current in those cases.
Nov 10: (mhelsley) rename fixup_uid to more meaningful usern_fixup_signal_uid
Nov 10: (akpm) make the !CONFIG_USER_NS case clearer
Signed-off-by: Serge Hallyn <serge.hallyn@canonical.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
From: Serge Hallyn <serge.hallyn@canonical.com>
Subject: __send_signal: pass q->info, not info, to userns_fixup_signal_uid (v2)
Eric Biederman pointed out that passing info is a bug and could lead to a
NULL pointer deref to boot.
A collection of signal, securebits, filecaps, cap_bounds, and a few other
ltp tests passed with this kernel.
Changelog:
Nov 18: previous patch missed a leading '&'
Signed-off-by: Serge Hallyn <serge.hallyn@canonical.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
From: Dan Carpenter <dan.carpenter@oracle.com>
Subject: ipc/mqueue: lock() => unlock() typo
There was a double lock typo introduced in b085f4bd6b21 "user namespace:
make signal.c respect user namespaces"
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-11 03:11:37 +04:00
|
|
|
rcu_read_unlock();
|
2005-04-17 02:20:36 +04:00
|
|
|
break;
|
ipc/mqueue.c: change __do_notify() to bypass check_kill_permission()
Commit cc731525f26a ("signal: Remove kernel interal si_code magic")
changed the value of SI_FROMUSER(SI_MESGQ), this means that mq_notify() no
longer works if the sender doesn't have rights to send a signal.
Change __do_notify() to use do_send_sig_info() instead of kill_pid_info()
to avoid check_kill_permission().
This needs the additional notify.sigev_signo != 0 check, shouldn't we
change do_mq_notify() to deny sigev_signo == 0 ?
Test-case:
#include <signal.h>
#include <mqueue.h>
#include <unistd.h>
#include <sys/wait.h>
#include <assert.h>
static int notified;
static void sigh(int sig)
{
notified = 1;
}
int main(void)
{
signal(SIGIO, sigh);
int fd = mq_open("/mq", O_RDWR|O_CREAT, 0666, NULL);
assert(fd >= 0);
struct sigevent se = {
.sigev_notify = SIGEV_SIGNAL,
.sigev_signo = SIGIO,
};
assert(mq_notify(fd, &se) == 0);
if (!fork()) {
assert(setuid(1) == 0);
mq_send(fd, "",1,0);
return 0;
}
wait(NULL);
mq_unlink("/mq");
assert(notified);
return 0;
}
[manfred@colorfullife.com: 1) Add self_exec_id evaluation so that the implementation matches do_notify_parent 2) use PIDTYPE_TGID everywhere]
Fixes: cc731525f26a ("signal: Remove kernel interal si_code magic")
Reported-by: Yoji <yoji.fujihar.min@gmail.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Markus Elfring <elfring@users.sourceforge.net>
Cc: <1vier1@web.de>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/e2a782e4-eab9-4f5c-c749-c07a8f7a4e66@colorfullife.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-05-08 04:35:39 +03:00
|
|
|
}
|
2005-04-17 02:20:36 +04:00
|
|
|
case SIGEV_THREAD:
|
|
|
|
set_cookie(info->notify_cookie, NOTIFY_WOKENUP);
|
2007-10-11 08:14:03 +04:00
|
|
|
netlink_sendskb(info->notify_sock, info->notify_cookie);
|
2005-04-17 02:20:36 +04:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
/* after notification unregisters process */
|
2006-10-02 13:17:26 +04:00
|
|
|
put_pid(info->notify_owner);
|
2011-11-17 10:57:55 +04:00
|
|
|
put_user_ns(info->notify_user_ns);
|
2006-10-02 13:17:26 +04:00
|
|
|
info->notify_owner = NULL;
|
2011-11-17 10:57:55 +04:00
|
|
|
info->notify_user_ns = NULL;
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
wake_up(&info->wait_q);
|
|
|
|
}
|
|
|
|
|
2018-04-13 14:58:00 +03:00
|
|
|
static int prepare_timeout(const struct __kernel_timespec __user *u_abs_timeout,
|
2017-08-03 05:51:11 +03:00
|
|
|
struct timespec64 *ts)
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
2017-08-03 05:51:11 +03:00
|
|
|
if (get_timespec64(ts, u_abs_timeout))
|
2010-04-03 00:40:20 +04:00
|
|
|
return -EFAULT;
|
2017-08-03 05:51:11 +03:00
|
|
|
if (!timespec64_valid(ts))
|
2010-04-03 00:40:20 +04:00
|
|
|
return -EINVAL;
|
|
|
|
return 0;
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
static void remove_notification(struct mqueue_inode_info *info)
|
|
|
|
{
|
2006-10-02 13:17:26 +04:00
|
|
|
if (info->notify_owner != NULL &&
|
2005-04-17 02:20:36 +04:00
|
|
|
info->notify.sigev_notify == SIGEV_THREAD) {
|
|
|
|
set_cookie(info->notify_cookie, NOTIFY_REMOVED);
|
2007-10-11 08:14:03 +04:00
|
|
|
netlink_sendskb(info->notify_sock, info->notify_cookie);
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
2006-10-02 13:17:26 +04:00
|
|
|
put_pid(info->notify_owner);
|
2011-11-17 10:57:55 +04:00
|
|
|
put_user_ns(info->notify_user_ns);
|
2006-10-02 13:17:26 +04:00
|
|
|
info->notify_owner = NULL;
|
2011-11-17 10:57:55 +04:00
|
|
|
info->notify_user_ns = NULL;
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
|
2017-12-02 01:51:39 +03:00
|
|
|
static int prepare_open(struct dentry *dentry, int oflag, int ro,
|
|
|
|
umode_t mode, struct filename *name,
|
2009-04-07 06:01:08 +04:00
|
|
|
struct mq_attr *attr)
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
2008-11-14 02:39:22 +03:00
|
|
|
static const int oflag2acc[O_ACCMODE] = { MAY_READ, MAY_WRITE,
|
|
|
|
MAY_READ | MAY_WRITE };
|
2012-06-26 21:58:53 +04:00
|
|
|
int acc;
|
2017-12-02 01:51:39 +03:00
|
|
|
|
2017-12-02 01:57:02 +03:00
|
|
|
if (d_really_is_negative(dentry)) {
|
|
|
|
if (!(oflag & O_CREAT))
|
2017-12-02 01:51:39 +03:00
|
|
|
return -ENOENT;
|
2017-12-02 01:57:02 +03:00
|
|
|
if (ro)
|
|
|
|
return ro;
|
|
|
|
audit_inode_parent_hidden(name, dentry->d_parent);
|
|
|
|
return vfs_mkobj(dentry, mode & ~current_umask(),
|
|
|
|
mqueue_create_attr, attr);
|
2017-12-02 01:51:39 +03:00
|
|
|
}
|
2017-12-02 01:57:02 +03:00
|
|
|
/* it already existed */
|
|
|
|
audit_inode(name, dentry, 0);
|
|
|
|
if ((oflag & (O_CREAT|O_EXCL)) == (O_CREAT|O_EXCL))
|
|
|
|
return -EEXIST;
|
2012-06-26 21:58:53 +04:00
|
|
|
if ((oflag & O_ACCMODE) == (O_RDWR | O_WRONLY))
|
2017-12-02 01:34:22 +03:00
|
|
|
return -EINVAL;
|
2012-06-26 21:58:53 +04:00
|
|
|
acc = oflag2acc[oflag & O_ACCMODE];
|
2021-01-21 16:19:24 +03:00
|
|
|
return inode_permission(&init_user_ns, d_inode(dentry), acc);
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
|
2017-06-28 04:32:36 +03:00
|
|
|
static int do_mq_open(const char __user *u_name, int oflag, umode_t mode,
|
|
|
|
struct mq_attr *attr)
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
2018-03-24 19:28:14 +03:00
|
|
|
struct vfsmount *mnt = current->nsproxy->ipc_ns->mq_mnt;
|
|
|
|
struct dentry *root = mnt->mnt_root;
|
2012-10-10 23:25:28 +04:00
|
|
|
struct filename *name;
|
2017-12-02 02:01:09 +03:00
|
|
|
struct path path;
|
2005-04-17 02:20:36 +04:00
|
|
|
int fd, error;
|
2012-08-06 10:18:17 +04:00
|
|
|
int ro;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2017-06-28 04:32:36 +03:00
|
|
|
audit_mq_open(oflag, mode, attr);
|
2006-05-25 01:09:55 +04:00
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
if (IS_ERR(name = getname(u_name)))
|
|
|
|
return PTR_ERR(name);
|
|
|
|
|
2008-05-03 23:28:45 +04:00
|
|
|
fd = get_unused_fd_flags(O_CLOEXEC);
|
2005-04-17 02:20:36 +04:00
|
|
|
if (fd < 0)
|
|
|
|
goto out_putname;
|
|
|
|
|
2012-08-06 10:18:17 +04:00
|
|
|
ro = mnt_want_write(mnt); /* we'll drop it in any case */
|
2016-01-22 23:40:57 +03:00
|
|
|
inode_lock(d_inode(root));
|
2012-10-10 23:25:28 +04:00
|
|
|
path.dentry = lookup_one_len(name->name, root, strlen(name->name));
|
2012-06-26 21:58:53 +04:00
|
|
|
if (IS_ERR(path.dentry)) {
|
|
|
|
error = PTR_ERR(path.dentry);
|
2010-02-23 10:04:28 +03:00
|
|
|
goto out_putfd;
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
2012-08-06 10:18:17 +04:00
|
|
|
path.mnt = mntget(mnt);
|
2017-12-02 01:51:39 +03:00
|
|
|
error = prepare_open(path.dentry, oflag, ro, mode, name, attr);
|
|
|
|
if (!error) {
|
|
|
|
struct file *file = dentry_open(&path, oflag, current_cred());
|
|
|
|
if (!IS_ERR(file))
|
|
|
|
fd_install(fd, file);
|
|
|
|
else
|
|
|
|
error = PTR_ERR(file);
|
2006-01-14 23:29:55 +03:00
|
|
|
}
|
2012-06-26 21:58:53 +04:00
|
|
|
path_put(&path);
|
2006-01-14 23:29:55 +03:00
|
|
|
out_putfd:
|
2012-06-26 21:58:53 +04:00
|
|
|
if (error) {
|
|
|
|
put_unused_fd(fd);
|
|
|
|
fd = error;
|
|
|
|
}
|
2016-01-22 23:40:57 +03:00
|
|
|
inode_unlock(d_inode(root));
|
2013-03-23 02:04:51 +04:00
|
|
|
if (!ro)
|
|
|
|
mnt_drop_write(mnt);
|
2005-04-17 02:20:36 +04:00
|
|
|
out_putname:
|
|
|
|
putname(name);
|
|
|
|
return fd;
|
|
|
|
}
|
|
|
|
|
2017-06-28 04:32:36 +03:00
|
|
|
SYSCALL_DEFINE4(mq_open, const char __user *, u_name, int, oflag, umode_t, mode,
|
|
|
|
struct mq_attr __user *, u_attr)
|
|
|
|
{
|
|
|
|
struct mq_attr attr;
|
|
|
|
if (u_attr && copy_from_user(&attr, u_attr, sizeof(struct mq_attr)))
|
|
|
|
return -EFAULT;
|
|
|
|
|
|
|
|
return do_mq_open(u_name, oflag, mode, u_attr ? &attr : NULL);
|
|
|
|
}
|
|
|
|
|
2009-01-14 16:14:27 +03:00
|
|
|
SYSCALL_DEFINE1(mq_unlink, const char __user *, u_name)
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
|
|
|
int err;
|
2012-10-10 23:25:28 +04:00
|
|
|
struct filename *name;
|
2005-04-17 02:20:36 +04:00
|
|
|
struct dentry *dentry;
|
|
|
|
struct inode *inode = NULL;
|
namespaces: ipc namespaces: implement support for posix msqueues
Implement multiple mounts of the mqueue file system, and link it to usage
of CLONE_NEWIPC.
Each ipc ns has a corresponding mqueuefs superblock. When a user does
clone(CLONE_NEWIPC) or unshare(CLONE_NEWIPC), the unshare will cause an
internal mount of a new mqueuefs sb linked to the new ipc ns.
When a user does 'mount -t mqueue mqueue /dev/mqueue', he mounts the
mqueuefs superblock.
Posix message queues can be worked with both through the mq_* system calls
(see mq_overview(7)), and through the VFS through the mqueue mount. Any
usage of mq_open() and friends will work with the acting task's ipc
namespace. Any actions through the VFS will work with the mqueuefs in
which the file was created. So if a user doesn't remount mqueuefs after
unshare(CLONE_NEWIPC), mq_open("/ab") will not be reflected in "ls
/dev/mqueue".
If task a mounts mqueue for ipc_ns:1, then clones task b with a new ipcns,
ipcns:2, and then task a is the last task in ipc_ns:1 to exit, then (1)
ipc_ns:1 will be freed, (2) it's superblock will live on until task b
umounts the corresponding mqueuefs, and vfs actions will continue to
succeed, but (3) sb->s_fs_info will be NULL for the sb corresponding to
the deceased ipc_ns:1.
To make this happen, we must protect the ipc reference count when
a) a task exits and drops its ipcns->count, since it might be dropping
it to 0 and freeing the ipcns
b) a task accesses the ipcns through its mqueuefs interface, since it
bumps the ipcns refcount and might race with the last task in the ipcns
exiting.
So the kref is changed to an atomic_t so we can use
atomic_dec_and_lock(&ns->count,mq_lock), and every access to the ipcns
through ns = mqueuefs_sb->s_fs_info is protected by the same lock.
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 06:01:10 +04:00
|
|
|
struct ipc_namespace *ipc_ns = current->nsproxy->ipc_ns;
|
2012-08-06 10:18:17 +04:00
|
|
|
struct vfsmount *mnt = ipc_ns->mq_mnt;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
|
|
|
name = getname(u_name);
|
|
|
|
if (IS_ERR(name))
|
|
|
|
return PTR_ERR(name);
|
|
|
|
|
2013-07-09 02:59:36 +04:00
|
|
|
audit_inode_parent_hidden(name, mnt->mnt_root);
|
2012-08-06 10:18:17 +04:00
|
|
|
err = mnt_want_write(mnt);
|
|
|
|
if (err)
|
|
|
|
goto out_name;
|
2016-01-22 23:40:57 +03:00
|
|
|
inode_lock_nested(d_inode(mnt->mnt_root), I_MUTEX_PARENT);
|
2012-10-10 23:25:28 +04:00
|
|
|
dentry = lookup_one_len(name->name, mnt->mnt_root,
|
|
|
|
strlen(name->name));
|
2005-04-17 02:20:36 +04:00
|
|
|
if (IS_ERR(dentry)) {
|
|
|
|
err = PTR_ERR(dentry);
|
|
|
|
goto out_unlock;
|
|
|
|
}
|
|
|
|
|
2015-03-18 01:26:12 +03:00
|
|
|
inode = d_inode(dentry);
|
2012-08-06 10:18:17 +04:00
|
|
|
if (!inode) {
|
|
|
|
err = -ENOENT;
|
|
|
|
} else {
|
2010-10-23 19:11:40 +04:00
|
|
|
ihold(inode);
|
2021-01-21 16:19:33 +03:00
|
|
|
err = vfs_unlink(&init_user_ns, d_inode(dentry->d_parent),
|
|
|
|
dentry, NULL);
|
2012-08-06 10:18:17 +04:00
|
|
|
}
|
2005-04-17 02:20:36 +04:00
|
|
|
dput(dentry);
|
|
|
|
|
|
|
|
out_unlock:
|
2016-01-22 23:40:57 +03:00
|
|
|
inode_unlock(d_inode(mnt->mnt_root));
|
2005-04-17 02:20:36 +04:00
|
|
|
if (inode)
|
|
|
|
iput(inode);
|
2012-08-06 10:18:17 +04:00
|
|
|
mnt_drop_write(mnt);
|
|
|
|
out_name:
|
|
|
|
putname(name);
|
2005-04-17 02:20:36 +04:00
|
|
|
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Pipelined send and receive functions.
|
|
|
|
*
|
|
|
|
* If a receiver finds no waiting message, then it registers itself in the
|
|
|
|
* list of waiting receivers. A sender checks that list before adding the new
|
|
|
|
* message into the message array. If there is a waiting receiver, then it
|
|
|
|
* bypasses the message array and directly hands the message over to the
|
2015-05-04 17:02:46 +03:00
|
|
|
* receiver. The receiver accepts the message and returns without grabbing the
|
|
|
|
* queue spinlock:
|
|
|
|
*
|
|
|
|
* - Set pointer to message.
|
|
|
|
* - Queue the receiver task for later wakeup (without the info->lock).
|
|
|
|
* - Update its state to STATE_READY. Now the receiver can continue.
|
|
|
|
* - Wake up the process after the lock is dropped. Should the process wake up
|
|
|
|
* before this wakeup (due to a timeout or a signal) it will either see
|
|
|
|
* STATE_READY and continue or acquire the lock to check the state again.
|
2005-04-17 02:20:36 +04:00
|
|
|
*
|
|
|
|
* The same algorithm is used for senders.
|
|
|
|
*/
|
|
|
|
|
2020-02-04 04:34:32 +03:00
|
|
|
static inline void __pipelined_op(struct wake_q_head *wake_q,
|
2015-05-04 17:02:46 +03:00
|
|
|
struct mqueue_inode_info *info,
|
2020-02-04 04:34:32 +03:00
|
|
|
struct ext_wait_queue *this)
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
ipc/mqueue, msg, sem: avoid relying on a stack reference past its expiry
do_mq_timedreceive calls wq_sleep with a stack local address. The
sender (do_mq_timedsend) uses this address to later call pipelined_send.
This leads to a very hard to trigger race where a do_mq_timedreceive
call might return and leave do_mq_timedsend to rely on an invalid
address, causing the following crash:
RIP: 0010:wake_q_add_safe+0x13/0x60
Call Trace:
__x64_sys_mq_timedsend+0x2a9/0x490
do_syscall_64+0x80/0x680
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f5928e40343
The race occurs as:
1. do_mq_timedreceive calls wq_sleep with the address of `struct
ext_wait_queue` on function stack (aliased as `ewq_addr` here) - it
holds a valid `struct ext_wait_queue *` as long as the stack has not
been overwritten.
2. `ewq_addr` gets added to info->e_wait_q[RECV].list in wq_add, and
do_mq_timedsend receives it via wq_get_first_waiter(info, RECV) to call
__pipelined_op.
3. Sender calls __pipelined_op::smp_store_release(&this->state,
STATE_READY). Here is where the race window begins. (`this` is
`ewq_addr`.)
4. If the receiver wakes up now in do_mq_timedreceive::wq_sleep, it
will see `state == STATE_READY` and break.
5. do_mq_timedreceive returns, and `ewq_addr` is no longer guaranteed
to be a `struct ext_wait_queue *` since it was on do_mq_timedreceive's
stack. (Although the address may not get overwritten until another
function happens to touch it, which means it can persist around for an
indefinite time.)
6. do_mq_timedsend::__pipelined_op() still believes `ewq_addr` is a
`struct ext_wait_queue *`, and uses it to find a task_struct to pass to
the wake_q_add_safe call. In the lucky case where nothing has
overwritten `ewq_addr` yet, `ewq_addr->task` is the right task_struct.
In the unlucky case, __pipelined_op::wake_q_add_safe gets handed a
bogus address as the receiver's task_struct causing the crash.
do_mq_timedsend::__pipelined_op() should not dereference `this` after
setting STATE_READY, as the receiver counterpart is now free to return.
Change __pipelined_op to call wake_q_add_safe on the receiver's
task_struct returned by get_task_struct, instead of dereferencing `this`
which sits on the receiver's stack.
As Manfred pointed out, the race potentially also exists in
ipc/msg.c::expunge_all and ipc/sem.c::wake_up_sem_queue_prepare. Fix
those in the same way.
Link: https://lkml.kernel.org/r/20210510102950.12551-1-varad.gautam@suse.com
Fixes: c5b2cbdbdac563 ("ipc/mqueue.c: update/document memory barriers")
Fixes: 8116b54e7e23ef ("ipc/sem.c: document and update memory barriers")
Fixes: 0d97a82ba830d8 ("ipc/msg.c: update and document memory barriers")
Signed-off-by: Varad Gautam <varad.gautam@suse.com>
Reported-by: Matthias von Faber <matthias.vonfaber@aox-tech.de>
Acked-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-23 03:41:49 +03:00
|
|
|
struct task_struct *task;
|
|
|
|
|
2020-02-04 04:34:32 +03:00
|
|
|
list_del(&this->list);
|
ipc/mqueue, msg, sem: avoid relying on a stack reference past its expiry
do_mq_timedreceive calls wq_sleep with a stack local address. The
sender (do_mq_timedsend) uses this address to later call pipelined_send.
This leads to a very hard to trigger race where a do_mq_timedreceive
call might return and leave do_mq_timedsend to rely on an invalid
address, causing the following crash:
RIP: 0010:wake_q_add_safe+0x13/0x60
Call Trace:
__x64_sys_mq_timedsend+0x2a9/0x490
do_syscall_64+0x80/0x680
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f5928e40343
The race occurs as:
1. do_mq_timedreceive calls wq_sleep with the address of `struct
ext_wait_queue` on function stack (aliased as `ewq_addr` here) - it
holds a valid `struct ext_wait_queue *` as long as the stack has not
been overwritten.
2. `ewq_addr` gets added to info->e_wait_q[RECV].list in wq_add, and
do_mq_timedsend receives it via wq_get_first_waiter(info, RECV) to call
__pipelined_op.
3. Sender calls __pipelined_op::smp_store_release(&this->state,
STATE_READY). Here is where the race window begins. (`this` is
`ewq_addr`.)
4. If the receiver wakes up now in do_mq_timedreceive::wq_sleep, it
will see `state == STATE_READY` and break.
5. do_mq_timedreceive returns, and `ewq_addr` is no longer guaranteed
to be a `struct ext_wait_queue *` since it was on do_mq_timedreceive's
stack. (Although the address may not get overwritten until another
function happens to touch it, which means it can persist around for an
indefinite time.)
6. do_mq_timedsend::__pipelined_op() still believes `ewq_addr` is a
`struct ext_wait_queue *`, and uses it to find a task_struct to pass to
the wake_q_add_safe call. In the lucky case where nothing has
overwritten `ewq_addr` yet, `ewq_addr->task` is the right task_struct.
In the unlucky case, __pipelined_op::wake_q_add_safe gets handed a
bogus address as the receiver's task_struct causing the crash.
do_mq_timedsend::__pipelined_op() should not dereference `this` after
setting STATE_READY, as the receiver counterpart is now free to return.
Change __pipelined_op to call wake_q_add_safe on the receiver's
task_struct returned by get_task_struct, instead of dereferencing `this`
which sits on the receiver's stack.
As Manfred pointed out, the race potentially also exists in
ipc/msg.c::expunge_all and ipc/sem.c::wake_up_sem_queue_prepare. Fix
those in the same way.
Link: https://lkml.kernel.org/r/20210510102950.12551-1-varad.gautam@suse.com
Fixes: c5b2cbdbdac563 ("ipc/mqueue.c: update/document memory barriers")
Fixes: 8116b54e7e23ef ("ipc/sem.c: document and update memory barriers")
Fixes: 0d97a82ba830d8 ("ipc/msg.c: update and document memory barriers")
Signed-off-by: Varad Gautam <varad.gautam@suse.com>
Reported-by: Matthias von Faber <matthias.vonfaber@aox-tech.de>
Acked-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-23 03:41:49 +03:00
|
|
|
task = get_task_struct(this->task);
|
2020-02-04 04:34:36 +03:00
|
|
|
|
|
|
|
/* see MQ_BARRIER for purpose/pairing */
|
|
|
|
smp_store_release(&this->state, STATE_READY);
|
ipc/mqueue, msg, sem: avoid relying on a stack reference past its expiry
do_mq_timedreceive calls wq_sleep with a stack local address. The
sender (do_mq_timedsend) uses this address to later call pipelined_send.
This leads to a very hard to trigger race where a do_mq_timedreceive
call might return and leave do_mq_timedsend to rely on an invalid
address, causing the following crash:
RIP: 0010:wake_q_add_safe+0x13/0x60
Call Trace:
__x64_sys_mq_timedsend+0x2a9/0x490
do_syscall_64+0x80/0x680
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f5928e40343
The race occurs as:
1. do_mq_timedreceive calls wq_sleep with the address of `struct
ext_wait_queue` on function stack (aliased as `ewq_addr` here) - it
holds a valid `struct ext_wait_queue *` as long as the stack has not
been overwritten.
2. `ewq_addr` gets added to info->e_wait_q[RECV].list in wq_add, and
do_mq_timedsend receives it via wq_get_first_waiter(info, RECV) to call
__pipelined_op.
3. Sender calls __pipelined_op::smp_store_release(&this->state,
STATE_READY). Here is where the race window begins. (`this` is
`ewq_addr`.)
4. If the receiver wakes up now in do_mq_timedreceive::wq_sleep, it
will see `state == STATE_READY` and break.
5. do_mq_timedreceive returns, and `ewq_addr` is no longer guaranteed
to be a `struct ext_wait_queue *` since it was on do_mq_timedreceive's
stack. (Although the address may not get overwritten until another
function happens to touch it, which means it can persist around for an
indefinite time.)
6. do_mq_timedsend::__pipelined_op() still believes `ewq_addr` is a
`struct ext_wait_queue *`, and uses it to find a task_struct to pass to
the wake_q_add_safe call. In the lucky case where nothing has
overwritten `ewq_addr` yet, `ewq_addr->task` is the right task_struct.
In the unlucky case, __pipelined_op::wake_q_add_safe gets handed a
bogus address as the receiver's task_struct causing the crash.
do_mq_timedsend::__pipelined_op() should not dereference `this` after
setting STATE_READY, as the receiver counterpart is now free to return.
Change __pipelined_op to call wake_q_add_safe on the receiver's
task_struct returned by get_task_struct, instead of dereferencing `this`
which sits on the receiver's stack.
As Manfred pointed out, the race potentially also exists in
ipc/msg.c::expunge_all and ipc/sem.c::wake_up_sem_queue_prepare. Fix
those in the same way.
Link: https://lkml.kernel.org/r/20210510102950.12551-1-varad.gautam@suse.com
Fixes: c5b2cbdbdac563 ("ipc/mqueue.c: update/document memory barriers")
Fixes: 8116b54e7e23ef ("ipc/sem.c: document and update memory barriers")
Fixes: 0d97a82ba830d8 ("ipc/msg.c: update and document memory barriers")
Signed-off-by: Varad Gautam <varad.gautam@suse.com>
Reported-by: Matthias von Faber <matthias.vonfaber@aox-tech.de>
Acked-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-23 03:41:49 +03:00
|
|
|
wake_q_add_safe(wake_q, task);
|
2020-02-04 04:34:32 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/* pipelined_send() - send a message directly to the task waiting in
|
|
|
|
* sys_mq_timedreceive() (without inserting message into a queue).
|
|
|
|
*/
|
|
|
|
static inline void pipelined_send(struct wake_q_head *wake_q,
|
|
|
|
struct mqueue_inode_info *info,
|
|
|
|
struct msg_msg *message,
|
|
|
|
struct ext_wait_queue *receiver)
|
|
|
|
{
|
|
|
|
receiver->msg = message;
|
|
|
|
__pipelined_op(wake_q, info, receiver);
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
/* pipelined_receive() - if there is task waiting in sys_mq_timedsend()
|
|
|
|
* gets its message and put to the queue (we have one free place for sure). */
|
2015-05-04 17:02:46 +03:00
|
|
|
static inline void pipelined_receive(struct wake_q_head *wake_q,
|
|
|
|
struct mqueue_inode_info *info)
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
|
|
|
struct ext_wait_queue *sender = wq_get_first_waiter(info, SEND);
|
|
|
|
|
|
|
|
if (!sender) {
|
|
|
|
/* for poll */
|
|
|
|
wake_up_interruptible(&info->wait_q);
|
|
|
|
return;
|
|
|
|
}
|
ipc/mqueue: improve performance of send/recv
The existing implementation of the POSIX message queue send and recv
functions is, well, abysmal. Even worse than abysmal. I submitted a
patch to increase the maximum POSIX message queue limit to 65536 due to
customer needs, however, upon looking over the send/recv implementation, I
realized that my customer needs help with that too even if they don't know
it. The basic problem is that, given the fairly typical use case scenario
for a large queue of queueing lots of messages all at the same priority (I
verified with my customer that this is indeed what their app does), the
msg_insert routine is basically a frikkin' bubble sort. I mean, whoa,
that's *so* middle school.
OK, OK, to not slam the original author too much, I'm sure they didn't
envision a queue depth of 50,000+ messages. No one would think that
moving elements in an array, one at a time, and dereferencing each pointer
in that array to check priority of the message being pointed too, again
one at a time, for 50,000+ times would be good. So let's assume that, as
is typical, the users have found a way to break our code simply by using
it in a way we didn't envision. Fair enough.
"So, just how broken is it?", you ask. I wondered the same thing, so I
wrote an app to let me know. It's my next patch. It gave me some
interesting results. Here's what it tested:
Interference with other apps - In continuous mode, the app just sits there
and hits a message queue forever, while you go do something productive on
another terminal using other CPUs. You then measure how long it takes you
to do that something productive. Then you restart the app in fake
continuous mode, and it sits in a tight loop on a CPU while you repeat
your tests. The whole point of this is to keep one CPU tied up (so it
can't be used in your other work) but in one case tied up hitting the
mqueue code so we can see the effect of walking that 65,528 element array
one pointer at a time on the global CPU cache. If it's bad, then it will
slow down your app on the other CPUs just by polluting cache mercilessly.
In the fake case, it will be in a tight loop, but not polluting cache.
Testing the mqueue subsystem directly - Here we just run a number of tests
to see how the mqueue subsystem performs under different conditions. A
couple conditions are known to be worst case for the old system, and some
routines, so this tests all of them.
So, on to the results already:
Subsystem/Test Old New
Time to compile linux
kernel (make -j12 on a
6 core CPU)
Running mqueue test user 49m10.744s user 45m26.294s
sys 5m51.924s sys 4m59.894s
total 55m02.668s total 50m26.188s
Running fake test user 45m32.686s user 45m18.552s
sys 5m12.465s sys 4m56.468s
total 50m45.151s total 50m15.020s
% slowdown from mqueue
cache thrashing ~8% ~.5%
Avg time to send/recv (in nanoseconds per message)
when queue empty 305/288 349/318
when queue full (65528 messages)
constant priority 526589/823 362/314
increasing priority 403105/916 495/445
decreasing priority 73420/594 482/409
random priority 280147/920 546/436
Time to fill/drain queue (65528 messages, in seconds)
constant priority 17.37/.12 .13/.12
increasing priority 4.14/.14 .21/.18
decreasing priority 12.93/.13 .21/.18
random priority 8.88/.16 .22/.17
So, I think the results speak for themselves. It's possible this
implementation could be improved by cacheing at least one priority level
in the node tree (that would bring the queue empty performance more in
line with the old implementation), but this works and is *so* much better
than what we had, especially for the common case of a single priority in
use, that further refinements can be in follow on patches.
[akpm@linux-foundation.org: fix typo in comment, remove stray semicolon]
[levinsasha928@gmail.com: use correct gfp flags in msg_insert]
Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Manfred Spraul <manfred@colorfullife.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-01 03:26:35 +04:00
|
|
|
if (msg_insert(sender->msg, info))
|
|
|
|
return;
|
2015-05-04 17:02:46 +03:00
|
|
|
|
2020-02-04 04:34:32 +03:00
|
|
|
__pipelined_op(wake_q, info, sender);
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
|
2017-06-28 04:32:36 +03:00
|
|
|
static int do_mq_timedsend(mqd_t mqdes, const char __user *u_msg_ptr,
|
|
|
|
size_t msg_len, unsigned int msg_prio,
|
2017-08-03 05:51:11 +03:00
|
|
|
struct timespec64 *ts)
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
2012-08-28 20:52:22 +04:00
|
|
|
struct fd f;
|
2005-04-17 02:20:36 +04:00
|
|
|
struct inode *inode;
|
|
|
|
struct ext_wait_queue wait;
|
|
|
|
struct ext_wait_queue *receiver;
|
|
|
|
struct msg_msg *msg_ptr;
|
|
|
|
struct mqueue_inode_info *info;
|
2010-04-03 00:40:20 +04:00
|
|
|
ktime_t expires, *timeout = NULL;
|
ipc/mqueue: add rbtree node caching support
When I wrote the first patch that added the rbtree support for message
queue insertion, it sped up the case where the queue was very full
drastically from the original code. It, however, slowed down the case
where the queue was empty (not drastically though).
This patch caches the last freed rbtree node struct so we can quickly
reuse it when we get a new message. This is the common path for any queue
that very frequently goes from 0 to 1 then back to 0 messages in queue.
Andrew Morton didn't like that we were doing a GFP_ATOMIC allocation in
msg_insert, so this patch attempts to speculatively allocate a new node
struct outside of the spin lock when we know we need it, but will still
fall back to a GFP_ATOMIC allocation if it has to.
Once I added the caching, the necessary various ret = ; spin_unlock
gyrations in mq_timedsend were getting pretty ugly, so this also slightly
refactors that function to streamline the flow of the code and the
function exit.
Finally, while working on getting performance back I made sure that all of
the node structs were always fully initialized when they were first used,
rendering the use of kzalloc unnecessary and a waste of CPU cycles.
The net result of all of this is:
1) We will avoid a GFP_ATOMIC allocation when possible, but fall back
on it when necessary.
2) We will speculatively allocate a node struct using GFP_KERNEL if our
cache is empty (and save the struct to our cache if it's still empty
after we have obtained the spin lock).
3) The performance of the common queue empty case has significantly
improved and is now much more in line with the older performance for
this case.
The performance changes are:
Old mqueue new mqueue new mqueue + caching
queue empty
send/recv 305/288ns 349/318ns 310/322ns
I don't think we'll ever be able to get the recv performance back, but
that's because the old recv performance was a direct result and
consequence of the old methods abysmal send performance. The recv path
simply must do more so that the send path does not incur such a penalty
under higher queue depths.
As it turns out, the new caching code also sped up the various queue full
cases relative to my last patch. That could be because of the difference
between the syscall path in 3.3.4-rc5 and 3.3.4-rc6, or because of the
change in code flow in the mq_timedsend routine. Regardless, I'll take
it. It wasn't huge, and I *would* say it was within the margin for error,
but after many repeated runs what I'm seeing is that the old numbers trend
slightly higher (about 10 to 20ns depending on which test is the one
running).
[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-01 03:26:38 +04:00
|
|
|
struct posix_msg_tree_node *new_leaf = NULL;
|
2012-08-28 20:52:22 +04:00
|
|
|
int ret = 0;
|
2016-11-17 19:46:38 +03:00
|
|
|
DEFINE_WAKE_Q(wake_q);
|
2005-04-17 02:20:36 +04:00
|
|
|
|
|
|
|
if (unlikely(msg_prio >= (unsigned long) MQ_PRIO_MAX))
|
|
|
|
return -EINVAL;
|
|
|
|
|
2017-06-28 04:32:36 +03:00
|
|
|
if (ts) {
|
2017-08-03 05:51:11 +03:00
|
|
|
expires = timespec64_to_ktime(*ts);
|
2017-06-28 04:32:36 +03:00
|
|
|
timeout = &expires;
|
|
|
|
}
|
|
|
|
|
|
|
|
audit_mq_sendrecv(mqdes, msg_len, msg_prio, ts);
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2012-08-28 20:52:22 +04:00
|
|
|
f = fdget(mqdes);
|
|
|
|
if (unlikely(!f.file)) {
|
2010-02-23 10:04:26 +03:00
|
|
|
ret = -EBADF;
|
2005-04-17 02:20:36 +04:00
|
|
|
goto out;
|
2010-02-23 10:04:26 +03:00
|
|
|
}
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2013-01-24 02:07:38 +04:00
|
|
|
inode = file_inode(f.file);
|
2012-08-28 20:52:22 +04:00
|
|
|
if (unlikely(f.file->f_op != &mqueue_file_operations)) {
|
2010-02-23 10:04:26 +03:00
|
|
|
ret = -EBADF;
|
2005-04-17 02:20:36 +04:00
|
|
|
goto out_fput;
|
2010-02-23 10:04:26 +03:00
|
|
|
}
|
2005-04-17 02:20:36 +04:00
|
|
|
info = MQUEUE_I(inode);
|
2014-11-01 00:44:57 +03:00
|
|
|
audit_file(f.file);
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2012-08-28 20:52:22 +04:00
|
|
|
if (unlikely(!(f.file->f_mode & FMODE_WRITE))) {
|
2010-02-23 10:04:26 +03:00
|
|
|
ret = -EBADF;
|
2005-04-17 02:20:36 +04:00
|
|
|
goto out_fput;
|
2010-02-23 10:04:26 +03:00
|
|
|
}
|
2005-04-17 02:20:36 +04:00
|
|
|
|
|
|
|
if (unlikely(msg_len > info->attr.mq_msgsize)) {
|
|
|
|
ret = -EMSGSIZE;
|
|
|
|
goto out_fput;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* First try to allocate memory, before doing anything with
|
|
|
|
* existing queues. */
|
|
|
|
msg_ptr = load_msg(u_msg_ptr, msg_len);
|
|
|
|
if (IS_ERR(msg_ptr)) {
|
|
|
|
ret = PTR_ERR(msg_ptr);
|
|
|
|
goto out_fput;
|
|
|
|
}
|
|
|
|
msg_ptr->m_ts = msg_len;
|
|
|
|
msg_ptr->m_type = msg_prio;
|
|
|
|
|
ipc/mqueue: add rbtree node caching support
When I wrote the first patch that added the rbtree support for message
queue insertion, it sped up the case where the queue was very full
drastically from the original code. It, however, slowed down the case
where the queue was empty (not drastically though).
This patch caches the last freed rbtree node struct so we can quickly
reuse it when we get a new message. This is the common path for any queue
that very frequently goes from 0 to 1 then back to 0 messages in queue.
Andrew Morton didn't like that we were doing a GFP_ATOMIC allocation in
msg_insert, so this patch attempts to speculatively allocate a new node
struct outside of the spin lock when we know we need it, but will still
fall back to a GFP_ATOMIC allocation if it has to.
Once I added the caching, the necessary various ret = ; spin_unlock
gyrations in mq_timedsend were getting pretty ugly, so this also slightly
refactors that function to streamline the flow of the code and the
function exit.
Finally, while working on getting performance back I made sure that all of
the node structs were always fully initialized when they were first used,
rendering the use of kzalloc unnecessary and a waste of CPU cycles.
The net result of all of this is:
1) We will avoid a GFP_ATOMIC allocation when possible, but fall back
on it when necessary.
2) We will speculatively allocate a node struct using GFP_KERNEL if our
cache is empty (and save the struct to our cache if it's still empty
after we have obtained the spin lock).
3) The performance of the common queue empty case has significantly
improved and is now much more in line with the older performance for
this case.
The performance changes are:
Old mqueue new mqueue new mqueue + caching
queue empty
send/recv 305/288ns 349/318ns 310/322ns
I don't think we'll ever be able to get the recv performance back, but
that's because the old recv performance was a direct result and
consequence of the old methods abysmal send performance. The recv path
simply must do more so that the send path does not incur such a penalty
under higher queue depths.
As it turns out, the new caching code also sped up the various queue full
cases relative to my last patch. That could be because of the difference
between the syscall path in 3.3.4-rc5 and 3.3.4-rc6, or because of the
change in code flow in the mq_timedsend routine. Regardless, I'll take
it. It wasn't huge, and I *would* say it was within the margin for error,
but after many repeated runs what I'm seeing is that the old numbers trend
slightly higher (about 10 to 20ns depending on which test is the one
running).
[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-01 03:26:38 +04:00
|
|
|
/*
|
|
|
|
* msg_insert really wants us to have a valid, spare node struct so
|
|
|
|
* it doesn't have to kmalloc a GFP_ATOMIC allocation, but it will
|
|
|
|
* fall back to that if necessary.
|
|
|
|
*/
|
|
|
|
if (!info->node_cache)
|
|
|
|
new_leaf = kmalloc(sizeof(*new_leaf), GFP_KERNEL);
|
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
spin_lock(&info->lock);
|
|
|
|
|
ipc/mqueue: add rbtree node caching support
When I wrote the first patch that added the rbtree support for message
queue insertion, it sped up the case where the queue was very full
drastically from the original code. It, however, slowed down the case
where the queue was empty (not drastically though).
This patch caches the last freed rbtree node struct so we can quickly
reuse it when we get a new message. This is the common path for any queue
that very frequently goes from 0 to 1 then back to 0 messages in queue.
Andrew Morton didn't like that we were doing a GFP_ATOMIC allocation in
msg_insert, so this patch attempts to speculatively allocate a new node
struct outside of the spin lock when we know we need it, but will still
fall back to a GFP_ATOMIC allocation if it has to.
Once I added the caching, the necessary various ret = ; spin_unlock
gyrations in mq_timedsend were getting pretty ugly, so this also slightly
refactors that function to streamline the flow of the code and the
function exit.
Finally, while working on getting performance back I made sure that all of
the node structs were always fully initialized when they were first used,
rendering the use of kzalloc unnecessary and a waste of CPU cycles.
The net result of all of this is:
1) We will avoid a GFP_ATOMIC allocation when possible, but fall back
on it when necessary.
2) We will speculatively allocate a node struct using GFP_KERNEL if our
cache is empty (and save the struct to our cache if it's still empty
after we have obtained the spin lock).
3) The performance of the common queue empty case has significantly
improved and is now much more in line with the older performance for
this case.
The performance changes are:
Old mqueue new mqueue new mqueue + caching
queue empty
send/recv 305/288ns 349/318ns 310/322ns
I don't think we'll ever be able to get the recv performance back, but
that's because the old recv performance was a direct result and
consequence of the old methods abysmal send performance. The recv path
simply must do more so that the send path does not incur such a penalty
under higher queue depths.
As it turns out, the new caching code also sped up the various queue full
cases relative to my last patch. That could be because of the difference
between the syscall path in 3.3.4-rc5 and 3.3.4-rc6, or because of the
change in code flow in the mq_timedsend routine. Regardless, I'll take
it. It wasn't huge, and I *would* say it was within the margin for error,
but after many repeated runs what I'm seeing is that the old numbers trend
slightly higher (about 10 to 20ns depending on which test is the one
running).
[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-01 03:26:38 +04:00
|
|
|
if (!info->node_cache && new_leaf) {
|
|
|
|
/* Save our speculative allocation into the cache */
|
|
|
|
INIT_LIST_HEAD(&new_leaf->msg_list);
|
|
|
|
info->node_cache = new_leaf;
|
|
|
|
new_leaf = NULL;
|
|
|
|
} else {
|
|
|
|
kfree(new_leaf);
|
|
|
|
}
|
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
if (info->attr.mq_curmsgs == info->attr.mq_maxmsg) {
|
2012-08-28 20:52:22 +04:00
|
|
|
if (f.file->f_flags & O_NONBLOCK) {
|
2005-04-17 02:20:36 +04:00
|
|
|
ret = -EAGAIN;
|
|
|
|
} else {
|
|
|
|
wait.task = current;
|
|
|
|
wait.msg = (void *) msg_ptr;
|
2020-02-04 04:34:36 +03:00
|
|
|
|
|
|
|
/* memory barrier not required, we hold info->lock */
|
|
|
|
WRITE_ONCE(wait.state, STATE_NONE);
|
2005-04-17 02:20:36 +04:00
|
|
|
ret = wq_sleep(info, SEND, timeout, &wait);
|
ipc/mqueue: add rbtree node caching support
When I wrote the first patch that added the rbtree support for message
queue insertion, it sped up the case where the queue was very full
drastically from the original code. It, however, slowed down the case
where the queue was empty (not drastically though).
This patch caches the last freed rbtree node struct so we can quickly
reuse it when we get a new message. This is the common path for any queue
that very frequently goes from 0 to 1 then back to 0 messages in queue.
Andrew Morton didn't like that we were doing a GFP_ATOMIC allocation in
msg_insert, so this patch attempts to speculatively allocate a new node
struct outside of the spin lock when we know we need it, but will still
fall back to a GFP_ATOMIC allocation if it has to.
Once I added the caching, the necessary various ret = ; spin_unlock
gyrations in mq_timedsend were getting pretty ugly, so this also slightly
refactors that function to streamline the flow of the code and the
function exit.
Finally, while working on getting performance back I made sure that all of
the node structs were always fully initialized when they were first used,
rendering the use of kzalloc unnecessary and a waste of CPU cycles.
The net result of all of this is:
1) We will avoid a GFP_ATOMIC allocation when possible, but fall back
on it when necessary.
2) We will speculatively allocate a node struct using GFP_KERNEL if our
cache is empty (and save the struct to our cache if it's still empty
after we have obtained the spin lock).
3) The performance of the common queue empty case has significantly
improved and is now much more in line with the older performance for
this case.
The performance changes are:
Old mqueue new mqueue new mqueue + caching
queue empty
send/recv 305/288ns 349/318ns 310/322ns
I don't think we'll ever be able to get the recv performance back, but
that's because the old recv performance was a direct result and
consequence of the old methods abysmal send performance. The recv path
simply must do more so that the send path does not incur such a penalty
under higher queue depths.
As it turns out, the new caching code also sped up the various queue full
cases relative to my last patch. That could be because of the difference
between the syscall path in 3.3.4-rc5 and 3.3.4-rc6, or because of the
change in code flow in the mq_timedsend routine. Regardless, I'll take
it. It wasn't huge, and I *would* say it was within the margin for error,
but after many repeated runs what I'm seeing is that the old numbers trend
slightly higher (about 10 to 20ns depending on which test is the one
running).
[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-01 03:26:38 +04:00
|
|
|
/*
|
|
|
|
* wq_sleep must be called with info->lock held, and
|
|
|
|
* returns with the lock released
|
|
|
|
*/
|
|
|
|
goto out_free;
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
} else {
|
|
|
|
receiver = wq_get_first_waiter(info, RECV);
|
|
|
|
if (receiver) {
|
2015-05-04 17:02:46 +03:00
|
|
|
pipelined_send(&wake_q, info, msg_ptr, receiver);
|
2005-04-17 02:20:36 +04:00
|
|
|
} else {
|
|
|
|
/* adds message to the queue */
|
ipc/mqueue: add rbtree node caching support
When I wrote the first patch that added the rbtree support for message
queue insertion, it sped up the case where the queue was very full
drastically from the original code. It, however, slowed down the case
where the queue was empty (not drastically though).
This patch caches the last freed rbtree node struct so we can quickly
reuse it when we get a new message. This is the common path for any queue
that very frequently goes from 0 to 1 then back to 0 messages in queue.
Andrew Morton didn't like that we were doing a GFP_ATOMIC allocation in
msg_insert, so this patch attempts to speculatively allocate a new node
struct outside of the spin lock when we know we need it, but will still
fall back to a GFP_ATOMIC allocation if it has to.
Once I added the caching, the necessary various ret = ; spin_unlock
gyrations in mq_timedsend were getting pretty ugly, so this also slightly
refactors that function to streamline the flow of the code and the
function exit.
Finally, while working on getting performance back I made sure that all of
the node structs were always fully initialized when they were first used,
rendering the use of kzalloc unnecessary and a waste of CPU cycles.
The net result of all of this is:
1) We will avoid a GFP_ATOMIC allocation when possible, but fall back
on it when necessary.
2) We will speculatively allocate a node struct using GFP_KERNEL if our
cache is empty (and save the struct to our cache if it's still empty
after we have obtained the spin lock).
3) The performance of the common queue empty case has significantly
improved and is now much more in line with the older performance for
this case.
The performance changes are:
Old mqueue new mqueue new mqueue + caching
queue empty
send/recv 305/288ns 349/318ns 310/322ns
I don't think we'll ever be able to get the recv performance back, but
that's because the old recv performance was a direct result and
consequence of the old methods abysmal send performance. The recv path
simply must do more so that the send path does not incur such a penalty
under higher queue depths.
As it turns out, the new caching code also sped up the various queue full
cases relative to my last patch. That could be because of the difference
between the syscall path in 3.3.4-rc5 and 3.3.4-rc6, or because of the
change in code flow in the mq_timedsend routine. Regardless, I'll take
it. It wasn't huge, and I *would* say it was within the margin for error,
but after many repeated runs what I'm seeing is that the old numbers trend
slightly higher (about 10 to 20ns depending on which test is the one
running).
[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-01 03:26:38 +04:00
|
|
|
ret = msg_insert(msg_ptr, info);
|
|
|
|
if (ret)
|
|
|
|
goto out_unlock;
|
2005-04-17 02:20:36 +04:00
|
|
|
__do_notify(info);
|
|
|
|
}
|
|
|
|
inode->i_atime = inode->i_mtime = inode->i_ctime =
|
2016-09-14 17:48:04 +03:00
|
|
|
current_time(inode);
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
ipc/mqueue: add rbtree node caching support
When I wrote the first patch that added the rbtree support for message
queue insertion, it sped up the case where the queue was very full
drastically from the original code. It, however, slowed down the case
where the queue was empty (not drastically though).
This patch caches the last freed rbtree node struct so we can quickly
reuse it when we get a new message. This is the common path for any queue
that very frequently goes from 0 to 1 then back to 0 messages in queue.
Andrew Morton didn't like that we were doing a GFP_ATOMIC allocation in
msg_insert, so this patch attempts to speculatively allocate a new node
struct outside of the spin lock when we know we need it, but will still
fall back to a GFP_ATOMIC allocation if it has to.
Once I added the caching, the necessary various ret = ; spin_unlock
gyrations in mq_timedsend were getting pretty ugly, so this also slightly
refactors that function to streamline the flow of the code and the
function exit.
Finally, while working on getting performance back I made sure that all of
the node structs were always fully initialized when they were first used,
rendering the use of kzalloc unnecessary and a waste of CPU cycles.
The net result of all of this is:
1) We will avoid a GFP_ATOMIC allocation when possible, but fall back
on it when necessary.
2) We will speculatively allocate a node struct using GFP_KERNEL if our
cache is empty (and save the struct to our cache if it's still empty
after we have obtained the spin lock).
3) The performance of the common queue empty case has significantly
improved and is now much more in line with the older performance for
this case.
The performance changes are:
Old mqueue new mqueue new mqueue + caching
queue empty
send/recv 305/288ns 349/318ns 310/322ns
I don't think we'll ever be able to get the recv performance back, but
that's because the old recv performance was a direct result and
consequence of the old methods abysmal send performance. The recv path
simply must do more so that the send path does not incur such a penalty
under higher queue depths.
As it turns out, the new caching code also sped up the various queue full
cases relative to my last patch. That could be because of the difference
between the syscall path in 3.3.4-rc5 and 3.3.4-rc6, or because of the
change in code flow in the mq_timedsend routine. Regardless, I'll take
it. It wasn't huge, and I *would* say it was within the margin for error,
but after many repeated runs what I'm seeing is that the old numbers trend
slightly higher (about 10 to 20ns depending on which test is the one
running).
[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-01 03:26:38 +04:00
|
|
|
out_unlock:
|
|
|
|
spin_unlock(&info->lock);
|
2015-05-04 17:02:46 +03:00
|
|
|
wake_up_q(&wake_q);
|
ipc/mqueue: add rbtree node caching support
When I wrote the first patch that added the rbtree support for message
queue insertion, it sped up the case where the queue was very full
drastically from the original code. It, however, slowed down the case
where the queue was empty (not drastically though).
This patch caches the last freed rbtree node struct so we can quickly
reuse it when we get a new message. This is the common path for any queue
that very frequently goes from 0 to 1 then back to 0 messages in queue.
Andrew Morton didn't like that we were doing a GFP_ATOMIC allocation in
msg_insert, so this patch attempts to speculatively allocate a new node
struct outside of the spin lock when we know we need it, but will still
fall back to a GFP_ATOMIC allocation if it has to.
Once I added the caching, the necessary various ret = ; spin_unlock
gyrations in mq_timedsend were getting pretty ugly, so this also slightly
refactors that function to streamline the flow of the code and the
function exit.
Finally, while working on getting performance back I made sure that all of
the node structs were always fully initialized when they were first used,
rendering the use of kzalloc unnecessary and a waste of CPU cycles.
The net result of all of this is:
1) We will avoid a GFP_ATOMIC allocation when possible, but fall back
on it when necessary.
2) We will speculatively allocate a node struct using GFP_KERNEL if our
cache is empty (and save the struct to our cache if it's still empty
after we have obtained the spin lock).
3) The performance of the common queue empty case has significantly
improved and is now much more in line with the older performance for
this case.
The performance changes are:
Old mqueue new mqueue new mqueue + caching
queue empty
send/recv 305/288ns 349/318ns 310/322ns
I don't think we'll ever be able to get the recv performance back, but
that's because the old recv performance was a direct result and
consequence of the old methods abysmal send performance. The recv path
simply must do more so that the send path does not incur such a penalty
under higher queue depths.
As it turns out, the new caching code also sped up the various queue full
cases relative to my last patch. That could be because of the difference
between the syscall path in 3.3.4-rc5 and 3.3.4-rc6, or because of the
change in code flow in the mq_timedsend routine. Regardless, I'll take
it. It wasn't huge, and I *would* say it was within the margin for error,
but after many repeated runs what I'm seeing is that the old numbers trend
slightly higher (about 10 to 20ns depending on which test is the one
running).
[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-01 03:26:38 +04:00
|
|
|
out_free:
|
|
|
|
if (ret)
|
|
|
|
free_msg(msg_ptr);
|
2005-04-17 02:20:36 +04:00
|
|
|
out_fput:
|
2012-08-28 20:52:22 +04:00
|
|
|
fdput(f);
|
2005-04-17 02:20:36 +04:00
|
|
|
out:
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2017-06-28 04:32:36 +03:00
|
|
|
static int do_mq_timedreceive(mqd_t mqdes, char __user *u_msg_ptr,
|
|
|
|
size_t msg_len, unsigned int __user *u_msg_prio,
|
2017-08-03 05:51:11 +03:00
|
|
|
struct timespec64 *ts)
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
|
|
|
ssize_t ret;
|
|
|
|
struct msg_msg *msg_ptr;
|
2012-08-28 20:52:22 +04:00
|
|
|
struct fd f;
|
2005-04-17 02:20:36 +04:00
|
|
|
struct inode *inode;
|
|
|
|
struct mqueue_inode_info *info;
|
|
|
|
struct ext_wait_queue wait;
|
2010-04-03 00:40:20 +04:00
|
|
|
ktime_t expires, *timeout = NULL;
|
ipc/mqueue: add rbtree node caching support
When I wrote the first patch that added the rbtree support for message
queue insertion, it sped up the case where the queue was very full
drastically from the original code. It, however, slowed down the case
where the queue was empty (not drastically though).
This patch caches the last freed rbtree node struct so we can quickly
reuse it when we get a new message. This is the common path for any queue
that very frequently goes from 0 to 1 then back to 0 messages in queue.
Andrew Morton didn't like that we were doing a GFP_ATOMIC allocation in
msg_insert, so this patch attempts to speculatively allocate a new node
struct outside of the spin lock when we know we need it, but will still
fall back to a GFP_ATOMIC allocation if it has to.
Once I added the caching, the necessary various ret = ; spin_unlock
gyrations in mq_timedsend were getting pretty ugly, so this also slightly
refactors that function to streamline the flow of the code and the
function exit.
Finally, while working on getting performance back I made sure that all of
the node structs were always fully initialized when they were first used,
rendering the use of kzalloc unnecessary and a waste of CPU cycles.
The net result of all of this is:
1) We will avoid a GFP_ATOMIC allocation when possible, but fall back
on it when necessary.
2) We will speculatively allocate a node struct using GFP_KERNEL if our
cache is empty (and save the struct to our cache if it's still empty
after we have obtained the spin lock).
3) The performance of the common queue empty case has significantly
improved and is now much more in line with the older performance for
this case.
The performance changes are:
Old mqueue new mqueue new mqueue + caching
queue empty
send/recv 305/288ns 349/318ns 310/322ns
I don't think we'll ever be able to get the recv performance back, but
that's because the old recv performance was a direct result and
consequence of the old methods abysmal send performance. The recv path
simply must do more so that the send path does not incur such a penalty
under higher queue depths.
As it turns out, the new caching code also sped up the various queue full
cases relative to my last patch. That could be because of the difference
between the syscall path in 3.3.4-rc5 and 3.3.4-rc6, or because of the
change in code flow in the mq_timedsend routine. Regardless, I'll take
it. It wasn't huge, and I *would* say it was within the margin for error,
but after many repeated runs what I'm seeing is that the old numbers trend
slightly higher (about 10 to 20ns depending on which test is the one
running).
[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-01 03:26:38 +04:00
|
|
|
struct posix_msg_tree_node *new_leaf = NULL;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2017-06-28 04:32:36 +03:00
|
|
|
if (ts) {
|
2017-08-03 05:51:11 +03:00
|
|
|
expires = timespec64_to_ktime(*ts);
|
2010-04-03 00:40:20 +04:00
|
|
|
timeout = &expires;
|
2008-12-14 11:46:48 +03:00
|
|
|
}
|
2006-05-25 01:09:55 +04:00
|
|
|
|
2017-06-28 04:32:36 +03:00
|
|
|
audit_mq_sendrecv(mqdes, msg_len, 0, ts);
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2012-08-28 20:52:22 +04:00
|
|
|
f = fdget(mqdes);
|
|
|
|
if (unlikely(!f.file)) {
|
2010-02-23 10:04:26 +03:00
|
|
|
ret = -EBADF;
|
2005-04-17 02:20:36 +04:00
|
|
|
goto out;
|
2010-02-23 10:04:26 +03:00
|
|
|
}
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2013-01-24 02:07:38 +04:00
|
|
|
inode = file_inode(f.file);
|
2012-08-28 20:52:22 +04:00
|
|
|
if (unlikely(f.file->f_op != &mqueue_file_operations)) {
|
2010-02-23 10:04:26 +03:00
|
|
|
ret = -EBADF;
|
2005-04-17 02:20:36 +04:00
|
|
|
goto out_fput;
|
2010-02-23 10:04:26 +03:00
|
|
|
}
|
2005-04-17 02:20:36 +04:00
|
|
|
info = MQUEUE_I(inode);
|
2014-11-01 00:44:57 +03:00
|
|
|
audit_file(f.file);
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2012-08-28 20:52:22 +04:00
|
|
|
if (unlikely(!(f.file->f_mode & FMODE_READ))) {
|
2010-02-23 10:04:26 +03:00
|
|
|
ret = -EBADF;
|
2005-04-17 02:20:36 +04:00
|
|
|
goto out_fput;
|
2010-02-23 10:04:26 +03:00
|
|
|
}
|
2005-04-17 02:20:36 +04:00
|
|
|
|
|
|
|
/* checks if buffer is big enough */
|
|
|
|
if (unlikely(msg_len < info->attr.mq_msgsize)) {
|
|
|
|
ret = -EMSGSIZE;
|
|
|
|
goto out_fput;
|
|
|
|
}
|
|
|
|
|
ipc/mqueue: add rbtree node caching support
When I wrote the first patch that added the rbtree support for message
queue insertion, it sped up the case where the queue was very full
drastically from the original code. It, however, slowed down the case
where the queue was empty (not drastically though).
This patch caches the last freed rbtree node struct so we can quickly
reuse it when we get a new message. This is the common path for any queue
that very frequently goes from 0 to 1 then back to 0 messages in queue.
Andrew Morton didn't like that we were doing a GFP_ATOMIC allocation in
msg_insert, so this patch attempts to speculatively allocate a new node
struct outside of the spin lock when we know we need it, but will still
fall back to a GFP_ATOMIC allocation if it has to.
Once I added the caching, the necessary various ret = ; spin_unlock
gyrations in mq_timedsend were getting pretty ugly, so this also slightly
refactors that function to streamline the flow of the code and the
function exit.
Finally, while working on getting performance back I made sure that all of
the node structs were always fully initialized when they were first used,
rendering the use of kzalloc unnecessary and a waste of CPU cycles.
The net result of all of this is:
1) We will avoid a GFP_ATOMIC allocation when possible, but fall back
on it when necessary.
2) We will speculatively allocate a node struct using GFP_KERNEL if our
cache is empty (and save the struct to our cache if it's still empty
after we have obtained the spin lock).
3) The performance of the common queue empty case has significantly
improved and is now much more in line with the older performance for
this case.
The performance changes are:
Old mqueue new mqueue new mqueue + caching
queue empty
send/recv 305/288ns 349/318ns 310/322ns
I don't think we'll ever be able to get the recv performance back, but
that's because the old recv performance was a direct result and
consequence of the old methods abysmal send performance. The recv path
simply must do more so that the send path does not incur such a penalty
under higher queue depths.
As it turns out, the new caching code also sped up the various queue full
cases relative to my last patch. That could be because of the difference
between the syscall path in 3.3.4-rc5 and 3.3.4-rc6, or because of the
change in code flow in the mq_timedsend routine. Regardless, I'll take
it. It wasn't huge, and I *would* say it was within the margin for error,
but after many repeated runs what I'm seeing is that the old numbers trend
slightly higher (about 10 to 20ns depending on which test is the one
running).
[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-01 03:26:38 +04:00
|
|
|
/*
|
|
|
|
* msg_insert really wants us to have a valid, spare node struct so
|
|
|
|
* it doesn't have to kmalloc a GFP_ATOMIC allocation, but it will
|
|
|
|
* fall back to that if necessary.
|
|
|
|
*/
|
|
|
|
if (!info->node_cache)
|
|
|
|
new_leaf = kmalloc(sizeof(*new_leaf), GFP_KERNEL);
|
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
spin_lock(&info->lock);
|
ipc/mqueue: add rbtree node caching support
When I wrote the first patch that added the rbtree support for message
queue insertion, it sped up the case where the queue was very full
drastically from the original code. It, however, slowed down the case
where the queue was empty (not drastically though).
This patch caches the last freed rbtree node struct so we can quickly
reuse it when we get a new message. This is the common path for any queue
that very frequently goes from 0 to 1 then back to 0 messages in queue.
Andrew Morton didn't like that we were doing a GFP_ATOMIC allocation in
msg_insert, so this patch attempts to speculatively allocate a new node
struct outside of the spin lock when we know we need it, but will still
fall back to a GFP_ATOMIC allocation if it has to.
Once I added the caching, the necessary various ret = ; spin_unlock
gyrations in mq_timedsend were getting pretty ugly, so this also slightly
refactors that function to streamline the flow of the code and the
function exit.
Finally, while working on getting performance back I made sure that all of
the node structs were always fully initialized when they were first used,
rendering the use of kzalloc unnecessary and a waste of CPU cycles.
The net result of all of this is:
1) We will avoid a GFP_ATOMIC allocation when possible, but fall back
on it when necessary.
2) We will speculatively allocate a node struct using GFP_KERNEL if our
cache is empty (and save the struct to our cache if it's still empty
after we have obtained the spin lock).
3) The performance of the common queue empty case has significantly
improved and is now much more in line with the older performance for
this case.
The performance changes are:
Old mqueue new mqueue new mqueue + caching
queue empty
send/recv 305/288ns 349/318ns 310/322ns
I don't think we'll ever be able to get the recv performance back, but
that's because the old recv performance was a direct result and
consequence of the old methods abysmal send performance. The recv path
simply must do more so that the send path does not incur such a penalty
under higher queue depths.
As it turns out, the new caching code also sped up the various queue full
cases relative to my last patch. That could be because of the difference
between the syscall path in 3.3.4-rc5 and 3.3.4-rc6, or because of the
change in code flow in the mq_timedsend routine. Regardless, I'll take
it. It wasn't huge, and I *would* say it was within the margin for error,
but after many repeated runs what I'm seeing is that the old numbers trend
slightly higher (about 10 to 20ns depending on which test is the one
running).
[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-01 03:26:38 +04:00
|
|
|
|
|
|
|
if (!info->node_cache && new_leaf) {
|
|
|
|
/* Save our speculative allocation into the cache */
|
|
|
|
INIT_LIST_HEAD(&new_leaf->msg_list);
|
|
|
|
info->node_cache = new_leaf;
|
|
|
|
} else {
|
|
|
|
kfree(new_leaf);
|
|
|
|
}
|
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
if (info->attr.mq_curmsgs == 0) {
|
2012-08-28 20:52:22 +04:00
|
|
|
if (f.file->f_flags & O_NONBLOCK) {
|
2005-04-17 02:20:36 +04:00
|
|
|
spin_unlock(&info->lock);
|
|
|
|
ret = -EAGAIN;
|
|
|
|
} else {
|
|
|
|
wait.task = current;
|
2020-02-04 04:34:36 +03:00
|
|
|
|
|
|
|
/* memory barrier not required, we hold info->lock */
|
|
|
|
WRITE_ONCE(wait.state, STATE_NONE);
|
2005-04-17 02:20:36 +04:00
|
|
|
ret = wq_sleep(info, RECV, timeout, &wait);
|
|
|
|
msg_ptr = wait.msg;
|
|
|
|
}
|
|
|
|
} else {
|
2016-11-17 19:46:38 +03:00
|
|
|
DEFINE_WAKE_Q(wake_q);
|
2015-05-04 17:02:46 +03:00
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
msg_ptr = msg_get(info);
|
|
|
|
|
|
|
|
inode->i_atime = inode->i_mtime = inode->i_ctime =
|
2016-09-14 17:48:04 +03:00
|
|
|
current_time(inode);
|
2005-04-17 02:20:36 +04:00
|
|
|
|
|
|
|
/* There is now free space in queue. */
|
2015-05-04 17:02:46 +03:00
|
|
|
pipelined_receive(&wake_q, info);
|
2005-04-17 02:20:36 +04:00
|
|
|
spin_unlock(&info->lock);
|
2015-05-04 17:02:46 +03:00
|
|
|
wake_up_q(&wake_q);
|
2005-04-17 02:20:36 +04:00
|
|
|
ret = 0;
|
|
|
|
}
|
|
|
|
if (ret == 0) {
|
|
|
|
ret = msg_ptr->m_ts;
|
|
|
|
|
|
|
|
if ((u_msg_prio && put_user(msg_ptr->m_type, u_msg_prio)) ||
|
|
|
|
store_msg(u_msg_ptr, msg_ptr, msg_ptr->m_ts)) {
|
|
|
|
ret = -EFAULT;
|
|
|
|
}
|
|
|
|
free_msg(msg_ptr);
|
|
|
|
}
|
|
|
|
out_fput:
|
2012-08-28 20:52:22 +04:00
|
|
|
fdput(f);
|
2005-04-17 02:20:36 +04:00
|
|
|
out:
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2017-06-28 04:32:36 +03:00
|
|
|
SYSCALL_DEFINE5(mq_timedsend, mqd_t, mqdes, const char __user *, u_msg_ptr,
|
|
|
|
size_t, msg_len, unsigned int, msg_prio,
|
2018-04-13 14:58:00 +03:00
|
|
|
const struct __kernel_timespec __user *, u_abs_timeout)
|
2017-06-28 04:32:36 +03:00
|
|
|
{
|
2017-08-03 05:51:11 +03:00
|
|
|
struct timespec64 ts, *p = NULL;
|
2017-06-28 04:32:36 +03:00
|
|
|
if (u_abs_timeout) {
|
|
|
|
int res = prepare_timeout(u_abs_timeout, &ts);
|
|
|
|
if (res)
|
|
|
|
return res;
|
|
|
|
p = &ts;
|
|
|
|
}
|
|
|
|
return do_mq_timedsend(mqdes, u_msg_ptr, msg_len, msg_prio, p);
|
|
|
|
}
|
|
|
|
|
|
|
|
SYSCALL_DEFINE5(mq_timedreceive, mqd_t, mqdes, char __user *, u_msg_ptr,
|
|
|
|
size_t, msg_len, unsigned int __user *, u_msg_prio,
|
2018-04-13 14:58:00 +03:00
|
|
|
const struct __kernel_timespec __user *, u_abs_timeout)
|
2017-06-28 04:32:36 +03:00
|
|
|
{
|
2017-08-03 05:51:11 +03:00
|
|
|
struct timespec64 ts, *p = NULL;
|
2017-06-28 04:32:36 +03:00
|
|
|
if (u_abs_timeout) {
|
|
|
|
int res = prepare_timeout(u_abs_timeout, &ts);
|
|
|
|
if (res)
|
|
|
|
return res;
|
|
|
|
p = &ts;
|
|
|
|
}
|
|
|
|
return do_mq_timedreceive(mqdes, u_msg_ptr, msg_len, u_msg_prio, p);
|
|
|
|
}
|
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
/*
|
|
|
|
* Notes: the case when user wants us to deregister (with NULL as pointer)
|
|
|
|
* and he isn't currently owner of notification, will be silently discarded.
|
|
|
|
* It isn't explicitly defined in the POSIX.
|
|
|
|
*/
|
2017-06-28 04:32:36 +03:00
|
|
|
static int do_mq_notify(mqd_t mqdes, const struct sigevent *notification)
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
2012-08-28 20:52:22 +04:00
|
|
|
int ret;
|
|
|
|
struct fd f;
|
2005-04-17 02:20:36 +04:00
|
|
|
struct sock *sock;
|
|
|
|
struct inode *inode;
|
|
|
|
struct mqueue_inode_info *info;
|
|
|
|
struct sk_buff *nc;
|
|
|
|
|
2017-06-28 04:32:36 +03:00
|
|
|
audit_mq_notify(mqdes, notification);
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2008-12-10 15:16:12 +03:00
|
|
|
nc = NULL;
|
|
|
|
sock = NULL;
|
2017-06-28 04:32:36 +03:00
|
|
|
if (notification != NULL) {
|
|
|
|
if (unlikely(notification->sigev_notify != SIGEV_NONE &&
|
|
|
|
notification->sigev_notify != SIGEV_SIGNAL &&
|
|
|
|
notification->sigev_notify != SIGEV_THREAD))
|
2005-04-17 02:20:36 +04:00
|
|
|
return -EINVAL;
|
2017-06-28 04:32:36 +03:00
|
|
|
if (notification->sigev_notify == SIGEV_SIGNAL &&
|
|
|
|
!valid_signal(notification->sigev_signo)) {
|
2005-04-17 02:20:36 +04:00
|
|
|
return -EINVAL;
|
|
|
|
}
|
2017-06-28 04:32:36 +03:00
|
|
|
if (notification->sigev_notify == SIGEV_THREAD) {
|
2007-11-07 13:42:09 +03:00
|
|
|
long timeo;
|
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
/* create the notify skb */
|
|
|
|
nc = alloc_skb(NOTIFY_COOKIE_LEN, GFP_KERNEL);
|
2019-09-26 02:48:17 +03:00
|
|
|
if (!nc)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
if (copy_from_user(nc->data,
|
2017-06-28 04:32:36 +03:00
|
|
|
notification->sigev_value.sival_ptr,
|
2005-04-17 02:20:36 +04:00
|
|
|
NOTIFY_COOKIE_LEN)) {
|
2010-02-23 10:04:26 +03:00
|
|
|
ret = -EFAULT;
|
2019-09-26 02:48:17 +03:00
|
|
|
goto free_skb;
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
/* TODO: add a header? */
|
|
|
|
skb_put(nc, NOTIFY_COOKIE_LEN);
|
|
|
|
/* and attach it to the socket */
|
|
|
|
retry:
|
2017-06-28 04:32:36 +03:00
|
|
|
f = fdget(notification->sigev_signo);
|
2012-08-28 20:52:22 +04:00
|
|
|
if (!f.file) {
|
2010-02-23 10:04:26 +03:00
|
|
|
ret = -EBADF;
|
2005-04-17 02:20:36 +04:00
|
|
|
goto out;
|
2010-02-23 10:04:26 +03:00
|
|
|
}
|
2012-08-28 20:52:22 +04:00
|
|
|
sock = netlink_getsockbyfilp(f.file);
|
|
|
|
fdput(f);
|
2005-04-17 02:20:36 +04:00
|
|
|
if (IS_ERR(sock)) {
|
|
|
|
ret = PTR_ERR(sock);
|
2019-09-26 02:48:17 +03:00
|
|
|
goto free_skb;
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
|
2007-11-07 13:42:09 +03:00
|
|
|
timeo = MAX_SCHEDULE_TIMEOUT;
|
2008-06-05 22:23:39 +04:00
|
|
|
ret = netlink_attachskb(sock, nc, &timeo, NULL);
|
2017-07-09 23:19:55 +03:00
|
|
|
if (ret == 1) {
|
|
|
|
sock = NULL;
|
2010-02-23 10:04:26 +03:00
|
|
|
goto retry;
|
2017-07-09 23:19:55 +03:00
|
|
|
}
|
2019-09-26 02:48:17 +03:00
|
|
|
if (ret)
|
|
|
|
return ret;
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2012-08-28 20:52:22 +04:00
|
|
|
f = fdget(mqdes);
|
|
|
|
if (!f.file) {
|
2010-02-23 10:04:26 +03:00
|
|
|
ret = -EBADF;
|
2005-04-17 02:20:36 +04:00
|
|
|
goto out;
|
2010-02-23 10:04:26 +03:00
|
|
|
}
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2013-01-24 02:07:38 +04:00
|
|
|
inode = file_inode(f.file);
|
2012-08-28 20:52:22 +04:00
|
|
|
if (unlikely(f.file->f_op != &mqueue_file_operations)) {
|
2010-02-23 10:04:26 +03:00
|
|
|
ret = -EBADF;
|
2005-04-17 02:20:36 +04:00
|
|
|
goto out_fput;
|
2010-02-23 10:04:26 +03:00
|
|
|
}
|
2005-04-17 02:20:36 +04:00
|
|
|
info = MQUEUE_I(inode);
|
|
|
|
|
|
|
|
ret = 0;
|
|
|
|
spin_lock(&info->lock);
|
2017-06-28 04:32:36 +03:00
|
|
|
if (notification == NULL) {
|
2006-10-02 13:17:26 +04:00
|
|
|
if (info->notify_owner == task_tgid(current)) {
|
2005-04-17 02:20:36 +04:00
|
|
|
remove_notification(info);
|
2016-09-14 17:48:04 +03:00
|
|
|
inode->i_atime = inode->i_ctime = current_time(inode);
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
2006-10-02 13:17:26 +04:00
|
|
|
} else if (info->notify_owner != NULL) {
|
2005-04-17 02:20:36 +04:00
|
|
|
ret = -EBUSY;
|
|
|
|
} else {
|
2017-06-28 04:32:36 +03:00
|
|
|
switch (notification->sigev_notify) {
|
2005-04-17 02:20:36 +04:00
|
|
|
case SIGEV_NONE:
|
|
|
|
info->notify.sigev_notify = SIGEV_NONE;
|
|
|
|
break;
|
|
|
|
case SIGEV_THREAD:
|
|
|
|
info->notify_sock = sock;
|
|
|
|
info->notify_cookie = nc;
|
|
|
|
sock = NULL;
|
|
|
|
nc = NULL;
|
|
|
|
info->notify.sigev_notify = SIGEV_THREAD;
|
|
|
|
break;
|
|
|
|
case SIGEV_SIGNAL:
|
2017-06-28 04:32:36 +03:00
|
|
|
info->notify.sigev_signo = notification->sigev_signo;
|
|
|
|
info->notify.sigev_value = notification->sigev_value;
|
2005-04-17 02:20:36 +04:00
|
|
|
info->notify.sigev_notify = SIGEV_SIGNAL;
|
ipc/mqueue.c: change __do_notify() to bypass check_kill_permission()
Commit cc731525f26a ("signal: Remove kernel interal si_code magic")
changed the value of SI_FROMUSER(SI_MESGQ), this means that mq_notify() no
longer works if the sender doesn't have rights to send a signal.
Change __do_notify() to use do_send_sig_info() instead of kill_pid_info()
to avoid check_kill_permission().
This needs the additional notify.sigev_signo != 0 check, shouldn't we
change do_mq_notify() to deny sigev_signo == 0 ?
Test-case:
#include <signal.h>
#include <mqueue.h>
#include <unistd.h>
#include <sys/wait.h>
#include <assert.h>
static int notified;
static void sigh(int sig)
{
notified = 1;
}
int main(void)
{
signal(SIGIO, sigh);
int fd = mq_open("/mq", O_RDWR|O_CREAT, 0666, NULL);
assert(fd >= 0);
struct sigevent se = {
.sigev_notify = SIGEV_SIGNAL,
.sigev_signo = SIGIO,
};
assert(mq_notify(fd, &se) == 0);
if (!fork()) {
assert(setuid(1) == 0);
mq_send(fd, "",1,0);
return 0;
}
wait(NULL);
mq_unlink("/mq");
assert(notified);
return 0;
}
[manfred@colorfullife.com: 1) Add self_exec_id evaluation so that the implementation matches do_notify_parent 2) use PIDTYPE_TGID everywhere]
Fixes: cc731525f26a ("signal: Remove kernel interal si_code magic")
Reported-by: Yoji <yoji.fujihar.min@gmail.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Markus Elfring <elfring@users.sourceforge.net>
Cc: <1vier1@web.de>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/e2a782e4-eab9-4f5c-c749-c07a8f7a4e66@colorfullife.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-05-08 04:35:39 +03:00
|
|
|
info->notify_self_exec_id = current->self_exec_id;
|
2005-04-17 02:20:36 +04:00
|
|
|
break;
|
|
|
|
}
|
2006-10-02 13:17:26 +04:00
|
|
|
|
|
|
|
info->notify_owner = get_pid(task_tgid(current));
|
2011-11-17 10:57:55 +04:00
|
|
|
info->notify_user_ns = get_user_ns(current_user_ns());
|
2016-09-14 17:48:04 +03:00
|
|
|
inode->i_atime = inode->i_ctime = current_time(inode);
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
spin_unlock(&info->lock);
|
|
|
|
out_fput:
|
2012-08-28 20:52:22 +04:00
|
|
|
fdput(f);
|
2005-04-17 02:20:36 +04:00
|
|
|
out:
|
2014-01-28 05:07:06 +04:00
|
|
|
if (sock)
|
2005-04-17 02:20:36 +04:00
|
|
|
netlink_detachskb(sock, nc);
|
2019-09-26 02:48:14 +03:00
|
|
|
else
|
2019-09-26 02:48:17 +03:00
|
|
|
free_skb:
|
2005-04-17 02:20:36 +04:00
|
|
|
dev_kfree_skb(nc);
|
2014-01-28 05:07:06 +04:00
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2017-06-28 04:32:36 +03:00
|
|
|
SYSCALL_DEFINE2(mq_notify, mqd_t, mqdes,
|
|
|
|
const struct sigevent __user *, u_notification)
|
|
|
|
{
|
|
|
|
struct sigevent n, *p = NULL;
|
|
|
|
if (u_notification) {
|
|
|
|
if (copy_from_user(&n, u_notification, sizeof(struct sigevent)))
|
|
|
|
return -EFAULT;
|
|
|
|
p = &n;
|
|
|
|
}
|
|
|
|
return do_mq_notify(mqdes, p);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int do_mq_getsetattr(int mqdes, struct mq_attr *new, struct mq_attr *old)
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
2012-08-28 20:52:22 +04:00
|
|
|
struct fd f;
|
2005-04-17 02:20:36 +04:00
|
|
|
struct inode *inode;
|
|
|
|
struct mqueue_inode_info *info;
|
|
|
|
|
2017-06-28 04:32:36 +03:00
|
|
|
if (new && (new->mq_flags & (~O_NONBLOCK)))
|
|
|
|
return -EINVAL;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2012-08-28 20:52:22 +04:00
|
|
|
f = fdget(mqdes);
|
2017-06-28 04:32:36 +03:00
|
|
|
if (!f.file)
|
|
|
|
return -EBADF;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2012-08-28 20:52:22 +04:00
|
|
|
if (unlikely(f.file->f_op != &mqueue_file_operations)) {
|
2017-06-28 04:32:36 +03:00
|
|
|
fdput(f);
|
|
|
|
return -EBADF;
|
2010-02-23 10:04:26 +03:00
|
|
|
}
|
2017-06-28 04:32:36 +03:00
|
|
|
|
|
|
|
inode = file_inode(f.file);
|
2005-04-17 02:20:36 +04:00
|
|
|
info = MQUEUE_I(inode);
|
|
|
|
|
|
|
|
spin_lock(&info->lock);
|
|
|
|
|
2017-06-28 04:32:36 +03:00
|
|
|
if (old) {
|
|
|
|
*old = info->attr;
|
|
|
|
old->mq_flags = f.file->f_flags & O_NONBLOCK;
|
|
|
|
}
|
|
|
|
if (new) {
|
|
|
|
audit_mq_getsetattr(mqdes, new);
|
2012-08-28 20:52:22 +04:00
|
|
|
spin_lock(&f.file->f_lock);
|
2017-06-28 04:32:36 +03:00
|
|
|
if (new->mq_flags & O_NONBLOCK)
|
2012-08-28 20:52:22 +04:00
|
|
|
f.file->f_flags |= O_NONBLOCK;
|
2005-04-17 02:20:36 +04:00
|
|
|
else
|
2012-08-28 20:52:22 +04:00
|
|
|
f.file->f_flags &= ~O_NONBLOCK;
|
|
|
|
spin_unlock(&f.file->f_lock);
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2016-09-14 17:48:04 +03:00
|
|
|
inode->i_atime = inode->i_ctime = current_time(inode);
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
spin_unlock(&info->lock);
|
2017-06-28 04:32:36 +03:00
|
|
|
fdput(f);
|
|
|
|
return 0;
|
|
|
|
}
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2017-06-28 04:32:36 +03:00
|
|
|
SYSCALL_DEFINE3(mq_getsetattr, mqd_t, mqdes,
|
|
|
|
const struct mq_attr __user *, u_mqstat,
|
|
|
|
struct mq_attr __user *, u_omqstat)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
struct mq_attr mqstat, omqstat;
|
|
|
|
struct mq_attr *new = NULL, *old = NULL;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2017-06-28 04:32:36 +03:00
|
|
|
if (u_mqstat) {
|
|
|
|
new = &mqstat;
|
|
|
|
if (copy_from_user(new, u_mqstat, sizeof(struct mq_attr)))
|
|
|
|
return -EFAULT;
|
|
|
|
}
|
|
|
|
if (u_omqstat)
|
|
|
|
old = &omqstat;
|
|
|
|
|
|
|
|
ret = do_mq_getsetattr(mqdes, new, old);
|
|
|
|
if (ret || !old)
|
|
|
|
return ret;
|
|
|
|
|
|
|
|
if (copy_to_user(u_omqstat, old, sizeof(struct mq_attr)))
|
|
|
|
return -EFAULT;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
#ifdef CONFIG_COMPAT
|
|
|
|
|
|
|
|
struct compat_mq_attr {
|
|
|
|
compat_long_t mq_flags; /* message queue flags */
|
|
|
|
compat_long_t mq_maxmsg; /* maximum number of messages */
|
|
|
|
compat_long_t mq_msgsize; /* maximum message size */
|
|
|
|
compat_long_t mq_curmsgs; /* number of messages currently queued */
|
|
|
|
compat_long_t __reserved[4]; /* ignored for input, zeroed for output */
|
|
|
|
};
|
|
|
|
|
|
|
|
static inline int get_compat_mq_attr(struct mq_attr *attr,
|
|
|
|
const struct compat_mq_attr __user *uattr)
|
|
|
|
{
|
|
|
|
struct compat_mq_attr v;
|
|
|
|
|
|
|
|
if (copy_from_user(&v, uattr, sizeof(*uattr)))
|
|
|
|
return -EFAULT;
|
|
|
|
|
|
|
|
memset(attr, 0, sizeof(*attr));
|
|
|
|
attr->mq_flags = v.mq_flags;
|
|
|
|
attr->mq_maxmsg = v.mq_maxmsg;
|
|
|
|
attr->mq_msgsize = v.mq_msgsize;
|
|
|
|
attr->mq_curmsgs = v.mq_curmsgs;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int put_compat_mq_attr(const struct mq_attr *attr,
|
|
|
|
struct compat_mq_attr __user *uattr)
|
|
|
|
{
|
|
|
|
struct compat_mq_attr v;
|
|
|
|
|
|
|
|
memset(&v, 0, sizeof(v));
|
|
|
|
v.mq_flags = attr->mq_flags;
|
|
|
|
v.mq_maxmsg = attr->mq_maxmsg;
|
|
|
|
v.mq_msgsize = attr->mq_msgsize;
|
|
|
|
v.mq_curmsgs = attr->mq_curmsgs;
|
|
|
|
if (copy_to_user(uattr, &v, sizeof(*uattr)))
|
|
|
|
return -EFAULT;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
COMPAT_SYSCALL_DEFINE4(mq_open, const char __user *, u_name,
|
|
|
|
int, oflag, compat_mode_t, mode,
|
|
|
|
struct compat_mq_attr __user *, u_attr)
|
|
|
|
{
|
|
|
|
struct mq_attr attr, *p = NULL;
|
|
|
|
if (u_attr && oflag & O_CREAT) {
|
|
|
|
p = &attr;
|
|
|
|
if (get_compat_mq_attr(&attr, u_attr))
|
|
|
|
return -EFAULT;
|
|
|
|
}
|
|
|
|
return do_mq_open(u_name, oflag, mode, p);
|
|
|
|
}
|
|
|
|
|
2018-04-13 14:58:23 +03:00
|
|
|
COMPAT_SYSCALL_DEFINE2(mq_notify, mqd_t, mqdes,
|
|
|
|
const struct compat_sigevent __user *, u_notification)
|
|
|
|
{
|
|
|
|
struct sigevent n, *p = NULL;
|
|
|
|
if (u_notification) {
|
|
|
|
if (get_compat_sigevent(&n, u_notification))
|
|
|
|
return -EFAULT;
|
|
|
|
if (n.sigev_notify == SIGEV_THREAD)
|
|
|
|
n.sigev_value.sival_ptr = compat_ptr(n.sigev_value.sival_int);
|
|
|
|
p = &n;
|
|
|
|
}
|
|
|
|
return do_mq_notify(mqdes, p);
|
|
|
|
}
|
|
|
|
|
|
|
|
COMPAT_SYSCALL_DEFINE3(mq_getsetattr, mqd_t, mqdes,
|
|
|
|
const struct compat_mq_attr __user *, u_mqstat,
|
|
|
|
struct compat_mq_attr __user *, u_omqstat)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
struct mq_attr mqstat, omqstat;
|
|
|
|
struct mq_attr *new = NULL, *old = NULL;
|
|
|
|
|
|
|
|
if (u_mqstat) {
|
|
|
|
new = &mqstat;
|
|
|
|
if (get_compat_mq_attr(new, u_mqstat))
|
|
|
|
return -EFAULT;
|
|
|
|
}
|
|
|
|
if (u_omqstat)
|
|
|
|
old = &omqstat;
|
|
|
|
|
|
|
|
ret = do_mq_getsetattr(mqdes, new, old);
|
|
|
|
if (ret || !old)
|
|
|
|
return ret;
|
|
|
|
|
|
|
|
if (put_compat_mq_attr(old, u_omqstat))
|
|
|
|
return -EFAULT;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
|
|
|
#ifdef CONFIG_COMPAT_32BIT_TIME
|
y2038: globally rename compat_time to old_time32
Christoph Hellwig suggested a slightly different path for handling
backwards compatibility with the 32-bit time_t based system calls:
Rather than simply reusing the compat_sys_* entry points on 32-bit
architectures unchanged, we get rid of those entry points and the
compat_time types by renaming them to something that makes more sense
on 32-bit architectures (which don't have a compat mode otherwise),
and then share the entry points under the new name with the 64-bit
architectures that use them for implementing the compatibility.
The following types and interfaces are renamed here, and moved
from linux/compat_time.h to linux/time32.h:
old new
--- ---
compat_time_t old_time32_t
struct compat_timeval struct old_timeval32
struct compat_timespec struct old_timespec32
struct compat_itimerspec struct old_itimerspec32
ns_to_compat_timeval() ns_to_old_timeval32()
get_compat_itimerspec64() get_old_itimerspec32()
put_compat_itimerspec64() put_old_itimerspec32()
compat_get_timespec64() get_old_timespec32()
compat_put_timespec64() put_old_timespec32()
As we already have aliases in place, this patch addresses only the
instances that are relevant to the system call interface in particular,
not those that occur in device drivers and other modules. Those
will get handled separately, while providing the 64-bit version
of the respective interfaces.
I'm not renaming the timex, rusage and itimerval structures, as we are
still debating what the new interface will look like, and whether we
will need a replacement at all.
This also doesn't change the names of the syscall entry points, which can
be done more easily when we actually switch over the 32-bit architectures
to use them, at that point we need to change COMPAT_SYSCALL_DEFINEx to
SYSCALL_DEFINEx with a new name, e.g. with a _time32 suffix.
Suggested-by: Christoph Hellwig <hch@infradead.org>
Link: https://lore.kernel.org/lkml/20180705222110.GA5698@infradead.org/
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2018-07-13 13:52:28 +03:00
|
|
|
static int compat_prepare_timeout(const struct old_timespec32 __user *p,
|
2017-08-03 05:51:11 +03:00
|
|
|
struct timespec64 *ts)
|
2017-06-28 04:32:36 +03:00
|
|
|
{
|
y2038: globally rename compat_time to old_time32
Christoph Hellwig suggested a slightly different path for handling
backwards compatibility with the 32-bit time_t based system calls:
Rather than simply reusing the compat_sys_* entry points on 32-bit
architectures unchanged, we get rid of those entry points and the
compat_time types by renaming them to something that makes more sense
on 32-bit architectures (which don't have a compat mode otherwise),
and then share the entry points under the new name with the 64-bit
architectures that use them for implementing the compatibility.
The following types and interfaces are renamed here, and moved
from linux/compat_time.h to linux/time32.h:
old new
--- ---
compat_time_t old_time32_t
struct compat_timeval struct old_timeval32
struct compat_timespec struct old_timespec32
struct compat_itimerspec struct old_itimerspec32
ns_to_compat_timeval() ns_to_old_timeval32()
get_compat_itimerspec64() get_old_itimerspec32()
put_compat_itimerspec64() put_old_itimerspec32()
compat_get_timespec64() get_old_timespec32()
compat_put_timespec64() put_old_timespec32()
As we already have aliases in place, this patch addresses only the
instances that are relevant to the system call interface in particular,
not those that occur in device drivers and other modules. Those
will get handled separately, while providing the 64-bit version
of the respective interfaces.
I'm not renaming the timex, rusage and itimerval structures, as we are
still debating what the new interface will look like, and whether we
will need a replacement at all.
This also doesn't change the names of the syscall entry points, which can
be done more easily when we actually switch over the 32-bit architectures
to use them, at that point we need to change COMPAT_SYSCALL_DEFINEx to
SYSCALL_DEFINEx with a new name, e.g. with a _time32 suffix.
Suggested-by: Christoph Hellwig <hch@infradead.org>
Link: https://lore.kernel.org/lkml/20180705222110.GA5698@infradead.org/
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2018-07-13 13:52:28 +03:00
|
|
|
if (get_old_timespec32(ts, p))
|
2017-06-28 04:32:36 +03:00
|
|
|
return -EFAULT;
|
2017-08-03 05:51:11 +03:00
|
|
|
if (!timespec64_valid(ts))
|
2017-06-28 04:32:36 +03:00
|
|
|
return -EINVAL;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2019-01-07 02:33:08 +03:00
|
|
|
SYSCALL_DEFINE5(mq_timedsend_time32, mqd_t, mqdes,
|
|
|
|
const char __user *, u_msg_ptr,
|
|
|
|
unsigned int, msg_len, unsigned int, msg_prio,
|
|
|
|
const struct old_timespec32 __user *, u_abs_timeout)
|
2017-06-28 04:32:36 +03:00
|
|
|
{
|
2017-08-03 05:51:11 +03:00
|
|
|
struct timespec64 ts, *p = NULL;
|
2017-06-28 04:32:36 +03:00
|
|
|
if (u_abs_timeout) {
|
|
|
|
int res = compat_prepare_timeout(u_abs_timeout, &ts);
|
|
|
|
if (res)
|
|
|
|
return res;
|
|
|
|
p = &ts;
|
|
|
|
}
|
|
|
|
return do_mq_timedsend(mqdes, u_msg_ptr, msg_len, msg_prio, p);
|
|
|
|
}
|
|
|
|
|
2019-01-07 02:33:08 +03:00
|
|
|
SYSCALL_DEFINE5(mq_timedreceive_time32, mqd_t, mqdes,
|
|
|
|
char __user *, u_msg_ptr,
|
|
|
|
unsigned int, msg_len, unsigned int __user *, u_msg_prio,
|
|
|
|
const struct old_timespec32 __user *, u_abs_timeout)
|
2017-06-28 04:32:36 +03:00
|
|
|
{
|
2017-08-03 05:51:11 +03:00
|
|
|
struct timespec64 ts, *p = NULL;
|
2017-06-28 04:32:36 +03:00
|
|
|
if (u_abs_timeout) {
|
|
|
|
int res = compat_prepare_timeout(u_abs_timeout, &ts);
|
|
|
|
if (res)
|
|
|
|
return res;
|
|
|
|
p = &ts;
|
|
|
|
}
|
|
|
|
return do_mq_timedreceive(mqdes, u_msg_ptr, msg_len, u_msg_prio, p);
|
|
|
|
}
|
|
|
|
#endif
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2007-02-12 11:55:39 +03:00
|
|
|
static const struct inode_operations mqueue_dir_inode_operations = {
|
2005-04-17 02:20:36 +04:00
|
|
|
.lookup = simple_lookup,
|
|
|
|
.create = mqueue_create,
|
|
|
|
.unlink = mqueue_unlink,
|
|
|
|
};
|
|
|
|
|
2007-02-12 11:55:35 +03:00
|
|
|
static const struct file_operations mqueue_file_operations = {
|
2005-04-17 02:20:36 +04:00
|
|
|
.flush = mqueue_flush_file,
|
|
|
|
.poll = mqueue_poll_file,
|
|
|
|
.read = mqueue_read_file,
|
llseek: automatically add .llseek fop
All file_operations should get a .llseek operation so we can make
nonseekable_open the default for future file operations without a
.llseek pointer.
The three cases that we can automatically detect are no_llseek, seq_lseek
and default_llseek. For cases where we can we can automatically prove that
the file offset is always ignored, we use noop_llseek, which maintains
the current behavior of not returning an error from a seek.
New drivers should normally not use noop_llseek but instead use no_llseek
and call nonseekable_open at open time. Existing drivers can be converted
to do the same when the maintainer knows for certain that no user code
relies on calling seek on the device file.
The generated code is often incorrectly indented and right now contains
comments that clarify for each added line why a specific variant was
chosen. In the version that gets submitted upstream, the comments will
be gone and I will manually fix the indentation, because there does not
seem to be a way to do that using coccinelle.
Some amount of new code is currently sitting in linux-next that should get
the same modifications, which I will do at the end of the merge window.
Many thanks to Julia Lawall for helping me learn to write a semantic
patch that does all this.
===== begin semantic patch =====
// This adds an llseek= method to all file operations,
// as a preparation for making no_llseek the default.
//
// The rules are
// - use no_llseek explicitly if we do nonseekable_open
// - use seq_lseek for sequential files
// - use default_llseek if we know we access f_pos
// - use noop_llseek if we know we don't access f_pos,
// but we still want to allow users to call lseek
//
@ open1 exists @
identifier nested_open;
@@
nested_open(...)
{
<+...
nonseekable_open(...)
...+>
}
@ open exists@
identifier open_f;
identifier i, f;
identifier open1.nested_open;
@@
int open_f(struct inode *i, struct file *f)
{
<+...
(
nonseekable_open(...)
|
nested_open(...)
)
...+>
}
@ read disable optional_qualifier exists @
identifier read_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
expression E;
identifier func;
@@
ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
{
<+...
(
*off = E
|
*off += E
|
func(..., off, ...)
|
E = *off
)
...+>
}
@ read_no_fpos disable optional_qualifier exists @
identifier read_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
@@
ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
{
... when != off
}
@ write @
identifier write_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
expression E;
identifier func;
@@
ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
{
<+...
(
*off = E
|
*off += E
|
func(..., off, ...)
|
E = *off
)
...+>
}
@ write_no_fpos @
identifier write_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
@@
ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
{
... when != off
}
@ fops0 @
identifier fops;
@@
struct file_operations fops = {
...
};
@ has_llseek depends on fops0 @
identifier fops0.fops;
identifier llseek_f;
@@
struct file_operations fops = {
...
.llseek = llseek_f,
...
};
@ has_read depends on fops0 @
identifier fops0.fops;
identifier read_f;
@@
struct file_operations fops = {
...
.read = read_f,
...
};
@ has_write depends on fops0 @
identifier fops0.fops;
identifier write_f;
@@
struct file_operations fops = {
...
.write = write_f,
...
};
@ has_open depends on fops0 @
identifier fops0.fops;
identifier open_f;
@@
struct file_operations fops = {
...
.open = open_f,
...
};
// use no_llseek if we call nonseekable_open
////////////////////////////////////////////
@ nonseekable1 depends on !has_llseek && has_open @
identifier fops0.fops;
identifier nso ~= "nonseekable_open";
@@
struct file_operations fops = {
... .open = nso, ...
+.llseek = no_llseek, /* nonseekable */
};
@ nonseekable2 depends on !has_llseek @
identifier fops0.fops;
identifier open.open_f;
@@
struct file_operations fops = {
... .open = open_f, ...
+.llseek = no_llseek, /* open uses nonseekable */
};
// use seq_lseek for sequential files
/////////////////////////////////////
@ seq depends on !has_llseek @
identifier fops0.fops;
identifier sr ~= "seq_read";
@@
struct file_operations fops = {
... .read = sr, ...
+.llseek = seq_lseek, /* we have seq_read */
};
// use default_llseek if there is a readdir
///////////////////////////////////////////
@ fops1 depends on !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier readdir_e;
@@
// any other fop is used that changes pos
struct file_operations fops = {
... .readdir = readdir_e, ...
+.llseek = default_llseek, /* readdir is present */
};
// use default_llseek if at least one of read/write touches f_pos
/////////////////////////////////////////////////////////////////
@ fops2 depends on !fops1 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read.read_f;
@@
// read fops use offset
struct file_operations fops = {
... .read = read_f, ...
+.llseek = default_llseek, /* read accesses f_pos */
};
@ fops3 depends on !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier write.write_f;
@@
// write fops use offset
struct file_operations fops = {
... .write = write_f, ...
+ .llseek = default_llseek, /* write accesses f_pos */
};
// Use noop_llseek if neither read nor write accesses f_pos
///////////////////////////////////////////////////////////
@ fops4 depends on !fops1 && !fops2 && !fops3 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read_no_fpos.read_f;
identifier write_no_fpos.write_f;
@@
// write fops use offset
struct file_operations fops = {
...
.write = write_f,
.read = read_f,
...
+.llseek = noop_llseek, /* read and write both use no f_pos */
};
@ depends on has_write && !has_read && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier write_no_fpos.write_f;
@@
struct file_operations fops = {
... .write = write_f, ...
+.llseek = noop_llseek, /* write uses no f_pos */
};
@ depends on has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read_no_fpos.read_f;
@@
struct file_operations fops = {
... .read = read_f, ...
+.llseek = noop_llseek, /* read uses no f_pos */
};
@ depends on !has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
@@
struct file_operations fops = {
...
+.llseek = noop_llseek, /* no read or write fn */
};
===== End semantic patch =====
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Julia Lawall <julia@diku.dk>
Cc: Christoph Hellwig <hch@infradead.org>
2010-08-15 20:52:59 +04:00
|
|
|
.llseek = default_llseek,
|
2005-04-17 02:20:36 +04:00
|
|
|
};
|
|
|
|
|
2009-09-22 04:01:09 +04:00
|
|
|
static const struct super_operations mqueue_super_ops = {
|
2005-04-17 02:20:36 +04:00
|
|
|
.alloc_inode = mqueue_alloc_inode,
|
2019-04-16 05:30:30 +03:00
|
|
|
.free_inode = mqueue_free_inode,
|
2010-06-06 00:29:45 +04:00
|
|
|
.evict_inode = mqueue_evict_inode,
|
2005-04-17 02:20:36 +04:00
|
|
|
.statfs = simple_statfs,
|
|
|
|
};
|
|
|
|
|
2018-11-02 02:07:25 +03:00
|
|
|
static const struct fs_context_operations mqueue_fs_context_ops = {
|
|
|
|
.free = mqueue_fs_context_free,
|
|
|
|
.get_tree = mqueue_get_tree,
|
|
|
|
};
|
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
static struct file_system_type mqueue_fs_type = {
|
2018-11-02 02:07:25 +03:00
|
|
|
.name = "mqueue",
|
|
|
|
.init_fs_context = mqueue_init_fs_context,
|
|
|
|
.kill_sb = kill_litter_super,
|
|
|
|
.fs_flags = FS_USERNS_MOUNT,
|
2005-04-17 02:20:36 +04:00
|
|
|
};
|
|
|
|
|
namespaces: ipc namespaces: implement support for posix msqueues
Implement multiple mounts of the mqueue file system, and link it to usage
of CLONE_NEWIPC.
Each ipc ns has a corresponding mqueuefs superblock. When a user does
clone(CLONE_NEWIPC) or unshare(CLONE_NEWIPC), the unshare will cause an
internal mount of a new mqueuefs sb linked to the new ipc ns.
When a user does 'mount -t mqueue mqueue /dev/mqueue', he mounts the
mqueuefs superblock.
Posix message queues can be worked with both through the mq_* system calls
(see mq_overview(7)), and through the VFS through the mqueue mount. Any
usage of mq_open() and friends will work with the acting task's ipc
namespace. Any actions through the VFS will work with the mqueuefs in
which the file was created. So if a user doesn't remount mqueuefs after
unshare(CLONE_NEWIPC), mq_open("/ab") will not be reflected in "ls
/dev/mqueue".
If task a mounts mqueue for ipc_ns:1, then clones task b with a new ipcns,
ipcns:2, and then task a is the last task in ipc_ns:1 to exit, then (1)
ipc_ns:1 will be freed, (2) it's superblock will live on until task b
umounts the corresponding mqueuefs, and vfs actions will continue to
succeed, but (3) sb->s_fs_info will be NULL for the sb corresponding to
the deceased ipc_ns:1.
To make this happen, we must protect the ipc reference count when
a) a task exits and drops its ipcns->count, since it might be dropping
it to 0 and freeing the ipcns
b) a task accesses the ipcns through its mqueuefs interface, since it
bumps the ipcns refcount and might race with the last task in the ipcns
exiting.
So the kref is changed to an atomic_t so we can use
atomic_dec_and_lock(&ns->count,mq_lock), and every access to the ipcns
through ns = mqueuefs_sb->s_fs_info is protected by the same lock.
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 06:01:10 +04:00
|
|
|
int mq_init_ns(struct ipc_namespace *ns)
|
|
|
|
{
|
2018-11-02 02:07:25 +03:00
|
|
|
struct vfsmount *m;
|
|
|
|
|
namespaces: ipc namespaces: implement support for posix msqueues
Implement multiple mounts of the mqueue file system, and link it to usage
of CLONE_NEWIPC.
Each ipc ns has a corresponding mqueuefs superblock. When a user does
clone(CLONE_NEWIPC) or unshare(CLONE_NEWIPC), the unshare will cause an
internal mount of a new mqueuefs sb linked to the new ipc ns.
When a user does 'mount -t mqueue mqueue /dev/mqueue', he mounts the
mqueuefs superblock.
Posix message queues can be worked with both through the mq_* system calls
(see mq_overview(7)), and through the VFS through the mqueue mount. Any
usage of mq_open() and friends will work with the acting task's ipc
namespace. Any actions through the VFS will work with the mqueuefs in
which the file was created. So if a user doesn't remount mqueuefs after
unshare(CLONE_NEWIPC), mq_open("/ab") will not be reflected in "ls
/dev/mqueue".
If task a mounts mqueue for ipc_ns:1, then clones task b with a new ipcns,
ipcns:2, and then task a is the last task in ipc_ns:1 to exit, then (1)
ipc_ns:1 will be freed, (2) it's superblock will live on until task b
umounts the corresponding mqueuefs, and vfs actions will continue to
succeed, but (3) sb->s_fs_info will be NULL for the sb corresponding to
the deceased ipc_ns:1.
To make this happen, we must protect the ipc reference count when
a) a task exits and drops its ipcns->count, since it might be dropping
it to 0 and freeing the ipcns
b) a task accesses the ipcns through its mqueuefs interface, since it
bumps the ipcns refcount and might race with the last task in the ipcns
exiting.
So the kref is changed to an atomic_t so we can use
atomic_dec_and_lock(&ns->count,mq_lock), and every access to the ipcns
through ns = mqueuefs_sb->s_fs_info is protected by the same lock.
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 06:01:10 +04:00
|
|
|
ns->mq_queues_count = 0;
|
|
|
|
ns->mq_queues_max = DFLT_QUEUESMAX;
|
|
|
|
ns->mq_msg_max = DFLT_MSGMAX;
|
|
|
|
ns->mq_msgsize_max = DFLT_MSGSIZEMAX;
|
2012-06-01 03:26:33 +04:00
|
|
|
ns->mq_msg_default = DFLT_MSG;
|
|
|
|
ns->mq_msgsize_default = DFLT_MSGSIZE;
|
namespaces: ipc namespaces: implement support for posix msqueues
Implement multiple mounts of the mqueue file system, and link it to usage
of CLONE_NEWIPC.
Each ipc ns has a corresponding mqueuefs superblock. When a user does
clone(CLONE_NEWIPC) or unshare(CLONE_NEWIPC), the unshare will cause an
internal mount of a new mqueuefs sb linked to the new ipc ns.
When a user does 'mount -t mqueue mqueue /dev/mqueue', he mounts the
mqueuefs superblock.
Posix message queues can be worked with both through the mq_* system calls
(see mq_overview(7)), and through the VFS through the mqueue mount. Any
usage of mq_open() and friends will work with the acting task's ipc
namespace. Any actions through the VFS will work with the mqueuefs in
which the file was created. So if a user doesn't remount mqueuefs after
unshare(CLONE_NEWIPC), mq_open("/ab") will not be reflected in "ls
/dev/mqueue".
If task a mounts mqueue for ipc_ns:1, then clones task b with a new ipcns,
ipcns:2, and then task a is the last task in ipc_ns:1 to exit, then (1)
ipc_ns:1 will be freed, (2) it's superblock will live on until task b
umounts the corresponding mqueuefs, and vfs actions will continue to
succeed, but (3) sb->s_fs_info will be NULL for the sb corresponding to
the deceased ipc_ns:1.
To make this happen, we must protect the ipc reference count when
a) a task exits and drops its ipcns->count, since it might be dropping
it to 0 and freeing the ipcns
b) a task accesses the ipcns through its mqueuefs interface, since it
bumps the ipcns refcount and might race with the last task in the ipcns
exiting.
So the kref is changed to an atomic_t so we can use
atomic_dec_and_lock(&ns->count,mq_lock), and every access to the ipcns
through ns = mqueuefs_sb->s_fs_info is protected by the same lock.
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 06:01:10 +04:00
|
|
|
|
2018-11-02 02:07:25 +03:00
|
|
|
m = mq_create_mount(ns);
|
|
|
|
if (IS_ERR(m))
|
|
|
|
return PTR_ERR(m);
|
|
|
|
ns->mq_mnt = m;
|
namespaces: ipc namespaces: implement support for posix msqueues
Implement multiple mounts of the mqueue file system, and link it to usage
of CLONE_NEWIPC.
Each ipc ns has a corresponding mqueuefs superblock. When a user does
clone(CLONE_NEWIPC) or unshare(CLONE_NEWIPC), the unshare will cause an
internal mount of a new mqueuefs sb linked to the new ipc ns.
When a user does 'mount -t mqueue mqueue /dev/mqueue', he mounts the
mqueuefs superblock.
Posix message queues can be worked with both through the mq_* system calls
(see mq_overview(7)), and through the VFS through the mqueue mount. Any
usage of mq_open() and friends will work with the acting task's ipc
namespace. Any actions through the VFS will work with the mqueuefs in
which the file was created. So if a user doesn't remount mqueuefs after
unshare(CLONE_NEWIPC), mq_open("/ab") will not be reflected in "ls
/dev/mqueue".
If task a mounts mqueue for ipc_ns:1, then clones task b with a new ipcns,
ipcns:2, and then task a is the last task in ipc_ns:1 to exit, then (1)
ipc_ns:1 will be freed, (2) it's superblock will live on until task b
umounts the corresponding mqueuefs, and vfs actions will continue to
succeed, but (3) sb->s_fs_info will be NULL for the sb corresponding to
the deceased ipc_ns:1.
To make this happen, we must protect the ipc reference count when
a) a task exits and drops its ipcns->count, since it might be dropping
it to 0 and freeing the ipcns
b) a task accesses the ipcns through its mqueuefs interface, since it
bumps the ipcns refcount and might race with the last task in the ipcns
exiting.
So the kref is changed to an atomic_t so we can use
atomic_dec_and_lock(&ns->count,mq_lock), and every access to the ipcns
through ns = mqueuefs_sb->s_fs_info is protected by the same lock.
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 06:01:10 +04:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
void mq_clear_sbinfo(struct ipc_namespace *ns)
|
|
|
|
{
|
2018-03-24 19:28:14 +03:00
|
|
|
ns->mq_mnt->mnt_sb->s_fs_info = NULL;
|
namespaces: ipc namespaces: implement support for posix msqueues
Implement multiple mounts of the mqueue file system, and link it to usage
of CLONE_NEWIPC.
Each ipc ns has a corresponding mqueuefs superblock. When a user does
clone(CLONE_NEWIPC) or unshare(CLONE_NEWIPC), the unshare will cause an
internal mount of a new mqueuefs sb linked to the new ipc ns.
When a user does 'mount -t mqueue mqueue /dev/mqueue', he mounts the
mqueuefs superblock.
Posix message queues can be worked with both through the mq_* system calls
(see mq_overview(7)), and through the VFS through the mqueue mount. Any
usage of mq_open() and friends will work with the acting task's ipc
namespace. Any actions through the VFS will work with the mqueuefs in
which the file was created. So if a user doesn't remount mqueuefs after
unshare(CLONE_NEWIPC), mq_open("/ab") will not be reflected in "ls
/dev/mqueue".
If task a mounts mqueue for ipc_ns:1, then clones task b with a new ipcns,
ipcns:2, and then task a is the last task in ipc_ns:1 to exit, then (1)
ipc_ns:1 will be freed, (2) it's superblock will live on until task b
umounts the corresponding mqueuefs, and vfs actions will continue to
succeed, but (3) sb->s_fs_info will be NULL for the sb corresponding to
the deceased ipc_ns:1.
To make this happen, we must protect the ipc reference count when
a) a task exits and drops its ipcns->count, since it might be dropping
it to 0 and freeing the ipcns
b) a task accesses the ipcns through its mqueuefs interface, since it
bumps the ipcns refcount and might race with the last task in the ipcns
exiting.
So the kref is changed to an atomic_t so we can use
atomic_dec_and_lock(&ns->count,mq_lock), and every access to the ipcns
through ns = mqueuefs_sb->s_fs_info is protected by the same lock.
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 06:01:10 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
void mq_put_mnt(struct ipc_namespace *ns)
|
|
|
|
{
|
2018-03-24 19:28:14 +03:00
|
|
|
kern_unmount(ns->mq_mnt);
|
namespaces: ipc namespaces: implement support for posix msqueues
Implement multiple mounts of the mqueue file system, and link it to usage
of CLONE_NEWIPC.
Each ipc ns has a corresponding mqueuefs superblock. When a user does
clone(CLONE_NEWIPC) or unshare(CLONE_NEWIPC), the unshare will cause an
internal mount of a new mqueuefs sb linked to the new ipc ns.
When a user does 'mount -t mqueue mqueue /dev/mqueue', he mounts the
mqueuefs superblock.
Posix message queues can be worked with both through the mq_* system calls
(see mq_overview(7)), and through the VFS through the mqueue mount. Any
usage of mq_open() and friends will work with the acting task's ipc
namespace. Any actions through the VFS will work with the mqueuefs in
which the file was created. So if a user doesn't remount mqueuefs after
unshare(CLONE_NEWIPC), mq_open("/ab") will not be reflected in "ls
/dev/mqueue".
If task a mounts mqueue for ipc_ns:1, then clones task b with a new ipcns,
ipcns:2, and then task a is the last task in ipc_ns:1 to exit, then (1)
ipc_ns:1 will be freed, (2) it's superblock will live on until task b
umounts the corresponding mqueuefs, and vfs actions will continue to
succeed, but (3) sb->s_fs_info will be NULL for the sb corresponding to
the deceased ipc_ns:1.
To make this happen, we must protect the ipc reference count when
a) a task exits and drops its ipcns->count, since it might be dropping
it to 0 and freeing the ipcns
b) a task accesses the ipcns through its mqueuefs interface, since it
bumps the ipcns refcount and might race with the last task in the ipcns
exiting.
So the kref is changed to an atomic_t so we can use
atomic_dec_and_lock(&ns->count,mq_lock), and every access to the ipcns
through ns = mqueuefs_sb->s_fs_info is protected by the same lock.
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 06:01:10 +04:00
|
|
|
}
|
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
static int __init init_mqueue_fs(void)
|
|
|
|
{
|
|
|
|
int error;
|
|
|
|
|
|
|
|
mqueue_inode_cachep = kmem_cache_create("mqueue_inode_cache",
|
|
|
|
sizeof(struct mqueue_inode_info), 0,
|
2016-01-15 02:18:21 +03:00
|
|
|
SLAB_HWCACHE_ALIGN|SLAB_ACCOUNT, init_once);
|
2005-04-17 02:20:36 +04:00
|
|
|
if (mqueue_inode_cachep == NULL)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
2022-02-14 21:18:14 +03:00
|
|
|
if (!setup_mq_sysctls(&init_ipc_ns)) {
|
|
|
|
pr_warn("sysctl registration failed\n");
|
|
|
|
return -ENOMEM;
|
|
|
|
}
|
2005-04-17 02:20:36 +04:00
|
|
|
|
|
|
|
error = register_filesystem(&mqueue_fs_type);
|
|
|
|
if (error)
|
|
|
|
goto out_sysctl;
|
|
|
|
|
namespaces: ipc namespaces: implement support for posix msqueues
Implement multiple mounts of the mqueue file system, and link it to usage
of CLONE_NEWIPC.
Each ipc ns has a corresponding mqueuefs superblock. When a user does
clone(CLONE_NEWIPC) or unshare(CLONE_NEWIPC), the unshare will cause an
internal mount of a new mqueuefs sb linked to the new ipc ns.
When a user does 'mount -t mqueue mqueue /dev/mqueue', he mounts the
mqueuefs superblock.
Posix message queues can be worked with both through the mq_* system calls
(see mq_overview(7)), and through the VFS through the mqueue mount. Any
usage of mq_open() and friends will work with the acting task's ipc
namespace. Any actions through the VFS will work with the mqueuefs in
which the file was created. So if a user doesn't remount mqueuefs after
unshare(CLONE_NEWIPC), mq_open("/ab") will not be reflected in "ls
/dev/mqueue".
If task a mounts mqueue for ipc_ns:1, then clones task b with a new ipcns,
ipcns:2, and then task a is the last task in ipc_ns:1 to exit, then (1)
ipc_ns:1 will be freed, (2) it's superblock will live on until task b
umounts the corresponding mqueuefs, and vfs actions will continue to
succeed, but (3) sb->s_fs_info will be NULL for the sb corresponding to
the deceased ipc_ns:1.
To make this happen, we must protect the ipc reference count when
a) a task exits and drops its ipcns->count, since it might be dropping
it to 0 and freeing the ipcns
b) a task accesses the ipcns through its mqueuefs interface, since it
bumps the ipcns refcount and might race with the last task in the ipcns
exiting.
So the kref is changed to an atomic_t so we can use
atomic_dec_and_lock(&ns->count,mq_lock), and every access to the ipcns
through ns = mqueuefs_sb->s_fs_info is protected by the same lock.
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 06:01:10 +04:00
|
|
|
spin_lock_init(&mq_lock);
|
|
|
|
|
2011-12-09 09:38:50 +04:00
|
|
|
error = mq_init_ns(&init_ipc_ns);
|
|
|
|
if (error)
|
2005-04-17 02:20:36 +04:00
|
|
|
goto out_filesystem;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
out_filesystem:
|
|
|
|
unregister_filesystem(&mqueue_fs_type);
|
|
|
|
out_sysctl:
|
2006-09-27 12:49:40 +04:00
|
|
|
kmem_cache_destroy(mqueue_inode_cachep);
|
2005-04-17 02:20:36 +04:00
|
|
|
return error;
|
|
|
|
}
|
|
|
|
|
2014-04-08 02:39:18 +04:00
|
|
|
device_initcall(init_mqueue_fs);
|