License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 17:07:57 +03:00
|
|
|
// SPDX-License-Identifier: GPL-2.0
|
2005-04-17 02:20:36 +04:00
|
|
|
/*
|
|
|
|
* linux/fs/pipe.c
|
|
|
|
*
|
|
|
|
* Copyright (C) 1991, 1992, 1999 Linus Torvalds
|
|
|
|
*/
|
|
|
|
|
|
|
|
#include <linux/mm.h>
|
|
|
|
#include <linux/file.h>
|
|
|
|
#include <linux/poll.h>
|
|
|
|
#include <linux/slab.h>
|
|
|
|
#include <linux/module.h>
|
|
|
|
#include <linux/init.h>
|
|
|
|
#include <linux/fs.h>
|
2010-05-20 12:43:18 +04:00
|
|
|
#include <linux/log2.h>
|
2005-04-17 02:20:36 +04:00
|
|
|
#include <linux/mount.h>
|
2019-03-25 19:38:23 +03:00
|
|
|
#include <linux/pseudo_fs.h>
|
2012-03-24 02:01:50 +04:00
|
|
|
#include <linux/magic.h>
|
2005-04-17 02:20:36 +04:00
|
|
|
#include <linux/pipe_fs_i.h>
|
|
|
|
#include <linux/uio.h>
|
|
|
|
#include <linux/highmem.h>
|
2006-03-30 17:15:30 +04:00
|
|
|
#include <linux/pagemap.h>
|
2007-02-07 09:48:00 +03:00
|
|
|
#include <linux/audit.h>
|
2008-05-07 07:42:38 +04:00
|
|
|
#include <linux/syscalls.h>
|
2010-05-19 23:03:16 +04:00
|
|
|
#include <linux/fcntl.h>
|
pipe: account to kmemcg
Pipes can consume a significant amount of system memory, hence they
should be accounted to kmemcg.
This patch marks pipe_inode_info and anonymous pipe buffer page
allocations as __GFP_ACCOUNT so that they would be charged to kmemcg.
Note, since a pipe buffer page can be "stolen" and get reused for other
purposes, including mapping to userspace, we clear PageKmemcg thus
resetting page->_mapcount and uncharge it in anon_pipe_buf_steal, which
is introduced by this patch.
A note regarding anon_pipe_buf_steal implementation. We allow to steal
the page if its ref count equals 1. It looks racy, but it is correct
for anonymous pipe buffer pages, because:
- We lock out all other pipe users, because ->steal is called with
pipe_lock held, so the page can't be spliced to another pipe from
under us.
- The page is not on LRU and it never was.
- Thus a parallel thread can access it only by PFN. Although this is
quite possible (e.g. see page_idle_get_page and balloon_page_isolate)
this is not dangerous, because all such functions do is increase page
ref count, check if the page is the one they are looking for, and
decrease ref count if it isn't. Since our page is clean except for
PageKmemcg mark, which doesn't conflict with other _mapcount users,
the worst that can happen is we see page_count > 2 due to a transient
ref, in which case we false-positively abort ->steal, which is still
fine, because ->steal is not guaranteed to succeed.
Link: http://lkml.kernel.org/r/20160527150313.GD26059@esperanza
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-27 01:24:33 +03:00
|
|
|
#include <linux/memcontrol.h>
|
pipe: Add general notification queue support
Make it possible to have a general notification queue built on top of a
standard pipe. Notifications are 'spliced' into the pipe and then read
out. splice(), vmsplice() and sendfile() are forbidden on pipes used for
notifications as post_one_notification() cannot take pipe->mutex. This
means that notifications could be posted in between individual pipe
buffers, making iov_iter_revert() difficult to effect.
The way the notification queue is used is:
(1) An application opens a pipe with a special flag and indicates the
number of messages it wishes to be able to queue at once (this can
only be set once):
pipe2(fds, O_NOTIFICATION_PIPE);
ioctl(fds[0], IOC_WATCH_QUEUE_SET_SIZE, queue_depth);
(2) The application then uses poll() and read() as normal to extract data
from the pipe. read() will return multiple notifications if the
buffer is big enough, but it will not split a notification across
buffers - rather it will return a short read or EMSGSIZE.
Notification messages include a length in the header so that the
caller can split them up.
Each message has a header that describes it:
struct watch_notification {
__u32 type:24;
__u32 subtype:8;
__u32 info;
};
The type indicates the source (eg. mount tree changes, superblock events,
keyring changes, block layer events) and the subtype indicates the event
type (eg. mount, unmount; EIO, EDQUOT; link, unlink). The info field
indicates a number of things, including the entry length, an ID assigned to
a watchpoint contributing to this buffer and type-specific flags.
Supplementary data, such as the key ID that generated an event, can be
attached in additional slots. The maximum message size is 127 bytes.
Messages may not be padded or aligned, so there is no guarantee, for
example, that the notification type will be on a 4-byte bounary.
Signed-off-by: David Howells <dhowells@redhat.com>
2020-01-14 20:07:11 +03:00
|
|
|
#include <linux/watch_queue.h>
|
2022-01-22 09:13:20 +03:00
|
|
|
#include <linux/sysctl.h>
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2016-12-24 22:46:01 +03:00
|
|
|
#include <linux/uaccess.h>
|
2005-04-17 02:20:36 +04:00
|
|
|
#include <asm/ioctls.h>
|
|
|
|
|
2013-03-12 17:58:10 +04:00
|
|
|
#include "internal.h"
|
|
|
|
|
pipe: increase minimum default pipe size to 2 pages
This program always prints 4096 and hangs before the patch, and always
prints 8192 and exits successfully after:
int main()
{
int pipefd[2];
for (int i = 0; i < 1025; i++)
if (pipe(pipefd) == -1)
return 1;
size_t bufsz = fcntl(pipefd[1], F_GETPIPE_SZ);
printf("%zd\n", bufsz);
char *buf = calloc(bufsz, 1);
write(pipefd[1], buf, bufsz);
read(pipefd[0], buf, bufsz-1);
write(pipefd[1], buf, 1);
}
Note that you may need to increase your RLIMIT_NOFILE before running the
program.
Fixes: 759c01142a ("pipe: limit the per-user amount of pages allocated in pipes")
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/lkml/1628086770.5rn8p04n6j.none@localhost/
Link: https://lore.kernel.org/lkml/1628127094.lxxn016tj7.none@localhost/
Signed-off-by: Alex Xu (Hello71) <alex_y_xu@yahoo.ca>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-08-05 17:40:47 +03:00
|
|
|
/*
|
|
|
|
* New pipe buffers will be restricted to this size while the user is exceeding
|
|
|
|
* their pipe buffer quota. The general pipe use case needs at least two
|
|
|
|
* buffers: one for data yet to be read, and one for new data. If this is less
|
|
|
|
* than two, then a write to a non-empty pipe may block even if the pipe is not
|
|
|
|
* full. This can occur with GNU make jobserver or similar uses of pipes as
|
|
|
|
* semaphores: multiple processes may be waiting to write tokens back to the
|
|
|
|
* pipe before reading tokens: https://lore.kernel.org/lkml/1628086770.5rn8p04n6j.none@localhost/.
|
|
|
|
*
|
|
|
|
* Users can reduce their pipe buffers with F_SETPIPE_SZ below this at their
|
|
|
|
* own risk, namely: pipe writes to non-full pipes may block until the pipe is
|
|
|
|
* emptied.
|
|
|
|
*/
|
|
|
|
#define PIPE_MIN_DEF_BUFFERS 2
|
|
|
|
|
2010-05-19 23:03:16 +04:00
|
|
|
/*
|
|
|
|
* The max size that a non-root user is allowed to grow the pipe. Can
|
2010-06-03 16:54:39 +04:00
|
|
|
* be set by root in /proc/sys/fs/pipe-max-size
|
2010-05-19 23:03:16 +04:00
|
|
|
*/
|
2022-01-22 09:13:20 +03:00
|
|
|
static unsigned int pipe_max_size = 1048576;
|
2010-06-03 16:54:39 +04:00
|
|
|
|
2016-01-18 18:36:09 +03:00
|
|
|
/* Maximum allocatable pages per user. Hard limit is unset by default, soft
|
|
|
|
* matches default values.
|
|
|
|
*/
|
2022-01-22 09:13:20 +03:00
|
|
|
static unsigned long pipe_user_pages_hard;
|
|
|
|
static unsigned long pipe_user_pages_soft = PIPE_DEF_BUFFERS * INR_OPEN_CUR;
|
2016-01-18 18:36:09 +03:00
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
/*
|
2019-11-15 16:30:32 +03:00
|
|
|
* We use head and tail indices that aren't masked off, except at the point of
|
|
|
|
* dereference, but rather they're allowed to wrap naturally. This means there
|
|
|
|
* isn't a dead spot in the buffer, but the ring has to be a power of two and
|
|
|
|
* <= 2^31.
|
|
|
|
* -- David Howells 2019-09-23.
|
|
|
|
*
|
2005-04-17 02:20:36 +04:00
|
|
|
* Reads with count = 0 should always return 0.
|
|
|
|
* -- Julian Bradfield 1999-06-07.
|
|
|
|
*
|
|
|
|
* FIFOs and Pipes now generate SIGIO for both readers and writers.
|
|
|
|
* -- Jeremy Elson <jelson@circlemud.org> 2001-08-16
|
|
|
|
*
|
|
|
|
* pipe_read & write cleanup
|
|
|
|
* -- Manfred Spraul <manfred@colorfullife.com> 2002-05-09
|
|
|
|
*/
|
|
|
|
|
2009-04-14 21:48:41 +04:00
|
|
|
static void pipe_lock_nested(struct pipe_inode_info *pipe, int subclass)
|
|
|
|
{
|
2013-03-21 19:01:38 +04:00
|
|
|
if (pipe->files)
|
2013-03-21 10:32:24 +04:00
|
|
|
mutex_lock_nested(&pipe->mutex, subclass);
|
2009-04-14 21:48:41 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
void pipe_lock(struct pipe_inode_info *pipe)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* pipe_lock() nests non-pipe inode locks (for writing to a file)
|
|
|
|
*/
|
|
|
|
pipe_lock_nested(pipe, I_MUTEX_PARENT);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(pipe_lock);
|
|
|
|
|
|
|
|
void pipe_unlock(struct pipe_inode_info *pipe)
|
|
|
|
{
|
2013-03-21 19:01:38 +04:00
|
|
|
if (pipe->files)
|
2013-03-21 10:32:24 +04:00
|
|
|
mutex_unlock(&pipe->mutex);
|
2009-04-14 21:48:41 +04:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(pipe_unlock);
|
|
|
|
|
2013-03-21 20:24:01 +04:00
|
|
|
static inline void __pipe_lock(struct pipe_inode_info *pipe)
|
|
|
|
{
|
|
|
|
mutex_lock_nested(&pipe->mutex, I_MUTEX_PARENT);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void __pipe_unlock(struct pipe_inode_info *pipe)
|
|
|
|
{
|
|
|
|
mutex_unlock(&pipe->mutex);
|
|
|
|
}
|
|
|
|
|
2009-04-14 21:48:41 +04:00
|
|
|
void pipe_double_lock(struct pipe_inode_info *pipe1,
|
|
|
|
struct pipe_inode_info *pipe2)
|
|
|
|
{
|
|
|
|
BUG_ON(pipe1 == pipe2);
|
|
|
|
|
|
|
|
if (pipe1 < pipe2) {
|
|
|
|
pipe_lock_nested(pipe1, I_MUTEX_PARENT);
|
|
|
|
pipe_lock_nested(pipe2, I_MUTEX_CHILD);
|
|
|
|
} else {
|
2009-07-21 12:09:23 +04:00
|
|
|
pipe_lock_nested(pipe2, I_MUTEX_PARENT);
|
|
|
|
pipe_lock_nested(pipe1, I_MUTEX_CHILD);
|
2009-04-14 21:48:41 +04:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2006-04-11 15:57:45 +04:00
|
|
|
static void anon_pipe_buf_release(struct pipe_inode_info *pipe,
|
|
|
|
struct pipe_buffer *buf)
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
|
|
|
struct page *page = buf->page;
|
|
|
|
|
2006-03-30 17:15:30 +04:00
|
|
|
/*
|
|
|
|
* If nobody else uses this page, and we don't already have a
|
|
|
|
* temporary page, let's keep track of it as a one-deep
|
2006-04-11 15:57:45 +04:00
|
|
|
* allocation cache. (Otherwise just release our reference to it)
|
2006-03-30 17:15:30 +04:00
|
|
|
*/
|
2006-04-11 15:57:45 +04:00
|
|
|
if (page_count(page) == 1 && !pipe->tmp_page)
|
2006-04-11 15:53:33 +04:00
|
|
|
pipe->tmp_page = page;
|
2006-04-11 15:57:45 +04:00
|
|
|
else
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 15:29:47 +03:00
|
|
|
put_page(page);
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
|
2020-05-20 18:58:16 +03:00
|
|
|
static bool anon_pipe_buf_try_steal(struct pipe_inode_info *pipe,
|
|
|
|
struct pipe_buffer *buf)
|
pipe: account to kmemcg
Pipes can consume a significant amount of system memory, hence they
should be accounted to kmemcg.
This patch marks pipe_inode_info and anonymous pipe buffer page
allocations as __GFP_ACCOUNT so that they would be charged to kmemcg.
Note, since a pipe buffer page can be "stolen" and get reused for other
purposes, including mapping to userspace, we clear PageKmemcg thus
resetting page->_mapcount and uncharge it in anon_pipe_buf_steal, which
is introduced by this patch.
A note regarding anon_pipe_buf_steal implementation. We allow to steal
the page if its ref count equals 1. It looks racy, but it is correct
for anonymous pipe buffer pages, because:
- We lock out all other pipe users, because ->steal is called with
pipe_lock held, so the page can't be spliced to another pipe from
under us.
- The page is not on LRU and it never was.
- Thus a parallel thread can access it only by PFN. Although this is
quite possible (e.g. see page_idle_get_page and balloon_page_isolate)
this is not dangerous, because all such functions do is increase page
ref count, check if the page is the one they are looking for, and
decrease ref count if it isn't. Since our page is clean except for
PageKmemcg mark, which doesn't conflict with other _mapcount users,
the worst that can happen is we see page_count > 2 due to a transient
ref, in which case we false-positively abort ->steal, which is still
fine, because ->steal is not guaranteed to succeed.
Link: http://lkml.kernel.org/r/20160527150313.GD26059@esperanza
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-27 01:24:33 +03:00
|
|
|
{
|
|
|
|
struct page *page = buf->page;
|
|
|
|
|
2020-05-20 18:58:16 +03:00
|
|
|
if (page_count(page) != 1)
|
|
|
|
return false;
|
|
|
|
memcg_kmem_uncharge_page(page, 0);
|
|
|
|
__SetPageLocked(page);
|
|
|
|
return true;
|
pipe: account to kmemcg
Pipes can consume a significant amount of system memory, hence they
should be accounted to kmemcg.
This patch marks pipe_inode_info and anonymous pipe buffer page
allocations as __GFP_ACCOUNT so that they would be charged to kmemcg.
Note, since a pipe buffer page can be "stolen" and get reused for other
purposes, including mapping to userspace, we clear PageKmemcg thus
resetting page->_mapcount and uncharge it in anon_pipe_buf_steal, which
is introduced by this patch.
A note regarding anon_pipe_buf_steal implementation. We allow to steal
the page if its ref count equals 1. It looks racy, but it is correct
for anonymous pipe buffer pages, because:
- We lock out all other pipe users, because ->steal is called with
pipe_lock held, so the page can't be spliced to another pipe from
under us.
- The page is not on LRU and it never was.
- Thus a parallel thread can access it only by PFN. Although this is
quite possible (e.g. see page_idle_get_page and balloon_page_isolate)
this is not dangerous, because all such functions do is increase page
ref count, check if the page is the one they are looking for, and
decrease ref count if it isn't. Since our page is clean except for
PageKmemcg mark, which doesn't conflict with other _mapcount users,
the worst that can happen is we see page_count > 2 due to a transient
ref, in which case we false-positively abort ->steal, which is still
fine, because ->steal is not guaranteed to succeed.
Link: http://lkml.kernel.org/r/20160527150313.GD26059@esperanza
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-27 01:24:33 +03:00
|
|
|
}
|
|
|
|
|
2007-06-12 22:51:32 +04:00
|
|
|
/**
|
2020-05-20 18:58:16 +03:00
|
|
|
* generic_pipe_buf_try_steal - attempt to take ownership of a &pipe_buffer
|
2007-06-12 22:51:32 +04:00
|
|
|
* @pipe: the pipe that the buffer belongs to
|
|
|
|
* @buf: the buffer to attempt to steal
|
|
|
|
*
|
|
|
|
* Description:
|
2008-02-14 02:03:22 +03:00
|
|
|
* This function attempts to steal the &struct page attached to
|
2007-06-12 22:51:32 +04:00
|
|
|
* @buf. If successful, this function returns 0 and returns with
|
|
|
|
* the page locked. The caller may then reuse the page for whatever
|
2008-02-14 02:03:22 +03:00
|
|
|
* he wishes; the typical use is insertion into a different file
|
2007-06-12 22:51:32 +04:00
|
|
|
* page cache.
|
|
|
|
*/
|
2020-05-20 18:58:16 +03:00
|
|
|
bool generic_pipe_buf_try_steal(struct pipe_inode_info *pipe,
|
|
|
|
struct pipe_buffer *buf)
|
2006-03-30 17:16:46 +04:00
|
|
|
{
|
2006-04-30 18:36:32 +04:00
|
|
|
struct page *page = buf->page;
|
|
|
|
|
2007-06-12 22:51:32 +04:00
|
|
|
/*
|
|
|
|
* A reference of one is golden, that means that the owner of this
|
|
|
|
* page is the only one holding a reference to it. lock the page
|
|
|
|
* and return OK.
|
|
|
|
*/
|
2006-04-30 18:36:32 +04:00
|
|
|
if (page_count(page) == 1) {
|
|
|
|
lock_page(page);
|
2020-05-20 18:58:16 +03:00
|
|
|
return true;
|
2006-04-30 18:36:32 +04:00
|
|
|
}
|
2020-05-20 18:58:16 +03:00
|
|
|
return false;
|
2006-03-30 17:16:46 +04:00
|
|
|
}
|
2020-05-20 18:58:16 +03:00
|
|
|
EXPORT_SYMBOL(generic_pipe_buf_try_steal);
|
2006-03-30 17:16:46 +04:00
|
|
|
|
2007-06-12 22:51:32 +04:00
|
|
|
/**
|
2008-02-14 02:03:22 +03:00
|
|
|
* generic_pipe_buf_get - get a reference to a &struct pipe_buffer
|
2007-06-12 22:51:32 +04:00
|
|
|
* @pipe: the pipe that the buffer belongs to
|
|
|
|
* @buf: the buffer to get a reference to
|
|
|
|
*
|
|
|
|
* Description:
|
|
|
|
* This function grabs an extra reference to @buf. It's used in
|
2021-02-24 23:00:48 +03:00
|
|
|
* the tee() system call, when we duplicate the buffers in one
|
2007-06-12 22:51:32 +04:00
|
|
|
* pipe into another.
|
|
|
|
*/
|
2019-04-06 00:02:10 +03:00
|
|
|
bool generic_pipe_buf_get(struct pipe_inode_info *pipe, struct pipe_buffer *buf)
|
2006-04-11 17:51:17 +04:00
|
|
|
{
|
Revert "mm/gup: remove try_get_page(), call try_get_compound_head() directly"
This reverts commit 9857a17f206ff374aea78bccfb687f145368be2e.
That commit was completely broken, and I should have caught on to it
earlier. But happily, the kernel test robot noticed the breakage fairly
quickly.
The breakage is because "try_get_page()" is about avoiding the page
reference count overflow case, but is otherwise the exact same as a
plain "get_page()".
In contrast, "try_get_compound_head()" is an entirely different beast,
and uses __page_cache_add_speculative() because it's not just about the
page reference count, but also about possibly racing with the underlying
page going away.
So all the commentary about how
"try_get_page() has fallen a little behind in terms of maintenance,
try_get_compound_head() handles speculative page references more
thoroughly"
was just completely wrong: yes, try_get_compound_head() handles
speculative page references, but the point is that try_get_page() does
not, and must not.
So there's no lack of maintainance - there are fundamentally different
semantics.
A speculative page reference would be entirely wrong in "get_page()",
and it's entirely wrong in "try_get_page()". It's not about
speculation, it's purely about "uhhuh, you can't get this page because
you've tried to increment the reference count too much already".
The reason the kernel test robot noticed this bug was that it hit the
VM_BUG_ON() in __page_cache_add_speculative(), which is all about
verifying that the context of any speculative page access is correct.
But since that isn't what try_get_page() is all about, the VM_BUG_ON()
tests things that are not correct to test for try_get_page().
Reported-by: kernel test robot <oliver.sang@intel.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-07 21:03:45 +03:00
|
|
|
return try_get_page(buf->page);
|
2006-04-11 17:51:17 +04:00
|
|
|
}
|
2010-05-26 10:44:22 +04:00
|
|
|
EXPORT_SYMBOL(generic_pipe_buf_get);
|
2006-04-11 17:51:17 +04:00
|
|
|
|
2009-05-07 17:37:36 +04:00
|
|
|
/**
|
|
|
|
* generic_pipe_buf_release - put a reference to a &struct pipe_buffer
|
|
|
|
* @pipe: the pipe that the buffer belongs to
|
|
|
|
* @buf: the buffer to put a reference to
|
|
|
|
*
|
|
|
|
* Description:
|
|
|
|
* This function releases a reference to @buf.
|
|
|
|
*/
|
|
|
|
void generic_pipe_buf_release(struct pipe_inode_info *pipe,
|
|
|
|
struct pipe_buffer *buf)
|
|
|
|
{
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 15:29:47 +03:00
|
|
|
put_page(buf->page);
|
2009-05-07 17:37:36 +04:00
|
|
|
}
|
2010-05-26 10:44:22 +04:00
|
|
|
EXPORT_SYMBOL(generic_pipe_buf_release);
|
2009-05-07 17:37:36 +04:00
|
|
|
|
2006-12-13 11:34:04 +03:00
|
|
|
static const struct pipe_buf_operations anon_pipe_buf_ops = {
|
2020-05-20 18:58:16 +03:00
|
|
|
.release = anon_pipe_buf_release,
|
|
|
|
.try_steal = anon_pipe_buf_try_steal,
|
|
|
|
.get = generic_pipe_buf_get,
|
2005-04-17 02:20:36 +04:00
|
|
|
};
|
|
|
|
|
2019-12-08 00:53:09 +03:00
|
|
|
/* Done while waiting without holding the pipe lock - thus the READ_ONCE() */
|
|
|
|
static inline bool pipe_readable(const struct pipe_inode_info *pipe)
|
|
|
|
{
|
|
|
|
unsigned int head = READ_ONCE(pipe->head);
|
|
|
|
unsigned int tail = READ_ONCE(pipe->tail);
|
|
|
|
unsigned int writers = READ_ONCE(pipe->writers);
|
|
|
|
|
|
|
|
return !pipe_empty(head, tail) || !writers;
|
|
|
|
}
|
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
static ssize_t
|
2014-04-03 03:56:54 +04:00
|
|
|
pipe_read(struct kiocb *iocb, struct iov_iter *to)
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
2014-04-03 03:56:54 +04:00
|
|
|
size_t total_len = iov_iter_count(to);
|
2006-10-01 10:28:47 +04:00
|
|
|
struct file *filp = iocb->ki_filp;
|
2013-03-21 19:16:56 +04:00
|
|
|
struct pipe_inode_info *pipe = filp->private_data;
|
pipe: use exclusive waits when reading or writing
This makes the pipe code use separate wait-queues and exclusive waiting
for readers and writers, avoiding a nasty thundering herd problem when
there are lots of readers waiting for data on a pipe (or, less commonly,
lots of writers waiting for a pipe to have space).
While this isn't a common occurrence in the traditional "use a pipe as a
data transport" case, where you typically only have a single reader and
a single writer process, there is one common special case: using a pipe
as a source of "locking tokens" rather than for data communication.
In particular, the GNU make jobserver code ends up using a pipe as a way
to limit parallelism, where each job consumes a token by reading a byte
from the jobserver pipe, and releases the token by writing a byte back
to the pipe.
This pattern is fairly traditional on Unix, and works very well, but
will waste a lot of time waking up a lot of processes when only a single
reader needs to be woken up when a writer releases a new token.
A simplified test-case of just this pipe interaction is to create 64
processes, and then pass a single token around between them (this
test-case also intentionally passes another token that gets ignored to
test the "wake up next" logic too, in case anybody wonders about it):
#include <unistd.h>
int main(int argc, char **argv)
{
int fd[2], counters[2];
pipe(fd);
counters[0] = 0;
counters[1] = -1;
write(fd[1], counters, sizeof(counters));
/* 64 processes */
fork(); fork(); fork(); fork(); fork(); fork();
do {
int i;
read(fd[0], &i, sizeof(i));
if (i < 0)
continue;
counters[0] = i+1;
write(fd[1], counters, (1+(i & 1)) *sizeof(int));
} while (counters[0] < 1000000);
return 0;
}
and in a perfect world, passing that token around should only cause one
context switch per transfer, when the writer of a token causes a
directed wakeup of just a single reader.
But with the "writer wakes all readers" model we traditionally had, on
my test box the above case causes more than an order of magnitude more
scheduling: instead of the expected ~1M context switches, "perf stat"
shows
231,852.37 msec task-clock # 15.857 CPUs utilized
11,250,961 context-switches # 0.049 M/sec
616,304 cpu-migrations # 0.003 M/sec
1,648 page-faults # 0.007 K/sec
1,097,903,998,514 cycles # 4.735 GHz
120,781,778,352 instructions # 0.11 insn per cycle
27,997,056,043 branches # 120.754 M/sec
283,581,233 branch-misses # 1.01% of all branches
14.621273891 seconds time elapsed
0.018243000 seconds user
3.611468000 seconds sys
before this commit.
After this commit, I get
5,229.55 msec task-clock # 3.072 CPUs utilized
1,212,233 context-switches # 0.232 M/sec
103,951 cpu-migrations # 0.020 M/sec
1,328 page-faults # 0.254 K/sec
21,307,456,166 cycles # 4.074 GHz
12,947,819,999 instructions # 0.61 insn per cycle
2,881,985,678 branches # 551.096 M/sec
64,267,015 branch-misses # 2.23% of all branches
1.702148350 seconds time elapsed
0.004868000 seconds user
0.110786000 seconds sys
instead. Much better.
[ Note! This kernel improvement seems to be very good at triggering a
race condition in the make jobserver (in GNU make 4.2.1) for me. It's
a long known bug that was fixed back in June 2017 by GNU make commit
b552b0525198 ("[SV 51159] Use a non-blocking read with pselect to
avoid hangs.").
But there wasn't a new release of GNU make until 4.3 on Jan 19 2020,
so a number of distributions may still have the buggy version. Some
have backported the fix to their 4.2.1 release, though, and even
without the fix it's quite timing-dependent whether the bug actually
is hit. ]
Josh Triplett says:
"I've been hammering on your pipe fix patch (switching to exclusive
wait queues) for a month or so, on several different systems, and I've
run into no issues with it. The patch *substantially* improves
parallel build times on large (~100 CPU) systems, both with parallel
make and with other things that use make's pipe-based jobserver.
All current distributions (including stable and long-term stable
distributions) have versions of GNU make that no longer have the
jobserver bug"
Tested-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-09 20:48:27 +03:00
|
|
|
bool was_full, wake_next_reader = false;
|
2005-04-17 02:20:36 +04:00
|
|
|
ssize_t ret;
|
|
|
|
|
|
|
|
/* Null read succeeds. */
|
|
|
|
if (unlikely(total_len == 0))
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
ret = 0;
|
2013-03-21 20:24:01 +04:00
|
|
|
__pipe_lock(pipe);
|
2019-12-07 23:54:26 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* We only wake up writers if the pipe was full when we started
|
|
|
|
* reading in order to avoid unnecessary wakeups.
|
|
|
|
*
|
|
|
|
* But when we do wake up writers, we do so using a sync wakeup
|
|
|
|
* (WF_SYNC), because we want them to get going and generate more
|
|
|
|
* data for us.
|
|
|
|
*/
|
|
|
|
was_full = pipe_full(pipe->head, pipe->tail, pipe->max_usage);
|
2005-04-17 02:20:36 +04:00
|
|
|
for (;;) {
|
2022-03-11 16:24:36 +03:00
|
|
|
/* Read ->head with a barrier vs post_one_notification() */
|
|
|
|
unsigned int head = smp_load_acquire(&pipe->head);
|
2019-11-15 16:30:32 +03:00
|
|
|
unsigned int tail = pipe->tail;
|
|
|
|
unsigned int mask = pipe->ring_size - 1;
|
|
|
|
|
2020-01-14 20:07:12 +03:00
|
|
|
#ifdef CONFIG_WATCH_QUEUE
|
|
|
|
if (pipe->note_loss) {
|
|
|
|
struct watch_notification n;
|
|
|
|
|
|
|
|
if (total_len < 8) {
|
|
|
|
if (ret == 0)
|
|
|
|
ret = -ENOBUFS;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
n.type = WATCH_TYPE_META;
|
|
|
|
n.subtype = WATCH_META_LOSS_NOTIFICATION;
|
|
|
|
n.info = watch_sizeof(n);
|
|
|
|
if (copy_to_iter(&n, sizeof(n), to) != sizeof(n)) {
|
|
|
|
if (ret == 0)
|
|
|
|
ret = -EFAULT;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
ret += sizeof(n);
|
|
|
|
total_len -= sizeof(n);
|
|
|
|
pipe->note_loss = false;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2019-11-15 16:30:32 +03:00
|
|
|
if (!pipe_empty(head, tail)) {
|
|
|
|
struct pipe_buffer *buf = &pipe->bufs[tail & mask];
|
2005-04-17 02:20:36 +04:00
|
|
|
size_t chars = buf->len;
|
2014-02-04 04:11:42 +04:00
|
|
|
size_t written;
|
|
|
|
int error;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2020-01-14 20:07:11 +03:00
|
|
|
if (chars > total_len) {
|
|
|
|
if (buf->flags & PIPE_BUF_FLAG_WHOLE) {
|
|
|
|
if (ret == 0)
|
|
|
|
ret = -ENOBUFS;
|
|
|
|
break;
|
|
|
|
}
|
2005-04-17 02:20:36 +04:00
|
|
|
chars = total_len;
|
2020-01-14 20:07:11 +03:00
|
|
|
}
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2016-09-27 11:45:12 +03:00
|
|
|
error = pipe_buf_confirm(pipe, buf);
|
2006-05-01 21:59:03 +04:00
|
|
|
if (error) {
|
2006-03-30 17:15:30 +04:00
|
|
|
if (!ret)
|
2010-10-21 16:56:00 +04:00
|
|
|
ret = error;
|
2006-03-30 17:15:30 +04:00
|
|
|
break;
|
|
|
|
}
|
2006-05-01 21:59:03 +04:00
|
|
|
|
2014-04-03 03:56:54 +04:00
|
|
|
written = copy_page_to_iter(buf->page, buf->offset, chars, to);
|
2014-02-04 04:11:42 +04:00
|
|
|
if (unlikely(written < chars)) {
|
2006-04-11 15:57:45 +04:00
|
|
|
if (!ret)
|
2014-02-04 04:11:42 +04:00
|
|
|
ret = -EFAULT;
|
2005-04-17 02:20:36 +04:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
ret += chars;
|
|
|
|
buf->offset += chars;
|
|
|
|
buf->len -= chars;
|
pipes: add a "packetized pipe" mode for writing
The actual internal pipe implementation is already really about
individual packets (called "pipe buffers"), and this simply exposes that
as a special packetized mode.
When we are in the packetized mode (marked by O_DIRECT as suggested by
Alan Cox), a write() on a pipe will not merge the new data with previous
writes, so each write will get a pipe buffer of its own. The pipe
buffer is then marked with the PIPE_BUF_FLAG_PACKET flag, which in turn
will tell the reader side to break the read at that boundary (and throw
away any partial packet contents that do not fit in the read buffer).
End result: as long as you do writes less than PIPE_BUF in size (so that
the pipe doesn't have to split them up), you can now treat the pipe as a
packet interface, where each read() system call will read one packet at
a time. You can just use a sufficiently big read buffer (PIPE_BUF is
sufficient, since bigger than that doesn't guarantee atomicity anyway),
and the return value of the read() will naturally give you the size of
the packet.
NOTE! We do not support zero-sized packets, and zero-sized reads and
writes to a pipe continue to be no-ops. Also note that big packets will
currently be split at write time, but that the size at which that
happens is not really specified (except that it's bigger than PIPE_BUF).
Currently that limit is the system page size, but we might want to
explicitly support bigger packets some day.
The main user for this is going to be the autofs packet interface,
allowing us to stop having to care so deeply about exact packet sizes
(which have had bugs with 32/64-bit compatibility modes). But user
space can create packetized pipes with "pipe2(fd, O_DIRECT)", which will
fail with an EINVAL on kernels that do not support this interface.
Tested-by: Michael Tokarev <mjt@tls.msk.ru>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: David Miller <davem@davemloft.net>
Cc: Ian Kent <raven@themaw.net>
Cc: Thomas Meyer <thomas@m3y3r.de>
Cc: stable@kernel.org # needed for systemd/autofs interaction fix
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-04-30 00:12:42 +04:00
|
|
|
|
|
|
|
/* Was it a packet buffer? Clean up and exit */
|
|
|
|
if (buf->flags & PIPE_BUF_FLAG_PACKET) {
|
|
|
|
total_len = chars;
|
|
|
|
buf->len = 0;
|
|
|
|
}
|
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
if (!buf->len) {
|
2016-09-27 11:45:12 +03:00
|
|
|
pipe_buf_release(pipe, buf);
|
pipe: use exclusive waits when reading or writing
This makes the pipe code use separate wait-queues and exclusive waiting
for readers and writers, avoiding a nasty thundering herd problem when
there are lots of readers waiting for data on a pipe (or, less commonly,
lots of writers waiting for a pipe to have space).
While this isn't a common occurrence in the traditional "use a pipe as a
data transport" case, where you typically only have a single reader and
a single writer process, there is one common special case: using a pipe
as a source of "locking tokens" rather than for data communication.
In particular, the GNU make jobserver code ends up using a pipe as a way
to limit parallelism, where each job consumes a token by reading a byte
from the jobserver pipe, and releases the token by writing a byte back
to the pipe.
This pattern is fairly traditional on Unix, and works very well, but
will waste a lot of time waking up a lot of processes when only a single
reader needs to be woken up when a writer releases a new token.
A simplified test-case of just this pipe interaction is to create 64
processes, and then pass a single token around between them (this
test-case also intentionally passes another token that gets ignored to
test the "wake up next" logic too, in case anybody wonders about it):
#include <unistd.h>
int main(int argc, char **argv)
{
int fd[2], counters[2];
pipe(fd);
counters[0] = 0;
counters[1] = -1;
write(fd[1], counters, sizeof(counters));
/* 64 processes */
fork(); fork(); fork(); fork(); fork(); fork();
do {
int i;
read(fd[0], &i, sizeof(i));
if (i < 0)
continue;
counters[0] = i+1;
write(fd[1], counters, (1+(i & 1)) *sizeof(int));
} while (counters[0] < 1000000);
return 0;
}
and in a perfect world, passing that token around should only cause one
context switch per transfer, when the writer of a token causes a
directed wakeup of just a single reader.
But with the "writer wakes all readers" model we traditionally had, on
my test box the above case causes more than an order of magnitude more
scheduling: instead of the expected ~1M context switches, "perf stat"
shows
231,852.37 msec task-clock # 15.857 CPUs utilized
11,250,961 context-switches # 0.049 M/sec
616,304 cpu-migrations # 0.003 M/sec
1,648 page-faults # 0.007 K/sec
1,097,903,998,514 cycles # 4.735 GHz
120,781,778,352 instructions # 0.11 insn per cycle
27,997,056,043 branches # 120.754 M/sec
283,581,233 branch-misses # 1.01% of all branches
14.621273891 seconds time elapsed
0.018243000 seconds user
3.611468000 seconds sys
before this commit.
After this commit, I get
5,229.55 msec task-clock # 3.072 CPUs utilized
1,212,233 context-switches # 0.232 M/sec
103,951 cpu-migrations # 0.020 M/sec
1,328 page-faults # 0.254 K/sec
21,307,456,166 cycles # 4.074 GHz
12,947,819,999 instructions # 0.61 insn per cycle
2,881,985,678 branches # 551.096 M/sec
64,267,015 branch-misses # 2.23% of all branches
1.702148350 seconds time elapsed
0.004868000 seconds user
0.110786000 seconds sys
instead. Much better.
[ Note! This kernel improvement seems to be very good at triggering a
race condition in the make jobserver (in GNU make 4.2.1) for me. It's
a long known bug that was fixed back in June 2017 by GNU make commit
b552b0525198 ("[SV 51159] Use a non-blocking read with pselect to
avoid hangs.").
But there wasn't a new release of GNU make until 4.3 on Jan 19 2020,
so a number of distributions may still have the buggy version. Some
have backported the fix to their 4.2.1 release, though, and even
without the fix it's quite timing-dependent whether the bug actually
is hit. ]
Josh Triplett says:
"I've been hammering on your pipe fix patch (switching to exclusive
wait queues) for a month or so, on several different systems, and I've
run into no issues with it. The patch *substantially* improves
parallel build times on large (~100 CPU) systems, both with parallel
make and with other things that use make's pipe-based jobserver.
All current distributions (including stable and long-term stable
distributions) have versions of GNU make that no longer have the
jobserver bug"
Tested-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-09 20:48:27 +03:00
|
|
|
spin_lock_irq(&pipe->rd_wait.lock);
|
2020-01-14 20:07:12 +03:00
|
|
|
#ifdef CONFIG_WATCH_QUEUE
|
|
|
|
if (buf->flags & PIPE_BUF_FLAG_LOSS)
|
|
|
|
pipe->note_loss = true;
|
|
|
|
#endif
|
2019-11-15 16:30:32 +03:00
|
|
|
tail++;
|
|
|
|
pipe->tail = tail;
|
pipe: use exclusive waits when reading or writing
This makes the pipe code use separate wait-queues and exclusive waiting
for readers and writers, avoiding a nasty thundering herd problem when
there are lots of readers waiting for data on a pipe (or, less commonly,
lots of writers waiting for a pipe to have space).
While this isn't a common occurrence in the traditional "use a pipe as a
data transport" case, where you typically only have a single reader and
a single writer process, there is one common special case: using a pipe
as a source of "locking tokens" rather than for data communication.
In particular, the GNU make jobserver code ends up using a pipe as a way
to limit parallelism, where each job consumes a token by reading a byte
from the jobserver pipe, and releases the token by writing a byte back
to the pipe.
This pattern is fairly traditional on Unix, and works very well, but
will waste a lot of time waking up a lot of processes when only a single
reader needs to be woken up when a writer releases a new token.
A simplified test-case of just this pipe interaction is to create 64
processes, and then pass a single token around between them (this
test-case also intentionally passes another token that gets ignored to
test the "wake up next" logic too, in case anybody wonders about it):
#include <unistd.h>
int main(int argc, char **argv)
{
int fd[2], counters[2];
pipe(fd);
counters[0] = 0;
counters[1] = -1;
write(fd[1], counters, sizeof(counters));
/* 64 processes */
fork(); fork(); fork(); fork(); fork(); fork();
do {
int i;
read(fd[0], &i, sizeof(i));
if (i < 0)
continue;
counters[0] = i+1;
write(fd[1], counters, (1+(i & 1)) *sizeof(int));
} while (counters[0] < 1000000);
return 0;
}
and in a perfect world, passing that token around should only cause one
context switch per transfer, when the writer of a token causes a
directed wakeup of just a single reader.
But with the "writer wakes all readers" model we traditionally had, on
my test box the above case causes more than an order of magnitude more
scheduling: instead of the expected ~1M context switches, "perf stat"
shows
231,852.37 msec task-clock # 15.857 CPUs utilized
11,250,961 context-switches # 0.049 M/sec
616,304 cpu-migrations # 0.003 M/sec
1,648 page-faults # 0.007 K/sec
1,097,903,998,514 cycles # 4.735 GHz
120,781,778,352 instructions # 0.11 insn per cycle
27,997,056,043 branches # 120.754 M/sec
283,581,233 branch-misses # 1.01% of all branches
14.621273891 seconds time elapsed
0.018243000 seconds user
3.611468000 seconds sys
before this commit.
After this commit, I get
5,229.55 msec task-clock # 3.072 CPUs utilized
1,212,233 context-switches # 0.232 M/sec
103,951 cpu-migrations # 0.020 M/sec
1,328 page-faults # 0.254 K/sec
21,307,456,166 cycles # 4.074 GHz
12,947,819,999 instructions # 0.61 insn per cycle
2,881,985,678 branches # 551.096 M/sec
64,267,015 branch-misses # 2.23% of all branches
1.702148350 seconds time elapsed
0.004868000 seconds user
0.110786000 seconds sys
instead. Much better.
[ Note! This kernel improvement seems to be very good at triggering a
race condition in the make jobserver (in GNU make 4.2.1) for me. It's
a long known bug that was fixed back in June 2017 by GNU make commit
b552b0525198 ("[SV 51159] Use a non-blocking read with pselect to
avoid hangs.").
But there wasn't a new release of GNU make until 4.3 on Jan 19 2020,
so a number of distributions may still have the buggy version. Some
have backported the fix to their 4.2.1 release, though, and even
without the fix it's quite timing-dependent whether the bug actually
is hit. ]
Josh Triplett says:
"I've been hammering on your pipe fix patch (switching to exclusive
wait queues) for a month or so, on several different systems, and I've
run into no issues with it. The patch *substantially* improves
parallel build times on large (~100 CPU) systems, both with parallel
make and with other things that use make's pipe-based jobserver.
All current distributions (including stable and long-term stable
distributions) have versions of GNU make that no longer have the
jobserver bug"
Tested-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-09 20:48:27 +03:00
|
|
|
spin_unlock_irq(&pipe->rd_wait.lock);
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
total_len -= chars;
|
|
|
|
if (!total_len)
|
|
|
|
break; /* common path: read succeeded */
|
2019-11-15 16:30:32 +03:00
|
|
|
if (!pipe_empty(head, tail)) /* More to do? */
|
|
|
|
continue;
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
2019-11-15 16:30:32 +03:00
|
|
|
|
2006-04-11 15:53:33 +04:00
|
|
|
if (!pipe->writers)
|
2005-04-17 02:20:36 +04:00
|
|
|
break;
|
pipe: remove 'waiting_writers' merging logic
This code is ancient, and goes back to when we only had a single page
for the pipe buffers. The exact history is hidden in the mists of time
(ie "before git", and in fact predates the BK repository too).
At that long-ago point in time, it actually helped to try to merge big
back-and-forth pipe reads and writes, and not limit pipe reads to the
single pipe buffer in length just because that was all we had at a time.
However, since then we've expanded the pipe buffers to multiple pages,
and this logic really doesn't seem to make sense. And a lot of it is
somewhat questionable (ie "hmm, the user asked for a non-blocking read,
but we see that there's a writer pending, so let's wait anyway to get
the extra data that the writer will have").
But more importantly, it makes the "go to sleep" logic much less
obvious, and considering the wakeup issues we've had, I want to make for
less of those kinds of things.
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-08 00:21:01 +03:00
|
|
|
if (ret)
|
|
|
|
break;
|
|
|
|
if (filp->f_flags & O_NONBLOCK) {
|
|
|
|
ret = -EAGAIN;
|
|
|
|
break;
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
2019-12-08 00:53:09 +03:00
|
|
|
__pipe_unlock(pipe);
|
2019-12-11 22:46:19 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* We only get here if we didn't actually read anything.
|
|
|
|
*
|
|
|
|
* However, we could have seen (and removed) a zero-sized
|
|
|
|
* pipe buffer, and might have made space in the buffers
|
|
|
|
* that way.
|
|
|
|
*
|
|
|
|
* You can't make zero-sized pipe buffers by doing an empty
|
|
|
|
* write (not even in packet mode), but they can happen if
|
|
|
|
* the writer gets an EFAULT when trying to fill a buffer
|
|
|
|
* that already got allocated and inserted in the buffer
|
|
|
|
* array.
|
|
|
|
*
|
|
|
|
* So we still need to wake up any pending writers in the
|
|
|
|
* _very_ unlikely case that the pipe was full, but we got
|
|
|
|
* no data.
|
|
|
|
*/
|
pipe: do FASYNC notifications for every pipe IO, not just state changes
It turns out that the SIGIO/FASYNC situation is almost exactly the same
as the EPOLLET case was: user space really wants to be notified after
every operation.
Now, in a perfect world it should be sufficient to only notify user
space on "state transitions" when the IO state changes (ie when a pipe
goes from unreadable to readable, or from unwritable to writable). User
space should then do as much as possible - fully emptying the buffer or
what not - and we'll notify it again the next time the state changes.
But as with EPOLLET, we have at least one case (stress-ng) where the
kernel sent SIGIO due to the pipe being marked for asynchronous
notification, but the user space signal handler then didn't actually
necessarily read it all before returning (it read more than what was
written, but since there could be multiple writes, it could leave data
pending).
The user space code then expected to get another SIGIO for subsequent
writes - even though the pipe had been readable the whole time - and
would only then read more.
This is arguably a user space bug - and Colin King already fixed the
stress-ng code in question - but the kernel regression rules are clear:
it doesn't matter if kernel people think that user space did something
silly and wrong. What matters is that it used to work.
So if user space depends on specific historical kernel behavior, it's a
regression when that behavior changes. It's on us: we were silly to
have that non-optimal historical behavior, and our old kernel behavior
was what user space was tested against.
Because of how the FASYNC notification was tied to wakeup behavior, this
was first broken by commits f467a6a66419 and 1b6b26ae7053 ("pipe: fix
and clarify pipe read/write wakeup logic"), but at the time it seems
nobody noticed. Probably because the stress-ng problem case ends up
being timing-dependent too.
It was then unwittingly fixed by commit 3a34b13a88ca ("pipe: make pipe
writes always wake up readers") only to be broken again when by commit
3b844826b6c6 ("pipe: avoid unnecessary EPOLLET wakeups under normal
loads").
And at that point the kernel test robot noticed the performance
refression in the stress-ng.sigio.ops_per_sec case. So the "Fixes" tag
below is somewhat ad hoc, but it matches when the issue was noticed.
Fix it for good (knock wood) by simply making the kill_fasync() case
separate from the wakeup case. FASYNC is quite rare, and we clearly
shouldn't even try to use the "avoid unnecessary wakeups" logic for it.
Link: https://lore.kernel.org/lkml/20210824151337.GC27667@xsang-OptiPlex-9020/
Fixes: 3b844826b6c6 ("pipe: avoid unnecessary EPOLLET wakeups under normal loads")
Reported-by: kernel test robot <oliver.sang@intel.com>
Tested-by: Oliver Sang <oliver.sang@intel.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-08-24 20:39:25 +03:00
|
|
|
if (unlikely(was_full))
|
pipe: use exclusive waits when reading or writing
This makes the pipe code use separate wait-queues and exclusive waiting
for readers and writers, avoiding a nasty thundering herd problem when
there are lots of readers waiting for data on a pipe (or, less commonly,
lots of writers waiting for a pipe to have space).
While this isn't a common occurrence in the traditional "use a pipe as a
data transport" case, where you typically only have a single reader and
a single writer process, there is one common special case: using a pipe
as a source of "locking tokens" rather than for data communication.
In particular, the GNU make jobserver code ends up using a pipe as a way
to limit parallelism, where each job consumes a token by reading a byte
from the jobserver pipe, and releases the token by writing a byte back
to the pipe.
This pattern is fairly traditional on Unix, and works very well, but
will waste a lot of time waking up a lot of processes when only a single
reader needs to be woken up when a writer releases a new token.
A simplified test-case of just this pipe interaction is to create 64
processes, and then pass a single token around between them (this
test-case also intentionally passes another token that gets ignored to
test the "wake up next" logic too, in case anybody wonders about it):
#include <unistd.h>
int main(int argc, char **argv)
{
int fd[2], counters[2];
pipe(fd);
counters[0] = 0;
counters[1] = -1;
write(fd[1], counters, sizeof(counters));
/* 64 processes */
fork(); fork(); fork(); fork(); fork(); fork();
do {
int i;
read(fd[0], &i, sizeof(i));
if (i < 0)
continue;
counters[0] = i+1;
write(fd[1], counters, (1+(i & 1)) *sizeof(int));
} while (counters[0] < 1000000);
return 0;
}
and in a perfect world, passing that token around should only cause one
context switch per transfer, when the writer of a token causes a
directed wakeup of just a single reader.
But with the "writer wakes all readers" model we traditionally had, on
my test box the above case causes more than an order of magnitude more
scheduling: instead of the expected ~1M context switches, "perf stat"
shows
231,852.37 msec task-clock # 15.857 CPUs utilized
11,250,961 context-switches # 0.049 M/sec
616,304 cpu-migrations # 0.003 M/sec
1,648 page-faults # 0.007 K/sec
1,097,903,998,514 cycles # 4.735 GHz
120,781,778,352 instructions # 0.11 insn per cycle
27,997,056,043 branches # 120.754 M/sec
283,581,233 branch-misses # 1.01% of all branches
14.621273891 seconds time elapsed
0.018243000 seconds user
3.611468000 seconds sys
before this commit.
After this commit, I get
5,229.55 msec task-clock # 3.072 CPUs utilized
1,212,233 context-switches # 0.232 M/sec
103,951 cpu-migrations # 0.020 M/sec
1,328 page-faults # 0.254 K/sec
21,307,456,166 cycles # 4.074 GHz
12,947,819,999 instructions # 0.61 insn per cycle
2,881,985,678 branches # 551.096 M/sec
64,267,015 branch-misses # 2.23% of all branches
1.702148350 seconds time elapsed
0.004868000 seconds user
0.110786000 seconds sys
instead. Much better.
[ Note! This kernel improvement seems to be very good at triggering a
race condition in the make jobserver (in GNU make 4.2.1) for me. It's
a long known bug that was fixed back in June 2017 by GNU make commit
b552b0525198 ("[SV 51159] Use a non-blocking read with pselect to
avoid hangs.").
But there wasn't a new release of GNU make until 4.3 on Jan 19 2020,
so a number of distributions may still have the buggy version. Some
have backported the fix to their 4.2.1 release, though, and even
without the fix it's quite timing-dependent whether the bug actually
is hit. ]
Josh Triplett says:
"I've been hammering on your pipe fix patch (switching to exclusive
wait queues) for a month or so, on several different systems, and I've
run into no issues with it. The patch *substantially* improves
parallel build times on large (~100 CPU) systems, both with parallel
make and with other things that use make's pipe-based jobserver.
All current distributions (including stable and long-term stable
distributions) have versions of GNU make that no longer have the
jobserver bug"
Tested-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-09 20:48:27 +03:00
|
|
|
wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM);
|
pipe: do FASYNC notifications for every pipe IO, not just state changes
It turns out that the SIGIO/FASYNC situation is almost exactly the same
as the EPOLLET case was: user space really wants to be notified after
every operation.
Now, in a perfect world it should be sufficient to only notify user
space on "state transitions" when the IO state changes (ie when a pipe
goes from unreadable to readable, or from unwritable to writable). User
space should then do as much as possible - fully emptying the buffer or
what not - and we'll notify it again the next time the state changes.
But as with EPOLLET, we have at least one case (stress-ng) where the
kernel sent SIGIO due to the pipe being marked for asynchronous
notification, but the user space signal handler then didn't actually
necessarily read it all before returning (it read more than what was
written, but since there could be multiple writes, it could leave data
pending).
The user space code then expected to get another SIGIO for subsequent
writes - even though the pipe had been readable the whole time - and
would only then read more.
This is arguably a user space bug - and Colin King already fixed the
stress-ng code in question - but the kernel regression rules are clear:
it doesn't matter if kernel people think that user space did something
silly and wrong. What matters is that it used to work.
So if user space depends on specific historical kernel behavior, it's a
regression when that behavior changes. It's on us: we were silly to
have that non-optimal historical behavior, and our old kernel behavior
was what user space was tested against.
Because of how the FASYNC notification was tied to wakeup behavior, this
was first broken by commits f467a6a66419 and 1b6b26ae7053 ("pipe: fix
and clarify pipe read/write wakeup logic"), but at the time it seems
nobody noticed. Probably because the stress-ng problem case ends up
being timing-dependent too.
It was then unwittingly fixed by commit 3a34b13a88ca ("pipe: make pipe
writes always wake up readers") only to be broken again when by commit
3b844826b6c6 ("pipe: avoid unnecessary EPOLLET wakeups under normal
loads").
And at that point the kernel test robot noticed the performance
refression in the stress-ng.sigio.ops_per_sec case. So the "Fixes" tag
below is somewhat ad hoc, but it matches when the issue was noticed.
Fix it for good (knock wood) by simply making the kill_fasync() case
separate from the wakeup case. FASYNC is quite rare, and we clearly
shouldn't even try to use the "avoid unnecessary wakeups" logic for it.
Link: https://lore.kernel.org/lkml/20210824151337.GC27667@xsang-OptiPlex-9020/
Fixes: 3b844826b6c6 ("pipe: avoid unnecessary EPOLLET wakeups under normal loads")
Reported-by: kernel test robot <oliver.sang@intel.com>
Tested-by: Oliver Sang <oliver.sang@intel.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-08-24 20:39:25 +03:00
|
|
|
kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
|
2019-12-11 22:46:19 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* But because we didn't read anything, at this point we can
|
|
|
|
* just return directly with -ERESTARTSYS if we're interrupted,
|
|
|
|
* since we've done any required wakeups and there's no need
|
|
|
|
* to mark anything accessed. And we've dropped the lock.
|
|
|
|
*/
|
pipe: use exclusive waits when reading or writing
This makes the pipe code use separate wait-queues and exclusive waiting
for readers and writers, avoiding a nasty thundering herd problem when
there are lots of readers waiting for data on a pipe (or, less commonly,
lots of writers waiting for a pipe to have space).
While this isn't a common occurrence in the traditional "use a pipe as a
data transport" case, where you typically only have a single reader and
a single writer process, there is one common special case: using a pipe
as a source of "locking tokens" rather than for data communication.
In particular, the GNU make jobserver code ends up using a pipe as a way
to limit parallelism, where each job consumes a token by reading a byte
from the jobserver pipe, and releases the token by writing a byte back
to the pipe.
This pattern is fairly traditional on Unix, and works very well, but
will waste a lot of time waking up a lot of processes when only a single
reader needs to be woken up when a writer releases a new token.
A simplified test-case of just this pipe interaction is to create 64
processes, and then pass a single token around between them (this
test-case also intentionally passes another token that gets ignored to
test the "wake up next" logic too, in case anybody wonders about it):
#include <unistd.h>
int main(int argc, char **argv)
{
int fd[2], counters[2];
pipe(fd);
counters[0] = 0;
counters[1] = -1;
write(fd[1], counters, sizeof(counters));
/* 64 processes */
fork(); fork(); fork(); fork(); fork(); fork();
do {
int i;
read(fd[0], &i, sizeof(i));
if (i < 0)
continue;
counters[0] = i+1;
write(fd[1], counters, (1+(i & 1)) *sizeof(int));
} while (counters[0] < 1000000);
return 0;
}
and in a perfect world, passing that token around should only cause one
context switch per transfer, when the writer of a token causes a
directed wakeup of just a single reader.
But with the "writer wakes all readers" model we traditionally had, on
my test box the above case causes more than an order of magnitude more
scheduling: instead of the expected ~1M context switches, "perf stat"
shows
231,852.37 msec task-clock # 15.857 CPUs utilized
11,250,961 context-switches # 0.049 M/sec
616,304 cpu-migrations # 0.003 M/sec
1,648 page-faults # 0.007 K/sec
1,097,903,998,514 cycles # 4.735 GHz
120,781,778,352 instructions # 0.11 insn per cycle
27,997,056,043 branches # 120.754 M/sec
283,581,233 branch-misses # 1.01% of all branches
14.621273891 seconds time elapsed
0.018243000 seconds user
3.611468000 seconds sys
before this commit.
After this commit, I get
5,229.55 msec task-clock # 3.072 CPUs utilized
1,212,233 context-switches # 0.232 M/sec
103,951 cpu-migrations # 0.020 M/sec
1,328 page-faults # 0.254 K/sec
21,307,456,166 cycles # 4.074 GHz
12,947,819,999 instructions # 0.61 insn per cycle
2,881,985,678 branches # 551.096 M/sec
64,267,015 branch-misses # 2.23% of all branches
1.702148350 seconds time elapsed
0.004868000 seconds user
0.110786000 seconds sys
instead. Much better.
[ Note! This kernel improvement seems to be very good at triggering a
race condition in the make jobserver (in GNU make 4.2.1) for me. It's
a long known bug that was fixed back in June 2017 by GNU make commit
b552b0525198 ("[SV 51159] Use a non-blocking read with pselect to
avoid hangs.").
But there wasn't a new release of GNU make until 4.3 on Jan 19 2020,
so a number of distributions may still have the buggy version. Some
have backported the fix to their 4.2.1 release, though, and even
without the fix it's quite timing-dependent whether the bug actually
is hit. ]
Josh Triplett says:
"I've been hammering on your pipe fix patch (switching to exclusive
wait queues) for a month or so, on several different systems, and I've
run into no issues with it. The patch *substantially* improves
parallel build times on large (~100 CPU) systems, both with parallel
make and with other things that use make's pipe-based jobserver.
All current distributions (including stable and long-term stable
distributions) have versions of GNU make that no longer have the
jobserver bug"
Tested-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-09 20:48:27 +03:00
|
|
|
if (wait_event_interruptible_exclusive(pipe->rd_wait, pipe_readable(pipe)) < 0)
|
2019-12-11 22:46:19 +03:00
|
|
|
return -ERESTARTSYS;
|
|
|
|
|
2019-12-08 00:53:09 +03:00
|
|
|
__pipe_lock(pipe);
|
2019-12-07 23:54:26 +03:00
|
|
|
was_full = pipe_full(pipe->head, pipe->tail, pipe->max_usage);
|
pipe: use exclusive waits when reading or writing
This makes the pipe code use separate wait-queues and exclusive waiting
for readers and writers, avoiding a nasty thundering herd problem when
there are lots of readers waiting for data on a pipe (or, less commonly,
lots of writers waiting for a pipe to have space).
While this isn't a common occurrence in the traditional "use a pipe as a
data transport" case, where you typically only have a single reader and
a single writer process, there is one common special case: using a pipe
as a source of "locking tokens" rather than for data communication.
In particular, the GNU make jobserver code ends up using a pipe as a way
to limit parallelism, where each job consumes a token by reading a byte
from the jobserver pipe, and releases the token by writing a byte back
to the pipe.
This pattern is fairly traditional on Unix, and works very well, but
will waste a lot of time waking up a lot of processes when only a single
reader needs to be woken up when a writer releases a new token.
A simplified test-case of just this pipe interaction is to create 64
processes, and then pass a single token around between them (this
test-case also intentionally passes another token that gets ignored to
test the "wake up next" logic too, in case anybody wonders about it):
#include <unistd.h>
int main(int argc, char **argv)
{
int fd[2], counters[2];
pipe(fd);
counters[0] = 0;
counters[1] = -1;
write(fd[1], counters, sizeof(counters));
/* 64 processes */
fork(); fork(); fork(); fork(); fork(); fork();
do {
int i;
read(fd[0], &i, sizeof(i));
if (i < 0)
continue;
counters[0] = i+1;
write(fd[1], counters, (1+(i & 1)) *sizeof(int));
} while (counters[0] < 1000000);
return 0;
}
and in a perfect world, passing that token around should only cause one
context switch per transfer, when the writer of a token causes a
directed wakeup of just a single reader.
But with the "writer wakes all readers" model we traditionally had, on
my test box the above case causes more than an order of magnitude more
scheduling: instead of the expected ~1M context switches, "perf stat"
shows
231,852.37 msec task-clock # 15.857 CPUs utilized
11,250,961 context-switches # 0.049 M/sec
616,304 cpu-migrations # 0.003 M/sec
1,648 page-faults # 0.007 K/sec
1,097,903,998,514 cycles # 4.735 GHz
120,781,778,352 instructions # 0.11 insn per cycle
27,997,056,043 branches # 120.754 M/sec
283,581,233 branch-misses # 1.01% of all branches
14.621273891 seconds time elapsed
0.018243000 seconds user
3.611468000 seconds sys
before this commit.
After this commit, I get
5,229.55 msec task-clock # 3.072 CPUs utilized
1,212,233 context-switches # 0.232 M/sec
103,951 cpu-migrations # 0.020 M/sec
1,328 page-faults # 0.254 K/sec
21,307,456,166 cycles # 4.074 GHz
12,947,819,999 instructions # 0.61 insn per cycle
2,881,985,678 branches # 551.096 M/sec
64,267,015 branch-misses # 2.23% of all branches
1.702148350 seconds time elapsed
0.004868000 seconds user
0.110786000 seconds sys
instead. Much better.
[ Note! This kernel improvement seems to be very good at triggering a
race condition in the make jobserver (in GNU make 4.2.1) for me. It's
a long known bug that was fixed back in June 2017 by GNU make commit
b552b0525198 ("[SV 51159] Use a non-blocking read with pselect to
avoid hangs.").
But there wasn't a new release of GNU make until 4.3 on Jan 19 2020,
so a number of distributions may still have the buggy version. Some
have backported the fix to their 4.2.1 release, though, and even
without the fix it's quite timing-dependent whether the bug actually
is hit. ]
Josh Triplett says:
"I've been hammering on your pipe fix patch (switching to exclusive
wait queues) for a month or so, on several different systems, and I've
run into no issues with it. The patch *substantially* improves
parallel build times on large (~100 CPU) systems, both with parallel
make and with other things that use make's pipe-based jobserver.
All current distributions (including stable and long-term stable
distributions) have versions of GNU make that no longer have the
jobserver bug"
Tested-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-09 20:48:27 +03:00
|
|
|
wake_next_reader = true;
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
pipe: use exclusive waits when reading or writing
This makes the pipe code use separate wait-queues and exclusive waiting
for readers and writers, avoiding a nasty thundering herd problem when
there are lots of readers waiting for data on a pipe (or, less commonly,
lots of writers waiting for a pipe to have space).
While this isn't a common occurrence in the traditional "use a pipe as a
data transport" case, where you typically only have a single reader and
a single writer process, there is one common special case: using a pipe
as a source of "locking tokens" rather than for data communication.
In particular, the GNU make jobserver code ends up using a pipe as a way
to limit parallelism, where each job consumes a token by reading a byte
from the jobserver pipe, and releases the token by writing a byte back
to the pipe.
This pattern is fairly traditional on Unix, and works very well, but
will waste a lot of time waking up a lot of processes when only a single
reader needs to be woken up when a writer releases a new token.
A simplified test-case of just this pipe interaction is to create 64
processes, and then pass a single token around between them (this
test-case also intentionally passes another token that gets ignored to
test the "wake up next" logic too, in case anybody wonders about it):
#include <unistd.h>
int main(int argc, char **argv)
{
int fd[2], counters[2];
pipe(fd);
counters[0] = 0;
counters[1] = -1;
write(fd[1], counters, sizeof(counters));
/* 64 processes */
fork(); fork(); fork(); fork(); fork(); fork();
do {
int i;
read(fd[0], &i, sizeof(i));
if (i < 0)
continue;
counters[0] = i+1;
write(fd[1], counters, (1+(i & 1)) *sizeof(int));
} while (counters[0] < 1000000);
return 0;
}
and in a perfect world, passing that token around should only cause one
context switch per transfer, when the writer of a token causes a
directed wakeup of just a single reader.
But with the "writer wakes all readers" model we traditionally had, on
my test box the above case causes more than an order of magnitude more
scheduling: instead of the expected ~1M context switches, "perf stat"
shows
231,852.37 msec task-clock # 15.857 CPUs utilized
11,250,961 context-switches # 0.049 M/sec
616,304 cpu-migrations # 0.003 M/sec
1,648 page-faults # 0.007 K/sec
1,097,903,998,514 cycles # 4.735 GHz
120,781,778,352 instructions # 0.11 insn per cycle
27,997,056,043 branches # 120.754 M/sec
283,581,233 branch-misses # 1.01% of all branches
14.621273891 seconds time elapsed
0.018243000 seconds user
3.611468000 seconds sys
before this commit.
After this commit, I get
5,229.55 msec task-clock # 3.072 CPUs utilized
1,212,233 context-switches # 0.232 M/sec
103,951 cpu-migrations # 0.020 M/sec
1,328 page-faults # 0.254 K/sec
21,307,456,166 cycles # 4.074 GHz
12,947,819,999 instructions # 0.61 insn per cycle
2,881,985,678 branches # 551.096 M/sec
64,267,015 branch-misses # 2.23% of all branches
1.702148350 seconds time elapsed
0.004868000 seconds user
0.110786000 seconds sys
instead. Much better.
[ Note! This kernel improvement seems to be very good at triggering a
race condition in the make jobserver (in GNU make 4.2.1) for me. It's
a long known bug that was fixed back in June 2017 by GNU make commit
b552b0525198 ("[SV 51159] Use a non-blocking read with pselect to
avoid hangs.").
But there wasn't a new release of GNU make until 4.3 on Jan 19 2020,
so a number of distributions may still have the buggy version. Some
have backported the fix to their 4.2.1 release, though, and even
without the fix it's quite timing-dependent whether the bug actually
is hit. ]
Josh Triplett says:
"I've been hammering on your pipe fix patch (switching to exclusive
wait queues) for a month or so, on several different systems, and I've
run into no issues with it. The patch *substantially* improves
parallel build times on large (~100 CPU) systems, both with parallel
make and with other things that use make's pipe-based jobserver.
All current distributions (including stable and long-term stable
distributions) have versions of GNU make that no longer have the
jobserver bug"
Tested-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-09 20:48:27 +03:00
|
|
|
if (pipe_empty(pipe->head, pipe->tail))
|
|
|
|
wake_next_reader = false;
|
2013-03-21 20:24:01 +04:00
|
|
|
__pipe_unlock(pipe);
|
2006-04-11 15:57:45 +04:00
|
|
|
|
pipe: do FASYNC notifications for every pipe IO, not just state changes
It turns out that the SIGIO/FASYNC situation is almost exactly the same
as the EPOLLET case was: user space really wants to be notified after
every operation.
Now, in a perfect world it should be sufficient to only notify user
space on "state transitions" when the IO state changes (ie when a pipe
goes from unreadable to readable, or from unwritable to writable). User
space should then do as much as possible - fully emptying the buffer or
what not - and we'll notify it again the next time the state changes.
But as with EPOLLET, we have at least one case (stress-ng) where the
kernel sent SIGIO due to the pipe being marked for asynchronous
notification, but the user space signal handler then didn't actually
necessarily read it all before returning (it read more than what was
written, but since there could be multiple writes, it could leave data
pending).
The user space code then expected to get another SIGIO for subsequent
writes - even though the pipe had been readable the whole time - and
would only then read more.
This is arguably a user space bug - and Colin King already fixed the
stress-ng code in question - but the kernel regression rules are clear:
it doesn't matter if kernel people think that user space did something
silly and wrong. What matters is that it used to work.
So if user space depends on specific historical kernel behavior, it's a
regression when that behavior changes. It's on us: we were silly to
have that non-optimal historical behavior, and our old kernel behavior
was what user space was tested against.
Because of how the FASYNC notification was tied to wakeup behavior, this
was first broken by commits f467a6a66419 and 1b6b26ae7053 ("pipe: fix
and clarify pipe read/write wakeup logic"), but at the time it seems
nobody noticed. Probably because the stress-ng problem case ends up
being timing-dependent too.
It was then unwittingly fixed by commit 3a34b13a88ca ("pipe: make pipe
writes always wake up readers") only to be broken again when by commit
3b844826b6c6 ("pipe: avoid unnecessary EPOLLET wakeups under normal
loads").
And at that point the kernel test robot noticed the performance
refression in the stress-ng.sigio.ops_per_sec case. So the "Fixes" tag
below is somewhat ad hoc, but it matches when the issue was noticed.
Fix it for good (knock wood) by simply making the kill_fasync() case
separate from the wakeup case. FASYNC is quite rare, and we clearly
shouldn't even try to use the "avoid unnecessary wakeups" logic for it.
Link: https://lore.kernel.org/lkml/20210824151337.GC27667@xsang-OptiPlex-9020/
Fixes: 3b844826b6c6 ("pipe: avoid unnecessary EPOLLET wakeups under normal loads")
Reported-by: kernel test robot <oliver.sang@intel.com>
Tested-by: Oliver Sang <oliver.sang@intel.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-08-24 20:39:25 +03:00
|
|
|
if (was_full)
|
pipe: use exclusive waits when reading or writing
This makes the pipe code use separate wait-queues and exclusive waiting
for readers and writers, avoiding a nasty thundering herd problem when
there are lots of readers waiting for data on a pipe (or, less commonly,
lots of writers waiting for a pipe to have space).
While this isn't a common occurrence in the traditional "use a pipe as a
data transport" case, where you typically only have a single reader and
a single writer process, there is one common special case: using a pipe
as a source of "locking tokens" rather than for data communication.
In particular, the GNU make jobserver code ends up using a pipe as a way
to limit parallelism, where each job consumes a token by reading a byte
from the jobserver pipe, and releases the token by writing a byte back
to the pipe.
This pattern is fairly traditional on Unix, and works very well, but
will waste a lot of time waking up a lot of processes when only a single
reader needs to be woken up when a writer releases a new token.
A simplified test-case of just this pipe interaction is to create 64
processes, and then pass a single token around between them (this
test-case also intentionally passes another token that gets ignored to
test the "wake up next" logic too, in case anybody wonders about it):
#include <unistd.h>
int main(int argc, char **argv)
{
int fd[2], counters[2];
pipe(fd);
counters[0] = 0;
counters[1] = -1;
write(fd[1], counters, sizeof(counters));
/* 64 processes */
fork(); fork(); fork(); fork(); fork(); fork();
do {
int i;
read(fd[0], &i, sizeof(i));
if (i < 0)
continue;
counters[0] = i+1;
write(fd[1], counters, (1+(i & 1)) *sizeof(int));
} while (counters[0] < 1000000);
return 0;
}
and in a perfect world, passing that token around should only cause one
context switch per transfer, when the writer of a token causes a
directed wakeup of just a single reader.
But with the "writer wakes all readers" model we traditionally had, on
my test box the above case causes more than an order of magnitude more
scheduling: instead of the expected ~1M context switches, "perf stat"
shows
231,852.37 msec task-clock # 15.857 CPUs utilized
11,250,961 context-switches # 0.049 M/sec
616,304 cpu-migrations # 0.003 M/sec
1,648 page-faults # 0.007 K/sec
1,097,903,998,514 cycles # 4.735 GHz
120,781,778,352 instructions # 0.11 insn per cycle
27,997,056,043 branches # 120.754 M/sec
283,581,233 branch-misses # 1.01% of all branches
14.621273891 seconds time elapsed
0.018243000 seconds user
3.611468000 seconds sys
before this commit.
After this commit, I get
5,229.55 msec task-clock # 3.072 CPUs utilized
1,212,233 context-switches # 0.232 M/sec
103,951 cpu-migrations # 0.020 M/sec
1,328 page-faults # 0.254 K/sec
21,307,456,166 cycles # 4.074 GHz
12,947,819,999 instructions # 0.61 insn per cycle
2,881,985,678 branches # 551.096 M/sec
64,267,015 branch-misses # 2.23% of all branches
1.702148350 seconds time elapsed
0.004868000 seconds user
0.110786000 seconds sys
instead. Much better.
[ Note! This kernel improvement seems to be very good at triggering a
race condition in the make jobserver (in GNU make 4.2.1) for me. It's
a long known bug that was fixed back in June 2017 by GNU make commit
b552b0525198 ("[SV 51159] Use a non-blocking read with pselect to
avoid hangs.").
But there wasn't a new release of GNU make until 4.3 on Jan 19 2020,
so a number of distributions may still have the buggy version. Some
have backported the fix to their 4.2.1 release, though, and even
without the fix it's quite timing-dependent whether the bug actually
is hit. ]
Josh Triplett says:
"I've been hammering on your pipe fix patch (switching to exclusive
wait queues) for a month or so, on several different systems, and I've
run into no issues with it. The patch *substantially* improves
parallel build times on large (~100 CPU) systems, both with parallel
make and with other things that use make's pipe-based jobserver.
All current distributions (including stable and long-term stable
distributions) have versions of GNU make that no longer have the
jobserver bug"
Tested-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-09 20:48:27 +03:00
|
|
|
wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM);
|
|
|
|
if (wake_next_reader)
|
|
|
|
wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM);
|
pipe: do FASYNC notifications for every pipe IO, not just state changes
It turns out that the SIGIO/FASYNC situation is almost exactly the same
as the EPOLLET case was: user space really wants to be notified after
every operation.
Now, in a perfect world it should be sufficient to only notify user
space on "state transitions" when the IO state changes (ie when a pipe
goes from unreadable to readable, or from unwritable to writable). User
space should then do as much as possible - fully emptying the buffer or
what not - and we'll notify it again the next time the state changes.
But as with EPOLLET, we have at least one case (stress-ng) where the
kernel sent SIGIO due to the pipe being marked for asynchronous
notification, but the user space signal handler then didn't actually
necessarily read it all before returning (it read more than what was
written, but since there could be multiple writes, it could leave data
pending).
The user space code then expected to get another SIGIO for subsequent
writes - even though the pipe had been readable the whole time - and
would only then read more.
This is arguably a user space bug - and Colin King already fixed the
stress-ng code in question - but the kernel regression rules are clear:
it doesn't matter if kernel people think that user space did something
silly and wrong. What matters is that it used to work.
So if user space depends on specific historical kernel behavior, it's a
regression when that behavior changes. It's on us: we were silly to
have that non-optimal historical behavior, and our old kernel behavior
was what user space was tested against.
Because of how the FASYNC notification was tied to wakeup behavior, this
was first broken by commits f467a6a66419 and 1b6b26ae7053 ("pipe: fix
and clarify pipe read/write wakeup logic"), but at the time it seems
nobody noticed. Probably because the stress-ng problem case ends up
being timing-dependent too.
It was then unwittingly fixed by commit 3a34b13a88ca ("pipe: make pipe
writes always wake up readers") only to be broken again when by commit
3b844826b6c6 ("pipe: avoid unnecessary EPOLLET wakeups under normal
loads").
And at that point the kernel test robot noticed the performance
refression in the stress-ng.sigio.ops_per_sec case. So the "Fixes" tag
below is somewhat ad hoc, but it matches when the issue was noticed.
Fix it for good (knock wood) by simply making the kill_fasync() case
separate from the wakeup case. FASYNC is quite rare, and we clearly
shouldn't even try to use the "avoid unnecessary wakeups" logic for it.
Link: https://lore.kernel.org/lkml/20210824151337.GC27667@xsang-OptiPlex-9020/
Fixes: 3b844826b6c6 ("pipe: avoid unnecessary EPOLLET wakeups under normal loads")
Reported-by: kernel test robot <oliver.sang@intel.com>
Tested-by: Oliver Sang <oliver.sang@intel.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-08-24 20:39:25 +03:00
|
|
|
kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
|
2005-04-17 02:20:36 +04:00
|
|
|
if (ret > 0)
|
|
|
|
file_accessed(filp);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
pipes: add a "packetized pipe" mode for writing
The actual internal pipe implementation is already really about
individual packets (called "pipe buffers"), and this simply exposes that
as a special packetized mode.
When we are in the packetized mode (marked by O_DIRECT as suggested by
Alan Cox), a write() on a pipe will not merge the new data with previous
writes, so each write will get a pipe buffer of its own. The pipe
buffer is then marked with the PIPE_BUF_FLAG_PACKET flag, which in turn
will tell the reader side to break the read at that boundary (and throw
away any partial packet contents that do not fit in the read buffer).
End result: as long as you do writes less than PIPE_BUF in size (so that
the pipe doesn't have to split them up), you can now treat the pipe as a
packet interface, where each read() system call will read one packet at
a time. You can just use a sufficiently big read buffer (PIPE_BUF is
sufficient, since bigger than that doesn't guarantee atomicity anyway),
and the return value of the read() will naturally give you the size of
the packet.
NOTE! We do not support zero-sized packets, and zero-sized reads and
writes to a pipe continue to be no-ops. Also note that big packets will
currently be split at write time, but that the size at which that
happens is not really specified (except that it's bigger than PIPE_BUF).
Currently that limit is the system page size, but we might want to
explicitly support bigger packets some day.
The main user for this is going to be the autofs packet interface,
allowing us to stop having to care so deeply about exact packet sizes
(which have had bugs with 32/64-bit compatibility modes). But user
space can create packetized pipes with "pipe2(fd, O_DIRECT)", which will
fail with an EINVAL on kernels that do not support this interface.
Tested-by: Michael Tokarev <mjt@tls.msk.ru>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: David Miller <davem@davemloft.net>
Cc: Ian Kent <raven@themaw.net>
Cc: Thomas Meyer <thomas@m3y3r.de>
Cc: stable@kernel.org # needed for systemd/autofs interaction fix
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-04-30 00:12:42 +04:00
|
|
|
static inline int is_packetized(struct file *file)
|
|
|
|
{
|
|
|
|
return (file->f_flags & O_DIRECT) != 0;
|
|
|
|
}
|
|
|
|
|
2019-12-08 00:53:09 +03:00
|
|
|
/* Done while waiting without holding the pipe lock - thus the READ_ONCE() */
|
|
|
|
static inline bool pipe_writable(const struct pipe_inode_info *pipe)
|
|
|
|
{
|
|
|
|
unsigned int head = READ_ONCE(pipe->head);
|
|
|
|
unsigned int tail = READ_ONCE(pipe->tail);
|
|
|
|
unsigned int max_usage = READ_ONCE(pipe->max_usage);
|
|
|
|
|
|
|
|
return !pipe_full(head, tail, max_usage) ||
|
|
|
|
!READ_ONCE(pipe->readers);
|
|
|
|
}
|
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
static ssize_t
|
2014-04-03 23:05:18 +04:00
|
|
|
pipe_write(struct kiocb *iocb, struct iov_iter *from)
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
2006-10-01 10:28:47 +04:00
|
|
|
struct file *filp = iocb->ki_filp;
|
2013-03-21 19:16:56 +04:00
|
|
|
struct pipe_inode_info *pipe = filp->private_data;
|
2019-12-06 01:30:37 +03:00
|
|
|
unsigned int head;
|
2014-04-03 23:05:18 +04:00
|
|
|
ssize_t ret = 0;
|
|
|
|
size_t total_len = iov_iter_count(from);
|
2005-04-17 02:20:36 +04:00
|
|
|
ssize_t chars;
|
2019-12-07 23:14:28 +03:00
|
|
|
bool was_empty = false;
|
pipe: use exclusive waits when reading or writing
This makes the pipe code use separate wait-queues and exclusive waiting
for readers and writers, avoiding a nasty thundering herd problem when
there are lots of readers waiting for data on a pipe (or, less commonly,
lots of writers waiting for a pipe to have space).
While this isn't a common occurrence in the traditional "use a pipe as a
data transport" case, where you typically only have a single reader and
a single writer process, there is one common special case: using a pipe
as a source of "locking tokens" rather than for data communication.
In particular, the GNU make jobserver code ends up using a pipe as a way
to limit parallelism, where each job consumes a token by reading a byte
from the jobserver pipe, and releases the token by writing a byte back
to the pipe.
This pattern is fairly traditional on Unix, and works very well, but
will waste a lot of time waking up a lot of processes when only a single
reader needs to be woken up when a writer releases a new token.
A simplified test-case of just this pipe interaction is to create 64
processes, and then pass a single token around between them (this
test-case also intentionally passes another token that gets ignored to
test the "wake up next" logic too, in case anybody wonders about it):
#include <unistd.h>
int main(int argc, char **argv)
{
int fd[2], counters[2];
pipe(fd);
counters[0] = 0;
counters[1] = -1;
write(fd[1], counters, sizeof(counters));
/* 64 processes */
fork(); fork(); fork(); fork(); fork(); fork();
do {
int i;
read(fd[0], &i, sizeof(i));
if (i < 0)
continue;
counters[0] = i+1;
write(fd[1], counters, (1+(i & 1)) *sizeof(int));
} while (counters[0] < 1000000);
return 0;
}
and in a perfect world, passing that token around should only cause one
context switch per transfer, when the writer of a token causes a
directed wakeup of just a single reader.
But with the "writer wakes all readers" model we traditionally had, on
my test box the above case causes more than an order of magnitude more
scheduling: instead of the expected ~1M context switches, "perf stat"
shows
231,852.37 msec task-clock # 15.857 CPUs utilized
11,250,961 context-switches # 0.049 M/sec
616,304 cpu-migrations # 0.003 M/sec
1,648 page-faults # 0.007 K/sec
1,097,903,998,514 cycles # 4.735 GHz
120,781,778,352 instructions # 0.11 insn per cycle
27,997,056,043 branches # 120.754 M/sec
283,581,233 branch-misses # 1.01% of all branches
14.621273891 seconds time elapsed
0.018243000 seconds user
3.611468000 seconds sys
before this commit.
After this commit, I get
5,229.55 msec task-clock # 3.072 CPUs utilized
1,212,233 context-switches # 0.232 M/sec
103,951 cpu-migrations # 0.020 M/sec
1,328 page-faults # 0.254 K/sec
21,307,456,166 cycles # 4.074 GHz
12,947,819,999 instructions # 0.61 insn per cycle
2,881,985,678 branches # 551.096 M/sec
64,267,015 branch-misses # 2.23% of all branches
1.702148350 seconds time elapsed
0.004868000 seconds user
0.110786000 seconds sys
instead. Much better.
[ Note! This kernel improvement seems to be very good at triggering a
race condition in the make jobserver (in GNU make 4.2.1) for me. It's
a long known bug that was fixed back in June 2017 by GNU make commit
b552b0525198 ("[SV 51159] Use a non-blocking read with pselect to
avoid hangs.").
But there wasn't a new release of GNU make until 4.3 on Jan 19 2020,
so a number of distributions may still have the buggy version. Some
have backported the fix to their 4.2.1 release, though, and even
without the fix it's quite timing-dependent whether the bug actually
is hit. ]
Josh Triplett says:
"I've been hammering on your pipe fix patch (switching to exclusive
wait queues) for a month or so, on several different systems, and I've
run into no issues with it. The patch *substantially* improves
parallel build times on large (~100 CPU) systems, both with parallel
make and with other things that use make's pipe-based jobserver.
All current distributions (including stable and long-term stable
distributions) have versions of GNU make that no longer have the
jobserver bug"
Tested-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-09 20:48:27 +03:00
|
|
|
bool wake_next_writer = false;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
|
|
|
/* Null write succeeds. */
|
|
|
|
if (unlikely(total_len == 0))
|
|
|
|
return 0;
|
|
|
|
|
2013-03-21 20:24:01 +04:00
|
|
|
__pipe_lock(pipe);
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2006-04-11 15:53:33 +04:00
|
|
|
if (!pipe->readers) {
|
2005-04-17 02:20:36 +04:00
|
|
|
send_sig(SIGPIPE, current, 0);
|
|
|
|
ret = -EPIPE;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
pipe: Add general notification queue support
Make it possible to have a general notification queue built on top of a
standard pipe. Notifications are 'spliced' into the pipe and then read
out. splice(), vmsplice() and sendfile() are forbidden on pipes used for
notifications as post_one_notification() cannot take pipe->mutex. This
means that notifications could be posted in between individual pipe
buffers, making iov_iter_revert() difficult to effect.
The way the notification queue is used is:
(1) An application opens a pipe with a special flag and indicates the
number of messages it wishes to be able to queue at once (this can
only be set once):
pipe2(fds, O_NOTIFICATION_PIPE);
ioctl(fds[0], IOC_WATCH_QUEUE_SET_SIZE, queue_depth);
(2) The application then uses poll() and read() as normal to extract data
from the pipe. read() will return multiple notifications if the
buffer is big enough, but it will not split a notification across
buffers - rather it will return a short read or EMSGSIZE.
Notification messages include a length in the header so that the
caller can split them up.
Each message has a header that describes it:
struct watch_notification {
__u32 type:24;
__u32 subtype:8;
__u32 info;
};
The type indicates the source (eg. mount tree changes, superblock events,
keyring changes, block layer events) and the subtype indicates the event
type (eg. mount, unmount; EIO, EDQUOT; link, unlink). The info field
indicates a number of things, including the entry length, an ID assigned to
a watchpoint contributing to this buffer and type-specific flags.
Supplementary data, such as the key ID that generated an event, can be
attached in additional slots. The maximum message size is 127 bytes.
Messages may not be padded or aligned, so there is no guarantee, for
example, that the notification type will be on a 4-byte bounary.
Signed-off-by: David Howells <dhowells@redhat.com>
2020-01-14 20:07:11 +03:00
|
|
|
#ifdef CONFIG_WATCH_QUEUE
|
|
|
|
if (pipe->watch_queue) {
|
|
|
|
ret = -EXDEV;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2019-12-07 23:14:28 +03:00
|
|
|
/*
|
|
|
|
* If it wasn't empty we try to merge new data into
|
|
|
|
* the last buffer.
|
|
|
|
*
|
|
|
|
* That naturally merges small writes, but it also
|
pipe: make pipe writes always wake up readers
Since commit 1b6b26ae7053 ("pipe: fix and clarify pipe write wakeup
logic") we have sanitized the pipe write logic, and would only try to
wake up readers if they needed it.
In particular, if the pipe already had data in it before the write,
there was no point in trying to wake up a reader, since any existing
readers must have been aware of the pre-existing data already. Doing
extraneous wakeups will only cause potential thundering herd problems.
However, it turns out that some Android libraries have misused the EPOLL
interface, and expected "edge triggered" be to "any new write will
trigger it". Even if there was no edge in sight.
Quoting Sandeep Patil:
"The commit 1b6b26ae7053 ('pipe: fix and clarify pipe write wakeup
logic') changed pipe write logic to wakeup readers only if the pipe
was empty at the time of write. However, there are libraries that
relied upon the older behavior for notification scheme similar to
what's described in [1]
One such library 'realm-core'[2] is used by numerous Android
applications. The library uses a similar notification mechanism as GNU
Make but it never drains the pipe until it is full. When Android moved
to v5.10 kernel, all applications using this library stopped working.
The library has since been fixed[3] but it will be a while before all
applications incorporate the updated library"
Our regression rule for the kernel is that if applications break from
new behavior, it's a regression, even if it was because the application
did something patently wrong. Also note the original report [4] by
Michal Kerrisk about a test for this epoll behavior - but at that point
we didn't know of any actual broken use case.
So add the extraneous wakeup, to approximate the old behavior.
[ I say "approximate", because the exact old behavior was to do a wakeup
not for each write(), but for each pipe buffer chunk that was filled
in. The behavior introduced by this change is not that - this is just
"every write will cause a wakeup, whether necessary or not", which
seems to be sufficient for the broken library use. ]
It's worth noting that this adds the extraneous wakeup only for the
write side, while the read side still considers the "edge" to be purely
about reading enough from the pipe to allow further writes.
See commit f467a6a66419 ("pipe: fix and clarify pipe read wakeup logic")
for the pipe read case, which remains that "only wake up if the pipe was
full, and we read something from it".
Link: https://lore.kernel.org/lkml/CAHk-=wjeG0q1vgzu4iJhW5juPkTsjTYmiqiMUYAebWW+0bam6w@mail.gmail.com/ [1]
Link: https://github.com/realm/realm-core [2]
Link: https://github.com/realm/realm-core/issues/4666 [3]
Link: https://lore.kernel.org/lkml/CAKgNAkjMBGeAwF=2MKK758BhxvW58wYTgYKB2V-gY1PwXxrH+Q@mail.gmail.com/ [4]
Link: https://lore.kernel.org/lkml/20210729222635.2937453-1-sspatil@android.com/
Reported-by: Sandeep Patil <sspatil@android.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-31 01:42:34 +03:00
|
|
|
* page-aligns the rest of the writes for large writes
|
2019-12-07 23:14:28 +03:00
|
|
|
* spanning multiple pages.
|
|
|
|
*/
|
2019-11-15 16:30:32 +03:00
|
|
|
head = pipe->head;
|
pipe: avoid unnecessary EPOLLET wakeups under normal loads
I had forgotten just how sensitive hackbench is to extra pipe wakeups,
and commit 3a34b13a88ca ("pipe: make pipe writes always wake up
readers") ended up causing a quite noticeable regression on larger
machines.
Now, hackbench isn't necessarily a hugely meaningful benchmark, and it's
not clear that this matters in real life all that much, but as Mel
points out, it's used often enough when comparing kernels and so the
performance regression shows up like a sore thumb.
It's easy enough to fix at least for the common cases where pipes are
used purely for data transfer, and you never have any exciting poll
usage at all. So set a special 'poll_usage' flag when there is polling
activity, and make the ugly "EPOLLET has crazy legacy expectations"
semantics explicit to only that case.
I would love to limit it to just the broken EPOLLET case, but the pipe
code can't see the difference between epoll and regular select/poll, so
any non-read/write waiting will trigger the extra wakeup behavior. That
is sufficient for at least the hackbench case.
Apart from making the odd extra wakeup cases more explicitly about
EPOLLET, this also makes the extra wakeup be at the _end_ of the pipe
write, not at the first write chunk. That is actually much saner
semantics (as much as you can call any of the legacy edge-triggered
expectations for EPOLLET "sane") since it means that you know the wakeup
will happen once the write is done, rather than possibly in the middle
of one.
[ For stable people: I'm putting a "Fixes" tag on this, but I leave it
up to you to decide whether you actually want to backport it or not.
It likely has no impact outside of synthetic benchmarks - Linus ]
Link: https://lore.kernel.org/lkml/20210802024945.GA8372@xsang-OptiPlex-9020/
Fixes: 3a34b13a88ca ("pipe: make pipe writes always wake up readers")
Reported-by: kernel test robot <oliver.sang@intel.com>
Tested-by: Sandeep Patil <sspatil@android.com>
Tested-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-08-05 20:04:43 +03:00
|
|
|
was_empty = pipe_empty(head, pipe->tail);
|
2019-12-07 23:14:28 +03:00
|
|
|
chars = total_len & (PAGE_SIZE-1);
|
pipe: avoid unnecessary EPOLLET wakeups under normal loads
I had forgotten just how sensitive hackbench is to extra pipe wakeups,
and commit 3a34b13a88ca ("pipe: make pipe writes always wake up
readers") ended up causing a quite noticeable regression on larger
machines.
Now, hackbench isn't necessarily a hugely meaningful benchmark, and it's
not clear that this matters in real life all that much, but as Mel
points out, it's used often enough when comparing kernels and so the
performance regression shows up like a sore thumb.
It's easy enough to fix at least for the common cases where pipes are
used purely for data transfer, and you never have any exciting poll
usage at all. So set a special 'poll_usage' flag when there is polling
activity, and make the ugly "EPOLLET has crazy legacy expectations"
semantics explicit to only that case.
I would love to limit it to just the broken EPOLLET case, but the pipe
code can't see the difference between epoll and regular select/poll, so
any non-read/write waiting will trigger the extra wakeup behavior. That
is sufficient for at least the hackbench case.
Apart from making the odd extra wakeup cases more explicitly about
EPOLLET, this also makes the extra wakeup be at the _end_ of the pipe
write, not at the first write chunk. That is actually much saner
semantics (as much as you can call any of the legacy edge-triggered
expectations for EPOLLET "sane") since it means that you know the wakeup
will happen once the write is done, rather than possibly in the middle
of one.
[ For stable people: I'm putting a "Fixes" tag on this, but I leave it
up to you to decide whether you actually want to backport it or not.
It likely has no impact outside of synthetic benchmarks - Linus ]
Link: https://lore.kernel.org/lkml/20210802024945.GA8372@xsang-OptiPlex-9020/
Fixes: 3a34b13a88ca ("pipe: make pipe writes always wake up readers")
Reported-by: kernel test robot <oliver.sang@intel.com>
Tested-by: Sandeep Patil <sspatil@android.com>
Tested-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-08-05 20:04:43 +03:00
|
|
|
if (chars && !was_empty) {
|
2019-12-06 01:30:37 +03:00
|
|
|
unsigned int mask = pipe->ring_size - 1;
|
2019-11-15 16:30:32 +03:00
|
|
|
struct pipe_buffer *buf = &pipe->bufs[(head - 1) & mask];
|
2005-04-17 02:20:36 +04:00
|
|
|
int offset = buf->offset + buf->len;
|
2006-04-11 15:57:45 +04:00
|
|
|
|
2020-05-20 18:58:12 +03:00
|
|
|
if ((buf->flags & PIPE_BUF_FLAG_CAN_MERGE) &&
|
|
|
|
offset + chars <= PAGE_SIZE) {
|
2016-09-27 11:45:12 +03:00
|
|
|
ret = pipe_buf_confirm(pipe, buf);
|
2015-10-18 00:26:09 +03:00
|
|
|
if (ret)
|
2006-03-30 17:15:30 +04:00
|
|
|
goto out;
|
2006-05-01 21:59:03 +04:00
|
|
|
|
2014-04-03 23:05:18 +04:00
|
|
|
ret = copy_page_from_iter(buf->page, offset, chars, from);
|
|
|
|
if (unlikely(ret < chars)) {
|
2015-10-18 00:26:09 +03:00
|
|
|
ret = -EFAULT;
|
2005-04-17 02:20:36 +04:00
|
|
|
goto out;
|
2006-05-01 22:02:05 +04:00
|
|
|
}
|
2019-12-07 23:14:28 +03:00
|
|
|
|
2015-10-18 00:26:09 +03:00
|
|
|
buf->len += ret;
|
2014-04-03 23:05:18 +04:00
|
|
|
if (!iov_iter_count(from))
|
2005-04-17 02:20:36 +04:00
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
for (;;) {
|
2006-04-11 15:53:33 +04:00
|
|
|
if (!pipe->readers) {
|
2005-04-17 02:20:36 +04:00
|
|
|
send_sig(SIGPIPE, current, 0);
|
2006-04-11 15:57:45 +04:00
|
|
|
if (!ret)
|
|
|
|
ret = -EPIPE;
|
2005-04-17 02:20:36 +04:00
|
|
|
break;
|
|
|
|
}
|
2019-11-15 16:30:32 +03:00
|
|
|
|
2019-09-20 18:32:19 +03:00
|
|
|
head = pipe->head;
|
2019-12-06 01:30:37 +03:00
|
|
|
if (!pipe_full(head, pipe->tail, pipe->max_usage)) {
|
|
|
|
unsigned int mask = pipe->ring_size - 1;
|
2019-11-15 16:30:32 +03:00
|
|
|
struct pipe_buffer *buf = &pipe->bufs[head & mask];
|
2006-04-11 15:53:33 +04:00
|
|
|
struct page *page = pipe->tmp_page;
|
2014-04-03 23:05:18 +04:00
|
|
|
int copied;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
|
|
|
if (!page) {
|
pipe: account to kmemcg
Pipes can consume a significant amount of system memory, hence they
should be accounted to kmemcg.
This patch marks pipe_inode_info and anonymous pipe buffer page
allocations as __GFP_ACCOUNT so that they would be charged to kmemcg.
Note, since a pipe buffer page can be "stolen" and get reused for other
purposes, including mapping to userspace, we clear PageKmemcg thus
resetting page->_mapcount and uncharge it in anon_pipe_buf_steal, which
is introduced by this patch.
A note regarding anon_pipe_buf_steal implementation. We allow to steal
the page if its ref count equals 1. It looks racy, but it is correct
for anonymous pipe buffer pages, because:
- We lock out all other pipe users, because ->steal is called with
pipe_lock held, so the page can't be spliced to another pipe from
under us.
- The page is not on LRU and it never was.
- Thus a parallel thread can access it only by PFN. Although this is
quite possible (e.g. see page_idle_get_page and balloon_page_isolate)
this is not dangerous, because all such functions do is increase page
ref count, check if the page is the one they are looking for, and
decrease ref count if it isn't. Since our page is clean except for
PageKmemcg mark, which doesn't conflict with other _mapcount users,
the worst that can happen is we see page_count > 2 due to a transient
ref, in which case we false-positively abort ->steal, which is still
fine, because ->steal is not guaranteed to succeed.
Link: http://lkml.kernel.org/r/20160527150313.GD26059@esperanza
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-27 01:24:33 +03:00
|
|
|
page = alloc_page(GFP_HIGHUSER | __GFP_ACCOUNT);
|
2005-04-17 02:20:36 +04:00
|
|
|
if (unlikely(!page)) {
|
|
|
|
ret = ret ? : -ENOMEM;
|
|
|
|
break;
|
|
|
|
}
|
2006-04-11 15:53:33 +04:00
|
|
|
pipe->tmp_page = page;
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
2019-09-20 18:32:19 +03:00
|
|
|
|
|
|
|
/* Allocate a slot in the ring in advance and attach an
|
|
|
|
* empty buffer. If we fault or otherwise fail to use
|
|
|
|
* it, either the reader will consume it or it'll still
|
|
|
|
* be there for the next write.
|
|
|
|
*/
|
pipe: use exclusive waits when reading or writing
This makes the pipe code use separate wait-queues and exclusive waiting
for readers and writers, avoiding a nasty thundering herd problem when
there are lots of readers waiting for data on a pipe (or, less commonly,
lots of writers waiting for a pipe to have space).
While this isn't a common occurrence in the traditional "use a pipe as a
data transport" case, where you typically only have a single reader and
a single writer process, there is one common special case: using a pipe
as a source of "locking tokens" rather than for data communication.
In particular, the GNU make jobserver code ends up using a pipe as a way
to limit parallelism, where each job consumes a token by reading a byte
from the jobserver pipe, and releases the token by writing a byte back
to the pipe.
This pattern is fairly traditional on Unix, and works very well, but
will waste a lot of time waking up a lot of processes when only a single
reader needs to be woken up when a writer releases a new token.
A simplified test-case of just this pipe interaction is to create 64
processes, and then pass a single token around between them (this
test-case also intentionally passes another token that gets ignored to
test the "wake up next" logic too, in case anybody wonders about it):
#include <unistd.h>
int main(int argc, char **argv)
{
int fd[2], counters[2];
pipe(fd);
counters[0] = 0;
counters[1] = -1;
write(fd[1], counters, sizeof(counters));
/* 64 processes */
fork(); fork(); fork(); fork(); fork(); fork();
do {
int i;
read(fd[0], &i, sizeof(i));
if (i < 0)
continue;
counters[0] = i+1;
write(fd[1], counters, (1+(i & 1)) *sizeof(int));
} while (counters[0] < 1000000);
return 0;
}
and in a perfect world, passing that token around should only cause one
context switch per transfer, when the writer of a token causes a
directed wakeup of just a single reader.
But with the "writer wakes all readers" model we traditionally had, on
my test box the above case causes more than an order of magnitude more
scheduling: instead of the expected ~1M context switches, "perf stat"
shows
231,852.37 msec task-clock # 15.857 CPUs utilized
11,250,961 context-switches # 0.049 M/sec
616,304 cpu-migrations # 0.003 M/sec
1,648 page-faults # 0.007 K/sec
1,097,903,998,514 cycles # 4.735 GHz
120,781,778,352 instructions # 0.11 insn per cycle
27,997,056,043 branches # 120.754 M/sec
283,581,233 branch-misses # 1.01% of all branches
14.621273891 seconds time elapsed
0.018243000 seconds user
3.611468000 seconds sys
before this commit.
After this commit, I get
5,229.55 msec task-clock # 3.072 CPUs utilized
1,212,233 context-switches # 0.232 M/sec
103,951 cpu-migrations # 0.020 M/sec
1,328 page-faults # 0.254 K/sec
21,307,456,166 cycles # 4.074 GHz
12,947,819,999 instructions # 0.61 insn per cycle
2,881,985,678 branches # 551.096 M/sec
64,267,015 branch-misses # 2.23% of all branches
1.702148350 seconds time elapsed
0.004868000 seconds user
0.110786000 seconds sys
instead. Much better.
[ Note! This kernel improvement seems to be very good at triggering a
race condition in the make jobserver (in GNU make 4.2.1) for me. It's
a long known bug that was fixed back in June 2017 by GNU make commit
b552b0525198 ("[SV 51159] Use a non-blocking read with pselect to
avoid hangs.").
But there wasn't a new release of GNU make until 4.3 on Jan 19 2020,
so a number of distributions may still have the buggy version. Some
have backported the fix to their 4.2.1 release, though, and even
without the fix it's quite timing-dependent whether the bug actually
is hit. ]
Josh Triplett says:
"I've been hammering on your pipe fix patch (switching to exclusive
wait queues) for a month or so, on several different systems, and I've
run into no issues with it. The patch *substantially* improves
parallel build times on large (~100 CPU) systems, both with parallel
make and with other things that use make's pipe-based jobserver.
All current distributions (including stable and long-term stable
distributions) have versions of GNU make that no longer have the
jobserver bug"
Tested-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-09 20:48:27 +03:00
|
|
|
spin_lock_irq(&pipe->rd_wait.lock);
|
2019-09-20 18:32:19 +03:00
|
|
|
|
|
|
|
head = pipe->head;
|
2019-12-06 01:30:37 +03:00
|
|
|
if (pipe_full(head, pipe->tail, pipe->max_usage)) {
|
pipe: use exclusive waits when reading or writing
This makes the pipe code use separate wait-queues and exclusive waiting
for readers and writers, avoiding a nasty thundering herd problem when
there are lots of readers waiting for data on a pipe (or, less commonly,
lots of writers waiting for a pipe to have space).
While this isn't a common occurrence in the traditional "use a pipe as a
data transport" case, where you typically only have a single reader and
a single writer process, there is one common special case: using a pipe
as a source of "locking tokens" rather than for data communication.
In particular, the GNU make jobserver code ends up using a pipe as a way
to limit parallelism, where each job consumes a token by reading a byte
from the jobserver pipe, and releases the token by writing a byte back
to the pipe.
This pattern is fairly traditional on Unix, and works very well, but
will waste a lot of time waking up a lot of processes when only a single
reader needs to be woken up when a writer releases a new token.
A simplified test-case of just this pipe interaction is to create 64
processes, and then pass a single token around between them (this
test-case also intentionally passes another token that gets ignored to
test the "wake up next" logic too, in case anybody wonders about it):
#include <unistd.h>
int main(int argc, char **argv)
{
int fd[2], counters[2];
pipe(fd);
counters[0] = 0;
counters[1] = -1;
write(fd[1], counters, sizeof(counters));
/* 64 processes */
fork(); fork(); fork(); fork(); fork(); fork();
do {
int i;
read(fd[0], &i, sizeof(i));
if (i < 0)
continue;
counters[0] = i+1;
write(fd[1], counters, (1+(i & 1)) *sizeof(int));
} while (counters[0] < 1000000);
return 0;
}
and in a perfect world, passing that token around should only cause one
context switch per transfer, when the writer of a token causes a
directed wakeup of just a single reader.
But with the "writer wakes all readers" model we traditionally had, on
my test box the above case causes more than an order of magnitude more
scheduling: instead of the expected ~1M context switches, "perf stat"
shows
231,852.37 msec task-clock # 15.857 CPUs utilized
11,250,961 context-switches # 0.049 M/sec
616,304 cpu-migrations # 0.003 M/sec
1,648 page-faults # 0.007 K/sec
1,097,903,998,514 cycles # 4.735 GHz
120,781,778,352 instructions # 0.11 insn per cycle
27,997,056,043 branches # 120.754 M/sec
283,581,233 branch-misses # 1.01% of all branches
14.621273891 seconds time elapsed
0.018243000 seconds user
3.611468000 seconds sys
before this commit.
After this commit, I get
5,229.55 msec task-clock # 3.072 CPUs utilized
1,212,233 context-switches # 0.232 M/sec
103,951 cpu-migrations # 0.020 M/sec
1,328 page-faults # 0.254 K/sec
21,307,456,166 cycles # 4.074 GHz
12,947,819,999 instructions # 0.61 insn per cycle
2,881,985,678 branches # 551.096 M/sec
64,267,015 branch-misses # 2.23% of all branches
1.702148350 seconds time elapsed
0.004868000 seconds user
0.110786000 seconds sys
instead. Much better.
[ Note! This kernel improvement seems to be very good at triggering a
race condition in the make jobserver (in GNU make 4.2.1) for me. It's
a long known bug that was fixed back in June 2017 by GNU make commit
b552b0525198 ("[SV 51159] Use a non-blocking read with pselect to
avoid hangs.").
But there wasn't a new release of GNU make until 4.3 on Jan 19 2020,
so a number of distributions may still have the buggy version. Some
have backported the fix to their 4.2.1 release, though, and even
without the fix it's quite timing-dependent whether the bug actually
is hit. ]
Josh Triplett says:
"I've been hammering on your pipe fix patch (switching to exclusive
wait queues) for a month or so, on several different systems, and I've
run into no issues with it. The patch *substantially* improves
parallel build times on large (~100 CPU) systems, both with parallel
make and with other things that use make's pipe-based jobserver.
All current distributions (including stable and long-term stable
distributions) have versions of GNU make that no longer have the
jobserver bug"
Tested-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-09 20:48:27 +03:00
|
|
|
spin_unlock_irq(&pipe->rd_wait.lock);
|
2019-10-07 18:30:51 +03:00
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
2019-09-20 18:32:19 +03:00
|
|
|
pipe->head = head + 1;
|
pipe: use exclusive waits when reading or writing
This makes the pipe code use separate wait-queues and exclusive waiting
for readers and writers, avoiding a nasty thundering herd problem when
there are lots of readers waiting for data on a pipe (or, less commonly,
lots of writers waiting for a pipe to have space).
While this isn't a common occurrence in the traditional "use a pipe as a
data transport" case, where you typically only have a single reader and
a single writer process, there is one common special case: using a pipe
as a source of "locking tokens" rather than for data communication.
In particular, the GNU make jobserver code ends up using a pipe as a way
to limit parallelism, where each job consumes a token by reading a byte
from the jobserver pipe, and releases the token by writing a byte back
to the pipe.
This pattern is fairly traditional on Unix, and works very well, but
will waste a lot of time waking up a lot of processes when only a single
reader needs to be woken up when a writer releases a new token.
A simplified test-case of just this pipe interaction is to create 64
processes, and then pass a single token around between them (this
test-case also intentionally passes another token that gets ignored to
test the "wake up next" logic too, in case anybody wonders about it):
#include <unistd.h>
int main(int argc, char **argv)
{
int fd[2], counters[2];
pipe(fd);
counters[0] = 0;
counters[1] = -1;
write(fd[1], counters, sizeof(counters));
/* 64 processes */
fork(); fork(); fork(); fork(); fork(); fork();
do {
int i;
read(fd[0], &i, sizeof(i));
if (i < 0)
continue;
counters[0] = i+1;
write(fd[1], counters, (1+(i & 1)) *sizeof(int));
} while (counters[0] < 1000000);
return 0;
}
and in a perfect world, passing that token around should only cause one
context switch per transfer, when the writer of a token causes a
directed wakeup of just a single reader.
But with the "writer wakes all readers" model we traditionally had, on
my test box the above case causes more than an order of magnitude more
scheduling: instead of the expected ~1M context switches, "perf stat"
shows
231,852.37 msec task-clock # 15.857 CPUs utilized
11,250,961 context-switches # 0.049 M/sec
616,304 cpu-migrations # 0.003 M/sec
1,648 page-faults # 0.007 K/sec
1,097,903,998,514 cycles # 4.735 GHz
120,781,778,352 instructions # 0.11 insn per cycle
27,997,056,043 branches # 120.754 M/sec
283,581,233 branch-misses # 1.01% of all branches
14.621273891 seconds time elapsed
0.018243000 seconds user
3.611468000 seconds sys
before this commit.
After this commit, I get
5,229.55 msec task-clock # 3.072 CPUs utilized
1,212,233 context-switches # 0.232 M/sec
103,951 cpu-migrations # 0.020 M/sec
1,328 page-faults # 0.254 K/sec
21,307,456,166 cycles # 4.074 GHz
12,947,819,999 instructions # 0.61 insn per cycle
2,881,985,678 branches # 551.096 M/sec
64,267,015 branch-misses # 2.23% of all branches
1.702148350 seconds time elapsed
0.004868000 seconds user
0.110786000 seconds sys
instead. Much better.
[ Note! This kernel improvement seems to be very good at triggering a
race condition in the make jobserver (in GNU make 4.2.1) for me. It's
a long known bug that was fixed back in June 2017 by GNU make commit
b552b0525198 ("[SV 51159] Use a non-blocking read with pselect to
avoid hangs.").
But there wasn't a new release of GNU make until 4.3 on Jan 19 2020,
so a number of distributions may still have the buggy version. Some
have backported the fix to their 4.2.1 release, though, and even
without the fix it's quite timing-dependent whether the bug actually
is hit. ]
Josh Triplett says:
"I've been hammering on your pipe fix patch (switching to exclusive
wait queues) for a month or so, on several different systems, and I've
run into no issues with it. The patch *substantially* improves
parallel build times on large (~100 CPU) systems, both with parallel
make and with other things that use make's pipe-based jobserver.
All current distributions (including stable and long-term stable
distributions) have versions of GNU make that no longer have the
jobserver bug"
Tested-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-09 20:48:27 +03:00
|
|
|
spin_unlock_irq(&pipe->rd_wait.lock);
|
2005-04-17 02:20:36 +04:00
|
|
|
|
|
|
|
/* Insert it into the buffer array */
|
2019-09-20 18:32:19 +03:00
|
|
|
buf = &pipe->bufs[head & mask];
|
2005-04-17 02:20:36 +04:00
|
|
|
buf->page = page;
|
|
|
|
buf->ops = &anon_pipe_buf_ops;
|
|
|
|
buf->offset = 0;
|
2019-09-20 18:32:19 +03:00
|
|
|
buf->len = 0;
|
2020-05-20 18:58:12 +03:00
|
|
|
if (is_packetized(filp))
|
pipes: add a "packetized pipe" mode for writing
The actual internal pipe implementation is already really about
individual packets (called "pipe buffers"), and this simply exposes that
as a special packetized mode.
When we are in the packetized mode (marked by O_DIRECT as suggested by
Alan Cox), a write() on a pipe will not merge the new data with previous
writes, so each write will get a pipe buffer of its own. The pipe
buffer is then marked with the PIPE_BUF_FLAG_PACKET flag, which in turn
will tell the reader side to break the read at that boundary (and throw
away any partial packet contents that do not fit in the read buffer).
End result: as long as you do writes less than PIPE_BUF in size (so that
the pipe doesn't have to split them up), you can now treat the pipe as a
packet interface, where each read() system call will read one packet at
a time. You can just use a sufficiently big read buffer (PIPE_BUF is
sufficient, since bigger than that doesn't guarantee atomicity anyway),
and the return value of the read() will naturally give you the size of
the packet.
NOTE! We do not support zero-sized packets, and zero-sized reads and
writes to a pipe continue to be no-ops. Also note that big packets will
currently be split at write time, but that the size at which that
happens is not really specified (except that it's bigger than PIPE_BUF).
Currently that limit is the system page size, but we might want to
explicitly support bigger packets some day.
The main user for this is going to be the autofs packet interface,
allowing us to stop having to care so deeply about exact packet sizes
(which have had bugs with 32/64-bit compatibility modes). But user
space can create packetized pipes with "pipe2(fd, O_DIRECT)", which will
fail with an EINVAL on kernels that do not support this interface.
Tested-by: Michael Tokarev <mjt@tls.msk.ru>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: David Miller <davem@davemloft.net>
Cc: Ian Kent <raven@themaw.net>
Cc: Thomas Meyer <thomas@m3y3r.de>
Cc: stable@kernel.org # needed for systemd/autofs interaction fix
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-04-30 00:12:42 +04:00
|
|
|
buf->flags = PIPE_BUF_FLAG_PACKET;
|
2020-05-20 18:58:12 +03:00
|
|
|
else
|
|
|
|
buf->flags = PIPE_BUF_FLAG_CAN_MERGE;
|
2006-04-11 15:53:33 +04:00
|
|
|
pipe->tmp_page = NULL;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2019-09-20 18:32:19 +03:00
|
|
|
copied = copy_page_from_iter(page, 0, PAGE_SIZE, from);
|
|
|
|
if (unlikely(copied < PAGE_SIZE && iov_iter_count(from))) {
|
|
|
|
if (!ret)
|
|
|
|
ret = -EFAULT;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
ret += copied;
|
|
|
|
buf->offset = 0;
|
|
|
|
buf->len = copied;
|
|
|
|
|
2014-04-03 23:05:18 +04:00
|
|
|
if (!iov_iter_count(from))
|
2005-04-17 02:20:36 +04:00
|
|
|
break;
|
|
|
|
}
|
2019-11-15 16:30:32 +03:00
|
|
|
|
2019-12-06 01:30:37 +03:00
|
|
|
if (!pipe_full(head, pipe->tail, pipe->max_usage))
|
2005-04-17 02:20:36 +04:00
|
|
|
continue;
|
2019-11-15 16:30:32 +03:00
|
|
|
|
|
|
|
/* Wait for buffer space to become available. */
|
2005-04-17 02:20:36 +04:00
|
|
|
if (filp->f_flags & O_NONBLOCK) {
|
2006-04-11 15:57:45 +04:00
|
|
|
if (!ret)
|
|
|
|
ret = -EAGAIN;
|
2005-04-17 02:20:36 +04:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
if (signal_pending(current)) {
|
2006-04-11 15:57:45 +04:00
|
|
|
if (!ret)
|
|
|
|
ret = -ERESTARTSYS;
|
2005-04-17 02:20:36 +04:00
|
|
|
break;
|
|
|
|
}
|
2019-12-07 23:14:28 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* We're going to release the pipe lock and wait for more
|
|
|
|
* space. We wake up any readers if necessary, and then
|
|
|
|
* after waiting we need to re-check whether the pipe
|
|
|
|
* become empty while we dropped the lock.
|
|
|
|
*/
|
2019-12-08 00:53:09 +03:00
|
|
|
__pipe_unlock(pipe);
|
pipe: do FASYNC notifications for every pipe IO, not just state changes
It turns out that the SIGIO/FASYNC situation is almost exactly the same
as the EPOLLET case was: user space really wants to be notified after
every operation.
Now, in a perfect world it should be sufficient to only notify user
space on "state transitions" when the IO state changes (ie when a pipe
goes from unreadable to readable, or from unwritable to writable). User
space should then do as much as possible - fully emptying the buffer or
what not - and we'll notify it again the next time the state changes.
But as with EPOLLET, we have at least one case (stress-ng) where the
kernel sent SIGIO due to the pipe being marked for asynchronous
notification, but the user space signal handler then didn't actually
necessarily read it all before returning (it read more than what was
written, but since there could be multiple writes, it could leave data
pending).
The user space code then expected to get another SIGIO for subsequent
writes - even though the pipe had been readable the whole time - and
would only then read more.
This is arguably a user space bug - and Colin King already fixed the
stress-ng code in question - but the kernel regression rules are clear:
it doesn't matter if kernel people think that user space did something
silly and wrong. What matters is that it used to work.
So if user space depends on specific historical kernel behavior, it's a
regression when that behavior changes. It's on us: we were silly to
have that non-optimal historical behavior, and our old kernel behavior
was what user space was tested against.
Because of how the FASYNC notification was tied to wakeup behavior, this
was first broken by commits f467a6a66419 and 1b6b26ae7053 ("pipe: fix
and clarify pipe read/write wakeup logic"), but at the time it seems
nobody noticed. Probably because the stress-ng problem case ends up
being timing-dependent too.
It was then unwittingly fixed by commit 3a34b13a88ca ("pipe: make pipe
writes always wake up readers") only to be broken again when by commit
3b844826b6c6 ("pipe: avoid unnecessary EPOLLET wakeups under normal
loads").
And at that point the kernel test robot noticed the performance
refression in the stress-ng.sigio.ops_per_sec case. So the "Fixes" tag
below is somewhat ad hoc, but it matches when the issue was noticed.
Fix it for good (knock wood) by simply making the kill_fasync() case
separate from the wakeup case. FASYNC is quite rare, and we clearly
shouldn't even try to use the "avoid unnecessary wakeups" logic for it.
Link: https://lore.kernel.org/lkml/20210824151337.GC27667@xsang-OptiPlex-9020/
Fixes: 3b844826b6c6 ("pipe: avoid unnecessary EPOLLET wakeups under normal loads")
Reported-by: kernel test robot <oliver.sang@intel.com>
Tested-by: Oliver Sang <oliver.sang@intel.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-08-24 20:39:25 +03:00
|
|
|
if (was_empty)
|
pipe: use exclusive waits when reading or writing
This makes the pipe code use separate wait-queues and exclusive waiting
for readers and writers, avoiding a nasty thundering herd problem when
there are lots of readers waiting for data on a pipe (or, less commonly,
lots of writers waiting for a pipe to have space).
While this isn't a common occurrence in the traditional "use a pipe as a
data transport" case, where you typically only have a single reader and
a single writer process, there is one common special case: using a pipe
as a source of "locking tokens" rather than for data communication.
In particular, the GNU make jobserver code ends up using a pipe as a way
to limit parallelism, where each job consumes a token by reading a byte
from the jobserver pipe, and releases the token by writing a byte back
to the pipe.
This pattern is fairly traditional on Unix, and works very well, but
will waste a lot of time waking up a lot of processes when only a single
reader needs to be woken up when a writer releases a new token.
A simplified test-case of just this pipe interaction is to create 64
processes, and then pass a single token around between them (this
test-case also intentionally passes another token that gets ignored to
test the "wake up next" logic too, in case anybody wonders about it):
#include <unistd.h>
int main(int argc, char **argv)
{
int fd[2], counters[2];
pipe(fd);
counters[0] = 0;
counters[1] = -1;
write(fd[1], counters, sizeof(counters));
/* 64 processes */
fork(); fork(); fork(); fork(); fork(); fork();
do {
int i;
read(fd[0], &i, sizeof(i));
if (i < 0)
continue;
counters[0] = i+1;
write(fd[1], counters, (1+(i & 1)) *sizeof(int));
} while (counters[0] < 1000000);
return 0;
}
and in a perfect world, passing that token around should only cause one
context switch per transfer, when the writer of a token causes a
directed wakeup of just a single reader.
But with the "writer wakes all readers" model we traditionally had, on
my test box the above case causes more than an order of magnitude more
scheduling: instead of the expected ~1M context switches, "perf stat"
shows
231,852.37 msec task-clock # 15.857 CPUs utilized
11,250,961 context-switches # 0.049 M/sec
616,304 cpu-migrations # 0.003 M/sec
1,648 page-faults # 0.007 K/sec
1,097,903,998,514 cycles # 4.735 GHz
120,781,778,352 instructions # 0.11 insn per cycle
27,997,056,043 branches # 120.754 M/sec
283,581,233 branch-misses # 1.01% of all branches
14.621273891 seconds time elapsed
0.018243000 seconds user
3.611468000 seconds sys
before this commit.
After this commit, I get
5,229.55 msec task-clock # 3.072 CPUs utilized
1,212,233 context-switches # 0.232 M/sec
103,951 cpu-migrations # 0.020 M/sec
1,328 page-faults # 0.254 K/sec
21,307,456,166 cycles # 4.074 GHz
12,947,819,999 instructions # 0.61 insn per cycle
2,881,985,678 branches # 551.096 M/sec
64,267,015 branch-misses # 2.23% of all branches
1.702148350 seconds time elapsed
0.004868000 seconds user
0.110786000 seconds sys
instead. Much better.
[ Note! This kernel improvement seems to be very good at triggering a
race condition in the make jobserver (in GNU make 4.2.1) for me. It's
a long known bug that was fixed back in June 2017 by GNU make commit
b552b0525198 ("[SV 51159] Use a non-blocking read with pselect to
avoid hangs.").
But there wasn't a new release of GNU make until 4.3 on Jan 19 2020,
so a number of distributions may still have the buggy version. Some
have backported the fix to their 4.2.1 release, though, and even
without the fix it's quite timing-dependent whether the bug actually
is hit. ]
Josh Triplett says:
"I've been hammering on your pipe fix patch (switching to exclusive
wait queues) for a month or so, on several different systems, and I've
run into no issues with it. The patch *substantially* improves
parallel build times on large (~100 CPU) systems, both with parallel
make and with other things that use make's pipe-based jobserver.
All current distributions (including stable and long-term stable
distributions) have versions of GNU make that no longer have the
jobserver bug"
Tested-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-09 20:48:27 +03:00
|
|
|
wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM);
|
pipe: do FASYNC notifications for every pipe IO, not just state changes
It turns out that the SIGIO/FASYNC situation is almost exactly the same
as the EPOLLET case was: user space really wants to be notified after
every operation.
Now, in a perfect world it should be sufficient to only notify user
space on "state transitions" when the IO state changes (ie when a pipe
goes from unreadable to readable, or from unwritable to writable). User
space should then do as much as possible - fully emptying the buffer or
what not - and we'll notify it again the next time the state changes.
But as with EPOLLET, we have at least one case (stress-ng) where the
kernel sent SIGIO due to the pipe being marked for asynchronous
notification, but the user space signal handler then didn't actually
necessarily read it all before returning (it read more than what was
written, but since there could be multiple writes, it could leave data
pending).
The user space code then expected to get another SIGIO for subsequent
writes - even though the pipe had been readable the whole time - and
would only then read more.
This is arguably a user space bug - and Colin King already fixed the
stress-ng code in question - but the kernel regression rules are clear:
it doesn't matter if kernel people think that user space did something
silly and wrong. What matters is that it used to work.
So if user space depends on specific historical kernel behavior, it's a
regression when that behavior changes. It's on us: we were silly to
have that non-optimal historical behavior, and our old kernel behavior
was what user space was tested against.
Because of how the FASYNC notification was tied to wakeup behavior, this
was first broken by commits f467a6a66419 and 1b6b26ae7053 ("pipe: fix
and clarify pipe read/write wakeup logic"), but at the time it seems
nobody noticed. Probably because the stress-ng problem case ends up
being timing-dependent too.
It was then unwittingly fixed by commit 3a34b13a88ca ("pipe: make pipe
writes always wake up readers") only to be broken again when by commit
3b844826b6c6 ("pipe: avoid unnecessary EPOLLET wakeups under normal
loads").
And at that point the kernel test robot noticed the performance
refression in the stress-ng.sigio.ops_per_sec case. So the "Fixes" tag
below is somewhat ad hoc, but it matches when the issue was noticed.
Fix it for good (knock wood) by simply making the kill_fasync() case
separate from the wakeup case. FASYNC is quite rare, and we clearly
shouldn't even try to use the "avoid unnecessary wakeups" logic for it.
Link: https://lore.kernel.org/lkml/20210824151337.GC27667@xsang-OptiPlex-9020/
Fixes: 3b844826b6c6 ("pipe: avoid unnecessary EPOLLET wakeups under normal loads")
Reported-by: kernel test robot <oliver.sang@intel.com>
Tested-by: Oliver Sang <oliver.sang@intel.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-08-24 20:39:25 +03:00
|
|
|
kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
|
pipe: use exclusive waits when reading or writing
This makes the pipe code use separate wait-queues and exclusive waiting
for readers and writers, avoiding a nasty thundering herd problem when
there are lots of readers waiting for data on a pipe (or, less commonly,
lots of writers waiting for a pipe to have space).
While this isn't a common occurrence in the traditional "use a pipe as a
data transport" case, where you typically only have a single reader and
a single writer process, there is one common special case: using a pipe
as a source of "locking tokens" rather than for data communication.
In particular, the GNU make jobserver code ends up using a pipe as a way
to limit parallelism, where each job consumes a token by reading a byte
from the jobserver pipe, and releases the token by writing a byte back
to the pipe.
This pattern is fairly traditional on Unix, and works very well, but
will waste a lot of time waking up a lot of processes when only a single
reader needs to be woken up when a writer releases a new token.
A simplified test-case of just this pipe interaction is to create 64
processes, and then pass a single token around between them (this
test-case also intentionally passes another token that gets ignored to
test the "wake up next" logic too, in case anybody wonders about it):
#include <unistd.h>
int main(int argc, char **argv)
{
int fd[2], counters[2];
pipe(fd);
counters[0] = 0;
counters[1] = -1;
write(fd[1], counters, sizeof(counters));
/* 64 processes */
fork(); fork(); fork(); fork(); fork(); fork();
do {
int i;
read(fd[0], &i, sizeof(i));
if (i < 0)
continue;
counters[0] = i+1;
write(fd[1], counters, (1+(i & 1)) *sizeof(int));
} while (counters[0] < 1000000);
return 0;
}
and in a perfect world, passing that token around should only cause one
context switch per transfer, when the writer of a token causes a
directed wakeup of just a single reader.
But with the "writer wakes all readers" model we traditionally had, on
my test box the above case causes more than an order of magnitude more
scheduling: instead of the expected ~1M context switches, "perf stat"
shows
231,852.37 msec task-clock # 15.857 CPUs utilized
11,250,961 context-switches # 0.049 M/sec
616,304 cpu-migrations # 0.003 M/sec
1,648 page-faults # 0.007 K/sec
1,097,903,998,514 cycles # 4.735 GHz
120,781,778,352 instructions # 0.11 insn per cycle
27,997,056,043 branches # 120.754 M/sec
283,581,233 branch-misses # 1.01% of all branches
14.621273891 seconds time elapsed
0.018243000 seconds user
3.611468000 seconds sys
before this commit.
After this commit, I get
5,229.55 msec task-clock # 3.072 CPUs utilized
1,212,233 context-switches # 0.232 M/sec
103,951 cpu-migrations # 0.020 M/sec
1,328 page-faults # 0.254 K/sec
21,307,456,166 cycles # 4.074 GHz
12,947,819,999 instructions # 0.61 insn per cycle
2,881,985,678 branches # 551.096 M/sec
64,267,015 branch-misses # 2.23% of all branches
1.702148350 seconds time elapsed
0.004868000 seconds user
0.110786000 seconds sys
instead. Much better.
[ Note! This kernel improvement seems to be very good at triggering a
race condition in the make jobserver (in GNU make 4.2.1) for me. It's
a long known bug that was fixed back in June 2017 by GNU make commit
b552b0525198 ("[SV 51159] Use a non-blocking read with pselect to
avoid hangs.").
But there wasn't a new release of GNU make until 4.3 on Jan 19 2020,
so a number of distributions may still have the buggy version. Some
have backported the fix to their 4.2.1 release, though, and even
without the fix it's quite timing-dependent whether the bug actually
is hit. ]
Josh Triplett says:
"I've been hammering on your pipe fix patch (switching to exclusive
wait queues) for a month or so, on several different systems, and I've
run into no issues with it. The patch *substantially* improves
parallel build times on large (~100 CPU) systems, both with parallel
make and with other things that use make's pipe-based jobserver.
All current distributions (including stable and long-term stable
distributions) have versions of GNU make that no longer have the
jobserver bug"
Tested-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-09 20:48:27 +03:00
|
|
|
wait_event_interruptible_exclusive(pipe->wr_wait, pipe_writable(pipe));
|
2019-12-08 00:53:09 +03:00
|
|
|
__pipe_lock(pipe);
|
2019-12-22 15:33:24 +03:00
|
|
|
was_empty = pipe_empty(pipe->head, pipe->tail);
|
pipe: use exclusive waits when reading or writing
This makes the pipe code use separate wait-queues and exclusive waiting
for readers and writers, avoiding a nasty thundering herd problem when
there are lots of readers waiting for data on a pipe (or, less commonly,
lots of writers waiting for a pipe to have space).
While this isn't a common occurrence in the traditional "use a pipe as a
data transport" case, where you typically only have a single reader and
a single writer process, there is one common special case: using a pipe
as a source of "locking tokens" rather than for data communication.
In particular, the GNU make jobserver code ends up using a pipe as a way
to limit parallelism, where each job consumes a token by reading a byte
from the jobserver pipe, and releases the token by writing a byte back
to the pipe.
This pattern is fairly traditional on Unix, and works very well, but
will waste a lot of time waking up a lot of processes when only a single
reader needs to be woken up when a writer releases a new token.
A simplified test-case of just this pipe interaction is to create 64
processes, and then pass a single token around between them (this
test-case also intentionally passes another token that gets ignored to
test the "wake up next" logic too, in case anybody wonders about it):
#include <unistd.h>
int main(int argc, char **argv)
{
int fd[2], counters[2];
pipe(fd);
counters[0] = 0;
counters[1] = -1;
write(fd[1], counters, sizeof(counters));
/* 64 processes */
fork(); fork(); fork(); fork(); fork(); fork();
do {
int i;
read(fd[0], &i, sizeof(i));
if (i < 0)
continue;
counters[0] = i+1;
write(fd[1], counters, (1+(i & 1)) *sizeof(int));
} while (counters[0] < 1000000);
return 0;
}
and in a perfect world, passing that token around should only cause one
context switch per transfer, when the writer of a token causes a
directed wakeup of just a single reader.
But with the "writer wakes all readers" model we traditionally had, on
my test box the above case causes more than an order of magnitude more
scheduling: instead of the expected ~1M context switches, "perf stat"
shows
231,852.37 msec task-clock # 15.857 CPUs utilized
11,250,961 context-switches # 0.049 M/sec
616,304 cpu-migrations # 0.003 M/sec
1,648 page-faults # 0.007 K/sec
1,097,903,998,514 cycles # 4.735 GHz
120,781,778,352 instructions # 0.11 insn per cycle
27,997,056,043 branches # 120.754 M/sec
283,581,233 branch-misses # 1.01% of all branches
14.621273891 seconds time elapsed
0.018243000 seconds user
3.611468000 seconds sys
before this commit.
After this commit, I get
5,229.55 msec task-clock # 3.072 CPUs utilized
1,212,233 context-switches # 0.232 M/sec
103,951 cpu-migrations # 0.020 M/sec
1,328 page-faults # 0.254 K/sec
21,307,456,166 cycles # 4.074 GHz
12,947,819,999 instructions # 0.61 insn per cycle
2,881,985,678 branches # 551.096 M/sec
64,267,015 branch-misses # 2.23% of all branches
1.702148350 seconds time elapsed
0.004868000 seconds user
0.110786000 seconds sys
instead. Much better.
[ Note! This kernel improvement seems to be very good at triggering a
race condition in the make jobserver (in GNU make 4.2.1) for me. It's
a long known bug that was fixed back in June 2017 by GNU make commit
b552b0525198 ("[SV 51159] Use a non-blocking read with pselect to
avoid hangs.").
But there wasn't a new release of GNU make until 4.3 on Jan 19 2020,
so a number of distributions may still have the buggy version. Some
have backported the fix to their 4.2.1 release, though, and even
without the fix it's quite timing-dependent whether the bug actually
is hit. ]
Josh Triplett says:
"I've been hammering on your pipe fix patch (switching to exclusive
wait queues) for a month or so, on several different systems, and I've
run into no issues with it. The patch *substantially* improves
parallel build times on large (~100 CPU) systems, both with parallel
make and with other things that use make's pipe-based jobserver.
All current distributions (including stable and long-term stable
distributions) have versions of GNU make that no longer have the
jobserver bug"
Tested-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-09 20:48:27 +03:00
|
|
|
wake_next_writer = true;
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
out:
|
pipe: use exclusive waits when reading or writing
This makes the pipe code use separate wait-queues and exclusive waiting
for readers and writers, avoiding a nasty thundering herd problem when
there are lots of readers waiting for data on a pipe (or, less commonly,
lots of writers waiting for a pipe to have space).
While this isn't a common occurrence in the traditional "use a pipe as a
data transport" case, where you typically only have a single reader and
a single writer process, there is one common special case: using a pipe
as a source of "locking tokens" rather than for data communication.
In particular, the GNU make jobserver code ends up using a pipe as a way
to limit parallelism, where each job consumes a token by reading a byte
from the jobserver pipe, and releases the token by writing a byte back
to the pipe.
This pattern is fairly traditional on Unix, and works very well, but
will waste a lot of time waking up a lot of processes when only a single
reader needs to be woken up when a writer releases a new token.
A simplified test-case of just this pipe interaction is to create 64
processes, and then pass a single token around between them (this
test-case also intentionally passes another token that gets ignored to
test the "wake up next" logic too, in case anybody wonders about it):
#include <unistd.h>
int main(int argc, char **argv)
{
int fd[2], counters[2];
pipe(fd);
counters[0] = 0;
counters[1] = -1;
write(fd[1], counters, sizeof(counters));
/* 64 processes */
fork(); fork(); fork(); fork(); fork(); fork();
do {
int i;
read(fd[0], &i, sizeof(i));
if (i < 0)
continue;
counters[0] = i+1;
write(fd[1], counters, (1+(i & 1)) *sizeof(int));
} while (counters[0] < 1000000);
return 0;
}
and in a perfect world, passing that token around should only cause one
context switch per transfer, when the writer of a token causes a
directed wakeup of just a single reader.
But with the "writer wakes all readers" model we traditionally had, on
my test box the above case causes more than an order of magnitude more
scheduling: instead of the expected ~1M context switches, "perf stat"
shows
231,852.37 msec task-clock # 15.857 CPUs utilized
11,250,961 context-switches # 0.049 M/sec
616,304 cpu-migrations # 0.003 M/sec
1,648 page-faults # 0.007 K/sec
1,097,903,998,514 cycles # 4.735 GHz
120,781,778,352 instructions # 0.11 insn per cycle
27,997,056,043 branches # 120.754 M/sec
283,581,233 branch-misses # 1.01% of all branches
14.621273891 seconds time elapsed
0.018243000 seconds user
3.611468000 seconds sys
before this commit.
After this commit, I get
5,229.55 msec task-clock # 3.072 CPUs utilized
1,212,233 context-switches # 0.232 M/sec
103,951 cpu-migrations # 0.020 M/sec
1,328 page-faults # 0.254 K/sec
21,307,456,166 cycles # 4.074 GHz
12,947,819,999 instructions # 0.61 insn per cycle
2,881,985,678 branches # 551.096 M/sec
64,267,015 branch-misses # 2.23% of all branches
1.702148350 seconds time elapsed
0.004868000 seconds user
0.110786000 seconds sys
instead. Much better.
[ Note! This kernel improvement seems to be very good at triggering a
race condition in the make jobserver (in GNU make 4.2.1) for me. It's
a long known bug that was fixed back in June 2017 by GNU make commit
b552b0525198 ("[SV 51159] Use a non-blocking read with pselect to
avoid hangs.").
But there wasn't a new release of GNU make until 4.3 on Jan 19 2020,
so a number of distributions may still have the buggy version. Some
have backported the fix to their 4.2.1 release, though, and even
without the fix it's quite timing-dependent whether the bug actually
is hit. ]
Josh Triplett says:
"I've been hammering on your pipe fix patch (switching to exclusive
wait queues) for a month or so, on several different systems, and I've
run into no issues with it. The patch *substantially* improves
parallel build times on large (~100 CPU) systems, both with parallel
make and with other things that use make's pipe-based jobserver.
All current distributions (including stable and long-term stable
distributions) have versions of GNU make that no longer have the
jobserver bug"
Tested-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-09 20:48:27 +03:00
|
|
|
if (pipe_full(pipe->head, pipe->tail, pipe->max_usage))
|
|
|
|
wake_next_writer = false;
|
2013-03-21 20:24:01 +04:00
|
|
|
__pipe_unlock(pipe);
|
2019-12-07 23:14:28 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If we do do a wakeup event, we do a 'sync' wakeup, because we
|
|
|
|
* want the reader to start processing things asap, rather than
|
|
|
|
* leave the data pending.
|
|
|
|
*
|
|
|
|
* This is particularly important for small writes, because of
|
|
|
|
* how (for example) the GNU make jobserver uses small writes to
|
|
|
|
* wake up pending jobs
|
pipe: avoid unnecessary EPOLLET wakeups under normal loads
I had forgotten just how sensitive hackbench is to extra pipe wakeups,
and commit 3a34b13a88ca ("pipe: make pipe writes always wake up
readers") ended up causing a quite noticeable regression on larger
machines.
Now, hackbench isn't necessarily a hugely meaningful benchmark, and it's
not clear that this matters in real life all that much, but as Mel
points out, it's used often enough when comparing kernels and so the
performance regression shows up like a sore thumb.
It's easy enough to fix at least for the common cases where pipes are
used purely for data transfer, and you never have any exciting poll
usage at all. So set a special 'poll_usage' flag when there is polling
activity, and make the ugly "EPOLLET has crazy legacy expectations"
semantics explicit to only that case.
I would love to limit it to just the broken EPOLLET case, but the pipe
code can't see the difference between epoll and regular select/poll, so
any non-read/write waiting will trigger the extra wakeup behavior. That
is sufficient for at least the hackbench case.
Apart from making the odd extra wakeup cases more explicitly about
EPOLLET, this also makes the extra wakeup be at the _end_ of the pipe
write, not at the first write chunk. That is actually much saner
semantics (as much as you can call any of the legacy edge-triggered
expectations for EPOLLET "sane") since it means that you know the wakeup
will happen once the write is done, rather than possibly in the middle
of one.
[ For stable people: I'm putting a "Fixes" tag on this, but I leave it
up to you to decide whether you actually want to backport it or not.
It likely has no impact outside of synthetic benchmarks - Linus ]
Link: https://lore.kernel.org/lkml/20210802024945.GA8372@xsang-OptiPlex-9020/
Fixes: 3a34b13a88ca ("pipe: make pipe writes always wake up readers")
Reported-by: kernel test robot <oliver.sang@intel.com>
Tested-by: Sandeep Patil <sspatil@android.com>
Tested-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-08-05 20:04:43 +03:00
|
|
|
*
|
|
|
|
* Epoll nonsensically wants a wakeup whether the pipe
|
|
|
|
* was already empty or not.
|
2019-12-07 23:14:28 +03:00
|
|
|
*/
|
pipe: do FASYNC notifications for every pipe IO, not just state changes
It turns out that the SIGIO/FASYNC situation is almost exactly the same
as the EPOLLET case was: user space really wants to be notified after
every operation.
Now, in a perfect world it should be sufficient to only notify user
space on "state transitions" when the IO state changes (ie when a pipe
goes from unreadable to readable, or from unwritable to writable). User
space should then do as much as possible - fully emptying the buffer or
what not - and we'll notify it again the next time the state changes.
But as with EPOLLET, we have at least one case (stress-ng) where the
kernel sent SIGIO due to the pipe being marked for asynchronous
notification, but the user space signal handler then didn't actually
necessarily read it all before returning (it read more than what was
written, but since there could be multiple writes, it could leave data
pending).
The user space code then expected to get another SIGIO for subsequent
writes - even though the pipe had been readable the whole time - and
would only then read more.
This is arguably a user space bug - and Colin King already fixed the
stress-ng code in question - but the kernel regression rules are clear:
it doesn't matter if kernel people think that user space did something
silly and wrong. What matters is that it used to work.
So if user space depends on specific historical kernel behavior, it's a
regression when that behavior changes. It's on us: we were silly to
have that non-optimal historical behavior, and our old kernel behavior
was what user space was tested against.
Because of how the FASYNC notification was tied to wakeup behavior, this
was first broken by commits f467a6a66419 and 1b6b26ae7053 ("pipe: fix
and clarify pipe read/write wakeup logic"), but at the time it seems
nobody noticed. Probably because the stress-ng problem case ends up
being timing-dependent too.
It was then unwittingly fixed by commit 3a34b13a88ca ("pipe: make pipe
writes always wake up readers") only to be broken again when by commit
3b844826b6c6 ("pipe: avoid unnecessary EPOLLET wakeups under normal
loads").
And at that point the kernel test robot noticed the performance
refression in the stress-ng.sigio.ops_per_sec case. So the "Fixes" tag
below is somewhat ad hoc, but it matches when the issue was noticed.
Fix it for good (knock wood) by simply making the kill_fasync() case
separate from the wakeup case. FASYNC is quite rare, and we clearly
shouldn't even try to use the "avoid unnecessary wakeups" logic for it.
Link: https://lore.kernel.org/lkml/20210824151337.GC27667@xsang-OptiPlex-9020/
Fixes: 3b844826b6c6 ("pipe: avoid unnecessary EPOLLET wakeups under normal loads")
Reported-by: kernel test robot <oliver.sang@intel.com>
Tested-by: Oliver Sang <oliver.sang@intel.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-08-24 20:39:25 +03:00
|
|
|
if (was_empty || pipe->poll_usage)
|
pipe: use exclusive waits when reading or writing
This makes the pipe code use separate wait-queues and exclusive waiting
for readers and writers, avoiding a nasty thundering herd problem when
there are lots of readers waiting for data on a pipe (or, less commonly,
lots of writers waiting for a pipe to have space).
While this isn't a common occurrence in the traditional "use a pipe as a
data transport" case, where you typically only have a single reader and
a single writer process, there is one common special case: using a pipe
as a source of "locking tokens" rather than for data communication.
In particular, the GNU make jobserver code ends up using a pipe as a way
to limit parallelism, where each job consumes a token by reading a byte
from the jobserver pipe, and releases the token by writing a byte back
to the pipe.
This pattern is fairly traditional on Unix, and works very well, but
will waste a lot of time waking up a lot of processes when only a single
reader needs to be woken up when a writer releases a new token.
A simplified test-case of just this pipe interaction is to create 64
processes, and then pass a single token around between them (this
test-case also intentionally passes another token that gets ignored to
test the "wake up next" logic too, in case anybody wonders about it):
#include <unistd.h>
int main(int argc, char **argv)
{
int fd[2], counters[2];
pipe(fd);
counters[0] = 0;
counters[1] = -1;
write(fd[1], counters, sizeof(counters));
/* 64 processes */
fork(); fork(); fork(); fork(); fork(); fork();
do {
int i;
read(fd[0], &i, sizeof(i));
if (i < 0)
continue;
counters[0] = i+1;
write(fd[1], counters, (1+(i & 1)) *sizeof(int));
} while (counters[0] < 1000000);
return 0;
}
and in a perfect world, passing that token around should only cause one
context switch per transfer, when the writer of a token causes a
directed wakeup of just a single reader.
But with the "writer wakes all readers" model we traditionally had, on
my test box the above case causes more than an order of magnitude more
scheduling: instead of the expected ~1M context switches, "perf stat"
shows
231,852.37 msec task-clock # 15.857 CPUs utilized
11,250,961 context-switches # 0.049 M/sec
616,304 cpu-migrations # 0.003 M/sec
1,648 page-faults # 0.007 K/sec
1,097,903,998,514 cycles # 4.735 GHz
120,781,778,352 instructions # 0.11 insn per cycle
27,997,056,043 branches # 120.754 M/sec
283,581,233 branch-misses # 1.01% of all branches
14.621273891 seconds time elapsed
0.018243000 seconds user
3.611468000 seconds sys
before this commit.
After this commit, I get
5,229.55 msec task-clock # 3.072 CPUs utilized
1,212,233 context-switches # 0.232 M/sec
103,951 cpu-migrations # 0.020 M/sec
1,328 page-faults # 0.254 K/sec
21,307,456,166 cycles # 4.074 GHz
12,947,819,999 instructions # 0.61 insn per cycle
2,881,985,678 branches # 551.096 M/sec
64,267,015 branch-misses # 2.23% of all branches
1.702148350 seconds time elapsed
0.004868000 seconds user
0.110786000 seconds sys
instead. Much better.
[ Note! This kernel improvement seems to be very good at triggering a
race condition in the make jobserver (in GNU make 4.2.1) for me. It's
a long known bug that was fixed back in June 2017 by GNU make commit
b552b0525198 ("[SV 51159] Use a non-blocking read with pselect to
avoid hangs.").
But there wasn't a new release of GNU make until 4.3 on Jan 19 2020,
so a number of distributions may still have the buggy version. Some
have backported the fix to their 4.2.1 release, though, and even
without the fix it's quite timing-dependent whether the bug actually
is hit. ]
Josh Triplett says:
"I've been hammering on your pipe fix patch (switching to exclusive
wait queues) for a month or so, on several different systems, and I've
run into no issues with it. The patch *substantially* improves
parallel build times on large (~100 CPU) systems, both with parallel
make and with other things that use make's pipe-based jobserver.
All current distributions (including stable and long-term stable
distributions) have versions of GNU make that no longer have the
jobserver bug"
Tested-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-09 20:48:27 +03:00
|
|
|
wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM);
|
pipe: do FASYNC notifications for every pipe IO, not just state changes
It turns out that the SIGIO/FASYNC situation is almost exactly the same
as the EPOLLET case was: user space really wants to be notified after
every operation.
Now, in a perfect world it should be sufficient to only notify user
space on "state transitions" when the IO state changes (ie when a pipe
goes from unreadable to readable, or from unwritable to writable). User
space should then do as much as possible - fully emptying the buffer or
what not - and we'll notify it again the next time the state changes.
But as with EPOLLET, we have at least one case (stress-ng) where the
kernel sent SIGIO due to the pipe being marked for asynchronous
notification, but the user space signal handler then didn't actually
necessarily read it all before returning (it read more than what was
written, but since there could be multiple writes, it could leave data
pending).
The user space code then expected to get another SIGIO for subsequent
writes - even though the pipe had been readable the whole time - and
would only then read more.
This is arguably a user space bug - and Colin King already fixed the
stress-ng code in question - but the kernel regression rules are clear:
it doesn't matter if kernel people think that user space did something
silly and wrong. What matters is that it used to work.
So if user space depends on specific historical kernel behavior, it's a
regression when that behavior changes. It's on us: we were silly to
have that non-optimal historical behavior, and our old kernel behavior
was what user space was tested against.
Because of how the FASYNC notification was tied to wakeup behavior, this
was first broken by commits f467a6a66419 and 1b6b26ae7053 ("pipe: fix
and clarify pipe read/write wakeup logic"), but at the time it seems
nobody noticed. Probably because the stress-ng problem case ends up
being timing-dependent too.
It was then unwittingly fixed by commit 3a34b13a88ca ("pipe: make pipe
writes always wake up readers") only to be broken again when by commit
3b844826b6c6 ("pipe: avoid unnecessary EPOLLET wakeups under normal
loads").
And at that point the kernel test robot noticed the performance
refression in the stress-ng.sigio.ops_per_sec case. So the "Fixes" tag
below is somewhat ad hoc, but it matches when the issue was noticed.
Fix it for good (knock wood) by simply making the kill_fasync() case
separate from the wakeup case. FASYNC is quite rare, and we clearly
shouldn't even try to use the "avoid unnecessary wakeups" logic for it.
Link: https://lore.kernel.org/lkml/20210824151337.GC27667@xsang-OptiPlex-9020/
Fixes: 3b844826b6c6 ("pipe: avoid unnecessary EPOLLET wakeups under normal loads")
Reported-by: kernel test robot <oliver.sang@intel.com>
Tested-by: Oliver Sang <oliver.sang@intel.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-08-24 20:39:25 +03:00
|
|
|
kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
|
pipe: use exclusive waits when reading or writing
This makes the pipe code use separate wait-queues and exclusive waiting
for readers and writers, avoiding a nasty thundering herd problem when
there are lots of readers waiting for data on a pipe (or, less commonly,
lots of writers waiting for a pipe to have space).
While this isn't a common occurrence in the traditional "use a pipe as a
data transport" case, where you typically only have a single reader and
a single writer process, there is one common special case: using a pipe
as a source of "locking tokens" rather than for data communication.
In particular, the GNU make jobserver code ends up using a pipe as a way
to limit parallelism, where each job consumes a token by reading a byte
from the jobserver pipe, and releases the token by writing a byte back
to the pipe.
This pattern is fairly traditional on Unix, and works very well, but
will waste a lot of time waking up a lot of processes when only a single
reader needs to be woken up when a writer releases a new token.
A simplified test-case of just this pipe interaction is to create 64
processes, and then pass a single token around between them (this
test-case also intentionally passes another token that gets ignored to
test the "wake up next" logic too, in case anybody wonders about it):
#include <unistd.h>
int main(int argc, char **argv)
{
int fd[2], counters[2];
pipe(fd);
counters[0] = 0;
counters[1] = -1;
write(fd[1], counters, sizeof(counters));
/* 64 processes */
fork(); fork(); fork(); fork(); fork(); fork();
do {
int i;
read(fd[0], &i, sizeof(i));
if (i < 0)
continue;
counters[0] = i+1;
write(fd[1], counters, (1+(i & 1)) *sizeof(int));
} while (counters[0] < 1000000);
return 0;
}
and in a perfect world, passing that token around should only cause one
context switch per transfer, when the writer of a token causes a
directed wakeup of just a single reader.
But with the "writer wakes all readers" model we traditionally had, on
my test box the above case causes more than an order of magnitude more
scheduling: instead of the expected ~1M context switches, "perf stat"
shows
231,852.37 msec task-clock # 15.857 CPUs utilized
11,250,961 context-switches # 0.049 M/sec
616,304 cpu-migrations # 0.003 M/sec
1,648 page-faults # 0.007 K/sec
1,097,903,998,514 cycles # 4.735 GHz
120,781,778,352 instructions # 0.11 insn per cycle
27,997,056,043 branches # 120.754 M/sec
283,581,233 branch-misses # 1.01% of all branches
14.621273891 seconds time elapsed
0.018243000 seconds user
3.611468000 seconds sys
before this commit.
After this commit, I get
5,229.55 msec task-clock # 3.072 CPUs utilized
1,212,233 context-switches # 0.232 M/sec
103,951 cpu-migrations # 0.020 M/sec
1,328 page-faults # 0.254 K/sec
21,307,456,166 cycles # 4.074 GHz
12,947,819,999 instructions # 0.61 insn per cycle
2,881,985,678 branches # 551.096 M/sec
64,267,015 branch-misses # 2.23% of all branches
1.702148350 seconds time elapsed
0.004868000 seconds user
0.110786000 seconds sys
instead. Much better.
[ Note! This kernel improvement seems to be very good at triggering a
race condition in the make jobserver (in GNU make 4.2.1) for me. It's
a long known bug that was fixed back in June 2017 by GNU make commit
b552b0525198 ("[SV 51159] Use a non-blocking read with pselect to
avoid hangs.").
But there wasn't a new release of GNU make until 4.3 on Jan 19 2020,
so a number of distributions may still have the buggy version. Some
have backported the fix to their 4.2.1 release, though, and even
without the fix it's quite timing-dependent whether the bug actually
is hit. ]
Josh Triplett says:
"I've been hammering on your pipe fix patch (switching to exclusive
wait queues) for a month or so, on several different systems, and I've
run into no issues with it. The patch *substantially* improves
parallel build times on large (~100 CPU) systems, both with parallel
make and with other things that use make's pipe-based jobserver.
All current distributions (including stable and long-term stable
distributions) have versions of GNU make that no longer have the
jobserver bug"
Tested-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-09 20:48:27 +03:00
|
|
|
if (wake_next_writer)
|
|
|
|
wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM);
|
2014-01-24 03:55:21 +04:00
|
|
|
if (ret > 0 && sb_start_write_trylock(file_inode(filp)->i_sb)) {
|
2012-03-26 17:59:21 +04:00
|
|
|
int err = file_update_time(filp);
|
|
|
|
if (err)
|
|
|
|
ret = err;
|
2014-01-24 03:55:21 +04:00
|
|
|
sb_end_write(file_inode(filp)->i_sb);
|
2012-03-26 17:59:21 +04:00
|
|
|
}
|
2005-04-17 02:20:36 +04:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2008-02-08 15:21:23 +03:00
|
|
|
static long pipe_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
2013-03-21 19:16:56 +04:00
|
|
|
struct pipe_inode_info *pipe = filp->private_data;
|
2022-03-24 02:06:20 +03:00
|
|
|
unsigned int count, head, tail, mask;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
|
|
|
switch (cmd) {
|
pipe: Add general notification queue support
Make it possible to have a general notification queue built on top of a
standard pipe. Notifications are 'spliced' into the pipe and then read
out. splice(), vmsplice() and sendfile() are forbidden on pipes used for
notifications as post_one_notification() cannot take pipe->mutex. This
means that notifications could be posted in between individual pipe
buffers, making iov_iter_revert() difficult to effect.
The way the notification queue is used is:
(1) An application opens a pipe with a special flag and indicates the
number of messages it wishes to be able to queue at once (this can
only be set once):
pipe2(fds, O_NOTIFICATION_PIPE);
ioctl(fds[0], IOC_WATCH_QUEUE_SET_SIZE, queue_depth);
(2) The application then uses poll() and read() as normal to extract data
from the pipe. read() will return multiple notifications if the
buffer is big enough, but it will not split a notification across
buffers - rather it will return a short read or EMSGSIZE.
Notification messages include a length in the header so that the
caller can split them up.
Each message has a header that describes it:
struct watch_notification {
__u32 type:24;
__u32 subtype:8;
__u32 info;
};
The type indicates the source (eg. mount tree changes, superblock events,
keyring changes, block layer events) and the subtype indicates the event
type (eg. mount, unmount; EIO, EDQUOT; link, unlink). The info field
indicates a number of things, including the entry length, an ID assigned to
a watchpoint contributing to this buffer and type-specific flags.
Supplementary data, such as the key ID that generated an event, can be
attached in additional slots. The maximum message size is 127 bytes.
Messages may not be padded or aligned, so there is no guarantee, for
example, that the notification type will be on a 4-byte bounary.
Signed-off-by: David Howells <dhowells@redhat.com>
2020-01-14 20:07:11 +03:00
|
|
|
case FIONREAD:
|
|
|
|
__pipe_lock(pipe);
|
|
|
|
count = 0;
|
|
|
|
head = pipe->head;
|
|
|
|
tail = pipe->tail;
|
|
|
|
mask = pipe->ring_size - 1;
|
2019-11-15 16:30:32 +03:00
|
|
|
|
pipe: Add general notification queue support
Make it possible to have a general notification queue built on top of a
standard pipe. Notifications are 'spliced' into the pipe and then read
out. splice(), vmsplice() and sendfile() are forbidden on pipes used for
notifications as post_one_notification() cannot take pipe->mutex. This
means that notifications could be posted in between individual pipe
buffers, making iov_iter_revert() difficult to effect.
The way the notification queue is used is:
(1) An application opens a pipe with a special flag and indicates the
number of messages it wishes to be able to queue at once (this can
only be set once):
pipe2(fds, O_NOTIFICATION_PIPE);
ioctl(fds[0], IOC_WATCH_QUEUE_SET_SIZE, queue_depth);
(2) The application then uses poll() and read() as normal to extract data
from the pipe. read() will return multiple notifications if the
buffer is big enough, but it will not split a notification across
buffers - rather it will return a short read or EMSGSIZE.
Notification messages include a length in the header so that the
caller can split them up.
Each message has a header that describes it:
struct watch_notification {
__u32 type:24;
__u32 subtype:8;
__u32 info;
};
The type indicates the source (eg. mount tree changes, superblock events,
keyring changes, block layer events) and the subtype indicates the event
type (eg. mount, unmount; EIO, EDQUOT; link, unlink). The info field
indicates a number of things, including the entry length, an ID assigned to
a watchpoint contributing to this buffer and type-specific flags.
Supplementary data, such as the key ID that generated an event, can be
attached in additional slots. The maximum message size is 127 bytes.
Messages may not be padded or aligned, so there is no guarantee, for
example, that the notification type will be on a 4-byte bounary.
Signed-off-by: David Howells <dhowells@redhat.com>
2020-01-14 20:07:11 +03:00
|
|
|
while (tail != head) {
|
|
|
|
count += pipe->bufs[tail & mask].len;
|
|
|
|
tail++;
|
|
|
|
}
|
|
|
|
__pipe_unlock(pipe);
|
2006-04-11 15:53:33 +04:00
|
|
|
|
pipe: Add general notification queue support
Make it possible to have a general notification queue built on top of a
standard pipe. Notifications are 'spliced' into the pipe and then read
out. splice(), vmsplice() and sendfile() are forbidden on pipes used for
notifications as post_one_notification() cannot take pipe->mutex. This
means that notifications could be posted in between individual pipe
buffers, making iov_iter_revert() difficult to effect.
The way the notification queue is used is:
(1) An application opens a pipe with a special flag and indicates the
number of messages it wishes to be able to queue at once (this can
only be set once):
pipe2(fds, O_NOTIFICATION_PIPE);
ioctl(fds[0], IOC_WATCH_QUEUE_SET_SIZE, queue_depth);
(2) The application then uses poll() and read() as normal to extract data
from the pipe. read() will return multiple notifications if the
buffer is big enough, but it will not split a notification across
buffers - rather it will return a short read or EMSGSIZE.
Notification messages include a length in the header so that the
caller can split them up.
Each message has a header that describes it:
struct watch_notification {
__u32 type:24;
__u32 subtype:8;
__u32 info;
};
The type indicates the source (eg. mount tree changes, superblock events,
keyring changes, block layer events) and the subtype indicates the event
type (eg. mount, unmount; EIO, EDQUOT; link, unlink). The info field
indicates a number of things, including the entry length, an ID assigned to
a watchpoint contributing to this buffer and type-specific flags.
Supplementary data, such as the key ID that generated an event, can be
attached in additional slots. The maximum message size is 127 bytes.
Messages may not be padded or aligned, so there is no guarantee, for
example, that the notification type will be on a 4-byte bounary.
Signed-off-by: David Howells <dhowells@redhat.com>
2020-01-14 20:07:11 +03:00
|
|
|
return put_user(count, (int __user *)arg);
|
2006-04-11 15:53:33 +04:00
|
|
|
|
pipe: Add general notification queue support
Make it possible to have a general notification queue built on top of a
standard pipe. Notifications are 'spliced' into the pipe and then read
out. splice(), vmsplice() and sendfile() are forbidden on pipes used for
notifications as post_one_notification() cannot take pipe->mutex. This
means that notifications could be posted in between individual pipe
buffers, making iov_iter_revert() difficult to effect.
The way the notification queue is used is:
(1) An application opens a pipe with a special flag and indicates the
number of messages it wishes to be able to queue at once (this can
only be set once):
pipe2(fds, O_NOTIFICATION_PIPE);
ioctl(fds[0], IOC_WATCH_QUEUE_SET_SIZE, queue_depth);
(2) The application then uses poll() and read() as normal to extract data
from the pipe. read() will return multiple notifications if the
buffer is big enough, but it will not split a notification across
buffers - rather it will return a short read or EMSGSIZE.
Notification messages include a length in the header so that the
caller can split them up.
Each message has a header that describes it:
struct watch_notification {
__u32 type:24;
__u32 subtype:8;
__u32 info;
};
The type indicates the source (eg. mount tree changes, superblock events,
keyring changes, block layer events) and the subtype indicates the event
type (eg. mount, unmount; EIO, EDQUOT; link, unlink). The info field
indicates a number of things, including the entry length, an ID assigned to
a watchpoint contributing to this buffer and type-specific flags.
Supplementary data, such as the key ID that generated an event, can be
attached in additional slots. The maximum message size is 127 bytes.
Messages may not be padded or aligned, so there is no guarantee, for
example, that the notification type will be on a 4-byte bounary.
Signed-off-by: David Howells <dhowells@redhat.com>
2020-01-14 20:07:11 +03:00
|
|
|
#ifdef CONFIG_WATCH_QUEUE
|
|
|
|
case IOC_WATCH_QUEUE_SET_SIZE: {
|
|
|
|
int ret;
|
|
|
|
__pipe_lock(pipe);
|
|
|
|
ret = watch_queue_set_size(pipe, arg);
|
|
|
|
__pipe_unlock(pipe);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
case IOC_WATCH_QUEUE_SET_FILTER:
|
|
|
|
return watch_queue_set_filter(
|
|
|
|
pipe, (struct watch_notification_filter __user *)arg);
|
|
|
|
#endif
|
|
|
|
|
|
|
|
default:
|
|
|
|
return -ENOIOCTLCMD;
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2017-12-31 18:42:12 +03:00
|
|
|
/* No kernel lock held - fine */
|
2018-06-28 19:43:44 +03:00
|
|
|
static __poll_t
|
|
|
|
pipe_poll(struct file *filp, poll_table *wait)
|
2017-12-31 18:42:12 +03:00
|
|
|
{
|
2018-06-28 19:43:44 +03:00
|
|
|
__poll_t mask;
|
2017-12-31 18:42:12 +03:00
|
|
|
struct pipe_inode_info *pipe = filp->private_data;
|
2019-12-07 21:41:17 +03:00
|
|
|
unsigned int head, tail;
|
2018-06-28 19:43:44 +03:00
|
|
|
|
pipe: avoid unnecessary EPOLLET wakeups under normal loads
I had forgotten just how sensitive hackbench is to extra pipe wakeups,
and commit 3a34b13a88ca ("pipe: make pipe writes always wake up
readers") ended up causing a quite noticeable regression on larger
machines.
Now, hackbench isn't necessarily a hugely meaningful benchmark, and it's
not clear that this matters in real life all that much, but as Mel
points out, it's used often enough when comparing kernels and so the
performance regression shows up like a sore thumb.
It's easy enough to fix at least for the common cases where pipes are
used purely for data transfer, and you never have any exciting poll
usage at all. So set a special 'poll_usage' flag when there is polling
activity, and make the ugly "EPOLLET has crazy legacy expectations"
semantics explicit to only that case.
I would love to limit it to just the broken EPOLLET case, but the pipe
code can't see the difference between epoll and regular select/poll, so
any non-read/write waiting will trigger the extra wakeup behavior. That
is sufficient for at least the hackbench case.
Apart from making the odd extra wakeup cases more explicitly about
EPOLLET, this also makes the extra wakeup be at the _end_ of the pipe
write, not at the first write chunk. That is actually much saner
semantics (as much as you can call any of the legacy edge-triggered
expectations for EPOLLET "sane") since it means that you know the wakeup
will happen once the write is done, rather than possibly in the middle
of one.
[ For stable people: I'm putting a "Fixes" tag on this, but I leave it
up to you to decide whether you actually want to backport it or not.
It likely has no impact outside of synthetic benchmarks - Linus ]
Link: https://lore.kernel.org/lkml/20210802024945.GA8372@xsang-OptiPlex-9020/
Fixes: 3a34b13a88ca ("pipe: make pipe writes always wake up readers")
Reported-by: kernel test robot <oliver.sang@intel.com>
Tested-by: Sandeep Patil <sspatil@android.com>
Tested-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-08-05 20:04:43 +03:00
|
|
|
/* Epoll has some historical nasty semantics, this enables them */
|
2022-04-30 00:38:01 +03:00
|
|
|
WRITE_ONCE(pipe->poll_usage, true);
|
pipe: avoid unnecessary EPOLLET wakeups under normal loads
I had forgotten just how sensitive hackbench is to extra pipe wakeups,
and commit 3a34b13a88ca ("pipe: make pipe writes always wake up
readers") ended up causing a quite noticeable regression on larger
machines.
Now, hackbench isn't necessarily a hugely meaningful benchmark, and it's
not clear that this matters in real life all that much, but as Mel
points out, it's used often enough when comparing kernels and so the
performance regression shows up like a sore thumb.
It's easy enough to fix at least for the common cases where pipes are
used purely for data transfer, and you never have any exciting poll
usage at all. So set a special 'poll_usage' flag when there is polling
activity, and make the ugly "EPOLLET has crazy legacy expectations"
semantics explicit to only that case.
I would love to limit it to just the broken EPOLLET case, but the pipe
code can't see the difference between epoll and regular select/poll, so
any non-read/write waiting will trigger the extra wakeup behavior. That
is sufficient for at least the hackbench case.
Apart from making the odd extra wakeup cases more explicitly about
EPOLLET, this also makes the extra wakeup be at the _end_ of the pipe
write, not at the first write chunk. That is actually much saner
semantics (as much as you can call any of the legacy edge-triggered
expectations for EPOLLET "sane") since it means that you know the wakeup
will happen once the write is done, rather than possibly in the middle
of one.
[ For stable people: I'm putting a "Fixes" tag on this, but I leave it
up to you to decide whether you actually want to backport it or not.
It likely has no impact outside of synthetic benchmarks - Linus ]
Link: https://lore.kernel.org/lkml/20210802024945.GA8372@xsang-OptiPlex-9020/
Fixes: 3a34b13a88ca ("pipe: make pipe writes always wake up readers")
Reported-by: kernel test robot <oliver.sang@intel.com>
Tested-by: Sandeep Patil <sspatil@android.com>
Tested-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-08-05 20:04:43 +03:00
|
|
|
|
2019-12-07 21:41:17 +03:00
|
|
|
/*
|
pipe: use exclusive waits when reading or writing
This makes the pipe code use separate wait-queues and exclusive waiting
for readers and writers, avoiding a nasty thundering herd problem when
there are lots of readers waiting for data on a pipe (or, less commonly,
lots of writers waiting for a pipe to have space).
While this isn't a common occurrence in the traditional "use a pipe as a
data transport" case, where you typically only have a single reader and
a single writer process, there is one common special case: using a pipe
as a source of "locking tokens" rather than for data communication.
In particular, the GNU make jobserver code ends up using a pipe as a way
to limit parallelism, where each job consumes a token by reading a byte
from the jobserver pipe, and releases the token by writing a byte back
to the pipe.
This pattern is fairly traditional on Unix, and works very well, but
will waste a lot of time waking up a lot of processes when only a single
reader needs to be woken up when a writer releases a new token.
A simplified test-case of just this pipe interaction is to create 64
processes, and then pass a single token around between them (this
test-case also intentionally passes another token that gets ignored to
test the "wake up next" logic too, in case anybody wonders about it):
#include <unistd.h>
int main(int argc, char **argv)
{
int fd[2], counters[2];
pipe(fd);
counters[0] = 0;
counters[1] = -1;
write(fd[1], counters, sizeof(counters));
/* 64 processes */
fork(); fork(); fork(); fork(); fork(); fork();
do {
int i;
read(fd[0], &i, sizeof(i));
if (i < 0)
continue;
counters[0] = i+1;
write(fd[1], counters, (1+(i & 1)) *sizeof(int));
} while (counters[0] < 1000000);
return 0;
}
and in a perfect world, passing that token around should only cause one
context switch per transfer, when the writer of a token causes a
directed wakeup of just a single reader.
But with the "writer wakes all readers" model we traditionally had, on
my test box the above case causes more than an order of magnitude more
scheduling: instead of the expected ~1M context switches, "perf stat"
shows
231,852.37 msec task-clock # 15.857 CPUs utilized
11,250,961 context-switches # 0.049 M/sec
616,304 cpu-migrations # 0.003 M/sec
1,648 page-faults # 0.007 K/sec
1,097,903,998,514 cycles # 4.735 GHz
120,781,778,352 instructions # 0.11 insn per cycle
27,997,056,043 branches # 120.754 M/sec
283,581,233 branch-misses # 1.01% of all branches
14.621273891 seconds time elapsed
0.018243000 seconds user
3.611468000 seconds sys
before this commit.
After this commit, I get
5,229.55 msec task-clock # 3.072 CPUs utilized
1,212,233 context-switches # 0.232 M/sec
103,951 cpu-migrations # 0.020 M/sec
1,328 page-faults # 0.254 K/sec
21,307,456,166 cycles # 4.074 GHz
12,947,819,999 instructions # 0.61 insn per cycle
2,881,985,678 branches # 551.096 M/sec
64,267,015 branch-misses # 2.23% of all branches
1.702148350 seconds time elapsed
0.004868000 seconds user
0.110786000 seconds sys
instead. Much better.
[ Note! This kernel improvement seems to be very good at triggering a
race condition in the make jobserver (in GNU make 4.2.1) for me. It's
a long known bug that was fixed back in June 2017 by GNU make commit
b552b0525198 ("[SV 51159] Use a non-blocking read with pselect to
avoid hangs.").
But there wasn't a new release of GNU make until 4.3 on Jan 19 2020,
so a number of distributions may still have the buggy version. Some
have backported the fix to their 4.2.1 release, though, and even
without the fix it's quite timing-dependent whether the bug actually
is hit. ]
Josh Triplett says:
"I've been hammering on your pipe fix patch (switching to exclusive
wait queues) for a month or so, on several different systems, and I've
run into no issues with it. The patch *substantially* improves
parallel build times on large (~100 CPU) systems, both with parallel
make and with other things that use make's pipe-based jobserver.
All current distributions (including stable and long-term stable
distributions) have versions of GNU make that no longer have the
jobserver bug"
Tested-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-09 20:48:27 +03:00
|
|
|
* Reading pipe state only -- no need for acquiring the semaphore.
|
2019-12-07 21:41:17 +03:00
|
|
|
*
|
|
|
|
* But because this is racy, the code has to add the
|
|
|
|
* entry to the poll table _first_ ..
|
|
|
|
*/
|
pipe: use exclusive waits when reading or writing
This makes the pipe code use separate wait-queues and exclusive waiting
for readers and writers, avoiding a nasty thundering herd problem when
there are lots of readers waiting for data on a pipe (or, less commonly,
lots of writers waiting for a pipe to have space).
While this isn't a common occurrence in the traditional "use a pipe as a
data transport" case, where you typically only have a single reader and
a single writer process, there is one common special case: using a pipe
as a source of "locking tokens" rather than for data communication.
In particular, the GNU make jobserver code ends up using a pipe as a way
to limit parallelism, where each job consumes a token by reading a byte
from the jobserver pipe, and releases the token by writing a byte back
to the pipe.
This pattern is fairly traditional on Unix, and works very well, but
will waste a lot of time waking up a lot of processes when only a single
reader needs to be woken up when a writer releases a new token.
A simplified test-case of just this pipe interaction is to create 64
processes, and then pass a single token around between them (this
test-case also intentionally passes another token that gets ignored to
test the "wake up next" logic too, in case anybody wonders about it):
#include <unistd.h>
int main(int argc, char **argv)
{
int fd[2], counters[2];
pipe(fd);
counters[0] = 0;
counters[1] = -1;
write(fd[1], counters, sizeof(counters));
/* 64 processes */
fork(); fork(); fork(); fork(); fork(); fork();
do {
int i;
read(fd[0], &i, sizeof(i));
if (i < 0)
continue;
counters[0] = i+1;
write(fd[1], counters, (1+(i & 1)) *sizeof(int));
} while (counters[0] < 1000000);
return 0;
}
and in a perfect world, passing that token around should only cause one
context switch per transfer, when the writer of a token causes a
directed wakeup of just a single reader.
But with the "writer wakes all readers" model we traditionally had, on
my test box the above case causes more than an order of magnitude more
scheduling: instead of the expected ~1M context switches, "perf stat"
shows
231,852.37 msec task-clock # 15.857 CPUs utilized
11,250,961 context-switches # 0.049 M/sec
616,304 cpu-migrations # 0.003 M/sec
1,648 page-faults # 0.007 K/sec
1,097,903,998,514 cycles # 4.735 GHz
120,781,778,352 instructions # 0.11 insn per cycle
27,997,056,043 branches # 120.754 M/sec
283,581,233 branch-misses # 1.01% of all branches
14.621273891 seconds time elapsed
0.018243000 seconds user
3.611468000 seconds sys
before this commit.
After this commit, I get
5,229.55 msec task-clock # 3.072 CPUs utilized
1,212,233 context-switches # 0.232 M/sec
103,951 cpu-migrations # 0.020 M/sec
1,328 page-faults # 0.254 K/sec
21,307,456,166 cycles # 4.074 GHz
12,947,819,999 instructions # 0.61 insn per cycle
2,881,985,678 branches # 551.096 M/sec
64,267,015 branch-misses # 2.23% of all branches
1.702148350 seconds time elapsed
0.004868000 seconds user
0.110786000 seconds sys
instead. Much better.
[ Note! This kernel improvement seems to be very good at triggering a
race condition in the make jobserver (in GNU make 4.2.1) for me. It's
a long known bug that was fixed back in June 2017 by GNU make commit
b552b0525198 ("[SV 51159] Use a non-blocking read with pselect to
avoid hangs.").
But there wasn't a new release of GNU make until 4.3 on Jan 19 2020,
so a number of distributions may still have the buggy version. Some
have backported the fix to their 4.2.1 release, though, and even
without the fix it's quite timing-dependent whether the bug actually
is hit. ]
Josh Triplett says:
"I've been hammering on your pipe fix patch (switching to exclusive
wait queues) for a month or so, on several different systems, and I've
run into no issues with it. The patch *substantially* improves
parallel build times on large (~100 CPU) systems, both with parallel
make and with other things that use make's pipe-based jobserver.
All current distributions (including stable and long-term stable
distributions) have versions of GNU make that no longer have the
jobserver bug"
Tested-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-09 20:48:27 +03:00
|
|
|
if (filp->f_mode & FMODE_READ)
|
|
|
|
poll_wait(filp, &pipe->rd_wait, wait);
|
|
|
|
if (filp->f_mode & FMODE_WRITE)
|
|
|
|
poll_wait(filp, &pipe->wr_wait, wait);
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2019-12-07 21:41:17 +03:00
|
|
|
/*
|
|
|
|
* .. and only then can you do the racy tests. That way,
|
|
|
|
* if something changes and you got it wrong, the poll
|
|
|
|
* table entry will wake you up and fix it.
|
|
|
|
*/
|
|
|
|
head = READ_ONCE(pipe->head);
|
|
|
|
tail = READ_ONCE(pipe->tail);
|
|
|
|
|
2018-06-28 19:43:44 +03:00
|
|
|
mask = 0;
|
2005-04-17 02:20:36 +04:00
|
|
|
if (filp->f_mode & FMODE_READ) {
|
2019-11-15 16:30:32 +03:00
|
|
|
if (!pipe_empty(head, tail))
|
|
|
|
mask |= EPOLLIN | EPOLLRDNORM;
|
2006-04-11 15:53:33 +04:00
|
|
|
if (!pipe->writers && filp->f_version != pipe->w_counter)
|
2018-02-12 01:34:03 +03:00
|
|
|
mask |= EPOLLHUP;
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
if (filp->f_mode & FMODE_WRITE) {
|
2019-10-16 18:47:32 +03:00
|
|
|
if (!pipe_full(head, tail, pipe->max_usage))
|
2019-11-15 16:30:32 +03:00
|
|
|
mask |= EPOLLOUT | EPOLLWRNORM;
|
2005-09-07 02:17:48 +04:00
|
|
|
/*
|
2018-02-12 01:34:03 +03:00
|
|
|
* Most Unices do not set EPOLLERR for FIFOs but on Linux they
|
2005-09-07 02:17:48 +04:00
|
|
|
* behave exactly like pipes for poll().
|
|
|
|
*/
|
2006-04-11 15:53:33 +04:00
|
|
|
if (!pipe->readers)
|
2018-02-12 01:34:03 +03:00
|
|
|
mask |= EPOLLERR;
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
return mask;
|
|
|
|
}
|
|
|
|
|
vfs: fix subtle use-after-free of pipe_inode_info
The pipe code was trying (and failing) to be very careful about freeing
the pipe info only after the last access, with a pattern like:
spin_lock(&inode->i_lock);
if (!--pipe->files) {
inode->i_pipe = NULL;
kill = 1;
}
spin_unlock(&inode->i_lock);
__pipe_unlock(pipe);
if (kill)
free_pipe_info(pipe);
where the final freeing is done last.
HOWEVER. The above is actually broken, because while the freeing is
done at the end, if we have two racing processes releasing the pipe
inode info, the one that *doesn't* free it will decrement the ->files
count, and unlock the inode i_lock, but then still use the
"pipe_inode_info" afterwards when it does the "__pipe_unlock(pipe)".
This is *very* hard to trigger in practice, since the race window is
very small, and adding debug options seems to just hide it by slowing
things down.
Simon originally reported this way back in July as an Oops in
kmem_cache_allocate due to a single bit corruption (due to the final
"spin_unlock(pipe->mutex.wait_lock)" incrementing a field in a different
allocation that had re-used the free'd pipe-info), it's taken this long
to figure out.
Since the 'pipe->files' accesses aren't even protected by the pipe lock
(we very much use the inode lock for that), the simple solution is to
just drop the pipe lock early. And since there were two users of this
pattern, create a helper function for it.
Introduced commit ba5bb147330a ("pipe: take allocation and freeing of
pipe_inode_info out of ->i_mutex").
Reported-by: Simon Kirby <sim@hostway.ca>
Reported-by: Ian Applegate <ia@cloudflare.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: stable@kernel.org # v3.10+
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-12-02 21:44:51 +04:00
|
|
|
static void put_pipe_info(struct inode *inode, struct pipe_inode_info *pipe)
|
|
|
|
{
|
|
|
|
int kill = 0;
|
|
|
|
|
|
|
|
spin_lock(&inode->i_lock);
|
|
|
|
if (!--pipe->files) {
|
|
|
|
inode->i_pipe = NULL;
|
|
|
|
kill = 1;
|
|
|
|
}
|
|
|
|
spin_unlock(&inode->i_lock);
|
|
|
|
|
|
|
|
if (kill)
|
|
|
|
free_pipe_info(pipe);
|
|
|
|
}
|
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
static int
|
2013-03-12 17:58:10 +04:00
|
|
|
pipe_release(struct inode *inode, struct file *file)
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
vfs: fix subtle use-after-free of pipe_inode_info
The pipe code was trying (and failing) to be very careful about freeing
the pipe info only after the last access, with a pattern like:
spin_lock(&inode->i_lock);
if (!--pipe->files) {
inode->i_pipe = NULL;
kill = 1;
}
spin_unlock(&inode->i_lock);
__pipe_unlock(pipe);
if (kill)
free_pipe_info(pipe);
where the final freeing is done last.
HOWEVER. The above is actually broken, because while the freeing is
done at the end, if we have two racing processes releasing the pipe
inode info, the one that *doesn't* free it will decrement the ->files
count, and unlock the inode i_lock, but then still use the
"pipe_inode_info" afterwards when it does the "__pipe_unlock(pipe)".
This is *very* hard to trigger in practice, since the race window is
very small, and adding debug options seems to just hide it by slowing
things down.
Simon originally reported this way back in July as an Oops in
kmem_cache_allocate due to a single bit corruption (due to the final
"spin_unlock(pipe->mutex.wait_lock)" incrementing a field in a different
allocation that had re-used the free'd pipe-info), it's taken this long
to figure out.
Since the 'pipe->files' accesses aren't even protected by the pipe lock
(we very much use the inode lock for that), the simple solution is to
just drop the pipe lock early. And since there were two users of this
pattern, create a helper function for it.
Introduced commit ba5bb147330a ("pipe: take allocation and freeing of
pipe_inode_info out of ->i_mutex").
Reported-by: Simon Kirby <sim@hostway.ca>
Reported-by: Ian Applegate <ia@cloudflare.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: stable@kernel.org # v3.10+
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-12-02 21:44:51 +04:00
|
|
|
struct pipe_inode_info *pipe = file->private_data;
|
2006-04-11 15:53:33 +04:00
|
|
|
|
2013-03-21 20:24:01 +04:00
|
|
|
__pipe_lock(pipe);
|
2013-03-12 17:58:10 +04:00
|
|
|
if (file->f_mode & FMODE_READ)
|
|
|
|
pipe->readers--;
|
|
|
|
if (file->f_mode & FMODE_WRITE)
|
|
|
|
pipe->writers--;
|
2006-04-11 15:57:45 +04:00
|
|
|
|
pipe: make sure to wake up everybody when the last reader/writer closes
Andrei Vagin reported that commit 0ddad21d3e99 ("pipe: use exclusive
waits when reading or writing") broke one of the CRIU tests. He even
has a trivial reproducer:
#include <unistd.h>
#include <sys/types.h>
#include <sys/wait.h>
int main()
{
int p[2];
pid_t p1, p2;
int status;
if (pipe(p) == -1)
return 1;
p1 = fork();
if (p1 == 0) {
close(p[1]);
read(p[0], &status, sizeof(status));
return 0;
}
p2 = fork();
if (p2 == 0) {
close(p[1]);
read(p[0], &status, sizeof(status));
return 0;
}
sleep(1);
close(p[1]);
wait(&status);
wait(&status);
return 0;
}
and the problem - once he points it out - is obvious. We use these nice
exclusive waits, but when the last writer goes away, it then needs to
wake up _every_ reader (and conversely, the last reader disappearing
needs to wake every writer, of course).
In fact, when going through this, we had several small oddities around
how to wake things. We did in fact wake every reader when we changed
the size of the pipe buffers. But that's entirely pointless, since that
just acts as a possible source of new space - no new data to read.
And when we change the size of the buffer, we don't need to wake all
writers even when we add space - that case acts just as if somebody made
space by reading, and any writer that finds itself not filling it up
entirely will wake the next one.
On the other hand, on the exit path, we tried to limit the wakeups with
the proper poll keys etc, which is entirely pointless, because at that
point we obviously need to wake up everybody. So don't do that: just
wake up everybody - but only do that if the counts changed to zero.
So fix those non-IO wakeups to be more proper: space change doesn't add
any new data, but it might make room for writers, so it wakes up a
writer. And the actual changes to reader/writer counts should wake up
everybody, since everybody is affected (ie readers will all see EOF if
the writers have gone away, and writers will all get EPIPE if all
readers have gone away).
Fixes: 0ddad21d3e99 ("pipe: use exclusive waits when reading or writing")
Reported-and-tested-by: Andrei Vagin <avagin@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-18 21:12:58 +03:00
|
|
|
/* Was that the last reader or writer, but not the other side? */
|
|
|
|
if (!pipe->readers != !pipe->writers) {
|
|
|
|
wake_up_interruptible_all(&pipe->rd_wait);
|
|
|
|
wake_up_interruptible_all(&pipe->wr_wait);
|
2006-04-11 15:53:33 +04:00
|
|
|
kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
|
|
|
|
kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
2013-03-21 20:24:01 +04:00
|
|
|
__pipe_unlock(pipe);
|
2013-03-21 10:21:19 +04:00
|
|
|
|
vfs: fix subtle use-after-free of pipe_inode_info
The pipe code was trying (and failing) to be very careful about freeing
the pipe info only after the last access, with a pattern like:
spin_lock(&inode->i_lock);
if (!--pipe->files) {
inode->i_pipe = NULL;
kill = 1;
}
spin_unlock(&inode->i_lock);
__pipe_unlock(pipe);
if (kill)
free_pipe_info(pipe);
where the final freeing is done last.
HOWEVER. The above is actually broken, because while the freeing is
done at the end, if we have two racing processes releasing the pipe
inode info, the one that *doesn't* free it will decrement the ->files
count, and unlock the inode i_lock, but then still use the
"pipe_inode_info" afterwards when it does the "__pipe_unlock(pipe)".
This is *very* hard to trigger in practice, since the race window is
very small, and adding debug options seems to just hide it by slowing
things down.
Simon originally reported this way back in July as an Oops in
kmem_cache_allocate due to a single bit corruption (due to the final
"spin_unlock(pipe->mutex.wait_lock)" incrementing a field in a different
allocation that had re-used the free'd pipe-info), it's taken this long
to figure out.
Since the 'pipe->files' accesses aren't even protected by the pipe lock
(we very much use the inode lock for that), the simple solution is to
just drop the pipe lock early. And since there were two users of this
pattern, create a helper function for it.
Introduced commit ba5bb147330a ("pipe: take allocation and freeing of
pipe_inode_info out of ->i_mutex").
Reported-by: Simon Kirby <sim@hostway.ca>
Reported-by: Ian Applegate <ia@cloudflare.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: stable@kernel.org # v3.10+
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-12-02 21:44:51 +04:00
|
|
|
put_pipe_info(inode, pipe);
|
2005-04-17 02:20:36 +04:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
2013-03-12 17:58:10 +04:00
|
|
|
pipe_fasync(int fd, struct file *filp, int on)
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
2013-03-21 19:16:56 +04:00
|
|
|
struct pipe_inode_info *pipe = filp->private_data;
|
2013-03-12 17:58:10 +04:00
|
|
|
int retval = 0;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2013-03-21 20:24:01 +04:00
|
|
|
__pipe_lock(pipe);
|
2013-03-12 17:58:10 +04:00
|
|
|
if (filp->f_mode & FMODE_READ)
|
|
|
|
retval = fasync_helper(fd, filp, on, &pipe->fasync_readers);
|
|
|
|
if ((filp->f_mode & FMODE_WRITE) && retval >= 0) {
|
2006-04-11 15:57:45 +04:00
|
|
|
retval = fasync_helper(fd, filp, on, &pipe->fasync_writers);
|
2013-03-12 17:58:10 +04:00
|
|
|
if (retval < 0 && (filp->f_mode & FMODE_READ))
|
|
|
|
/* this can happen only if on == T */
|
2009-03-13 00:31:28 +03:00
|
|
|
fasync_helper(-1, filp, 0, &pipe->fasync_readers);
|
|
|
|
}
|
2013-03-21 20:24:01 +04:00
|
|
|
__pipe_unlock(pipe);
|
2009-02-02 00:52:56 +03:00
|
|
|
return retval;
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
|
pipe: Add general notification queue support
Make it possible to have a general notification queue built on top of a
standard pipe. Notifications are 'spliced' into the pipe and then read
out. splice(), vmsplice() and sendfile() are forbidden on pipes used for
notifications as post_one_notification() cannot take pipe->mutex. This
means that notifications could be posted in between individual pipe
buffers, making iov_iter_revert() difficult to effect.
The way the notification queue is used is:
(1) An application opens a pipe with a special flag and indicates the
number of messages it wishes to be able to queue at once (this can
only be set once):
pipe2(fds, O_NOTIFICATION_PIPE);
ioctl(fds[0], IOC_WATCH_QUEUE_SET_SIZE, queue_depth);
(2) The application then uses poll() and read() as normal to extract data
from the pipe. read() will return multiple notifications if the
buffer is big enough, but it will not split a notification across
buffers - rather it will return a short read or EMSGSIZE.
Notification messages include a length in the header so that the
caller can split them up.
Each message has a header that describes it:
struct watch_notification {
__u32 type:24;
__u32 subtype:8;
__u32 info;
};
The type indicates the source (eg. mount tree changes, superblock events,
keyring changes, block layer events) and the subtype indicates the event
type (eg. mount, unmount; EIO, EDQUOT; link, unlink). The info field
indicates a number of things, including the entry length, an ID assigned to
a watchpoint contributing to this buffer and type-specific flags.
Supplementary data, such as the key ID that generated an event, can be
attached in additional slots. The maximum message size is 127 bytes.
Messages may not be padded or aligned, so there is no guarantee, for
example, that the notification type will be on a 4-byte bounary.
Signed-off-by: David Howells <dhowells@redhat.com>
2020-01-14 20:07:11 +03:00
|
|
|
unsigned long account_pipe_buffers(struct user_struct *user,
|
|
|
|
unsigned long old, unsigned long new)
|
2016-01-18 18:36:09 +03:00
|
|
|
{
|
2016-10-11 23:53:40 +03:00
|
|
|
return atomic_long_add_return(new - old, &user->pipe_bufs);
|
2016-01-18 18:36:09 +03:00
|
|
|
}
|
|
|
|
|
pipe: Add general notification queue support
Make it possible to have a general notification queue built on top of a
standard pipe. Notifications are 'spliced' into the pipe and then read
out. splice(), vmsplice() and sendfile() are forbidden on pipes used for
notifications as post_one_notification() cannot take pipe->mutex. This
means that notifications could be posted in between individual pipe
buffers, making iov_iter_revert() difficult to effect.
The way the notification queue is used is:
(1) An application opens a pipe with a special flag and indicates the
number of messages it wishes to be able to queue at once (this can
only be set once):
pipe2(fds, O_NOTIFICATION_PIPE);
ioctl(fds[0], IOC_WATCH_QUEUE_SET_SIZE, queue_depth);
(2) The application then uses poll() and read() as normal to extract data
from the pipe. read() will return multiple notifications if the
buffer is big enough, but it will not split a notification across
buffers - rather it will return a short read or EMSGSIZE.
Notification messages include a length in the header so that the
caller can split them up.
Each message has a header that describes it:
struct watch_notification {
__u32 type:24;
__u32 subtype:8;
__u32 info;
};
The type indicates the source (eg. mount tree changes, superblock events,
keyring changes, block layer events) and the subtype indicates the event
type (eg. mount, unmount; EIO, EDQUOT; link, unlink). The info field
indicates a number of things, including the entry length, an ID assigned to
a watchpoint contributing to this buffer and type-specific flags.
Supplementary data, such as the key ID that generated an event, can be
attached in additional slots. The maximum message size is 127 bytes.
Messages may not be padded or aligned, so there is no guarantee, for
example, that the notification type will be on a 4-byte bounary.
Signed-off-by: David Howells <dhowells@redhat.com>
2020-01-14 20:07:11 +03:00
|
|
|
bool too_many_pipe_buffers_soft(unsigned long user_bufs)
|
2016-01-18 18:36:09 +03:00
|
|
|
{
|
2018-02-07 02:42:08 +03:00
|
|
|
unsigned long soft_limit = READ_ONCE(pipe_user_pages_soft);
|
|
|
|
|
|
|
|
return soft_limit && user_bufs > soft_limit;
|
2016-01-18 18:36:09 +03:00
|
|
|
}
|
|
|
|
|
pipe: Add general notification queue support
Make it possible to have a general notification queue built on top of a
standard pipe. Notifications are 'spliced' into the pipe and then read
out. splice(), vmsplice() and sendfile() are forbidden on pipes used for
notifications as post_one_notification() cannot take pipe->mutex. This
means that notifications could be posted in between individual pipe
buffers, making iov_iter_revert() difficult to effect.
The way the notification queue is used is:
(1) An application opens a pipe with a special flag and indicates the
number of messages it wishes to be able to queue at once (this can
only be set once):
pipe2(fds, O_NOTIFICATION_PIPE);
ioctl(fds[0], IOC_WATCH_QUEUE_SET_SIZE, queue_depth);
(2) The application then uses poll() and read() as normal to extract data
from the pipe. read() will return multiple notifications if the
buffer is big enough, but it will not split a notification across
buffers - rather it will return a short read or EMSGSIZE.
Notification messages include a length in the header so that the
caller can split them up.
Each message has a header that describes it:
struct watch_notification {
__u32 type:24;
__u32 subtype:8;
__u32 info;
};
The type indicates the source (eg. mount tree changes, superblock events,
keyring changes, block layer events) and the subtype indicates the event
type (eg. mount, unmount; EIO, EDQUOT; link, unlink). The info field
indicates a number of things, including the entry length, an ID assigned to
a watchpoint contributing to this buffer and type-specific flags.
Supplementary data, such as the key ID that generated an event, can be
attached in additional slots. The maximum message size is 127 bytes.
Messages may not be padded or aligned, so there is no guarantee, for
example, that the notification type will be on a 4-byte bounary.
Signed-off-by: David Howells <dhowells@redhat.com>
2020-01-14 20:07:11 +03:00
|
|
|
bool too_many_pipe_buffers_hard(unsigned long user_bufs)
|
2016-01-18 18:36:09 +03:00
|
|
|
{
|
2018-02-07 02:42:08 +03:00
|
|
|
unsigned long hard_limit = READ_ONCE(pipe_user_pages_hard);
|
|
|
|
|
|
|
|
return hard_limit && user_bufs > hard_limit;
|
2016-01-18 18:36:09 +03:00
|
|
|
}
|
|
|
|
|
pipe: Add general notification queue support
Make it possible to have a general notification queue built on top of a
standard pipe. Notifications are 'spliced' into the pipe and then read
out. splice(), vmsplice() and sendfile() are forbidden on pipes used for
notifications as post_one_notification() cannot take pipe->mutex. This
means that notifications could be posted in between individual pipe
buffers, making iov_iter_revert() difficult to effect.
The way the notification queue is used is:
(1) An application opens a pipe with a special flag and indicates the
number of messages it wishes to be able to queue at once (this can
only be set once):
pipe2(fds, O_NOTIFICATION_PIPE);
ioctl(fds[0], IOC_WATCH_QUEUE_SET_SIZE, queue_depth);
(2) The application then uses poll() and read() as normal to extract data
from the pipe. read() will return multiple notifications if the
buffer is big enough, but it will not split a notification across
buffers - rather it will return a short read or EMSGSIZE.
Notification messages include a length in the header so that the
caller can split them up.
Each message has a header that describes it:
struct watch_notification {
__u32 type:24;
__u32 subtype:8;
__u32 info;
};
The type indicates the source (eg. mount tree changes, superblock events,
keyring changes, block layer events) and the subtype indicates the event
type (eg. mount, unmount; EIO, EDQUOT; link, unlink). The info field
indicates a number of things, including the entry length, an ID assigned to
a watchpoint contributing to this buffer and type-specific flags.
Supplementary data, such as the key ID that generated an event, can be
attached in additional slots. The maximum message size is 127 bytes.
Messages may not be padded or aligned, so there is no guarantee, for
example, that the notification type will be on a 4-byte bounary.
Signed-off-by: David Howells <dhowells@redhat.com>
2020-01-14 20:07:11 +03:00
|
|
|
bool pipe_is_unprivileged_user(void)
|
2018-02-07 02:41:53 +03:00
|
|
|
{
|
|
|
|
return !capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN);
|
|
|
|
}
|
|
|
|
|
2013-03-21 19:04:15 +04:00
|
|
|
struct pipe_inode_info *alloc_pipe_info(void)
|
2006-04-10 17:18:35 +04:00
|
|
|
{
|
2006-04-11 15:53:33 +04:00
|
|
|
struct pipe_inode_info *pipe;
|
2016-10-11 23:53:34 +03:00
|
|
|
unsigned long pipe_bufs = PIPE_DEF_BUFFERS;
|
|
|
|
struct user_struct *user = get_current_user();
|
2016-10-11 23:53:40 +03:00
|
|
|
unsigned long user_bufs;
|
2018-02-07 02:42:08 +03:00
|
|
|
unsigned int max_size = READ_ONCE(pipe_max_size);
|
2006-04-10 17:18:35 +04:00
|
|
|
|
pipe: account to kmemcg
Pipes can consume a significant amount of system memory, hence they
should be accounted to kmemcg.
This patch marks pipe_inode_info and anonymous pipe buffer page
allocations as __GFP_ACCOUNT so that they would be charged to kmemcg.
Note, since a pipe buffer page can be "stolen" and get reused for other
purposes, including mapping to userspace, we clear PageKmemcg thus
resetting page->_mapcount and uncharge it in anon_pipe_buf_steal, which
is introduced by this patch.
A note regarding anon_pipe_buf_steal implementation. We allow to steal
the page if its ref count equals 1. It looks racy, but it is correct
for anonymous pipe buffer pages, because:
- We lock out all other pipe users, because ->steal is called with
pipe_lock held, so the page can't be spliced to another pipe from
under us.
- The page is not on LRU and it never was.
- Thus a parallel thread can access it only by PFN. Although this is
quite possible (e.g. see page_idle_get_page and balloon_page_isolate)
this is not dangerous, because all such functions do is increase page
ref count, check if the page is the one they are looking for, and
decrease ref count if it isn't. Since our page is clean except for
PageKmemcg mark, which doesn't conflict with other _mapcount users,
the worst that can happen is we see page_count > 2 due to a transient
ref, in which case we false-positively abort ->steal, which is still
fine, because ->steal is not guaranteed to succeed.
Link: http://lkml.kernel.org/r/20160527150313.GD26059@esperanza
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-27 01:24:33 +03:00
|
|
|
pipe = kzalloc(sizeof(struct pipe_inode_info), GFP_KERNEL_ACCOUNT);
|
2016-10-11 23:53:34 +03:00
|
|
|
if (pipe == NULL)
|
|
|
|
goto out_free_uid;
|
|
|
|
|
2018-02-07 02:42:08 +03:00
|
|
|
if (pipe_bufs * PAGE_SIZE > max_size && !capable(CAP_SYS_RESOURCE))
|
|
|
|
pipe_bufs = max_size >> PAGE_SHIFT;
|
2016-10-11 23:53:43 +03:00
|
|
|
|
2016-10-11 23:53:40 +03:00
|
|
|
user_bufs = account_pipe_buffers(user, 0, pipe_bufs);
|
2016-10-11 23:53:37 +03:00
|
|
|
|
pipe: Add general notification queue support
Make it possible to have a general notification queue built on top of a
standard pipe. Notifications are 'spliced' into the pipe and then read
out. splice(), vmsplice() and sendfile() are forbidden on pipes used for
notifications as post_one_notification() cannot take pipe->mutex. This
means that notifications could be posted in between individual pipe
buffers, making iov_iter_revert() difficult to effect.
The way the notification queue is used is:
(1) An application opens a pipe with a special flag and indicates the
number of messages it wishes to be able to queue at once (this can
only be set once):
pipe2(fds, O_NOTIFICATION_PIPE);
ioctl(fds[0], IOC_WATCH_QUEUE_SET_SIZE, queue_depth);
(2) The application then uses poll() and read() as normal to extract data
from the pipe. read() will return multiple notifications if the
buffer is big enough, but it will not split a notification across
buffers - rather it will return a short read or EMSGSIZE.
Notification messages include a length in the header so that the
caller can split them up.
Each message has a header that describes it:
struct watch_notification {
__u32 type:24;
__u32 subtype:8;
__u32 info;
};
The type indicates the source (eg. mount tree changes, superblock events,
keyring changes, block layer events) and the subtype indicates the event
type (eg. mount, unmount; EIO, EDQUOT; link, unlink). The info field
indicates a number of things, including the entry length, an ID assigned to
a watchpoint contributing to this buffer and type-specific flags.
Supplementary data, such as the key ID that generated an event, can be
attached in additional slots. The maximum message size is 127 bytes.
Messages may not be padded or aligned, so there is no guarantee, for
example, that the notification type will be on a 4-byte bounary.
Signed-off-by: David Howells <dhowells@redhat.com>
2020-01-14 20:07:11 +03:00
|
|
|
if (too_many_pipe_buffers_soft(user_bufs) && pipe_is_unprivileged_user()) {
|
pipe: increase minimum default pipe size to 2 pages
This program always prints 4096 and hangs before the patch, and always
prints 8192 and exits successfully after:
int main()
{
int pipefd[2];
for (int i = 0; i < 1025; i++)
if (pipe(pipefd) == -1)
return 1;
size_t bufsz = fcntl(pipefd[1], F_GETPIPE_SZ);
printf("%zd\n", bufsz);
char *buf = calloc(bufsz, 1);
write(pipefd[1], buf, bufsz);
read(pipefd[0], buf, bufsz-1);
write(pipefd[1], buf, 1);
}
Note that you may need to increase your RLIMIT_NOFILE before running the
program.
Fixes: 759c01142a ("pipe: limit the per-user amount of pages allocated in pipes")
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/lkml/1628086770.5rn8p04n6j.none@localhost/
Link: https://lore.kernel.org/lkml/1628127094.lxxn016tj7.none@localhost/
Signed-off-by: Alex Xu (Hello71) <alex_y_xu@yahoo.ca>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-08-05 17:40:47 +03:00
|
|
|
user_bufs = account_pipe_buffers(user, pipe_bufs, PIPE_MIN_DEF_BUFFERS);
|
|
|
|
pipe_bufs = PIPE_MIN_DEF_BUFFERS;
|
2016-10-11 23:53:34 +03:00
|
|
|
}
|
2016-01-18 18:36:09 +03:00
|
|
|
|
pipe: Add general notification queue support
Make it possible to have a general notification queue built on top of a
standard pipe. Notifications are 'spliced' into the pipe and then read
out. splice(), vmsplice() and sendfile() are forbidden on pipes used for
notifications as post_one_notification() cannot take pipe->mutex. This
means that notifications could be posted in between individual pipe
buffers, making iov_iter_revert() difficult to effect.
The way the notification queue is used is:
(1) An application opens a pipe with a special flag and indicates the
number of messages it wishes to be able to queue at once (this can
only be set once):
pipe2(fds, O_NOTIFICATION_PIPE);
ioctl(fds[0], IOC_WATCH_QUEUE_SET_SIZE, queue_depth);
(2) The application then uses poll() and read() as normal to extract data
from the pipe. read() will return multiple notifications if the
buffer is big enough, but it will not split a notification across
buffers - rather it will return a short read or EMSGSIZE.
Notification messages include a length in the header so that the
caller can split them up.
Each message has a header that describes it:
struct watch_notification {
__u32 type:24;
__u32 subtype:8;
__u32 info;
};
The type indicates the source (eg. mount tree changes, superblock events,
keyring changes, block layer events) and the subtype indicates the event
type (eg. mount, unmount; EIO, EDQUOT; link, unlink). The info field
indicates a number of things, including the entry length, an ID assigned to
a watchpoint contributing to this buffer and type-specific flags.
Supplementary data, such as the key ID that generated an event, can be
attached in additional slots. The maximum message size is 127 bytes.
Messages may not be padded or aligned, so there is no guarantee, for
example, that the notification type will be on a 4-byte bounary.
Signed-off-by: David Howells <dhowells@redhat.com>
2020-01-14 20:07:11 +03:00
|
|
|
if (too_many_pipe_buffers_hard(user_bufs) && pipe_is_unprivileged_user())
|
2016-10-11 23:53:37 +03:00
|
|
|
goto out_revert_acct;
|
|
|
|
|
2022-04-20 22:07:53 +03:00
|
|
|
pipe->bufs = kcalloc(pipe_bufs, sizeof(struct pipe_buffer),
|
2016-10-11 23:53:37 +03:00
|
|
|
GFP_KERNEL_ACCOUNT);
|
|
|
|
|
2016-10-11 23:53:34 +03:00
|
|
|
if (pipe->bufs) {
|
pipe: use exclusive waits when reading or writing
This makes the pipe code use separate wait-queues and exclusive waiting
for readers and writers, avoiding a nasty thundering herd problem when
there are lots of readers waiting for data on a pipe (or, less commonly,
lots of writers waiting for a pipe to have space).
While this isn't a common occurrence in the traditional "use a pipe as a
data transport" case, where you typically only have a single reader and
a single writer process, there is one common special case: using a pipe
as a source of "locking tokens" rather than for data communication.
In particular, the GNU make jobserver code ends up using a pipe as a way
to limit parallelism, where each job consumes a token by reading a byte
from the jobserver pipe, and releases the token by writing a byte back
to the pipe.
This pattern is fairly traditional on Unix, and works very well, but
will waste a lot of time waking up a lot of processes when only a single
reader needs to be woken up when a writer releases a new token.
A simplified test-case of just this pipe interaction is to create 64
processes, and then pass a single token around between them (this
test-case also intentionally passes another token that gets ignored to
test the "wake up next" logic too, in case anybody wonders about it):
#include <unistd.h>
int main(int argc, char **argv)
{
int fd[2], counters[2];
pipe(fd);
counters[0] = 0;
counters[1] = -1;
write(fd[1], counters, sizeof(counters));
/* 64 processes */
fork(); fork(); fork(); fork(); fork(); fork();
do {
int i;
read(fd[0], &i, sizeof(i));
if (i < 0)
continue;
counters[0] = i+1;
write(fd[1], counters, (1+(i & 1)) *sizeof(int));
} while (counters[0] < 1000000);
return 0;
}
and in a perfect world, passing that token around should only cause one
context switch per transfer, when the writer of a token causes a
directed wakeup of just a single reader.
But with the "writer wakes all readers" model we traditionally had, on
my test box the above case causes more than an order of magnitude more
scheduling: instead of the expected ~1M context switches, "perf stat"
shows
231,852.37 msec task-clock # 15.857 CPUs utilized
11,250,961 context-switches # 0.049 M/sec
616,304 cpu-migrations # 0.003 M/sec
1,648 page-faults # 0.007 K/sec
1,097,903,998,514 cycles # 4.735 GHz
120,781,778,352 instructions # 0.11 insn per cycle
27,997,056,043 branches # 120.754 M/sec
283,581,233 branch-misses # 1.01% of all branches
14.621273891 seconds time elapsed
0.018243000 seconds user
3.611468000 seconds sys
before this commit.
After this commit, I get
5,229.55 msec task-clock # 3.072 CPUs utilized
1,212,233 context-switches # 0.232 M/sec
103,951 cpu-migrations # 0.020 M/sec
1,328 page-faults # 0.254 K/sec
21,307,456,166 cycles # 4.074 GHz
12,947,819,999 instructions # 0.61 insn per cycle
2,881,985,678 branches # 551.096 M/sec
64,267,015 branch-misses # 2.23% of all branches
1.702148350 seconds time elapsed
0.004868000 seconds user
0.110786000 seconds sys
instead. Much better.
[ Note! This kernel improvement seems to be very good at triggering a
race condition in the make jobserver (in GNU make 4.2.1) for me. It's
a long known bug that was fixed back in June 2017 by GNU make commit
b552b0525198 ("[SV 51159] Use a non-blocking read with pselect to
avoid hangs.").
But there wasn't a new release of GNU make until 4.3 on Jan 19 2020,
so a number of distributions may still have the buggy version. Some
have backported the fix to their 4.2.1 release, though, and even
without the fix it's quite timing-dependent whether the bug actually
is hit. ]
Josh Triplett says:
"I've been hammering on your pipe fix patch (switching to exclusive
wait queues) for a month or so, on several different systems, and I've
run into no issues with it. The patch *substantially* improves
parallel build times on large (~100 CPU) systems, both with parallel
make and with other things that use make's pipe-based jobserver.
All current distributions (including stable and long-term stable
distributions) have versions of GNU make that no longer have the
jobserver bug"
Tested-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-09 20:48:27 +03:00
|
|
|
init_waitqueue_head(&pipe->rd_wait);
|
|
|
|
init_waitqueue_head(&pipe->wr_wait);
|
2016-10-11 23:53:34 +03:00
|
|
|
pipe->r_counter = pipe->w_counter = 1;
|
2019-10-16 18:47:32 +03:00
|
|
|
pipe->max_usage = pipe_bufs;
|
2019-11-15 16:30:32 +03:00
|
|
|
pipe->ring_size = pipe_bufs;
|
pipe: Add general notification queue support
Make it possible to have a general notification queue built on top of a
standard pipe. Notifications are 'spliced' into the pipe and then read
out. splice(), vmsplice() and sendfile() are forbidden on pipes used for
notifications as post_one_notification() cannot take pipe->mutex. This
means that notifications could be posted in between individual pipe
buffers, making iov_iter_revert() difficult to effect.
The way the notification queue is used is:
(1) An application opens a pipe with a special flag and indicates the
number of messages it wishes to be able to queue at once (this can
only be set once):
pipe2(fds, O_NOTIFICATION_PIPE);
ioctl(fds[0], IOC_WATCH_QUEUE_SET_SIZE, queue_depth);
(2) The application then uses poll() and read() as normal to extract data
from the pipe. read() will return multiple notifications if the
buffer is big enough, but it will not split a notification across
buffers - rather it will return a short read or EMSGSIZE.
Notification messages include a length in the header so that the
caller can split them up.
Each message has a header that describes it:
struct watch_notification {
__u32 type:24;
__u32 subtype:8;
__u32 info;
};
The type indicates the source (eg. mount tree changes, superblock events,
keyring changes, block layer events) and the subtype indicates the event
type (eg. mount, unmount; EIO, EDQUOT; link, unlink). The info field
indicates a number of things, including the entry length, an ID assigned to
a watchpoint contributing to this buffer and type-specific flags.
Supplementary data, such as the key ID that generated an event, can be
attached in additional slots. The maximum message size is 127 bytes.
Messages may not be padded or aligned, so there is no guarantee, for
example, that the notification type will be on a 4-byte bounary.
Signed-off-by: David Howells <dhowells@redhat.com>
2020-01-14 20:07:11 +03:00
|
|
|
pipe->nr_accounted = pipe_bufs;
|
2016-10-11 23:53:34 +03:00
|
|
|
pipe->user = user;
|
|
|
|
mutex_init(&pipe->mutex);
|
|
|
|
return pipe;
|
2006-04-10 17:18:35 +04:00
|
|
|
}
|
|
|
|
|
2016-10-11 23:53:37 +03:00
|
|
|
out_revert_acct:
|
2016-10-11 23:53:40 +03:00
|
|
|
(void) account_pipe_buffers(user, pipe_bufs, 0);
|
2016-10-11 23:53:34 +03:00
|
|
|
kfree(pipe);
|
|
|
|
out_free_uid:
|
|
|
|
free_uid(user);
|
2010-05-20 12:43:18 +04:00
|
|
|
return NULL;
|
2006-04-10 17:18:35 +04:00
|
|
|
}
|
|
|
|
|
2013-03-21 19:06:46 +04:00
|
|
|
void free_pipe_info(struct pipe_inode_info *pipe)
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
2022-03-24 02:06:20 +03:00
|
|
|
unsigned int i;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
pipe: Add general notification queue support
Make it possible to have a general notification queue built on top of a
standard pipe. Notifications are 'spliced' into the pipe and then read
out. splice(), vmsplice() and sendfile() are forbidden on pipes used for
notifications as post_one_notification() cannot take pipe->mutex. This
means that notifications could be posted in between individual pipe
buffers, making iov_iter_revert() difficult to effect.
The way the notification queue is used is:
(1) An application opens a pipe with a special flag and indicates the
number of messages it wishes to be able to queue at once (this can
only be set once):
pipe2(fds, O_NOTIFICATION_PIPE);
ioctl(fds[0], IOC_WATCH_QUEUE_SET_SIZE, queue_depth);
(2) The application then uses poll() and read() as normal to extract data
from the pipe. read() will return multiple notifications if the
buffer is big enough, but it will not split a notification across
buffers - rather it will return a short read or EMSGSIZE.
Notification messages include a length in the header so that the
caller can split them up.
Each message has a header that describes it:
struct watch_notification {
__u32 type:24;
__u32 subtype:8;
__u32 info;
};
The type indicates the source (eg. mount tree changes, superblock events,
keyring changes, block layer events) and the subtype indicates the event
type (eg. mount, unmount; EIO, EDQUOT; link, unlink). The info field
indicates a number of things, including the entry length, an ID assigned to
a watchpoint contributing to this buffer and type-specific flags.
Supplementary data, such as the key ID that generated an event, can be
attached in additional slots. The maximum message size is 127 bytes.
Messages may not be padded or aligned, so there is no guarantee, for
example, that the notification type will be on a 4-byte bounary.
Signed-off-by: David Howells <dhowells@redhat.com>
2020-01-14 20:07:11 +03:00
|
|
|
#ifdef CONFIG_WATCH_QUEUE
|
2022-03-11 16:23:38 +03:00
|
|
|
if (pipe->watch_queue)
|
pipe: Add general notification queue support
Make it possible to have a general notification queue built on top of a
standard pipe. Notifications are 'spliced' into the pipe and then read
out. splice(), vmsplice() and sendfile() are forbidden on pipes used for
notifications as post_one_notification() cannot take pipe->mutex. This
means that notifications could be posted in between individual pipe
buffers, making iov_iter_revert() difficult to effect.
The way the notification queue is used is:
(1) An application opens a pipe with a special flag and indicates the
number of messages it wishes to be able to queue at once (this can
only be set once):
pipe2(fds, O_NOTIFICATION_PIPE);
ioctl(fds[0], IOC_WATCH_QUEUE_SET_SIZE, queue_depth);
(2) The application then uses poll() and read() as normal to extract data
from the pipe. read() will return multiple notifications if the
buffer is big enough, but it will not split a notification across
buffers - rather it will return a short read or EMSGSIZE.
Notification messages include a length in the header so that the
caller can split them up.
Each message has a header that describes it:
struct watch_notification {
__u32 type:24;
__u32 subtype:8;
__u32 info;
};
The type indicates the source (eg. mount tree changes, superblock events,
keyring changes, block layer events) and the subtype indicates the event
type (eg. mount, unmount; EIO, EDQUOT; link, unlink). The info field
indicates a number of things, including the entry length, an ID assigned to
a watchpoint contributing to this buffer and type-specific flags.
Supplementary data, such as the key ID that generated an event, can be
attached in additional slots. The maximum message size is 127 bytes.
Messages may not be padded or aligned, so there is no guarantee, for
example, that the notification type will be on a 4-byte bounary.
Signed-off-by: David Howells <dhowells@redhat.com>
2020-01-14 20:07:11 +03:00
|
|
|
watch_queue_clear(pipe->watch_queue);
|
|
|
|
#endif
|
|
|
|
|
|
|
|
(void) account_pipe_buffers(pipe->user, pipe->nr_accounted, 0);
|
2016-01-18 18:36:09 +03:00
|
|
|
free_uid(pipe->user);
|
2019-11-15 16:30:32 +03:00
|
|
|
for (i = 0; i < pipe->ring_size; i++) {
|
2006-04-11 15:53:33 +04:00
|
|
|
struct pipe_buffer *buf = pipe->bufs + i;
|
2005-04-17 02:20:36 +04:00
|
|
|
if (buf->ops)
|
2016-09-27 11:45:12 +03:00
|
|
|
pipe_buf_release(pipe, buf);
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
2022-03-11 16:23:38 +03:00
|
|
|
#ifdef CONFIG_WATCH_QUEUE
|
|
|
|
if (pipe->watch_queue)
|
|
|
|
put_watch_queue(pipe->watch_queue);
|
|
|
|
#endif
|
2006-04-11 15:53:33 +04:00
|
|
|
if (pipe->tmp_page)
|
|
|
|
__free_page(pipe->tmp_page);
|
2022-04-20 22:07:53 +03:00
|
|
|
kfree(pipe->bufs);
|
2006-04-11 15:53:33 +04:00
|
|
|
kfree(pipe);
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
|
2006-03-26 13:37:24 +04:00
|
|
|
static struct vfsmount *pipe_mnt __read_mostly;
|
2006-04-11 15:57:45 +04:00
|
|
|
|
2007-05-08 11:26:18 +04:00
|
|
|
/*
|
|
|
|
* pipefs_dname() is called from d_path().
|
|
|
|
*/
|
|
|
|
static char *pipefs_dname(struct dentry *dentry, char *buffer, int buflen)
|
|
|
|
{
|
2022-01-30 23:03:49 +03:00
|
|
|
return dynamic_dname(buffer, buflen, "pipe:[%lu]",
|
2015-03-18 01:26:12 +03:00
|
|
|
d_inode(dentry)->i_ino);
|
2007-05-08 11:26:18 +04:00
|
|
|
}
|
|
|
|
|
2009-02-20 09:02:22 +03:00
|
|
|
static const struct dentry_operations pipefs_dentry_operations = {
|
2007-05-08 11:26:18 +04:00
|
|
|
.d_dname = pipefs_dname,
|
2005-04-17 02:20:36 +04:00
|
|
|
};
|
|
|
|
|
|
|
|
static struct inode * get_pipe_inode(void)
|
|
|
|
{
|
2011-07-26 13:36:34 +04:00
|
|
|
struct inode *inode = new_inode_pseudo(pipe_mnt->mnt_sb);
|
2006-04-11 15:53:33 +04:00
|
|
|
struct pipe_inode_info *pipe;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
|
|
|
if (!inode)
|
|
|
|
goto fail_inode;
|
|
|
|
|
2010-10-23 19:19:54 +04:00
|
|
|
inode->i_ino = get_next_ino();
|
|
|
|
|
2013-03-21 19:04:15 +04:00
|
|
|
pipe = alloc_pipe_info();
|
2006-04-11 15:53:33 +04:00
|
|
|
if (!pipe)
|
2005-04-17 02:20:36 +04:00
|
|
|
goto fail_iput;
|
2006-04-10 17:18:35 +04:00
|
|
|
|
2013-03-21 10:21:19 +04:00
|
|
|
inode->i_pipe = pipe;
|
|
|
|
pipe->files = 2;
|
2006-04-11 15:53:33 +04:00
|
|
|
pipe->readers = pipe->writers = 1;
|
2013-03-12 17:58:10 +04:00
|
|
|
inode->i_fop = &pipefifo_fops;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Mark the inode dirty from the very beginning,
|
|
|
|
* that way it will never be moved to the dirty
|
|
|
|
* list because "mark_inode_dirty()" will think
|
|
|
|
* that it already _is_ on the dirty list.
|
|
|
|
*/
|
|
|
|
inode->i_state = I_DIRTY;
|
|
|
|
inode->i_mode = S_IFIFO | S_IRUSR | S_IWUSR;
|
2008-11-14 02:39:05 +03:00
|
|
|
inode->i_uid = current_fsuid();
|
|
|
|
inode->i_gid = current_fsgid();
|
2016-09-14 17:48:04 +03:00
|
|
|
inode->i_atime = inode->i_mtime = inode->i_ctime = current_time(inode);
|
2006-04-11 15:53:33 +04:00
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
return inode;
|
|
|
|
|
|
|
|
fail_iput:
|
|
|
|
iput(inode);
|
2006-04-11 15:57:45 +04:00
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
fail_inode:
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2012-07-21 15:33:25 +04:00
|
|
|
int create_pipe_files(struct file **res, int flags)
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
2012-07-21 15:33:25 +04:00
|
|
|
struct inode *inode = get_pipe_inode();
|
2006-10-01 10:29:26 +04:00
|
|
|
struct file *f;
|
2020-10-01 15:50:55 +03:00
|
|
|
int error;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
|
|
|
if (!inode)
|
2012-07-21 15:33:25 +04:00
|
|
|
return -ENFILE;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
pipe: Add general notification queue support
Make it possible to have a general notification queue built on top of a
standard pipe. Notifications are 'spliced' into the pipe and then read
out. splice(), vmsplice() and sendfile() are forbidden on pipes used for
notifications as post_one_notification() cannot take pipe->mutex. This
means that notifications could be posted in between individual pipe
buffers, making iov_iter_revert() difficult to effect.
The way the notification queue is used is:
(1) An application opens a pipe with a special flag and indicates the
number of messages it wishes to be able to queue at once (this can
only be set once):
pipe2(fds, O_NOTIFICATION_PIPE);
ioctl(fds[0], IOC_WATCH_QUEUE_SET_SIZE, queue_depth);
(2) The application then uses poll() and read() as normal to extract data
from the pipe. read() will return multiple notifications if the
buffer is big enough, but it will not split a notification across
buffers - rather it will return a short read or EMSGSIZE.
Notification messages include a length in the header so that the
caller can split them up.
Each message has a header that describes it:
struct watch_notification {
__u32 type:24;
__u32 subtype:8;
__u32 info;
};
The type indicates the source (eg. mount tree changes, superblock events,
keyring changes, block layer events) and the subtype indicates the event
type (eg. mount, unmount; EIO, EDQUOT; link, unlink). The info field
indicates a number of things, including the entry length, an ID assigned to
a watchpoint contributing to this buffer and type-specific flags.
Supplementary data, such as the key ID that generated an event, can be
attached in additional slots. The maximum message size is 127 bytes.
Messages may not be padded or aligned, so there is no guarantee, for
example, that the notification type will be on a 4-byte bounary.
Signed-off-by: David Howells <dhowells@redhat.com>
2020-01-14 20:07:11 +03:00
|
|
|
if (flags & O_NOTIFICATION_PIPE) {
|
2020-10-01 15:50:55 +03:00
|
|
|
error = watch_queue_init(inode->i_pipe);
|
|
|
|
if (error) {
|
|
|
|
free_pipe_info(inode->i_pipe);
|
pipe: Add general notification queue support
Make it possible to have a general notification queue built on top of a
standard pipe. Notifications are 'spliced' into the pipe and then read
out. splice(), vmsplice() and sendfile() are forbidden on pipes used for
notifications as post_one_notification() cannot take pipe->mutex. This
means that notifications could be posted in between individual pipe
buffers, making iov_iter_revert() difficult to effect.
The way the notification queue is used is:
(1) An application opens a pipe with a special flag and indicates the
number of messages it wishes to be able to queue at once (this can
only be set once):
pipe2(fds, O_NOTIFICATION_PIPE);
ioctl(fds[0], IOC_WATCH_QUEUE_SET_SIZE, queue_depth);
(2) The application then uses poll() and read() as normal to extract data
from the pipe. read() will return multiple notifications if the
buffer is big enough, but it will not split a notification across
buffers - rather it will return a short read or EMSGSIZE.
Notification messages include a length in the header so that the
caller can split them up.
Each message has a header that describes it:
struct watch_notification {
__u32 type:24;
__u32 subtype:8;
__u32 info;
};
The type indicates the source (eg. mount tree changes, superblock events,
keyring changes, block layer events) and the subtype indicates the event
type (eg. mount, unmount; EIO, EDQUOT; link, unlink). The info field
indicates a number of things, including the entry length, an ID assigned to
a watchpoint contributing to this buffer and type-specific flags.
Supplementary data, such as the key ID that generated an event, can be
attached in additional slots. The maximum message size is 127 bytes.
Messages may not be padded or aligned, so there is no guarantee, for
example, that the notification type will be on a 4-byte bounary.
Signed-off-by: David Howells <dhowells@redhat.com>
2020-01-14 20:07:11 +03:00
|
|
|
iput(inode);
|
2020-10-01 15:50:55 +03:00
|
|
|
return error;
|
pipe: Add general notification queue support
Make it possible to have a general notification queue built on top of a
standard pipe. Notifications are 'spliced' into the pipe and then read
out. splice(), vmsplice() and sendfile() are forbidden on pipes used for
notifications as post_one_notification() cannot take pipe->mutex. This
means that notifications could be posted in between individual pipe
buffers, making iov_iter_revert() difficult to effect.
The way the notification queue is used is:
(1) An application opens a pipe with a special flag and indicates the
number of messages it wishes to be able to queue at once (this can
only be set once):
pipe2(fds, O_NOTIFICATION_PIPE);
ioctl(fds[0], IOC_WATCH_QUEUE_SET_SIZE, queue_depth);
(2) The application then uses poll() and read() as normal to extract data
from the pipe. read() will return multiple notifications if the
buffer is big enough, but it will not split a notification across
buffers - rather it will return a short read or EMSGSIZE.
Notification messages include a length in the header so that the
caller can split them up.
Each message has a header that describes it:
struct watch_notification {
__u32 type:24;
__u32 subtype:8;
__u32 info;
};
The type indicates the source (eg. mount tree changes, superblock events,
keyring changes, block layer events) and the subtype indicates the event
type (eg. mount, unmount; EIO, EDQUOT; link, unlink). The info field
indicates a number of things, including the entry length, an ID assigned to
a watchpoint contributing to this buffer and type-specific flags.
Supplementary data, such as the key ID that generated an event, can be
attached in additional slots. The maximum message size is 127 bytes.
Messages may not be padded or aligned, so there is no guarantee, for
example, that the notification type will be on a 4-byte bounary.
Signed-off-by: David Howells <dhowells@redhat.com>
2020-01-14 20:07:11 +03:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2018-06-09 17:05:18 +03:00
|
|
|
f = alloc_file_pseudo(inode, pipe_mnt, "",
|
|
|
|
O_WRONLY | (flags & (O_NONBLOCK | O_DIRECT)),
|
|
|
|
&pipefifo_fops);
|
fs/pipe.c: preserve alloc_file() error code
If sys_pipe() was unable to allocate a 'struct file', it always failed
with ENFILE, which means "The number of simultaneously open files in the
system would exceed a system-imposed limit." However, alloc_file()
actually returns an ERR_PTR value and might fail with other error codes.
Currently, in addition to ENFILE, it can fail with ENOMEM, potentially
when there are few open files in the system. Update sys_pipe() to
preserve this error code.
In a prior submission of a similar patch (1) some concern was raised
about introducing a new error code for sys_pipe(). However, for most
system calls, programs cannot assume that new error codes will never be
introduced. In addition, ENOMEM was, in fact, already a possible error
code for sys_pipe(), in the case where the file descriptor table could
not be expanded due to insufficient memory.
(1) http://comments.gmane.org/gmane.linux.kernel/1357942
Signed-off-by: Eric Biggers <ebiggers3@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-10-18 00:26:08 +03:00
|
|
|
if (IS_ERR(f)) {
|
2018-06-09 17:05:18 +03:00
|
|
|
free_pipe_info(inode->i_pipe);
|
|
|
|
iput(inode);
|
|
|
|
return PTR_ERR(f);
|
fs/pipe.c: preserve alloc_file() error code
If sys_pipe() was unable to allocate a 'struct file', it always failed
with ENFILE, which means "The number of simultaneously open files in the
system would exceed a system-imposed limit." However, alloc_file()
actually returns an ERR_PTR value and might fail with other error codes.
Currently, in addition to ENFILE, it can fail with ENOMEM, potentially
when there are few open files in the system. Update sys_pipe() to
preserve this error code.
In a prior submission of a similar patch (1) some concern was raised
about introducing a new error code for sys_pipe(). However, for most
system calls, programs cannot assume that new error codes will never be
introduced. In addition, ENOMEM was, in fact, already a possible error
code for sys_pipe(), in the case where the file descriptor table could
not be expanded due to insufficient memory.
(1) http://comments.gmane.org/gmane.linux.kernel/1357942
Signed-off-by: Eric Biggers <ebiggers3@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-10-18 00:26:08 +03:00
|
|
|
}
|
2006-04-11 15:57:45 +04:00
|
|
|
|
2013-03-21 19:16:56 +04:00
|
|
|
f->private_data = inode->i_pipe;
|
2006-10-01 10:29:26 +04:00
|
|
|
|
2018-06-17 21:15:10 +03:00
|
|
|
res[0] = alloc_file_clone(f, O_RDONLY | (flags & O_NONBLOCK),
|
|
|
|
&pipefifo_fops);
|
fs/pipe.c: preserve alloc_file() error code
If sys_pipe() was unable to allocate a 'struct file', it always failed
with ENFILE, which means "The number of simultaneously open files in the
system would exceed a system-imposed limit." However, alloc_file()
actually returns an ERR_PTR value and might fail with other error codes.
Currently, in addition to ENFILE, it can fail with ENOMEM, potentially
when there are few open files in the system. Update sys_pipe() to
preserve this error code.
In a prior submission of a similar patch (1) some concern was raised
about introducing a new error code for sys_pipe(). However, for most
system calls, programs cannot assume that new error codes will never be
introduced. In addition, ENOMEM was, in fact, already a possible error
code for sys_pipe(), in the case where the file descriptor table could
not be expanded due to insufficient memory.
(1) http://comments.gmane.org/gmane.linux.kernel/1357942
Signed-off-by: Eric Biggers <ebiggers3@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-10-18 00:26:08 +03:00
|
|
|
if (IS_ERR(res[0])) {
|
2018-07-09 09:29:58 +03:00
|
|
|
put_pipe_info(inode, inode->i_pipe);
|
|
|
|
fput(f);
|
|
|
|
return PTR_ERR(res[0]);
|
fs/pipe.c: preserve alloc_file() error code
If sys_pipe() was unable to allocate a 'struct file', it always failed
with ENFILE, which means "The number of simultaneously open files in the
system would exceed a system-imposed limit." However, alloc_file()
actually returns an ERR_PTR value and might fail with other error codes.
Currently, in addition to ENFILE, it can fail with ENOMEM, potentially
when there are few open files in the system. Update sys_pipe() to
preserve this error code.
In a prior submission of a similar patch (1) some concern was raised
about introducing a new error code for sys_pipe(). However, for most
system calls, programs cannot assume that new error codes will never be
introduced. In addition, ENOMEM was, in fact, already a possible error
code for sys_pipe(), in the case where the file descriptor table could
not be expanded due to insufficient memory.
(1) http://comments.gmane.org/gmane.linux.kernel/1357942
Signed-off-by: Eric Biggers <ebiggers3@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-10-18 00:26:08 +03:00
|
|
|
}
|
2013-03-21 19:16:56 +04:00
|
|
|
res[0]->private_data = inode->i_pipe;
|
2012-07-21 15:33:25 +04:00
|
|
|
res[1] = f;
|
vfs: mark pipes and sockets as stream-like file descriptors
In commit 3975b097e577 ("convert stream-like files -> stream_open, even
if they use noop_llseek") Kirill used a coccinelle script to change
"nonseekable_open()" to "stream_open()", which changed the trivial cases
of stream-like file descriptors to the new model with FMODE_STREAM.
However, the two big cases - sockets and pipes - don't actually have
that trivial pattern at all, and were thus never converted to
FMODE_STREAM even though it makes lots of sense to do so.
That's particularly true when looking forward to the next change:
getting rid of FMODE_ATOMIC_POS entirely, and just using FMODE_STREAM to
decide whether f_pos updates are needed or not. And if they are, we'll
always do them atomically.
This came up because KCSAN (correctly) noted that the non-locked f_pos
updates are data races: they are clearly benign for the case where we
don't care, but it would be good to just not have that issue exist at
all.
Note that the reason we used FMODE_ATOMIC_POS originally is that only
doing it for the minimal required case is "safer" in that it's possible
that the f_pos locking can cause unnecessary serialization across the
whole write() call. And in the worst case, that kind of serialization
can cause deadlock issues: think writers that need readers to empty the
state using the same file descriptor.
[ Note that the locking is per-file descriptor - because it protects
"f_pos", which is obviously per-file descriptor - so it only affects
cases where you literally use the same file descriptor to both read
and write.
So a regular pipe that has separate reading and writing file
descriptors doesn't really have this situation even though it's the
obvious case of "reader empties what a bit writer concurrently fills"
But we want to make pipes as being stream-line anyway, because we
don't want the unnecessary overhead of locking, and because a named
pipe can be (ab-)used by reading and writing to the same file
descriptor. ]
There are likely a lot of other cases that might want FMODE_STREAM, and
looking for ".llseek = no_llseek" users and other cases that don't have
an lseek file operation at all and making them use "stream_open()" might
be a good idea. But pipes and sockets are likely to be the two main
cases.
Cc: Kirill Smelkov <kirr@nexedi.com>
Cc: Eic Dumazet <edumazet@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Marco Elver <elver@google.com>
Cc: Andrea Parri <parri.andrea@gmail.com>
Cc: Paul McKenney <paulmck@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-11-17 22:20:48 +03:00
|
|
|
stream_open(inode, res[0]);
|
|
|
|
stream_open(inode, res[1]);
|
2012-07-21 15:33:25 +04:00
|
|
|
return 0;
|
2006-10-01 10:29:26 +04:00
|
|
|
}
|
|
|
|
|
2012-08-19 20:17:29 +04:00
|
|
|
static int __do_pipe_flags(int *fd, struct file **files, int flags)
|
2006-10-01 10:29:26 +04:00
|
|
|
{
|
|
|
|
int error;
|
|
|
|
int fdw, fdr;
|
|
|
|
|
pipe: Add general notification queue support
Make it possible to have a general notification queue built on top of a
standard pipe. Notifications are 'spliced' into the pipe and then read
out. splice(), vmsplice() and sendfile() are forbidden on pipes used for
notifications as post_one_notification() cannot take pipe->mutex. This
means that notifications could be posted in between individual pipe
buffers, making iov_iter_revert() difficult to effect.
The way the notification queue is used is:
(1) An application opens a pipe with a special flag and indicates the
number of messages it wishes to be able to queue at once (this can
only be set once):
pipe2(fds, O_NOTIFICATION_PIPE);
ioctl(fds[0], IOC_WATCH_QUEUE_SET_SIZE, queue_depth);
(2) The application then uses poll() and read() as normal to extract data
from the pipe. read() will return multiple notifications if the
buffer is big enough, but it will not split a notification across
buffers - rather it will return a short read or EMSGSIZE.
Notification messages include a length in the header so that the
caller can split them up.
Each message has a header that describes it:
struct watch_notification {
__u32 type:24;
__u32 subtype:8;
__u32 info;
};
The type indicates the source (eg. mount tree changes, superblock events,
keyring changes, block layer events) and the subtype indicates the event
type (eg. mount, unmount; EIO, EDQUOT; link, unlink). The info field
indicates a number of things, including the entry length, an ID assigned to
a watchpoint contributing to this buffer and type-specific flags.
Supplementary data, such as the key ID that generated an event, can be
attached in additional slots. The maximum message size is 127 bytes.
Messages may not be padded or aligned, so there is no guarantee, for
example, that the notification type will be on a 4-byte bounary.
Signed-off-by: David Howells <dhowells@redhat.com>
2020-01-14 20:07:11 +03:00
|
|
|
if (flags & ~(O_CLOEXEC | O_NONBLOCK | O_DIRECT | O_NOTIFICATION_PIPE))
|
2008-07-24 08:29:30 +04:00
|
|
|
return -EINVAL;
|
|
|
|
|
2012-07-21 15:33:25 +04:00
|
|
|
error = create_pipe_files(files, flags);
|
|
|
|
if (error)
|
|
|
|
return error;
|
2006-10-01 10:29:26 +04:00
|
|
|
|
2008-07-24 08:29:30 +04:00
|
|
|
error = get_unused_fd_flags(flags);
|
2006-10-01 10:29:26 +04:00
|
|
|
if (error < 0)
|
|
|
|
goto err_read_pipe;
|
|
|
|
fdr = error;
|
|
|
|
|
2008-07-24 08:29:30 +04:00
|
|
|
error = get_unused_fd_flags(flags);
|
2006-10-01 10:29:26 +04:00
|
|
|
if (error < 0)
|
|
|
|
goto err_fdr;
|
|
|
|
fdw = error;
|
|
|
|
|
2008-12-14 12:57:47 +03:00
|
|
|
audit_fd_pair(fdr, fdw);
|
2006-10-01 10:29:26 +04:00
|
|
|
fd[0] = fdr;
|
|
|
|
fd[1] = fdw;
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
err_fdr:
|
|
|
|
put_unused_fd(fdr);
|
|
|
|
err_read_pipe:
|
2012-07-21 15:33:25 +04:00
|
|
|
fput(files[0]);
|
|
|
|
fput(files[1]);
|
2006-10-01 10:29:26 +04:00
|
|
|
return error;
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
|
2012-08-19 20:17:29 +04:00
|
|
|
int do_pipe_flags(int *fd, int flags)
|
|
|
|
{
|
|
|
|
struct file *files[2];
|
|
|
|
int error = __do_pipe_flags(fd, files, flags);
|
|
|
|
if (!error) {
|
|
|
|
fd_install(fd[0], files[0]);
|
|
|
|
fd_install(fd[1], files[1]);
|
|
|
|
}
|
|
|
|
return error;
|
|
|
|
}
|
|
|
|
|
2008-05-03 23:10:37 +04:00
|
|
|
/*
|
|
|
|
* sys_pipe() is the normal C calling standard for creating
|
|
|
|
* a pipe. It's not the way Unix traditionally does this, though.
|
|
|
|
*/
|
2018-03-11 13:34:28 +03:00
|
|
|
static int do_pipe2(int __user *fildes, int flags)
|
2008-05-03 23:10:37 +04:00
|
|
|
{
|
2012-08-19 20:17:29 +04:00
|
|
|
struct file *files[2];
|
2008-05-03 23:10:37 +04:00
|
|
|
int fd[2];
|
|
|
|
int error;
|
|
|
|
|
2012-08-19 20:17:29 +04:00
|
|
|
error = __do_pipe_flags(fd, files, flags);
|
2008-05-03 23:10:37 +04:00
|
|
|
if (!error) {
|
2012-08-19 20:17:29 +04:00
|
|
|
if (unlikely(copy_to_user(fildes, fd, sizeof(fd)))) {
|
|
|
|
fput(files[0]);
|
|
|
|
fput(files[1]);
|
|
|
|
put_unused_fd(fd[0]);
|
|
|
|
put_unused_fd(fd[1]);
|
2008-05-03 23:10:37 +04:00
|
|
|
error = -EFAULT;
|
2012-08-19 20:17:29 +04:00
|
|
|
} else {
|
|
|
|
fd_install(fd[0], files[0]);
|
|
|
|
fd_install(fd[1], files[1]);
|
2008-05-07 07:42:38 +04:00
|
|
|
}
|
2008-05-03 23:10:37 +04:00
|
|
|
}
|
|
|
|
return error;
|
|
|
|
}
|
|
|
|
|
2018-03-11 13:34:28 +03:00
|
|
|
SYSCALL_DEFINE2(pipe2, int __user *, fildes, int, flags)
|
|
|
|
{
|
|
|
|
return do_pipe2(fildes, flags);
|
|
|
|
}
|
|
|
|
|
2009-01-14 16:14:35 +03:00
|
|
|
SYSCALL_DEFINE1(pipe, int __user *, fildes)
|
2008-07-24 08:29:30 +04:00
|
|
|
{
|
2018-03-11 13:34:28 +03:00
|
|
|
return do_pipe2(fildes, 0);
|
2008-07-24 08:29:30 +04:00
|
|
|
}
|
|
|
|
|
pipe: remove pipe_wait() and fix wakeup race with splice
The pipe splice code still used the old model of waiting for pipe IO by
using a non-specific "pipe_wait()" that waited for any pipe event to
happen, which depended on all pipe IO being entirely serialized by the
pipe lock. So by checking the state you were waiting for, and then
adding yourself to the wait queue before dropping the lock, you were
guaranteed to see all the wakeups.
Strictly speaking, the actual wakeups were not done under the lock, but
the pipe_wait() model still worked, because since the waiter held the
lock when checking whether it should sleep, it would always see the
current state, and the wakeup was always done after updating the state.
However, commit 0ddad21d3e99 ("pipe: use exclusive waits when reading or
writing") split the single wait-queue into two, and in the process also
made the "wait for event" code wait for _two_ wait queues, and that then
showed a race with the wakers that were not serialized by the pipe lock.
It's only splice that used that "pipe_wait()" model, so the problem
wasn't obvious, but Josef Bacik reports:
"I hit a hang with fstest btrfs/187, which does a btrfs send into
/dev/null. This works by creating a pipe, the write side is given to
the kernel to write into, and the read side is handed to a thread that
splices into a file, in this case /dev/null.
The box that was hung had the write side stuck here [pipe_write] and
the read side stuck here [splice_from_pipe_next -> pipe_wait].
[ more details about pipe_wait() scenario ]
The problem is we're doing the prepare_to_wait, which sets our state
each time, however we can be woken up either with reads or writes. In
the case above we race with the WRITER waking us up, and re-set our
state to INTERRUPTIBLE, and thus never break out of schedule"
Josef had a patch that avoided the issue in pipe_wait() by just making
it set the state only once, but the deeper problem is that pipe_wait()
depends on a level of synchonization by the pipe mutex that it really
shouldn't. And the whole "wait for any pipe state change" model really
isn't very good to begin with.
So rather than trying to work around things in pipe_wait(), remove that
legacy model of "wait for arbitrary pipe event" entirely, and actually
create functions that wait for the pipe actually being readable or
writable, and can do so without depending on the pipe lock serializing
everything.
Fixes: 0ddad21d3e99 ("pipe: use exclusive waits when reading or writing")
Link: https://lore.kernel.org/linux-fsdevel/bfa88b5ad6f069b2b679316b9e495a970130416c.1601567868.git.josef@toxicpanda.com/
Reported-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-and-tested-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-02 05:14:36 +03:00
|
|
|
/*
|
|
|
|
* This is the stupid "wait for pipe to be readable or writable"
|
|
|
|
* model.
|
|
|
|
*
|
|
|
|
* See pipe_read/write() for the proper kind of exclusive wait,
|
|
|
|
* but that requires that we wake up any other readers/writers
|
|
|
|
* if we then do not end up reading everything (ie the whole
|
|
|
|
* "wake_next_reader/writer" logic in pipe_read/write()).
|
|
|
|
*/
|
|
|
|
void pipe_wait_readable(struct pipe_inode_info *pipe)
|
|
|
|
{
|
|
|
|
pipe_unlock(pipe);
|
|
|
|
wait_event_interruptible(pipe->rd_wait, pipe_readable(pipe));
|
|
|
|
pipe_lock(pipe);
|
|
|
|
}
|
|
|
|
|
|
|
|
void pipe_wait_writable(struct pipe_inode_info *pipe)
|
|
|
|
{
|
|
|
|
pipe_unlock(pipe);
|
|
|
|
wait_event_interruptible(pipe->wr_wait, pipe_writable(pipe));
|
|
|
|
pipe_lock(pipe);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This depends on both the wait (here) and the wakeup (wake_up_partner)
|
|
|
|
* holding the pipe lock, so "*cnt" is stable and we know a wakeup cannot
|
|
|
|
* race with the count check and waitqueue prep.
|
|
|
|
*
|
|
|
|
* Normally in order to avoid races, you'd do the prepare_to_wait() first,
|
|
|
|
* then check the condition you're waiting for, and only then sleep. But
|
|
|
|
* because of the pipe lock, we can check the condition before being on
|
|
|
|
* the wait queue.
|
|
|
|
*
|
|
|
|
* We use the 'rd_wait' waitqueue for pipe partner waiting.
|
|
|
|
*/
|
2013-03-21 10:07:59 +04:00
|
|
|
static int wait_for_partner(struct pipe_inode_info *pipe, unsigned int *cnt)
|
2013-03-12 17:46:27 +04:00
|
|
|
{
|
pipe: remove pipe_wait() and fix wakeup race with splice
The pipe splice code still used the old model of waiting for pipe IO by
using a non-specific "pipe_wait()" that waited for any pipe event to
happen, which depended on all pipe IO being entirely serialized by the
pipe lock. So by checking the state you were waiting for, and then
adding yourself to the wait queue before dropping the lock, you were
guaranteed to see all the wakeups.
Strictly speaking, the actual wakeups were not done under the lock, but
the pipe_wait() model still worked, because since the waiter held the
lock when checking whether it should sleep, it would always see the
current state, and the wakeup was always done after updating the state.
However, commit 0ddad21d3e99 ("pipe: use exclusive waits when reading or
writing") split the single wait-queue into two, and in the process also
made the "wait for event" code wait for _two_ wait queues, and that then
showed a race with the wakers that were not serialized by the pipe lock.
It's only splice that used that "pipe_wait()" model, so the problem
wasn't obvious, but Josef Bacik reports:
"I hit a hang with fstest btrfs/187, which does a btrfs send into
/dev/null. This works by creating a pipe, the write side is given to
the kernel to write into, and the read side is handed to a thread that
splices into a file, in this case /dev/null.
The box that was hung had the write side stuck here [pipe_write] and
the read side stuck here [splice_from_pipe_next -> pipe_wait].
[ more details about pipe_wait() scenario ]
The problem is we're doing the prepare_to_wait, which sets our state
each time, however we can be woken up either with reads or writes. In
the case above we race with the WRITER waking us up, and re-set our
state to INTERRUPTIBLE, and thus never break out of schedule"
Josef had a patch that avoided the issue in pipe_wait() by just making
it set the state only once, but the deeper problem is that pipe_wait()
depends on a level of synchonization by the pipe mutex that it really
shouldn't. And the whole "wait for any pipe state change" model really
isn't very good to begin with.
So rather than trying to work around things in pipe_wait(), remove that
legacy model of "wait for arbitrary pipe event" entirely, and actually
create functions that wait for the pipe actually being readable or
writable, and can do so without depending on the pipe lock serializing
everything.
Fixes: 0ddad21d3e99 ("pipe: use exclusive waits when reading or writing")
Link: https://lore.kernel.org/linux-fsdevel/bfa88b5ad6f069b2b679316b9e495a970130416c.1601567868.git.josef@toxicpanda.com/
Reported-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-and-tested-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-02 05:14:36 +03:00
|
|
|
DEFINE_WAIT(rdwait);
|
2019-11-15 16:30:32 +03:00
|
|
|
int cur = *cnt;
|
2013-03-12 17:46:27 +04:00
|
|
|
|
|
|
|
while (cur == *cnt) {
|
pipe: remove pipe_wait() and fix wakeup race with splice
The pipe splice code still used the old model of waiting for pipe IO by
using a non-specific "pipe_wait()" that waited for any pipe event to
happen, which depended on all pipe IO being entirely serialized by the
pipe lock. So by checking the state you were waiting for, and then
adding yourself to the wait queue before dropping the lock, you were
guaranteed to see all the wakeups.
Strictly speaking, the actual wakeups were not done under the lock, but
the pipe_wait() model still worked, because since the waiter held the
lock when checking whether it should sleep, it would always see the
current state, and the wakeup was always done after updating the state.
However, commit 0ddad21d3e99 ("pipe: use exclusive waits when reading or
writing") split the single wait-queue into two, and in the process also
made the "wait for event" code wait for _two_ wait queues, and that then
showed a race with the wakers that were not serialized by the pipe lock.
It's only splice that used that "pipe_wait()" model, so the problem
wasn't obvious, but Josef Bacik reports:
"I hit a hang with fstest btrfs/187, which does a btrfs send into
/dev/null. This works by creating a pipe, the write side is given to
the kernel to write into, and the read side is handed to a thread that
splices into a file, in this case /dev/null.
The box that was hung had the write side stuck here [pipe_write] and
the read side stuck here [splice_from_pipe_next -> pipe_wait].
[ more details about pipe_wait() scenario ]
The problem is we're doing the prepare_to_wait, which sets our state
each time, however we can be woken up either with reads or writes. In
the case above we race with the WRITER waking us up, and re-set our
state to INTERRUPTIBLE, and thus never break out of schedule"
Josef had a patch that avoided the issue in pipe_wait() by just making
it set the state only once, but the deeper problem is that pipe_wait()
depends on a level of synchonization by the pipe mutex that it really
shouldn't. And the whole "wait for any pipe state change" model really
isn't very good to begin with.
So rather than trying to work around things in pipe_wait(), remove that
legacy model of "wait for arbitrary pipe event" entirely, and actually
create functions that wait for the pipe actually being readable or
writable, and can do so without depending on the pipe lock serializing
everything.
Fixes: 0ddad21d3e99 ("pipe: use exclusive waits when reading or writing")
Link: https://lore.kernel.org/linux-fsdevel/bfa88b5ad6f069b2b679316b9e495a970130416c.1601567868.git.josef@toxicpanda.com/
Reported-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-and-tested-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-02 05:14:36 +03:00
|
|
|
prepare_to_wait(&pipe->rd_wait, &rdwait, TASK_INTERRUPTIBLE);
|
|
|
|
pipe_unlock(pipe);
|
|
|
|
schedule();
|
|
|
|
finish_wait(&pipe->rd_wait, &rdwait);
|
|
|
|
pipe_lock(pipe);
|
2013-03-12 17:46:27 +04:00
|
|
|
if (signal_pending(current))
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
return cur == *cnt ? -ERESTARTSYS : 0;
|
|
|
|
}
|
|
|
|
|
2013-03-21 10:07:59 +04:00
|
|
|
static void wake_up_partner(struct pipe_inode_info *pipe)
|
2013-03-12 17:46:27 +04:00
|
|
|
{
|
pipe: make sure to wake up everybody when the last reader/writer closes
Andrei Vagin reported that commit 0ddad21d3e99 ("pipe: use exclusive
waits when reading or writing") broke one of the CRIU tests. He even
has a trivial reproducer:
#include <unistd.h>
#include <sys/types.h>
#include <sys/wait.h>
int main()
{
int p[2];
pid_t p1, p2;
int status;
if (pipe(p) == -1)
return 1;
p1 = fork();
if (p1 == 0) {
close(p[1]);
read(p[0], &status, sizeof(status));
return 0;
}
p2 = fork();
if (p2 == 0) {
close(p[1]);
read(p[0], &status, sizeof(status));
return 0;
}
sleep(1);
close(p[1]);
wait(&status);
wait(&status);
return 0;
}
and the problem - once he points it out - is obvious. We use these nice
exclusive waits, but when the last writer goes away, it then needs to
wake up _every_ reader (and conversely, the last reader disappearing
needs to wake every writer, of course).
In fact, when going through this, we had several small oddities around
how to wake things. We did in fact wake every reader when we changed
the size of the pipe buffers. But that's entirely pointless, since that
just acts as a possible source of new space - no new data to read.
And when we change the size of the buffer, we don't need to wake all
writers even when we add space - that case acts just as if somebody made
space by reading, and any writer that finds itself not filling it up
entirely will wake the next one.
On the other hand, on the exit path, we tried to limit the wakeups with
the proper poll keys etc, which is entirely pointless, because at that
point we obviously need to wake up everybody. So don't do that: just
wake up everybody - but only do that if the counts changed to zero.
So fix those non-IO wakeups to be more proper: space change doesn't add
any new data, but it might make room for writers, so it wakes up a
writer. And the actual changes to reader/writer counts should wake up
everybody, since everybody is affected (ie readers will all see EOF if
the writers have gone away, and writers will all get EPIPE if all
readers have gone away).
Fixes: 0ddad21d3e99 ("pipe: use exclusive waits when reading or writing")
Reported-and-tested-by: Andrei Vagin <avagin@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-18 21:12:58 +03:00
|
|
|
wake_up_interruptible_all(&pipe->rd_wait);
|
2013-03-12 17:46:27 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
static int fifo_open(struct inode *inode, struct file *filp)
|
|
|
|
{
|
|
|
|
struct pipe_inode_info *pipe;
|
2013-03-12 17:58:10 +04:00
|
|
|
bool is_pipe = inode->i_sb->s_magic == PIPEFS_MAGIC;
|
2013-03-12 17:46:27 +04:00
|
|
|
int ret;
|
|
|
|
|
2013-03-21 10:21:19 +04:00
|
|
|
filp->f_version = 0;
|
|
|
|
|
|
|
|
spin_lock(&inode->i_lock);
|
|
|
|
if (inode->i_pipe) {
|
|
|
|
pipe = inode->i_pipe;
|
|
|
|
pipe->files++;
|
|
|
|
spin_unlock(&inode->i_lock);
|
|
|
|
} else {
|
|
|
|
spin_unlock(&inode->i_lock);
|
2013-03-21 19:04:15 +04:00
|
|
|
pipe = alloc_pipe_info();
|
2013-03-12 17:46:27 +04:00
|
|
|
if (!pipe)
|
2013-03-21 10:21:19 +04:00
|
|
|
return -ENOMEM;
|
|
|
|
pipe->files = 1;
|
|
|
|
spin_lock(&inode->i_lock);
|
|
|
|
if (unlikely(inode->i_pipe)) {
|
|
|
|
inode->i_pipe->files++;
|
|
|
|
spin_unlock(&inode->i_lock);
|
2013-03-21 19:06:46 +04:00
|
|
|
free_pipe_info(pipe);
|
2013-03-21 10:21:19 +04:00
|
|
|
pipe = inode->i_pipe;
|
|
|
|
} else {
|
|
|
|
inode->i_pipe = pipe;
|
|
|
|
spin_unlock(&inode->i_lock);
|
|
|
|
}
|
2013-03-12 17:46:27 +04:00
|
|
|
}
|
2013-03-21 19:16:56 +04:00
|
|
|
filp->private_data = pipe;
|
2013-03-21 10:21:19 +04:00
|
|
|
/* OK, we have a pipe and it's pinned down */
|
|
|
|
|
2013-03-21 20:24:01 +04:00
|
|
|
__pipe_lock(pipe);
|
2013-03-12 17:46:27 +04:00
|
|
|
|
|
|
|
/* We can only do regular read/write on fifos */
|
vfs: mark pipes and sockets as stream-like file descriptors
In commit 3975b097e577 ("convert stream-like files -> stream_open, even
if they use noop_llseek") Kirill used a coccinelle script to change
"nonseekable_open()" to "stream_open()", which changed the trivial cases
of stream-like file descriptors to the new model with FMODE_STREAM.
However, the two big cases - sockets and pipes - don't actually have
that trivial pattern at all, and were thus never converted to
FMODE_STREAM even though it makes lots of sense to do so.
That's particularly true when looking forward to the next change:
getting rid of FMODE_ATOMIC_POS entirely, and just using FMODE_STREAM to
decide whether f_pos updates are needed or not. And if they are, we'll
always do them atomically.
This came up because KCSAN (correctly) noted that the non-locked f_pos
updates are data races: they are clearly benign for the case where we
don't care, but it would be good to just not have that issue exist at
all.
Note that the reason we used FMODE_ATOMIC_POS originally is that only
doing it for the minimal required case is "safer" in that it's possible
that the f_pos locking can cause unnecessary serialization across the
whole write() call. And in the worst case, that kind of serialization
can cause deadlock issues: think writers that need readers to empty the
state using the same file descriptor.
[ Note that the locking is per-file descriptor - because it protects
"f_pos", which is obviously per-file descriptor - so it only affects
cases where you literally use the same file descriptor to both read
and write.
So a regular pipe that has separate reading and writing file
descriptors doesn't really have this situation even though it's the
obvious case of "reader empties what a bit writer concurrently fills"
But we want to make pipes as being stream-line anyway, because we
don't want the unnecessary overhead of locking, and because a named
pipe can be (ab-)used by reading and writing to the same file
descriptor. ]
There are likely a lot of other cases that might want FMODE_STREAM, and
looking for ".llseek = no_llseek" users and other cases that don't have
an lseek file operation at all and making them use "stream_open()" might
be a good idea. But pipes and sockets are likely to be the two main
cases.
Cc: Kirill Smelkov <kirr@nexedi.com>
Cc: Eic Dumazet <edumazet@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Marco Elver <elver@google.com>
Cc: Andrea Parri <parri.andrea@gmail.com>
Cc: Paul McKenney <paulmck@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-11-17 22:20:48 +03:00
|
|
|
stream_open(inode, filp);
|
2013-03-12 17:46:27 +04:00
|
|
|
|
vfs: mark pipes and sockets as stream-like file descriptors
In commit 3975b097e577 ("convert stream-like files -> stream_open, even
if they use noop_llseek") Kirill used a coccinelle script to change
"nonseekable_open()" to "stream_open()", which changed the trivial cases
of stream-like file descriptors to the new model with FMODE_STREAM.
However, the two big cases - sockets and pipes - don't actually have
that trivial pattern at all, and were thus never converted to
FMODE_STREAM even though it makes lots of sense to do so.
That's particularly true when looking forward to the next change:
getting rid of FMODE_ATOMIC_POS entirely, and just using FMODE_STREAM to
decide whether f_pos updates are needed or not. And if they are, we'll
always do them atomically.
This came up because KCSAN (correctly) noted that the non-locked f_pos
updates are data races: they are clearly benign for the case where we
don't care, but it would be good to just not have that issue exist at
all.
Note that the reason we used FMODE_ATOMIC_POS originally is that only
doing it for the minimal required case is "safer" in that it's possible
that the f_pos locking can cause unnecessary serialization across the
whole write() call. And in the worst case, that kind of serialization
can cause deadlock issues: think writers that need readers to empty the
state using the same file descriptor.
[ Note that the locking is per-file descriptor - because it protects
"f_pos", which is obviously per-file descriptor - so it only affects
cases where you literally use the same file descriptor to both read
and write.
So a regular pipe that has separate reading and writing file
descriptors doesn't really have this situation even though it's the
obvious case of "reader empties what a bit writer concurrently fills"
But we want to make pipes as being stream-line anyway, because we
don't want the unnecessary overhead of locking, and because a named
pipe can be (ab-)used by reading and writing to the same file
descriptor. ]
There are likely a lot of other cases that might want FMODE_STREAM, and
looking for ".llseek = no_llseek" users and other cases that don't have
an lseek file operation at all and making them use "stream_open()" might
be a good idea. But pipes and sockets are likely to be the two main
cases.
Cc: Kirill Smelkov <kirr@nexedi.com>
Cc: Eic Dumazet <edumazet@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Marco Elver <elver@google.com>
Cc: Andrea Parri <parri.andrea@gmail.com>
Cc: Paul McKenney <paulmck@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-11-17 22:20:48 +03:00
|
|
|
switch (filp->f_mode & (FMODE_READ | FMODE_WRITE)) {
|
2013-03-12 17:46:27 +04:00
|
|
|
case FMODE_READ:
|
|
|
|
/*
|
|
|
|
* O_RDONLY
|
|
|
|
* POSIX.1 says that O_NONBLOCK means return with the FIFO
|
|
|
|
* opened, even when there is no process writing the FIFO.
|
|
|
|
*/
|
|
|
|
pipe->r_counter++;
|
|
|
|
if (pipe->readers++ == 0)
|
2013-03-21 10:07:59 +04:00
|
|
|
wake_up_partner(pipe);
|
2013-03-12 17:46:27 +04:00
|
|
|
|
2013-03-12 17:58:10 +04:00
|
|
|
if (!is_pipe && !pipe->writers) {
|
2013-03-12 17:46:27 +04:00
|
|
|
if ((filp->f_flags & O_NONBLOCK)) {
|
2018-02-12 01:34:03 +03:00
|
|
|
/* suppress EPOLLHUP until we have
|
2013-03-12 17:46:27 +04:00
|
|
|
* seen a writer */
|
|
|
|
filp->f_version = pipe->w_counter;
|
|
|
|
} else {
|
2013-03-21 10:07:59 +04:00
|
|
|
if (wait_for_partner(pipe, &pipe->w_counter))
|
2013-03-12 17:46:27 +04:00
|
|
|
goto err_rd;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
break;
|
2019-11-15 16:30:32 +03:00
|
|
|
|
2013-03-12 17:46:27 +04:00
|
|
|
case FMODE_WRITE:
|
|
|
|
/*
|
|
|
|
* O_WRONLY
|
|
|
|
* POSIX.1 says that O_NONBLOCK means return -1 with
|
|
|
|
* errno=ENXIO when there is no process reading the FIFO.
|
|
|
|
*/
|
|
|
|
ret = -ENXIO;
|
2013-03-12 17:58:10 +04:00
|
|
|
if (!is_pipe && (filp->f_flags & O_NONBLOCK) && !pipe->readers)
|
2013-03-12 17:46:27 +04:00
|
|
|
goto err;
|
|
|
|
|
|
|
|
pipe->w_counter++;
|
|
|
|
if (!pipe->writers++)
|
2013-03-21 10:07:59 +04:00
|
|
|
wake_up_partner(pipe);
|
2013-03-12 17:46:27 +04:00
|
|
|
|
2013-03-12 17:58:10 +04:00
|
|
|
if (!is_pipe && !pipe->readers) {
|
2013-03-21 10:07:59 +04:00
|
|
|
if (wait_for_partner(pipe, &pipe->r_counter))
|
2013-03-12 17:46:27 +04:00
|
|
|
goto err_wr;
|
|
|
|
}
|
|
|
|
break;
|
2019-11-15 16:30:32 +03:00
|
|
|
|
2013-03-12 17:46:27 +04:00
|
|
|
case FMODE_READ | FMODE_WRITE:
|
|
|
|
/*
|
|
|
|
* O_RDWR
|
|
|
|
* POSIX.1 leaves this case "undefined" when O_NONBLOCK is set.
|
|
|
|
* This implementation will NEVER block on a O_RDWR open, since
|
|
|
|
* the process can at least talk to itself.
|
|
|
|
*/
|
|
|
|
|
|
|
|
pipe->readers++;
|
|
|
|
pipe->writers++;
|
|
|
|
pipe->r_counter++;
|
|
|
|
pipe->w_counter++;
|
|
|
|
if (pipe->readers == 1 || pipe->writers == 1)
|
2013-03-21 10:07:59 +04:00
|
|
|
wake_up_partner(pipe);
|
2013-03-12 17:46:27 +04:00
|
|
|
break;
|
|
|
|
|
|
|
|
default:
|
|
|
|
ret = -EINVAL;
|
|
|
|
goto err;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Ok! */
|
2013-03-21 20:24:01 +04:00
|
|
|
__pipe_unlock(pipe);
|
2013-03-12 17:46:27 +04:00
|
|
|
return 0;
|
|
|
|
|
|
|
|
err_rd:
|
|
|
|
if (!--pipe->readers)
|
pipe: use exclusive waits when reading or writing
This makes the pipe code use separate wait-queues and exclusive waiting
for readers and writers, avoiding a nasty thundering herd problem when
there are lots of readers waiting for data on a pipe (or, less commonly,
lots of writers waiting for a pipe to have space).
While this isn't a common occurrence in the traditional "use a pipe as a
data transport" case, where you typically only have a single reader and
a single writer process, there is one common special case: using a pipe
as a source of "locking tokens" rather than for data communication.
In particular, the GNU make jobserver code ends up using a pipe as a way
to limit parallelism, where each job consumes a token by reading a byte
from the jobserver pipe, and releases the token by writing a byte back
to the pipe.
This pattern is fairly traditional on Unix, and works very well, but
will waste a lot of time waking up a lot of processes when only a single
reader needs to be woken up when a writer releases a new token.
A simplified test-case of just this pipe interaction is to create 64
processes, and then pass a single token around between them (this
test-case also intentionally passes another token that gets ignored to
test the "wake up next" logic too, in case anybody wonders about it):
#include <unistd.h>
int main(int argc, char **argv)
{
int fd[2], counters[2];
pipe(fd);
counters[0] = 0;
counters[1] = -1;
write(fd[1], counters, sizeof(counters));
/* 64 processes */
fork(); fork(); fork(); fork(); fork(); fork();
do {
int i;
read(fd[0], &i, sizeof(i));
if (i < 0)
continue;
counters[0] = i+1;
write(fd[1], counters, (1+(i & 1)) *sizeof(int));
} while (counters[0] < 1000000);
return 0;
}
and in a perfect world, passing that token around should only cause one
context switch per transfer, when the writer of a token causes a
directed wakeup of just a single reader.
But with the "writer wakes all readers" model we traditionally had, on
my test box the above case causes more than an order of magnitude more
scheduling: instead of the expected ~1M context switches, "perf stat"
shows
231,852.37 msec task-clock # 15.857 CPUs utilized
11,250,961 context-switches # 0.049 M/sec
616,304 cpu-migrations # 0.003 M/sec
1,648 page-faults # 0.007 K/sec
1,097,903,998,514 cycles # 4.735 GHz
120,781,778,352 instructions # 0.11 insn per cycle
27,997,056,043 branches # 120.754 M/sec
283,581,233 branch-misses # 1.01% of all branches
14.621273891 seconds time elapsed
0.018243000 seconds user
3.611468000 seconds sys
before this commit.
After this commit, I get
5,229.55 msec task-clock # 3.072 CPUs utilized
1,212,233 context-switches # 0.232 M/sec
103,951 cpu-migrations # 0.020 M/sec
1,328 page-faults # 0.254 K/sec
21,307,456,166 cycles # 4.074 GHz
12,947,819,999 instructions # 0.61 insn per cycle
2,881,985,678 branches # 551.096 M/sec
64,267,015 branch-misses # 2.23% of all branches
1.702148350 seconds time elapsed
0.004868000 seconds user
0.110786000 seconds sys
instead. Much better.
[ Note! This kernel improvement seems to be very good at triggering a
race condition in the make jobserver (in GNU make 4.2.1) for me. It's
a long known bug that was fixed back in June 2017 by GNU make commit
b552b0525198 ("[SV 51159] Use a non-blocking read with pselect to
avoid hangs.").
But there wasn't a new release of GNU make until 4.3 on Jan 19 2020,
so a number of distributions may still have the buggy version. Some
have backported the fix to their 4.2.1 release, though, and even
without the fix it's quite timing-dependent whether the bug actually
is hit. ]
Josh Triplett says:
"I've been hammering on your pipe fix patch (switching to exclusive
wait queues) for a month or so, on several different systems, and I've
run into no issues with it. The patch *substantially* improves
parallel build times on large (~100 CPU) systems, both with parallel
make and with other things that use make's pipe-based jobserver.
All current distributions (including stable and long-term stable
distributions) have versions of GNU make that no longer have the
jobserver bug"
Tested-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-09 20:48:27 +03:00
|
|
|
wake_up_interruptible(&pipe->wr_wait);
|
2013-03-12 17:46:27 +04:00
|
|
|
ret = -ERESTARTSYS;
|
|
|
|
goto err;
|
|
|
|
|
|
|
|
err_wr:
|
|
|
|
if (!--pipe->writers)
|
pipe: make sure to wake up everybody when the last reader/writer closes
Andrei Vagin reported that commit 0ddad21d3e99 ("pipe: use exclusive
waits when reading or writing") broke one of the CRIU tests. He even
has a trivial reproducer:
#include <unistd.h>
#include <sys/types.h>
#include <sys/wait.h>
int main()
{
int p[2];
pid_t p1, p2;
int status;
if (pipe(p) == -1)
return 1;
p1 = fork();
if (p1 == 0) {
close(p[1]);
read(p[0], &status, sizeof(status));
return 0;
}
p2 = fork();
if (p2 == 0) {
close(p[1]);
read(p[0], &status, sizeof(status));
return 0;
}
sleep(1);
close(p[1]);
wait(&status);
wait(&status);
return 0;
}
and the problem - once he points it out - is obvious. We use these nice
exclusive waits, but when the last writer goes away, it then needs to
wake up _every_ reader (and conversely, the last reader disappearing
needs to wake every writer, of course).
In fact, when going through this, we had several small oddities around
how to wake things. We did in fact wake every reader when we changed
the size of the pipe buffers. But that's entirely pointless, since that
just acts as a possible source of new space - no new data to read.
And when we change the size of the buffer, we don't need to wake all
writers even when we add space - that case acts just as if somebody made
space by reading, and any writer that finds itself not filling it up
entirely will wake the next one.
On the other hand, on the exit path, we tried to limit the wakeups with
the proper poll keys etc, which is entirely pointless, because at that
point we obviously need to wake up everybody. So don't do that: just
wake up everybody - but only do that if the counts changed to zero.
So fix those non-IO wakeups to be more proper: space change doesn't add
any new data, but it might make room for writers, so it wakes up a
writer. And the actual changes to reader/writer counts should wake up
everybody, since everybody is affected (ie readers will all see EOF if
the writers have gone away, and writers will all get EPIPE if all
readers have gone away).
Fixes: 0ddad21d3e99 ("pipe: use exclusive waits when reading or writing")
Reported-and-tested-by: Andrei Vagin <avagin@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-18 21:12:58 +03:00
|
|
|
wake_up_interruptible_all(&pipe->rd_wait);
|
2013-03-12 17:46:27 +04:00
|
|
|
ret = -ERESTARTSYS;
|
|
|
|
goto err;
|
|
|
|
|
|
|
|
err:
|
2013-03-21 20:24:01 +04:00
|
|
|
__pipe_unlock(pipe);
|
vfs: fix subtle use-after-free of pipe_inode_info
The pipe code was trying (and failing) to be very careful about freeing
the pipe info only after the last access, with a pattern like:
spin_lock(&inode->i_lock);
if (!--pipe->files) {
inode->i_pipe = NULL;
kill = 1;
}
spin_unlock(&inode->i_lock);
__pipe_unlock(pipe);
if (kill)
free_pipe_info(pipe);
where the final freeing is done last.
HOWEVER. The above is actually broken, because while the freeing is
done at the end, if we have two racing processes releasing the pipe
inode info, the one that *doesn't* free it will decrement the ->files
count, and unlock the inode i_lock, but then still use the
"pipe_inode_info" afterwards when it does the "__pipe_unlock(pipe)".
This is *very* hard to trigger in practice, since the race window is
very small, and adding debug options seems to just hide it by slowing
things down.
Simon originally reported this way back in July as an Oops in
kmem_cache_allocate due to a single bit corruption (due to the final
"spin_unlock(pipe->mutex.wait_lock)" incrementing a field in a different
allocation that had re-used the free'd pipe-info), it's taken this long
to figure out.
Since the 'pipe->files' accesses aren't even protected by the pipe lock
(we very much use the inode lock for that), the simple solution is to
just drop the pipe lock early. And since there were two users of this
pattern, create a helper function for it.
Introduced commit ba5bb147330a ("pipe: take allocation and freeing of
pipe_inode_info out of ->i_mutex").
Reported-by: Simon Kirby <sim@hostway.ca>
Reported-by: Ian Applegate <ia@cloudflare.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: stable@kernel.org # v3.10+
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-12-02 21:44:51 +04:00
|
|
|
|
|
|
|
put_pipe_info(inode, pipe);
|
2013-03-12 17:46:27 +04:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2013-03-12 17:58:10 +04:00
|
|
|
const struct file_operations pipefifo_fops = {
|
|
|
|
.open = fifo_open,
|
|
|
|
.llseek = no_llseek,
|
2014-04-03 03:56:54 +04:00
|
|
|
.read_iter = pipe_read,
|
2014-04-03 23:05:18 +04:00
|
|
|
.write_iter = pipe_write,
|
2018-06-28 19:43:44 +03:00
|
|
|
.poll = pipe_poll,
|
2013-03-12 17:58:10 +04:00
|
|
|
.unlocked_ioctl = pipe_ioctl,
|
|
|
|
.release = pipe_release,
|
|
|
|
.fasync = pipe_fasync,
|
2021-01-25 12:16:15 +03:00
|
|
|
.splice_write = iter_file_splice_write,
|
2013-03-12 17:46:27 +04:00
|
|
|
};
|
|
|
|
|
pipe: relocate round_pipe_size() above pipe_set_size()
Patch series "pipe: fix limit handling", v2.
When changing a pipe's capacity with fcntl(F_SETPIPE_SZ), various limits
defined by /proc/sys/fs/pipe-* files are checked to see if unprivileged
users are exceeding limits on memory consumption.
While documenting and testing the operation of these limits I noticed
that, as currently implemented, these checks have a number of problems:
(1) When increasing the pipe capacity, the checks against the limits
in /proc/sys/fs/pipe-user-pages-{soft,hard} are made against
existing consumption, and exclude the memory required for the
increased pipe capacity. The new increase in pipe capacity can then
push the total memory used by the user for pipes (possibly far) over
a limit. This can also trigger the problem described next.
(2) The limit checks are performed even when the new pipe capacity
is less than the existing pipe capacity. This can lead to problems
if a user sets a large pipe capacity, and then the limits are
lowered, with the result that the user will no longer be able to
decrease the pipe capacity.
(3) As currently implemented, accounting and checking against the
limits is done as follows:
(a) Test whether the user has exceeded the limit.
(b) Make new pipe buffer allocation.
(c) Account new allocation against the limits.
This is racey. Multiple processes may pass point (a) simultaneously,
and then allocate pipe buffers that are accounted for only in step
(c). The race means that the user's pipe buffer allocation could be
pushed over the limit (by an arbitrary amount, depending on how
unlucky we were in the race). [Thanks to Vegard Nossum for spotting
this point, which I had missed.]
This patch series addresses these three problems.
This patch (of 8):
This is a minor preparatory patch. After subsequent patches,
round_pipe_size() will be called from pipe_set_size(), so place
round_pipe_size() above pipe_set_size().
Link: http://lkml.kernel.org/r/91a91fdb-a959-ba7f-b551-b62477cc98a1@gmail.com
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Reviewed-by: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Willy Tarreau <w@1wt.eu>
Cc: <socketpair@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Jens Axboe <axboe@fb.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11 23:53:22 +03:00
|
|
|
/*
|
|
|
|
* Currently we rely on the pipe array holding a power-of-2 number
|
2017-11-18 02:29:21 +03:00
|
|
|
* of pages. Returns 0 on error.
|
pipe: relocate round_pipe_size() above pipe_set_size()
Patch series "pipe: fix limit handling", v2.
When changing a pipe's capacity with fcntl(F_SETPIPE_SZ), various limits
defined by /proc/sys/fs/pipe-* files are checked to see if unprivileged
users are exceeding limits on memory consumption.
While documenting and testing the operation of these limits I noticed
that, as currently implemented, these checks have a number of problems:
(1) When increasing the pipe capacity, the checks against the limits
in /proc/sys/fs/pipe-user-pages-{soft,hard} are made against
existing consumption, and exclude the memory required for the
increased pipe capacity. The new increase in pipe capacity can then
push the total memory used by the user for pipes (possibly far) over
a limit. This can also trigger the problem described next.
(2) The limit checks are performed even when the new pipe capacity
is less than the existing pipe capacity. This can lead to problems
if a user sets a large pipe capacity, and then the limits are
lowered, with the result that the user will no longer be able to
decrease the pipe capacity.
(3) As currently implemented, accounting and checking against the
limits is done as follows:
(a) Test whether the user has exceeded the limit.
(b) Make new pipe buffer allocation.
(c) Account new allocation against the limits.
This is racey. Multiple processes may pass point (a) simultaneously,
and then allocate pipe buffers that are accounted for only in step
(c). The race means that the user's pipe buffer allocation could be
pushed over the limit (by an arbitrary amount, depending on how
unlucky we were in the race). [Thanks to Vegard Nossum for spotting
this point, which I had missed.]
This patch series addresses these three problems.
This patch (of 8):
This is a minor preparatory patch. After subsequent patches,
round_pipe_size() will be called from pipe_set_size(), so place
round_pipe_size() above pipe_set_size().
Link: http://lkml.kernel.org/r/91a91fdb-a959-ba7f-b551-b62477cc98a1@gmail.com
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Reviewed-by: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Willy Tarreau <w@1wt.eu>
Cc: <socketpair@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Jens Axboe <axboe@fb.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11 23:53:22 +03:00
|
|
|
*/
|
2018-02-07 02:42:00 +03:00
|
|
|
unsigned int round_pipe_size(unsigned long size)
|
pipe: relocate round_pipe_size() above pipe_set_size()
Patch series "pipe: fix limit handling", v2.
When changing a pipe's capacity with fcntl(F_SETPIPE_SZ), various limits
defined by /proc/sys/fs/pipe-* files are checked to see if unprivileged
users are exceeding limits on memory consumption.
While documenting and testing the operation of these limits I noticed
that, as currently implemented, these checks have a number of problems:
(1) When increasing the pipe capacity, the checks against the limits
in /proc/sys/fs/pipe-user-pages-{soft,hard} are made against
existing consumption, and exclude the memory required for the
increased pipe capacity. The new increase in pipe capacity can then
push the total memory used by the user for pipes (possibly far) over
a limit. This can also trigger the problem described next.
(2) The limit checks are performed even when the new pipe capacity
is less than the existing pipe capacity. This can lead to problems
if a user sets a large pipe capacity, and then the limits are
lowered, with the result that the user will no longer be able to
decrease the pipe capacity.
(3) As currently implemented, accounting and checking against the
limits is done as follows:
(a) Test whether the user has exceeded the limit.
(b) Make new pipe buffer allocation.
(c) Account new allocation against the limits.
This is racey. Multiple processes may pass point (a) simultaneously,
and then allocate pipe buffers that are accounted for only in step
(c). The race means that the user's pipe buffer allocation could be
pushed over the limit (by an arbitrary amount, depending on how
unlucky we were in the race). [Thanks to Vegard Nossum for spotting
this point, which I had missed.]
This patch series addresses these three problems.
This patch (of 8):
This is a minor preparatory patch. After subsequent patches,
round_pipe_size() will be called from pipe_set_size(), so place
round_pipe_size() above pipe_set_size().
Link: http://lkml.kernel.org/r/91a91fdb-a959-ba7f-b551-b62477cc98a1@gmail.com
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Reviewed-by: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Willy Tarreau <w@1wt.eu>
Cc: <socketpair@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Jens Axboe <axboe@fb.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11 23:53:22 +03:00
|
|
|
{
|
2018-02-07 02:42:05 +03:00
|
|
|
if (size > (1U << 31))
|
2018-02-07 02:42:00 +03:00
|
|
|
return 0;
|
|
|
|
|
2018-02-07 02:41:45 +03:00
|
|
|
/* Minimum pipe size, as required by POSIX */
|
|
|
|
if (size < PAGE_SIZE)
|
2018-02-07 02:42:05 +03:00
|
|
|
return PAGE_SIZE;
|
2017-11-18 02:29:21 +03:00
|
|
|
|
2018-02-07 02:42:05 +03:00
|
|
|
return roundup_pow_of_two(size);
|
pipe: relocate round_pipe_size() above pipe_set_size()
Patch series "pipe: fix limit handling", v2.
When changing a pipe's capacity with fcntl(F_SETPIPE_SZ), various limits
defined by /proc/sys/fs/pipe-* files are checked to see if unprivileged
users are exceeding limits on memory consumption.
While documenting and testing the operation of these limits I noticed
that, as currently implemented, these checks have a number of problems:
(1) When increasing the pipe capacity, the checks against the limits
in /proc/sys/fs/pipe-user-pages-{soft,hard} are made against
existing consumption, and exclude the memory required for the
increased pipe capacity. The new increase in pipe capacity can then
push the total memory used by the user for pipes (possibly far) over
a limit. This can also trigger the problem described next.
(2) The limit checks are performed even when the new pipe capacity
is less than the existing pipe capacity. This can lead to problems
if a user sets a large pipe capacity, and then the limits are
lowered, with the result that the user will no longer be able to
decrease the pipe capacity.
(3) As currently implemented, accounting and checking against the
limits is done as follows:
(a) Test whether the user has exceeded the limit.
(b) Make new pipe buffer allocation.
(c) Account new allocation against the limits.
This is racey. Multiple processes may pass point (a) simultaneously,
and then allocate pipe buffers that are accounted for only in step
(c). The race means that the user's pipe buffer allocation could be
pushed over the limit (by an arbitrary amount, depending on how
unlucky we were in the race). [Thanks to Vegard Nossum for spotting
this point, which I had missed.]
This patch series addresses these three problems.
This patch (of 8):
This is a minor preparatory patch. After subsequent patches,
round_pipe_size() will be called from pipe_set_size(), so place
round_pipe_size() above pipe_set_size().
Link: http://lkml.kernel.org/r/91a91fdb-a959-ba7f-b551-b62477cc98a1@gmail.com
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Reviewed-by: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Willy Tarreau <w@1wt.eu>
Cc: <socketpair@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Jens Axboe <axboe@fb.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11 23:53:22 +03:00
|
|
|
}
|
|
|
|
|
2010-05-20 12:43:18 +04:00
|
|
|
/*
|
pipe: Add general notification queue support
Make it possible to have a general notification queue built on top of a
standard pipe. Notifications are 'spliced' into the pipe and then read
out. splice(), vmsplice() and sendfile() are forbidden on pipes used for
notifications as post_one_notification() cannot take pipe->mutex. This
means that notifications could be posted in between individual pipe
buffers, making iov_iter_revert() difficult to effect.
The way the notification queue is used is:
(1) An application opens a pipe with a special flag and indicates the
number of messages it wishes to be able to queue at once (this can
only be set once):
pipe2(fds, O_NOTIFICATION_PIPE);
ioctl(fds[0], IOC_WATCH_QUEUE_SET_SIZE, queue_depth);
(2) The application then uses poll() and read() as normal to extract data
from the pipe. read() will return multiple notifications if the
buffer is big enough, but it will not split a notification across
buffers - rather it will return a short read or EMSGSIZE.
Notification messages include a length in the header so that the
caller can split them up.
Each message has a header that describes it:
struct watch_notification {
__u32 type:24;
__u32 subtype:8;
__u32 info;
};
The type indicates the source (eg. mount tree changes, superblock events,
keyring changes, block layer events) and the subtype indicates the event
type (eg. mount, unmount; EIO, EDQUOT; link, unlink). The info field
indicates a number of things, including the entry length, an ID assigned to
a watchpoint contributing to this buffer and type-specific flags.
Supplementary data, such as the key ID that generated an event, can be
attached in additional slots. The maximum message size is 127 bytes.
Messages may not be padded or aligned, so there is no guarantee, for
example, that the notification type will be on a 4-byte bounary.
Signed-off-by: David Howells <dhowells@redhat.com>
2020-01-14 20:07:11 +03:00
|
|
|
* Resize the pipe ring to a number of slots.
|
2022-05-26 09:34:52 +03:00
|
|
|
*
|
|
|
|
* Note the pipe can be reduced in capacity, but only if the current
|
|
|
|
* occupancy doesn't exceed nr_slots; if it does, EBUSY will be
|
|
|
|
* returned instead.
|
2010-05-20 12:43:18 +04:00
|
|
|
*/
|
pipe: Add general notification queue support
Make it possible to have a general notification queue built on top of a
standard pipe. Notifications are 'spliced' into the pipe and then read
out. splice(), vmsplice() and sendfile() are forbidden on pipes used for
notifications as post_one_notification() cannot take pipe->mutex. This
means that notifications could be posted in between individual pipe
buffers, making iov_iter_revert() difficult to effect.
The way the notification queue is used is:
(1) An application opens a pipe with a special flag and indicates the
number of messages it wishes to be able to queue at once (this can
only be set once):
pipe2(fds, O_NOTIFICATION_PIPE);
ioctl(fds[0], IOC_WATCH_QUEUE_SET_SIZE, queue_depth);
(2) The application then uses poll() and read() as normal to extract data
from the pipe. read() will return multiple notifications if the
buffer is big enough, but it will not split a notification across
buffers - rather it will return a short read or EMSGSIZE.
Notification messages include a length in the header so that the
caller can split them up.
Each message has a header that describes it:
struct watch_notification {
__u32 type:24;
__u32 subtype:8;
__u32 info;
};
The type indicates the source (eg. mount tree changes, superblock events,
keyring changes, block layer events) and the subtype indicates the event
type (eg. mount, unmount; EIO, EDQUOT; link, unlink). The info field
indicates a number of things, including the entry length, an ID assigned to
a watchpoint contributing to this buffer and type-specific flags.
Supplementary data, such as the key ID that generated an event, can be
attached in additional slots. The maximum message size is 127 bytes.
Messages may not be padded or aligned, so there is no guarantee, for
example, that the notification type will be on a 4-byte bounary.
Signed-off-by: David Howells <dhowells@redhat.com>
2020-01-14 20:07:11 +03:00
|
|
|
int pipe_resize_ring(struct pipe_inode_info *pipe, unsigned int nr_slots)
|
2010-05-20 12:43:18 +04:00
|
|
|
{
|
|
|
|
struct pipe_buffer *bufs;
|
pipe: Add general notification queue support
Make it possible to have a general notification queue built on top of a
standard pipe. Notifications are 'spliced' into the pipe and then read
out. splice(), vmsplice() and sendfile() are forbidden on pipes used for
notifications as post_one_notification() cannot take pipe->mutex. This
means that notifications could be posted in between individual pipe
buffers, making iov_iter_revert() difficult to effect.
The way the notification queue is used is:
(1) An application opens a pipe with a special flag and indicates the
number of messages it wishes to be able to queue at once (this can
only be set once):
pipe2(fds, O_NOTIFICATION_PIPE);
ioctl(fds[0], IOC_WATCH_QUEUE_SET_SIZE, queue_depth);
(2) The application then uses poll() and read() as normal to extract data
from the pipe. read() will return multiple notifications if the
buffer is big enough, but it will not split a notification across
buffers - rather it will return a short read or EMSGSIZE.
Notification messages include a length in the header so that the
caller can split them up.
Each message has a header that describes it:
struct watch_notification {
__u32 type:24;
__u32 subtype:8;
__u32 info;
};
The type indicates the source (eg. mount tree changes, superblock events,
keyring changes, block layer events) and the subtype indicates the event
type (eg. mount, unmount; EIO, EDQUOT; link, unlink). The info field
indicates a number of things, including the entry length, an ID assigned to
a watchpoint contributing to this buffer and type-specific flags.
Supplementary data, such as the key ID that generated an event, can be
attached in additional slots. The maximum message size is 127 bytes.
Messages may not be padded or aligned, so there is no guarantee, for
example, that the notification type will be on a 4-byte bounary.
Signed-off-by: David Howells <dhowells@redhat.com>
2020-01-14 20:07:11 +03:00
|
|
|
unsigned int head, tail, mask, n;
|
2010-05-20 12:43:18 +04:00
|
|
|
|
2022-04-20 22:07:53 +03:00
|
|
|
bufs = kcalloc(nr_slots, sizeof(*bufs),
|
|
|
|
GFP_KERNEL_ACCOUNT | __GFP_NOWARN);
|
pipe: Add general notification queue support
Make it possible to have a general notification queue built on top of a
standard pipe. Notifications are 'spliced' into the pipe and then read
out. splice(), vmsplice() and sendfile() are forbidden on pipes used for
notifications as post_one_notification() cannot take pipe->mutex. This
means that notifications could be posted in between individual pipe
buffers, making iov_iter_revert() difficult to effect.
The way the notification queue is used is:
(1) An application opens a pipe with a special flag and indicates the
number of messages it wishes to be able to queue at once (this can
only be set once):
pipe2(fds, O_NOTIFICATION_PIPE);
ioctl(fds[0], IOC_WATCH_QUEUE_SET_SIZE, queue_depth);
(2) The application then uses poll() and read() as normal to extract data
from the pipe. read() will return multiple notifications if the
buffer is big enough, but it will not split a notification across
buffers - rather it will return a short read or EMSGSIZE.
Notification messages include a length in the header so that the
caller can split them up.
Each message has a header that describes it:
struct watch_notification {
__u32 type:24;
__u32 subtype:8;
__u32 info;
};
The type indicates the source (eg. mount tree changes, superblock events,
keyring changes, block layer events) and the subtype indicates the event
type (eg. mount, unmount; EIO, EDQUOT; link, unlink). The info field
indicates a number of things, including the entry length, an ID assigned to
a watchpoint contributing to this buffer and type-specific flags.
Supplementary data, such as the key ID that generated an event, can be
attached in additional slots. The maximum message size is 127 bytes.
Messages may not be padded or aligned, so there is no guarantee, for
example, that the notification type will be on a 4-byte bounary.
Signed-off-by: David Howells <dhowells@redhat.com>
2020-01-14 20:07:11 +03:00
|
|
|
if (unlikely(!bufs))
|
|
|
|
return -ENOMEM;
|
2010-05-20 12:43:18 +04:00
|
|
|
|
2022-05-26 09:34:52 +03:00
|
|
|
spin_lock_irq(&pipe->rd_wait.lock);
|
|
|
|
mask = pipe->ring_size - 1;
|
|
|
|
head = pipe->head;
|
|
|
|
tail = pipe->tail;
|
|
|
|
|
|
|
|
n = pipe_occupancy(head, tail);
|
|
|
|
if (nr_slots < n) {
|
|
|
|
spin_unlock_irq(&pipe->rd_wait.lock);
|
|
|
|
kfree(bufs);
|
|
|
|
return -EBUSY;
|
|
|
|
}
|
|
|
|
|
2010-05-20 12:43:18 +04:00
|
|
|
/*
|
|
|
|
* The pipe array wraps around, so just start the new one at zero
|
2019-11-15 16:30:32 +03:00
|
|
|
* and adjust the indices.
|
2010-05-20 12:43:18 +04:00
|
|
|
*/
|
2019-11-15 16:30:32 +03:00
|
|
|
if (n > 0) {
|
|
|
|
unsigned int h = head & mask;
|
|
|
|
unsigned int t = tail & mask;
|
|
|
|
if (h > t) {
|
|
|
|
memcpy(bufs, pipe->bufs + t,
|
|
|
|
n * sizeof(struct pipe_buffer));
|
|
|
|
} else {
|
|
|
|
unsigned int tsize = pipe->ring_size - t;
|
|
|
|
if (h > 0)
|
|
|
|
memcpy(bufs + tsize, pipe->bufs,
|
|
|
|
h * sizeof(struct pipe_buffer));
|
|
|
|
memcpy(bufs, pipe->bufs + t,
|
|
|
|
tsize * sizeof(struct pipe_buffer));
|
|
|
|
}
|
2010-05-20 12:43:18 +04:00
|
|
|
}
|
|
|
|
|
2019-11-15 16:30:32 +03:00
|
|
|
head = n;
|
|
|
|
tail = 0;
|
|
|
|
|
2022-04-20 22:07:53 +03:00
|
|
|
kfree(pipe->bufs);
|
2010-05-20 12:43:18 +04:00
|
|
|
pipe->bufs = bufs;
|
2019-11-15 16:30:32 +03:00
|
|
|
pipe->ring_size = nr_slots;
|
pipe: Add general notification queue support
Make it possible to have a general notification queue built on top of a
standard pipe. Notifications are 'spliced' into the pipe and then read
out. splice(), vmsplice() and sendfile() are forbidden on pipes used for
notifications as post_one_notification() cannot take pipe->mutex. This
means that notifications could be posted in between individual pipe
buffers, making iov_iter_revert() difficult to effect.
The way the notification queue is used is:
(1) An application opens a pipe with a special flag and indicates the
number of messages it wishes to be able to queue at once (this can
only be set once):
pipe2(fds, O_NOTIFICATION_PIPE);
ioctl(fds[0], IOC_WATCH_QUEUE_SET_SIZE, queue_depth);
(2) The application then uses poll() and read() as normal to extract data
from the pipe. read() will return multiple notifications if the
buffer is big enough, but it will not split a notification across
buffers - rather it will return a short read or EMSGSIZE.
Notification messages include a length in the header so that the
caller can split them up.
Each message has a header that describes it:
struct watch_notification {
__u32 type:24;
__u32 subtype:8;
__u32 info;
};
The type indicates the source (eg. mount tree changes, superblock events,
keyring changes, block layer events) and the subtype indicates the event
type (eg. mount, unmount; EIO, EDQUOT; link, unlink). The info field
indicates a number of things, including the entry length, an ID assigned to
a watchpoint contributing to this buffer and type-specific flags.
Supplementary data, such as the key ID that generated an event, can be
attached in additional slots. The maximum message size is 127 bytes.
Messages may not be padded or aligned, so there is no guarantee, for
example, that the notification type will be on a 4-byte bounary.
Signed-off-by: David Howells <dhowells@redhat.com>
2020-01-14 20:07:11 +03:00
|
|
|
if (pipe->max_usage > nr_slots)
|
|
|
|
pipe->max_usage = nr_slots;
|
2019-11-15 16:30:32 +03:00
|
|
|
pipe->tail = tail;
|
|
|
|
pipe->head = head;
|
pipe: make sure to wake up everybody when the last reader/writer closes
Andrei Vagin reported that commit 0ddad21d3e99 ("pipe: use exclusive
waits when reading or writing") broke one of the CRIU tests. He even
has a trivial reproducer:
#include <unistd.h>
#include <sys/types.h>
#include <sys/wait.h>
int main()
{
int p[2];
pid_t p1, p2;
int status;
if (pipe(p) == -1)
return 1;
p1 = fork();
if (p1 == 0) {
close(p[1]);
read(p[0], &status, sizeof(status));
return 0;
}
p2 = fork();
if (p2 == 0) {
close(p[1]);
read(p[0], &status, sizeof(status));
return 0;
}
sleep(1);
close(p[1]);
wait(&status);
wait(&status);
return 0;
}
and the problem - once he points it out - is obvious. We use these nice
exclusive waits, but when the last writer goes away, it then needs to
wake up _every_ reader (and conversely, the last reader disappearing
needs to wake every writer, of course).
In fact, when going through this, we had several small oddities around
how to wake things. We did in fact wake every reader when we changed
the size of the pipe buffers. But that's entirely pointless, since that
just acts as a possible source of new space - no new data to read.
And when we change the size of the buffer, we don't need to wake all
writers even when we add space - that case acts just as if somebody made
space by reading, and any writer that finds itself not filling it up
entirely will wake the next one.
On the other hand, on the exit path, we tried to limit the wakeups with
the proper poll keys etc, which is entirely pointless, because at that
point we obviously need to wake up everybody. So don't do that: just
wake up everybody - but only do that if the counts changed to zero.
So fix those non-IO wakeups to be more proper: space change doesn't add
any new data, but it might make room for writers, so it wakes up a
writer. And the actual changes to reader/writer counts should wake up
everybody, since everybody is affected (ie readers will all see EOF if
the writers have gone away, and writers will all get EPIPE if all
readers have gone away).
Fixes: 0ddad21d3e99 ("pipe: use exclusive waits when reading or writing")
Reported-and-tested-by: Andrei Vagin <avagin@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-18 21:12:58 +03:00
|
|
|
|
2022-05-26 09:34:52 +03:00
|
|
|
spin_unlock_irq(&pipe->rd_wait.lock);
|
|
|
|
|
pipe: make sure to wake up everybody when the last reader/writer closes
Andrei Vagin reported that commit 0ddad21d3e99 ("pipe: use exclusive
waits when reading or writing") broke one of the CRIU tests. He even
has a trivial reproducer:
#include <unistd.h>
#include <sys/types.h>
#include <sys/wait.h>
int main()
{
int p[2];
pid_t p1, p2;
int status;
if (pipe(p) == -1)
return 1;
p1 = fork();
if (p1 == 0) {
close(p[1]);
read(p[0], &status, sizeof(status));
return 0;
}
p2 = fork();
if (p2 == 0) {
close(p[1]);
read(p[0], &status, sizeof(status));
return 0;
}
sleep(1);
close(p[1]);
wait(&status);
wait(&status);
return 0;
}
and the problem - once he points it out - is obvious. We use these nice
exclusive waits, but when the last writer goes away, it then needs to
wake up _every_ reader (and conversely, the last reader disappearing
needs to wake every writer, of course).
In fact, when going through this, we had several small oddities around
how to wake things. We did in fact wake every reader when we changed
the size of the pipe buffers. But that's entirely pointless, since that
just acts as a possible source of new space - no new data to read.
And when we change the size of the buffer, we don't need to wake all
writers even when we add space - that case acts just as if somebody made
space by reading, and any writer that finds itself not filling it up
entirely will wake the next one.
On the other hand, on the exit path, we tried to limit the wakeups with
the proper poll keys etc, which is entirely pointless, because at that
point we obviously need to wake up everybody. So don't do that: just
wake up everybody - but only do that if the counts changed to zero.
So fix those non-IO wakeups to be more proper: space change doesn't add
any new data, but it might make room for writers, so it wakes up a
writer. And the actual changes to reader/writer counts should wake up
everybody, since everybody is affected (ie readers will all see EOF if
the writers have gone away, and writers will all get EPIPE if all
readers have gone away).
Fixes: 0ddad21d3e99 ("pipe: use exclusive waits when reading or writing")
Reported-and-tested-by: Andrei Vagin <avagin@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-18 21:12:58 +03:00
|
|
|
/* This might have made more room for writers */
|
|
|
|
wake_up_interruptible(&pipe->wr_wait);
|
pipe: Add general notification queue support
Make it possible to have a general notification queue built on top of a
standard pipe. Notifications are 'spliced' into the pipe and then read
out. splice(), vmsplice() and sendfile() are forbidden on pipes used for
notifications as post_one_notification() cannot take pipe->mutex. This
means that notifications could be posted in between individual pipe
buffers, making iov_iter_revert() difficult to effect.
The way the notification queue is used is:
(1) An application opens a pipe with a special flag and indicates the
number of messages it wishes to be able to queue at once (this can
only be set once):
pipe2(fds, O_NOTIFICATION_PIPE);
ioctl(fds[0], IOC_WATCH_QUEUE_SET_SIZE, queue_depth);
(2) The application then uses poll() and read() as normal to extract data
from the pipe. read() will return multiple notifications if the
buffer is big enough, but it will not split a notification across
buffers - rather it will return a short read or EMSGSIZE.
Notification messages include a length in the header so that the
caller can split them up.
Each message has a header that describes it:
struct watch_notification {
__u32 type:24;
__u32 subtype:8;
__u32 info;
};
The type indicates the source (eg. mount tree changes, superblock events,
keyring changes, block layer events) and the subtype indicates the event
type (eg. mount, unmount; EIO, EDQUOT; link, unlink). The info field
indicates a number of things, including the entry length, an ID assigned to
a watchpoint contributing to this buffer and type-specific flags.
Supplementary data, such as the key ID that generated an event, can be
attached in additional slots. The maximum message size is 127 bytes.
Messages may not be padded or aligned, so there is no guarantee, for
example, that the notification type will be on a 4-byte bounary.
Signed-off-by: David Howells <dhowells@redhat.com>
2020-01-14 20:07:11 +03:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Allocate a new array of pipe buffers and copy the info over. Returns the
|
|
|
|
* pipe size if successful, or return -ERROR on error.
|
|
|
|
*/
|
|
|
|
static long pipe_set_size(struct pipe_inode_info *pipe, unsigned long arg)
|
|
|
|
{
|
|
|
|
unsigned long user_bufs;
|
|
|
|
unsigned int nr_slots, size;
|
|
|
|
long ret = 0;
|
|
|
|
|
|
|
|
#ifdef CONFIG_WATCH_QUEUE
|
|
|
|
if (pipe->watch_queue)
|
|
|
|
return -EBUSY;
|
|
|
|
#endif
|
|
|
|
|
|
|
|
size = round_pipe_size(arg);
|
|
|
|
nr_slots = size >> PAGE_SHIFT;
|
|
|
|
|
|
|
|
if (!nr_slots)
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If trying to increase the pipe capacity, check that an
|
|
|
|
* unprivileged user is not trying to exceed various limits
|
|
|
|
* (soft limit check here, hard limit check just below).
|
|
|
|
* Decreasing the pipe capacity is always permitted, even
|
|
|
|
* if the user is currently over a limit.
|
|
|
|
*/
|
|
|
|
if (nr_slots > pipe->max_usage &&
|
|
|
|
size > pipe_max_size && !capable(CAP_SYS_RESOURCE))
|
|
|
|
return -EPERM;
|
|
|
|
|
|
|
|
user_bufs = account_pipe_buffers(pipe->user, pipe->nr_accounted, nr_slots);
|
|
|
|
|
|
|
|
if (nr_slots > pipe->max_usage &&
|
|
|
|
(too_many_pipe_buffers_hard(user_bufs) ||
|
|
|
|
too_many_pipe_buffers_soft(user_bufs)) &&
|
|
|
|
pipe_is_unprivileged_user()) {
|
|
|
|
ret = -EPERM;
|
|
|
|
goto out_revert_acct;
|
|
|
|
}
|
|
|
|
|
|
|
|
ret = pipe_resize_ring(pipe, nr_slots);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out_revert_acct;
|
|
|
|
|
|
|
|
pipe->max_usage = nr_slots;
|
|
|
|
pipe->nr_accounted = nr_slots;
|
2019-10-16 18:47:32 +03:00
|
|
|
return pipe->max_usage * PAGE_SIZE;
|
pipe: fix limit checking in pipe_set_size()
The limit checking in pipe_set_size() (used by fcntl(F_SETPIPE_SZ))
has the following problems:
(1) When increasing the pipe capacity, the checks against the limits in
/proc/sys/fs/pipe-user-pages-{soft,hard} are made against existing
consumption, and exclude the memory required for the increased pipe
capacity. The new increase in pipe capacity can then push the total
memory used by the user for pipes (possibly far) over a limit. This
can also trigger the problem described next.
(2) The limit checks are performed even when the new pipe capacity is
less than the existing pipe capacity. This can lead to problems if a
user sets a large pipe capacity, and then the limits are lowered,
with the result that the user will no longer be able to decrease the
pipe capacity.
(3) As currently implemented, accounting and checking against the
limits is done as follows:
(a) Test whether the user has exceeded the limit.
(b) Make new pipe buffer allocation.
(c) Account new allocation against the limits.
This is racey. Multiple processes may pass point (a)
simultaneously, and then allocate pipe buffers that are accounted
for only in step (c). The race means that the user's pipe buffer
allocation could be pushed over the limit (by an arbitrary amount,
depending on how unlucky we were in the race). [Thanks to Vegard
Nossum for spotting this point, which I had missed.]
This patch addresses the above problems as follows:
* Perform checks against the limits only when increasing a pipe's
capacity; an unprivileged user can always decrease a pipe's capacity.
* Alter the checks against limits to include the memory required for
the new pipe capacity.
* Re-order the accounting step so that it precedes the buffer
allocation. If the accounting step determines that a limit has
been reached, revert the accounting and cause the operation to fail.
The program below can be used to demonstrate problems 1 and 2, and the
effect of the fix. The program takes one or more command-line arguments.
The first argument specifies the number of pipes that the program should
create. The remaining arguments are, alternately, pipe capacities that
should be set using fcntl(F_SETPIPE_SZ), and sleep intervals (in
seconds) between the fcntl() operations. (The sleep intervals allow the
possibility to change the limits between fcntl() operations.)
Problem 1
=========
Using the test program on an unpatched kernel, we first set some
limits:
# echo 0 > /proc/sys/fs/pipe-user-pages-soft
# echo 1000000000 > /proc/sys/fs/pipe-max-size
# echo 10000 > /proc/sys/fs/pipe-user-pages-hard # 40.96 MB
Then show that we can set a pipe with capacity (100MB) that is
over the hard limit
# sudo -u mtk ./test_F_SETPIPE_SZ 1 100000000
Initial pipe capacity: 65536
Loop 1: set pipe capacity to 100000000 bytes
F_SETPIPE_SZ returned 134217728
Now set the capacity to 100MB twice. The second call fails (which is
probably surprising to most users, since it seems like a no-op):
# sudo -u mtk ./test_F_SETPIPE_SZ 1 100000000 0 100000000
Initial pipe capacity: 65536
Loop 1: set pipe capacity to 100000000 bytes
F_SETPIPE_SZ returned 134217728
Loop 2: set pipe capacity to 100000000 bytes
Loop 2, pipe 0: F_SETPIPE_SZ failed: fcntl: Operation not permitted
With a patched kernel, setting a capacity over the limit fails at the
first attempt:
# echo 0 > /proc/sys/fs/pipe-user-pages-soft
# echo 1000000000 > /proc/sys/fs/pipe-max-size
# echo 10000 > /proc/sys/fs/pipe-user-pages-hard
# sudo -u mtk ./test_F_SETPIPE_SZ 1 100000000
Initial pipe capacity: 65536
Loop 1: set pipe capacity to 100000000 bytes
Loop 1, pipe 0: F_SETPIPE_SZ failed: fcntl: Operation not permitted
There is a small chance that the change to fix this problem could
break user-space, since there are cases where fcntl(F_SETPIPE_SZ)
calls that previously succeeded might fail. However, the chances are
small, since (a) the pipe-user-pages-{soft,hard} limits are new (in
4.5), and the default soft/hard limits are high/unlimited. Therefore,
it seems warranted to make these limits operate more precisely (and
behave more like what users probably expect).
Problem 2
=========
Running the test program on an unpatched kernel, we first set some limits:
# getconf PAGESIZE
4096
# echo 0 > /proc/sys/fs/pipe-user-pages-soft
# echo 1000000000 > /proc/sys/fs/pipe-max-size
# echo 10000 > /proc/sys/fs/pipe-user-pages-hard # 40.96 MB
Now perform two fcntl(F_SETPIPE_SZ) operations on a single pipe,
first setting a pipe capacity (10MB), sleeping for a few seconds,
during which time the hard limit is lowered, and then set pipe
capacity to a smaller amount (5MB):
# sudo -u mtk ./test_F_SETPIPE_SZ 1 10000000 15 5000000 &
[1] 748
# Initial pipe capacity: 65536
Loop 1: set pipe capacity to 10000000 bytes
F_SETPIPE_SZ returned 16777216
Sleeping 15 seconds
# echo 1000 > /proc/sys/fs/pipe-user-pages-hard # 4.096 MB
# Loop 2: set pipe capacity to 5000000 bytes
Loop 2, pipe 0: F_SETPIPE_SZ failed: fcntl: Operation not permitted
In this case, the user should be able to lower the limit.
With a kernel that has the patch below, the second fcntl()
succeeds:
# echo 0 > /proc/sys/fs/pipe-user-pages-soft
# echo 1000000000 > /proc/sys/fs/pipe-max-size
# echo 10000 > /proc/sys/fs/pipe-user-pages-hard
# sudo -u mtk ./test_F_SETPIPE_SZ 1 10000000 15 5000000 &
[1] 3215
# Initial pipe capacity: 65536
# Loop 1: set pipe capacity to 10000000 bytes
F_SETPIPE_SZ returned 16777216
Sleeping 15 seconds
# echo 1000 > /proc/sys/fs/pipe-user-pages-hard
# Loop 2: set pipe capacity to 5000000 bytes
F_SETPIPE_SZ returned 8388608
8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---
/* test_F_SETPIPE_SZ.c
(C) 2016, Michael Kerrisk; licensed under GNU GPL version 2 or later
Test operation of fcntl(F_SETPIPE_SZ) for setting pipe capacity
and interactions with limits defined by /proc/sys/fs/pipe-* files.
*/
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
int
main(int argc, char *argv[])
{
int (*pfd)[2];
int npipes;
int pcap, rcap;
int j, p, s, stime, loop;
if (argc < 2) {
fprintf(stderr, "Usage: %s num-pipes "
"[pipe-capacity sleep-time]...\n", argv[0]);
exit(EXIT_FAILURE);
}
npipes = atoi(argv[1]);
pfd = calloc(npipes, sizeof (int [2]));
if (pfd == NULL) {
perror("calloc");
exit(EXIT_FAILURE);
}
for (j = 0; j < npipes; j++) {
if (pipe(pfd[j]) == -1) {
fprintf(stderr, "Loop %d: pipe() failed: ", j);
perror("pipe");
exit(EXIT_FAILURE);
}
}
printf("Initial pipe capacity: %d\n", fcntl(pfd[0][0], F_GETPIPE_SZ));
for (j = 2; j < argc; j += 2 ) {
loop = j / 2;
pcap = atoi(argv[j]);
printf(" Loop %d: set pipe capacity to %d bytes\n", loop, pcap);
for (p = 0; p < npipes; p++) {
s = fcntl(pfd[p][0], F_SETPIPE_SZ, pcap);
if (s == -1) {
fprintf(stderr, " Loop %d, pipe %d: F_SETPIPE_SZ "
"failed: ", loop, p);
perror("fcntl");
exit(EXIT_FAILURE);
}
if (p == 0) {
printf(" F_SETPIPE_SZ returned %d\n", s);
rcap = s;
} else {
if (s != rcap) {
fprintf(stderr, " Loop %d, pipe %d: F_SETPIPE_SZ "
"unexpected return: %d\n", loop, p, s);
exit(EXIT_FAILURE);
}
}
stime = (j + 1 < argc) ? atoi(argv[j + 1]) : 0;
if (stime > 0) {
printf(" Sleeping %d seconds\n", stime);
sleep(stime);
}
}
}
exit(EXIT_SUCCESS);
}
8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---
Patch history:
v2
* Switch order of test in 'if' statement to avoid function call
(to capability()) in normal path. [This is a fix to a preexisting
wart in the code. Thanks to Willy Tarreau]
* Perform (size > pipe_max_size) check before calling
account_pipe_buffers(). [Thanks to Vegard Nossum]
Quoting Vegard:
The potential problem happens if the user passes a very large number
which will overflow pipe->user->pipe_bufs.
On 32-bit, sizeof(int) == sizeof(long), so if they pass arg = INT_MAX
then round_pipe_size() returns INT_MAX. Although it's true that the
accounting is done in terms of pages and not bytes, so you'd need on
the order of (1 << 13) = 8192 processes hitting the limit at the same
time in order to make it overflow, which seems a bit unlikely.
(See https://lkml.org/lkml/2016/8/12/215 for another discussion on the
limit checking)
Link: http://lkml.kernel.org/r/1e464945-536b-2420-798b-e77b9c7e8593@gmail.com
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Reviewed-by: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Willy Tarreau <w@1wt.eu>
Cc: <socketpair@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Jens Axboe <axboe@fb.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11 23:53:31 +03:00
|
|
|
|
|
|
|
out_revert_acct:
|
pipe: Add general notification queue support
Make it possible to have a general notification queue built on top of a
standard pipe. Notifications are 'spliced' into the pipe and then read
out. splice(), vmsplice() and sendfile() are forbidden on pipes used for
notifications as post_one_notification() cannot take pipe->mutex. This
means that notifications could be posted in between individual pipe
buffers, making iov_iter_revert() difficult to effect.
The way the notification queue is used is:
(1) An application opens a pipe with a special flag and indicates the
number of messages it wishes to be able to queue at once (this can
only be set once):
pipe2(fds, O_NOTIFICATION_PIPE);
ioctl(fds[0], IOC_WATCH_QUEUE_SET_SIZE, queue_depth);
(2) The application then uses poll() and read() as normal to extract data
from the pipe. read() will return multiple notifications if the
buffer is big enough, but it will not split a notification across
buffers - rather it will return a short read or EMSGSIZE.
Notification messages include a length in the header so that the
caller can split them up.
Each message has a header that describes it:
struct watch_notification {
__u32 type:24;
__u32 subtype:8;
__u32 info;
};
The type indicates the source (eg. mount tree changes, superblock events,
keyring changes, block layer events) and the subtype indicates the event
type (eg. mount, unmount; EIO, EDQUOT; link, unlink). The info field
indicates a number of things, including the entry length, an ID assigned to
a watchpoint contributing to this buffer and type-specific flags.
Supplementary data, such as the key ID that generated an event, can be
attached in additional slots. The maximum message size is 127 bytes.
Messages may not be padded or aligned, so there is no guarantee, for
example, that the notification type will be on a 4-byte bounary.
Signed-off-by: David Howells <dhowells@redhat.com>
2020-01-14 20:07:11 +03:00
|
|
|
(void) account_pipe_buffers(pipe->user, nr_slots, pipe->nr_accounted);
|
pipe: fix limit checking in pipe_set_size()
The limit checking in pipe_set_size() (used by fcntl(F_SETPIPE_SZ))
has the following problems:
(1) When increasing the pipe capacity, the checks against the limits in
/proc/sys/fs/pipe-user-pages-{soft,hard} are made against existing
consumption, and exclude the memory required for the increased pipe
capacity. The new increase in pipe capacity can then push the total
memory used by the user for pipes (possibly far) over a limit. This
can also trigger the problem described next.
(2) The limit checks are performed even when the new pipe capacity is
less than the existing pipe capacity. This can lead to problems if a
user sets a large pipe capacity, and then the limits are lowered,
with the result that the user will no longer be able to decrease the
pipe capacity.
(3) As currently implemented, accounting and checking against the
limits is done as follows:
(a) Test whether the user has exceeded the limit.
(b) Make new pipe buffer allocation.
(c) Account new allocation against the limits.
This is racey. Multiple processes may pass point (a)
simultaneously, and then allocate pipe buffers that are accounted
for only in step (c). The race means that the user's pipe buffer
allocation could be pushed over the limit (by an arbitrary amount,
depending on how unlucky we were in the race). [Thanks to Vegard
Nossum for spotting this point, which I had missed.]
This patch addresses the above problems as follows:
* Perform checks against the limits only when increasing a pipe's
capacity; an unprivileged user can always decrease a pipe's capacity.
* Alter the checks against limits to include the memory required for
the new pipe capacity.
* Re-order the accounting step so that it precedes the buffer
allocation. If the accounting step determines that a limit has
been reached, revert the accounting and cause the operation to fail.
The program below can be used to demonstrate problems 1 and 2, and the
effect of the fix. The program takes one or more command-line arguments.
The first argument specifies the number of pipes that the program should
create. The remaining arguments are, alternately, pipe capacities that
should be set using fcntl(F_SETPIPE_SZ), and sleep intervals (in
seconds) between the fcntl() operations. (The sleep intervals allow the
possibility to change the limits between fcntl() operations.)
Problem 1
=========
Using the test program on an unpatched kernel, we first set some
limits:
# echo 0 > /proc/sys/fs/pipe-user-pages-soft
# echo 1000000000 > /proc/sys/fs/pipe-max-size
# echo 10000 > /proc/sys/fs/pipe-user-pages-hard # 40.96 MB
Then show that we can set a pipe with capacity (100MB) that is
over the hard limit
# sudo -u mtk ./test_F_SETPIPE_SZ 1 100000000
Initial pipe capacity: 65536
Loop 1: set pipe capacity to 100000000 bytes
F_SETPIPE_SZ returned 134217728
Now set the capacity to 100MB twice. The second call fails (which is
probably surprising to most users, since it seems like a no-op):
# sudo -u mtk ./test_F_SETPIPE_SZ 1 100000000 0 100000000
Initial pipe capacity: 65536
Loop 1: set pipe capacity to 100000000 bytes
F_SETPIPE_SZ returned 134217728
Loop 2: set pipe capacity to 100000000 bytes
Loop 2, pipe 0: F_SETPIPE_SZ failed: fcntl: Operation not permitted
With a patched kernel, setting a capacity over the limit fails at the
first attempt:
# echo 0 > /proc/sys/fs/pipe-user-pages-soft
# echo 1000000000 > /proc/sys/fs/pipe-max-size
# echo 10000 > /proc/sys/fs/pipe-user-pages-hard
# sudo -u mtk ./test_F_SETPIPE_SZ 1 100000000
Initial pipe capacity: 65536
Loop 1: set pipe capacity to 100000000 bytes
Loop 1, pipe 0: F_SETPIPE_SZ failed: fcntl: Operation not permitted
There is a small chance that the change to fix this problem could
break user-space, since there are cases where fcntl(F_SETPIPE_SZ)
calls that previously succeeded might fail. However, the chances are
small, since (a) the pipe-user-pages-{soft,hard} limits are new (in
4.5), and the default soft/hard limits are high/unlimited. Therefore,
it seems warranted to make these limits operate more precisely (and
behave more like what users probably expect).
Problem 2
=========
Running the test program on an unpatched kernel, we first set some limits:
# getconf PAGESIZE
4096
# echo 0 > /proc/sys/fs/pipe-user-pages-soft
# echo 1000000000 > /proc/sys/fs/pipe-max-size
# echo 10000 > /proc/sys/fs/pipe-user-pages-hard # 40.96 MB
Now perform two fcntl(F_SETPIPE_SZ) operations on a single pipe,
first setting a pipe capacity (10MB), sleeping for a few seconds,
during which time the hard limit is lowered, and then set pipe
capacity to a smaller amount (5MB):
# sudo -u mtk ./test_F_SETPIPE_SZ 1 10000000 15 5000000 &
[1] 748
# Initial pipe capacity: 65536
Loop 1: set pipe capacity to 10000000 bytes
F_SETPIPE_SZ returned 16777216
Sleeping 15 seconds
# echo 1000 > /proc/sys/fs/pipe-user-pages-hard # 4.096 MB
# Loop 2: set pipe capacity to 5000000 bytes
Loop 2, pipe 0: F_SETPIPE_SZ failed: fcntl: Operation not permitted
In this case, the user should be able to lower the limit.
With a kernel that has the patch below, the second fcntl()
succeeds:
# echo 0 > /proc/sys/fs/pipe-user-pages-soft
# echo 1000000000 > /proc/sys/fs/pipe-max-size
# echo 10000 > /proc/sys/fs/pipe-user-pages-hard
# sudo -u mtk ./test_F_SETPIPE_SZ 1 10000000 15 5000000 &
[1] 3215
# Initial pipe capacity: 65536
# Loop 1: set pipe capacity to 10000000 bytes
F_SETPIPE_SZ returned 16777216
Sleeping 15 seconds
# echo 1000 > /proc/sys/fs/pipe-user-pages-hard
# Loop 2: set pipe capacity to 5000000 bytes
F_SETPIPE_SZ returned 8388608
8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---
/* test_F_SETPIPE_SZ.c
(C) 2016, Michael Kerrisk; licensed under GNU GPL version 2 or later
Test operation of fcntl(F_SETPIPE_SZ) for setting pipe capacity
and interactions with limits defined by /proc/sys/fs/pipe-* files.
*/
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
int
main(int argc, char *argv[])
{
int (*pfd)[2];
int npipes;
int pcap, rcap;
int j, p, s, stime, loop;
if (argc < 2) {
fprintf(stderr, "Usage: %s num-pipes "
"[pipe-capacity sleep-time]...\n", argv[0]);
exit(EXIT_FAILURE);
}
npipes = atoi(argv[1]);
pfd = calloc(npipes, sizeof (int [2]));
if (pfd == NULL) {
perror("calloc");
exit(EXIT_FAILURE);
}
for (j = 0; j < npipes; j++) {
if (pipe(pfd[j]) == -1) {
fprintf(stderr, "Loop %d: pipe() failed: ", j);
perror("pipe");
exit(EXIT_FAILURE);
}
}
printf("Initial pipe capacity: %d\n", fcntl(pfd[0][0], F_GETPIPE_SZ));
for (j = 2; j < argc; j += 2 ) {
loop = j / 2;
pcap = atoi(argv[j]);
printf(" Loop %d: set pipe capacity to %d bytes\n", loop, pcap);
for (p = 0; p < npipes; p++) {
s = fcntl(pfd[p][0], F_SETPIPE_SZ, pcap);
if (s == -1) {
fprintf(stderr, " Loop %d, pipe %d: F_SETPIPE_SZ "
"failed: ", loop, p);
perror("fcntl");
exit(EXIT_FAILURE);
}
if (p == 0) {
printf(" F_SETPIPE_SZ returned %d\n", s);
rcap = s;
} else {
if (s != rcap) {
fprintf(stderr, " Loop %d, pipe %d: F_SETPIPE_SZ "
"unexpected return: %d\n", loop, p, s);
exit(EXIT_FAILURE);
}
}
stime = (j + 1 < argc) ? atoi(argv[j + 1]) : 0;
if (stime > 0) {
printf(" Sleeping %d seconds\n", stime);
sleep(stime);
}
}
}
exit(EXIT_SUCCESS);
}
8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---
Patch history:
v2
* Switch order of test in 'if' statement to avoid function call
(to capability()) in normal path. [This is a fix to a preexisting
wart in the code. Thanks to Willy Tarreau]
* Perform (size > pipe_max_size) check before calling
account_pipe_buffers(). [Thanks to Vegard Nossum]
Quoting Vegard:
The potential problem happens if the user passes a very large number
which will overflow pipe->user->pipe_bufs.
On 32-bit, sizeof(int) == sizeof(long), so if they pass arg = INT_MAX
then round_pipe_size() returns INT_MAX. Although it's true that the
accounting is done in terms of pages and not bytes, so you'd need on
the order of (1 << 13) = 8192 processes hitting the limit at the same
time in order to make it overflow, which seems a bit unlikely.
(See https://lkml.org/lkml/2016/8/12/215 for another discussion on the
limit checking)
Link: http://lkml.kernel.org/r/1e464945-536b-2420-798b-e77b9c7e8593@gmail.com
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Reviewed-by: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Willy Tarreau <w@1wt.eu>
Cc: <socketpair@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Jens Axboe <axboe@fb.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11 23:53:31 +03:00
|
|
|
return ret;
|
2010-05-20 12:43:18 +04:00
|
|
|
}
|
|
|
|
|
2010-11-29 03:27:19 +03:00
|
|
|
/*
|
2020-11-23 15:38:40 +03:00
|
|
|
* Note that i_pipe and i_cdev share the same location, so checking ->i_pipe is
|
|
|
|
* not enough to verify that this is a pipe.
|
2010-11-29 03:27:19 +03:00
|
|
|
*/
|
pipe: Add general notification queue support
Make it possible to have a general notification queue built on top of a
standard pipe. Notifications are 'spliced' into the pipe and then read
out. splice(), vmsplice() and sendfile() are forbidden on pipes used for
notifications as post_one_notification() cannot take pipe->mutex. This
means that notifications could be posted in between individual pipe
buffers, making iov_iter_revert() difficult to effect.
The way the notification queue is used is:
(1) An application opens a pipe with a special flag and indicates the
number of messages it wishes to be able to queue at once (this can
only be set once):
pipe2(fds, O_NOTIFICATION_PIPE);
ioctl(fds[0], IOC_WATCH_QUEUE_SET_SIZE, queue_depth);
(2) The application then uses poll() and read() as normal to extract data
from the pipe. read() will return multiple notifications if the
buffer is big enough, but it will not split a notification across
buffers - rather it will return a short read or EMSGSIZE.
Notification messages include a length in the header so that the
caller can split them up.
Each message has a header that describes it:
struct watch_notification {
__u32 type:24;
__u32 subtype:8;
__u32 info;
};
The type indicates the source (eg. mount tree changes, superblock events,
keyring changes, block layer events) and the subtype indicates the event
type (eg. mount, unmount; EIO, EDQUOT; link, unlink). The info field
indicates a number of things, including the entry length, an ID assigned to
a watchpoint contributing to this buffer and type-specific flags.
Supplementary data, such as the key ID that generated an event, can be
attached in additional slots. The maximum message size is 127 bytes.
Messages may not be padded or aligned, so there is no guarantee, for
example, that the notification type will be on a 4-byte bounary.
Signed-off-by: David Howells <dhowells@redhat.com>
2020-01-14 20:07:11 +03:00
|
|
|
struct pipe_inode_info *get_pipe_info(struct file *file, bool for_splice)
|
2010-11-29 03:27:19 +03:00
|
|
|
{
|
pipe: Add general notification queue support
Make it possible to have a general notification queue built on top of a
standard pipe. Notifications are 'spliced' into the pipe and then read
out. splice(), vmsplice() and sendfile() are forbidden on pipes used for
notifications as post_one_notification() cannot take pipe->mutex. This
means that notifications could be posted in between individual pipe
buffers, making iov_iter_revert() difficult to effect.
The way the notification queue is used is:
(1) An application opens a pipe with a special flag and indicates the
number of messages it wishes to be able to queue at once (this can
only be set once):
pipe2(fds, O_NOTIFICATION_PIPE);
ioctl(fds[0], IOC_WATCH_QUEUE_SET_SIZE, queue_depth);
(2) The application then uses poll() and read() as normal to extract data
from the pipe. read() will return multiple notifications if the
buffer is big enough, but it will not split a notification across
buffers - rather it will return a short read or EMSGSIZE.
Notification messages include a length in the header so that the
caller can split them up.
Each message has a header that describes it:
struct watch_notification {
__u32 type:24;
__u32 subtype:8;
__u32 info;
};
The type indicates the source (eg. mount tree changes, superblock events,
keyring changes, block layer events) and the subtype indicates the event
type (eg. mount, unmount; EIO, EDQUOT; link, unlink). The info field
indicates a number of things, including the entry length, an ID assigned to
a watchpoint contributing to this buffer and type-specific flags.
Supplementary data, such as the key ID that generated an event, can be
attached in additional slots. The maximum message size is 127 bytes.
Messages may not be padded or aligned, so there is no guarantee, for
example, that the notification type will be on a 4-byte bounary.
Signed-off-by: David Howells <dhowells@redhat.com>
2020-01-14 20:07:11 +03:00
|
|
|
struct pipe_inode_info *pipe = file->private_data;
|
|
|
|
|
|
|
|
if (file->f_op != &pipefifo_fops || !pipe)
|
|
|
|
return NULL;
|
|
|
|
#ifdef CONFIG_WATCH_QUEUE
|
|
|
|
if (for_splice && pipe->watch_queue)
|
|
|
|
return NULL;
|
|
|
|
#endif
|
|
|
|
return pipe;
|
2010-11-29 03:27:19 +03:00
|
|
|
}
|
|
|
|
|
2010-05-20 12:43:18 +04:00
|
|
|
long pipe_fcntl(struct file *file, unsigned int cmd, unsigned long arg)
|
|
|
|
{
|
|
|
|
struct pipe_inode_info *pipe;
|
|
|
|
long ret;
|
|
|
|
|
pipe: Add general notification queue support
Make it possible to have a general notification queue built on top of a
standard pipe. Notifications are 'spliced' into the pipe and then read
out. splice(), vmsplice() and sendfile() are forbidden on pipes used for
notifications as post_one_notification() cannot take pipe->mutex. This
means that notifications could be posted in between individual pipe
buffers, making iov_iter_revert() difficult to effect.
The way the notification queue is used is:
(1) An application opens a pipe with a special flag and indicates the
number of messages it wishes to be able to queue at once (this can
only be set once):
pipe2(fds, O_NOTIFICATION_PIPE);
ioctl(fds[0], IOC_WATCH_QUEUE_SET_SIZE, queue_depth);
(2) The application then uses poll() and read() as normal to extract data
from the pipe. read() will return multiple notifications if the
buffer is big enough, but it will not split a notification across
buffers - rather it will return a short read or EMSGSIZE.
Notification messages include a length in the header so that the
caller can split them up.
Each message has a header that describes it:
struct watch_notification {
__u32 type:24;
__u32 subtype:8;
__u32 info;
};
The type indicates the source (eg. mount tree changes, superblock events,
keyring changes, block layer events) and the subtype indicates the event
type (eg. mount, unmount; EIO, EDQUOT; link, unlink). The info field
indicates a number of things, including the entry length, an ID assigned to
a watchpoint contributing to this buffer and type-specific flags.
Supplementary data, such as the key ID that generated an event, can be
attached in additional slots. The maximum message size is 127 bytes.
Messages may not be padded or aligned, so there is no guarantee, for
example, that the notification type will be on a 4-byte bounary.
Signed-off-by: David Howells <dhowells@redhat.com>
2020-01-14 20:07:11 +03:00
|
|
|
pipe = get_pipe_info(file, false);
|
2010-05-20 12:43:18 +04:00
|
|
|
if (!pipe)
|
|
|
|
return -EBADF;
|
|
|
|
|
2013-03-21 20:24:01 +04:00
|
|
|
__pipe_lock(pipe);
|
2010-05-20 12:43:18 +04:00
|
|
|
|
|
|
|
switch (cmd) {
|
2016-10-11 23:53:25 +03:00
|
|
|
case F_SETPIPE_SZ:
|
|
|
|
ret = pipe_set_size(pipe, arg);
|
2010-05-20 12:43:18 +04:00
|
|
|
break;
|
|
|
|
case F_GETPIPE_SZ:
|
2019-10-16 18:47:32 +03:00
|
|
|
ret = pipe->max_usage * PAGE_SIZE;
|
2010-05-20 12:43:18 +04:00
|
|
|
break;
|
|
|
|
default:
|
|
|
|
ret = -EINVAL;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
2013-03-21 20:24:01 +04:00
|
|
|
__pipe_unlock(pipe);
|
2010-05-20 12:43:18 +04:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2011-01-07 09:49:50 +03:00
|
|
|
static const struct super_operations pipefs_ops = {
|
|
|
|
.destroy_inode = free_inode_nonrcu,
|
2011-11-01 04:10:04 +04:00
|
|
|
.statfs = simple_statfs,
|
2011-01-07 09:49:50 +03:00
|
|
|
};
|
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
/*
|
|
|
|
* pipefs should _never_ be mounted by userland - too much of security hassle,
|
|
|
|
* no real gain from having the whole whorehouse mounted. So we don't need
|
|
|
|
* any operations on the root directory. However, we need a non-trivial
|
|
|
|
* d_name - pipe: will go nicely and kill the special-casing in procfs.
|
|
|
|
*/
|
2019-03-25 19:38:23 +03:00
|
|
|
|
|
|
|
static int pipefs_init_fs_context(struct fs_context *fc)
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
2019-03-25 19:38:23 +03:00
|
|
|
struct pseudo_fs_context *ctx = init_pseudo(fc, PIPEFS_MAGIC);
|
|
|
|
if (!ctx)
|
|
|
|
return -ENOMEM;
|
|
|
|
ctx->ops = &pipefs_ops;
|
|
|
|
ctx->dops = &pipefs_dentry_operations;
|
|
|
|
return 0;
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
static struct file_system_type pipe_fs_type = {
|
|
|
|
.name = "pipefs",
|
2019-03-25 19:38:23 +03:00
|
|
|
.init_fs_context = pipefs_init_fs_context,
|
2005-04-17 02:20:36 +04:00
|
|
|
.kill_sb = kill_anon_super,
|
|
|
|
};
|
|
|
|
|
2022-01-22 09:13:20 +03:00
|
|
|
#ifdef CONFIG_SYSCTL
|
|
|
|
static int do_proc_dopipe_max_size_conv(unsigned long *lvalp,
|
|
|
|
unsigned int *valp,
|
|
|
|
int write, void *data)
|
|
|
|
{
|
|
|
|
if (write) {
|
|
|
|
unsigned int val;
|
|
|
|
|
|
|
|
val = round_pipe_size(*lvalp);
|
|
|
|
if (val == 0)
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
*valp = val;
|
|
|
|
} else {
|
|
|
|
unsigned int val = *valp;
|
|
|
|
*lvalp = (unsigned long) val;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int proc_dopipe_max_size(struct ctl_table *table, int write,
|
|
|
|
void *buffer, size_t *lenp, loff_t *ppos)
|
|
|
|
{
|
|
|
|
return do_proc_douintvec(table, write, buffer, lenp, ppos,
|
|
|
|
do_proc_dopipe_max_size_conv, NULL);
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct ctl_table fs_pipe_sysctls[] = {
|
|
|
|
{
|
|
|
|
.procname = "pipe-max-size",
|
|
|
|
.data = &pipe_max_size,
|
|
|
|
.maxlen = sizeof(pipe_max_size),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = proc_dopipe_max_size,
|
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "pipe-user-pages-hard",
|
|
|
|
.data = &pipe_user_pages_hard,
|
|
|
|
.maxlen = sizeof(pipe_user_pages_hard),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = proc_doulongvec_minmax,
|
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "pipe-user-pages-soft",
|
|
|
|
.data = &pipe_user_pages_soft,
|
|
|
|
.maxlen = sizeof(pipe_user_pages_soft),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = proc_doulongvec_minmax,
|
|
|
|
},
|
|
|
|
{ }
|
|
|
|
};
|
|
|
|
#endif
|
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
static int __init init_pipe_fs(void)
|
|
|
|
{
|
|
|
|
int err = register_filesystem(&pipe_fs_type);
|
2006-04-11 15:57:45 +04:00
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
if (!err) {
|
|
|
|
pipe_mnt = kern_mount(&pipe_fs_type);
|
|
|
|
if (IS_ERR(pipe_mnt)) {
|
|
|
|
err = PTR_ERR(pipe_mnt);
|
|
|
|
unregister_filesystem(&pipe_fs_type);
|
|
|
|
}
|
|
|
|
}
|
2022-01-22 09:13:20 +03:00
|
|
|
#ifdef CONFIG_SYSCTL
|
|
|
|
register_sysctl_init("fs", fs_pipe_sysctls);
|
|
|
|
#endif
|
2005-04-17 02:20:36 +04:00
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
|
|
|
fs_initcall(init_pipe_fs);
|