Граф коммитов

473 Коммитов

Автор SHA1 Сообщение Дата
Jann Horn 4d45e75a99 mm: remove the now-unnecessary mmget_still_valid() hack
The preceding patches have ensured that core dumping properly takes the
mmap_lock.  Thanks to that, we can now remove mmget_still_valid() and all
its users.

Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Link: http://lkml.kernel.org/r/20200827114932.3572699-8-jannh@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16 11:11:22 -07:00
Linus Torvalds 612e7a4c16 kernel-clone-v5.9
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCXz5bNAAKCRCRxhvAZXjc
 opfjAP9R/J72yxdd2CLGNZ96hyiRX1NgFDOVUhscOvujYJf8ZwD+OoLmKMvAyFW6
 hnMhT1n9Q+aq194hyzChOLQaBTejBQ8=
 =4WCX
 -----END PGP SIGNATURE-----

Merge tag 'kernel-clone-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull kernel_clone() updates from Christian Brauner:
 "During the v5.9 merge window we reworked the process creation
  codepaths across multiple architectures. After this work we were only
  left with the _do_fork() helper based on the struct kernel_clone_args
  calling convention. As was pointed out _do_fork() isn't valid
  kernelese especially for a helper that isn't just static.

  This series removes the _do_fork() helper and introduces the new
  kernel_clone() helper. The process creation cleanup didn't change the
  name to something more reasonable mainly because _do_fork() was used
  in quite a few places. So sending this as a separate series seemed the
  better strategy.

  I originally intended to send this early in the v5.9 development cycle
  after the merge window had closed but given that this was touching
  quite a few places I decided to defer this until the v5.10 merge
  window"

* tag 'kernel-clone-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  sched: remove _do_fork()
  tracing: switch to kernel_clone()
  kgdbts: switch to kernel_clone()
  kprobes: switch to kernel_clone()
  x86: switch to kernel_clone()
  sparc: switch to kernel_clone()
  nios2: switch to kernel_clone()
  m68k: switch to kernel_clone()
  ia64: switch to kernel_clone()
  h8300: switch to kernel_clone()
  fork: introduce kernel_clone()
2020-10-14 14:32:52 -07:00
Suren Baghdasaryan 67197a4f28 mm, oom_adj: don't loop through tasks in __set_oom_adj when not necessary
Currently __set_oom_adj loops through all processes in the system to keep
oom_score_adj and oom_score_adj_min in sync between processes sharing
their mm.  This is done for any task with more that one mm_users, which
includes processes with multiple threads (sharing mm and signals).
However for such processes the loop is unnecessary because their signal
structure is shared as well.

Android updates oom_score_adj whenever a tasks changes its role
(background/foreground/...) or binds to/unbinds from a service, making it
more/less important.  Such operation can happen frequently.  We noticed
that updates to oom_score_adj became more expensive and after further
investigation found out that the patch mentioned in "Fixes" introduced a
regression.  Using Pixel 4 with a typical Android workload, write time to
oom_score_adj increased from ~3.57us to ~362us.  Moreover this regression
linearly depends on the number of multi-threaded processes running on the
system.

Mark the mm with a new MMF_MULTIPROCESS flag bit when task is created with
(CLONE_VM && !CLONE_THREAD && !CLONE_VFORK).  Change __set_oom_adj to use
MMF_MULTIPROCESS instead of mm_users to decide whether oom_score_adj
update should be synchronized between multiple processes.  To prevent
races between clone() and __set_oom_adj(), when oom_score_adj of the
process being cloned might be modified from userspace, we use
oom_adj_mutex.  Its scope is changed to global.

The combination of (CLONE_VM && !CLONE_THREAD) is rarely used except for
the case of vfork().  To prevent performance regressions of vfork(), we
skip taking oom_adj_mutex and setting MMF_MULTIPROCESS when CLONE_VFORK is
specified.  Clearing the MMF_MULTIPROCESS flag (when the last process
sharing the mm exits) is left out of this patch to keep it simple and
because it is believed that this threading model is rare.  Should there
ever be a need for optimizing that case as well, it can be done by hooking
into the exit path, likely following the mm_update_next_owner pattern.

With the combination of (CLONE_VM && !CLONE_THREAD && !CLONE_VFORK) being
quite rare, the regression is gone after the change is applied.

[surenb@google.com: v3]
  Link: https://lkml.kernel.org/r/20200902012558.2335613-1-surenb@google.com

Fixes: 44a70adec9 ("mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj")
Reported-by: Tim Murray <timmurray@google.com>
Suggested-by: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Eugene Syromiatnikov <esyr@redhat.com>
Cc: Christian Kellner <christian@kellner.me>
Cc: Adrian Reber <areber@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Alexey Gladkov <gladkov.alexey@gmail.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Bernd Edlinger <bernd.edlinger@hotmail.de>
Cc: John Johansen <john.johansen@canonical.com>
Cc: Yafang Shao <laoar.shao@gmail.com>
Link: https://lkml.kernel.org/r/20200824153036.3201505-1-surenb@google.com
Debugged-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-13 18:38:35 -07:00
Peter Oskolkov 2a36ab717e rseq/membarrier: Add MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ
This patchset is based on Google-internal RSEQ work done by Paul
Turner and Andrew Hunter.

When working with per-CPU RSEQ-based memory allocations, it is
sometimes important to make sure that a global memory location is no
longer accessed from RSEQ critical sections. For example, there can be
two per-CPU lists, one is "active" and accessed per-CPU, while another
one is inactive and worked on asynchronously "off CPU" (e.g.  garbage
collection is performed). Then at some point the two lists are
swapped, and a fast RCU-like mechanism is required to make sure that
the previously active list is no longer accessed.

This patch introduces such a mechanism: in short, membarrier() syscall
issues an IPI to a CPU, restarting a potentially active RSEQ critical
section on the CPU.

Signed-off-by: Peter Oskolkov <posk@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lkml.kernel.org/r/20200923233618.2572849-1-posk@google.com
2020-09-25 14:23:27 +02:00
Valentin Schneider 4fc472f121 sched/topology: Move SD_DEGENERATE_GROUPS_MASK out of linux/sched/topology.h
SD_DEGENERATE_GROUPS_MASK is only useful for sched/topology.c, but still
gets defined for anyone who imports topology.h, leading to a flurry of
unused variable warnings.

Move it out of the header and place it next to the SD degeneration
functions in sched/topology.c.

Fixes: 4ee4ea443a ("sched/topology: Introduce SD metaflag for flags needing > 1 groups")
Reported-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200825133216.9163-2-valentin.schneider@arm.com
2020-08-26 12:41:59 +02:00
Valentin Schneider 8fca9494d4 sched/topology: Move sd_flag_debug out of linux/sched/topology.h
Defining an array in a header imported all over the place clearly is a daft
idea, that still didn't stop me from doing it.

Leave a declaration of sd_flag_debug in topology.h and move its definition
to sched/debug.c.

Fixes: b6e862f386 ("sched/topology: Define and assign sched_domain flag metadata")
Reported-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200825133216.9163-1-valentin.schneider@arm.com
2020-08-26 12:41:59 +02:00
Christian Brauner 06fe456349
sched: remove _do_fork()
Now that all callers of _do_fork() have been switched to kernel_clone() remove
the _do_fork() helper.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/r/20200819104655.436656-12-christian.brauner@ubuntu.com
2020-08-20 13:12:59 +02:00
Christian Brauner cad6967ac1
fork: introduce kernel_clone()
The old _do_fork() helper doesn't follow naming conventions of in-kernel
helpers for syscalls. The process creation cleanup in [1] didn't change the
name to something more reasonable mainly because _do_fork() was used in quite a
few places. So sending this as a separate series seemed the better strategy.

This commit does two things:
1. renames _do_fork() to kernel_clone() but keeps _do_fork() as a simple static
   inline wrapper around kernel_clone().
2. Changes the return type from long to pid_t. This aligns kernel_thread() and
   kernel_clone(). Also, the return value from kernel_clone that is surfaced in
   fork(), vfork(), clone(), and clone3() is taken from pid_vrn() which returns
   a pid_t too.

Follow-up patches will switch each caller of _do_fork() and each place where it
is referenced over to kernel_clone(). After all these changes are done, we can
remove _do_fork() completely and will only be left with kernel_clone().

[1]: 9ba27414f2 ("Merge tag 'fork-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux")

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Link: https://lore.kernel.org/r/20200819104655.436656-2-christian.brauner@ubuntu.com
2020-08-20 13:12:57 +02:00
Valentin Schneider 5f4a1c4ea4 sched/topology: Mark SD_NUMA as SDF_NEEDS_GROUPS
There would be no point in preserving a sched_domain with a single group
just because it has this flag set. Add it to SD_DEGENERATE_GROUPS_MASK.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: https://lore.kernel.org/r/20200817113003.20802-17-valentin.schneider@arm.com
2020-08-19 10:49:50 +02:00
Valentin Schneider 3551e954f5 sched/topology: Mark SD_OVERLAP as SDF_NEEDS_GROUPS
A sched_domain can only have overlapping sched_groups if it has more than
one group.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: https://lore.kernel.org/r/20200817113003.20802-16-valentin.schneider@arm.com
2020-08-19 10:49:50 +02:00
Valentin Schneider 33199b0143 sched/topology: Mark SD_ASYM_PACKING as SDF_NEEDS_GROUPS
Being a load-balancing flag, it requires 2+ groups to have any effect.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: https://lore.kernel.org/r/20200817113003.20802-15-valentin.schneider@arm.com
2020-08-19 10:49:49 +02:00
Valentin Schneider bdb7c802cc sched/topology: Mark SD_SERIALIZE as SDF_NEEDS_GROUPS
There would be no point in preserving a sched_domain with a single group
just because it has this flag set. Add it to SD_DEGENERATE_GROUPS_MASK.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: https://lore.kernel.org/r/20200817113003.20802-14-valentin.schneider@arm.com
2020-08-19 10:49:49 +02:00
Valentin Schneider 94b858fea1 sched/topology: Mark SD_BALANCE_WAKE as SDF_NEEDS_GROUPS
Even if no mainline topology uses this flag, it is a load balancing flag
just like SD_BALANCE_FORK and requires 2+ groups to have any effect.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: https://lore.kernel.org/r/20200817113003.20802-13-valentin.schneider@arm.com
2020-08-19 10:49:49 +02:00
Valentin Schneider 3a6712c768 sched/topology: Mark SD_PREFER_SIBLING as SDF_NEEDS_GROUPS
SD_PREFER_SIBLING is currently considered in sd_parent_degenerate() but not
in sd_degenerate(). It too hinges on load balancing, and thus won't have
any effect when set on a domain with a single group. Add it to
SD_DEGENERATE_GROUPS_MASK.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: https://lore.kernel.org/r/20200817113003.20802-12-valentin.schneider@arm.com
2020-08-19 10:49:49 +02:00
Valentin Schneider c200191d4c sched/topology: Propagate SD_ASYM_CPUCAPACITY upwards
We currently set this flag *only* on domains whose topology level exactly
match the level where we detect asymmetry (as returned by
asym_cpu_capacity_level()). This is rather problematic.

Say there are two clusters in the system, one with a lone big CPU and the
other with a mix of big and LITTLE CPUs (as is allowed by DynamIQ):

  DIE [                ]
  MC  [             ][ ]
       0   1   2   3  4
       L   L   B   B  B

asym_cpu_capacity_level() will figure out that the MC level is the one
where all CPUs can see a CPU of max capacity, and we will thus set
SD_ASYM_CPUCAPACITY at MC level for all CPUs.

That lone big CPU will degenerate its MC domain, since it would be alone in
there, and will end up with just a DIE domain. Since the flag was only set
at MC, this CPU ends up not seeing any SD with the flag set, which is
broken.

Rather than clearing dflags at every topology level, clear it before
entering the topology level loop. This will properly propagate upwards
flags that are set starting from a certain level.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Quentin Perret <qperret@google.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: https://lore.kernel.org/r/20200817113003.20802-11-valentin.schneider@arm.com
2020-08-19 10:49:49 +02:00
Valentin Schneider 4ee4ea443a sched/topology: Introduce SD metaflag for flags needing > 1 groups
In preparation of cleaning up the sd_degenerate*() functions, mark flags
used in sd_degenerate() with the new SDF_NEEDS_GROUPS flag. With this,
build a compile-time mask of those SD flags.

Note that sd_parent_degenerate() uses an extra flag in its mask,
SD_PREFER_SIBLING, which remains singled out for now.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: https://lore.kernel.org/r/20200817113003.20802-8-valentin.schneider@arm.com
2020-08-19 10:49:48 +02:00
Valentin Schneider b6e862f386 sched/topology: Define and assign sched_domain flag metadata
There are some expectations regarding how sched domain flags should be laid
out, but none of them are checked or asserted in
sched_domain_debug_one(). After staring at said flags for a while, I've
come to realize there's two repeating patterns:

- Shared with children: those flags are set from the base CPU domain
  upwards. Any domain that has it set will have it set in its children. It
  hints at "some property holds true / some behaviour is enabled until this
  level".

- Shared with parents: those flags are set from the topmost domain
  downwards. Any domain that has it set will have it set in its parents. It
  hints at "some property isn't visible / some behaviour is disabled until
  this level".

There are two outliers that (currently) do not map to either of these:

o SD_PREFER_SIBLING, which is cleared below levels with
  SD_ASYM_CPUCAPACITY. The change was introduced by commit:

    9c63e84db2 ("sched/core: Disable SD_PREFER_SIBLING on asymmetric CPU capacity domains")

  as it could break misfit migration on some systems. In light of this, we
  might want to change it back to make it fit one of the two categories and
  fix the issue another way.

o SD_ASYM_CPUCAPACITY, which gets set on a single level and isn't
  propagated up nor down. From a topology description point of view, it
  really wants to be SDF_SHARED_PARENT; this will be rectified in a later
  patch.

Tweak the sched_domain flag declaration to assign each flag an expected
layout, and include the rationale for each flag "meta type" assignment as a
comment. Consolidate the flag metadata into an array; the index of a flag's
metadata can easily be found with log2(flag), IOW __ffs(flag).

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: https://lore.kernel.org/r/20200817113003.20802-5-valentin.schneider@arm.com
2020-08-19 10:49:48 +02:00
Valentin Schneider d54a9658a7 sched/topology: Split out SD_* flags declaration to its own file
To associate the SD flags with some metadata, we need some more structure
in the way they are declared.

Rather than shove that in a free-standing macro list, move the declaration
in a separate file that can be re-imported with different SD_FLAG
definitions. This is inspired by what is done with the syscall
table (see uapi/asm/unistd.h and sys_call_table).

The value assigned to a given SD flag now depends on the order it appears
in sd_flags.h. No change in functionality.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: https://lore.kernel.org/r/20200817113003.20802-4-valentin.schneider@arm.com
2020-08-19 10:49:47 +02:00
Valentin Schneider cfe7ddcbd7 ARM, sched/topology: Remove SD_SHARE_POWERDOMAIN
This flag was introduced in 2014 by commit:

  d77b3ed5c9 ("sched: Add a new SD_SHARE_POWERDOMAIN for sched_domain")

but AFAIA it was never leveraged by the scheduler. The closest thing I can
think of is EAS caring about frequency domains, and it does that by
leveraging performance domains.

Remove the flag. No change in functionality is expected.

Suggested-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: https://lore.kernel.org/r/20200817113003.20802-2-valentin.schneider@arm.com
2020-08-19 10:49:47 +02:00
David Howells 29e44f4535 watch_queue: Limit the number of watches a user can hold
Impose a limit on the number of watches that a user can hold so that
they can't use this mechanism to fill up all the available memory.

This is done by putting a counter in user_struct that's incremented when
a watch is allocated and decreased when it is released.  If the number
exceeds the RLIMIT_NOFILE limit, the watch is rejected with EAGAIN.

This can be tested by the following means:

 (1) Create a watch queue and attach it to fd 5 in the program given - in
     this case, bash:

	keyctl watch_session /tmp/nlog /tmp/gclog 5 bash

 (2) In the shell, set the maximum number of files to, say, 99:

	ulimit -n 99

 (3) Add 200 keyrings:

	for ((i=0; i<200; i++)); do keyctl newring a$i @s || break; done

 (4) Try to watch all of the keyrings:

	for ((i=0; i<200; i++)); do echo $i; keyctl watch_add 5 %:a$i || break; done

     This should fail when the number of watches belonging to the user hits
     99.

 (5) Remove all the keyrings and all of those watches should go away:

	for ((i=0; i<200; i++)); do keyctl unlink %:a$i; done

 (6) Kill off the watch queue by exiting the shell spawned by
     watch_session.

Fixes: c73be61ced ("pipe: Add general notification queue support")
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-17 09:39:18 -07:00
Christoph Hellwig 8043fc147a kernel: add a kernel_wait helper
Add a helper that waits for a pid and stores the status in the passed in
kernel pointer.  Use it to fix the usage of kernel_wait4 in
call_usermodehelper_exec_sync that only happens to work due to the
implicit set_fs(KERNEL_DS) for kernel threads.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Link: http://lkml.kernel.org/r/20200721130449.5008-1-hch@lst.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 10:57:59 -07:00
Waiman Long af161bee93 include/linux/sched/mm.h: optimize current_gfp_context()
The current_gfp_context() converts a number of PF_MEMALLOC_* per-process
flags into the corresponding GFP_* flags for memory allocation.  In that
function, current->flags is accessed 3 times.  That may lead to duplicated
access of the same memory location.

This is not usually a problem with minimal debug config options on as the
compiler can optimize away the duplicated memory accesses.  With most of
the debug config options on, however, that may not be the case.  For
example, the x86-64 object size of the __need_fs_reclaim() in a debug
kernel that calls current_gfp_context() was 309 bytes.  With this patch
applied, the object size is reduced to 202 bytes.  This is a saving of 107
bytes and will probably be slightly faster too.

Use READ_ONCE() to access current->flags to prevent the compiler from
possibly accessing current->flags multiple times.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Michel Lespinasse <walken@google.com>
Link: http://lkml.kernel.org/r/20200618212936.9776-1-longman@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 10:57:57 -07:00
Joonsoo Kim 8510e69c8e mm/page_alloc: fix memalloc_nocma_{save/restore} APIs
Currently, memalloc_nocma_{save/restore} API that prevents CMA area
in page allocation is implemented by using current_gfp_context(). However,
there are two problems of this implementation.

First, this doesn't work for allocation fastpath. In the fastpath,
original gfp_mask is used since current_gfp_context() is introduced in
order to control reclaim and it is on slowpath. So, CMA area can be
allocated through the allocation fastpath even if
memalloc_nocma_{save/restore} APIs are used. Currently, there is just
one user for these APIs and it has a fallback method to prevent actual
problem.
Second, clearing __GFP_MOVABLE in current_gfp_context() has a side effect
to exclude the memory on the ZONE_MOVABLE for allocation target.

To fix these problems, this patch changes the implementation to exclude
CMA area in page allocation. Main point of this change is using the
alloc_flags. alloc_flags is mainly used to control allocation so it fits
for excluding CMA area in allocation.

Fixes: d7fefcc8de (mm/cma: add PF flag to force non cma alloc)
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
Link: http://lkml.kernel.org/r/1595468942-29687-1-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07 11:33:29 -07:00
Linus Torvalds 9ba27414f2 fork-v5.9
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCXyge/QAKCRCRxhvAZXjc
 oildAQCCWpnTeXm6hrIE3VZ36X5npFtbaEthdBVAUJM7mo0FYwEA8+Wbnubg6jCw
 mztkXCnTfU7tApUdhKtQzcpEws45/Qk=
 =REE/
 -----END PGP SIGNATURE-----

Merge tag 'fork-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull fork cleanups from Christian Brauner:
 "This is cleanup series from when we reworked a chunk of the process
  creation paths in the kernel and switched to struct
  {kernel_}clone_args.

  High-level this does two main things:

   - Remove the double export of both do_fork() and _do_fork() where
     do_fork() used the incosistent legacy clone calling convention.

     Now we only export _do_fork() which is based on struct
     kernel_clone_args.

   - Remove the copy_thread_tls()/copy_thread() split making the
     architecture specific HAVE_COYP_THREAD_TLS config option obsolete.

  This switches all remaining architectures to select
  HAVE_COPY_THREAD_TLS and thus to the copy_thread_tls() calling
  convention. The current split makes the process creation codepaths
  more convoluted than they need to be. Each architecture has their own
  copy_thread() function unless it selects HAVE_COPY_THREAD_TLS then it
  has a copy_thread_tls() function.

  The split is not needed anymore nowadays, all architectures support
  CLONE_SETTLS but quite a few of them never bothered to select
  HAVE_COPY_THREAD_TLS and instead simply continued to use copy_thread()
  and use the old calling convention. Removing this split cleans up the
  process creation codepaths and paves the way for implementing clone3()
  on such architectures since it requires the copy_thread_tls() calling
  convention.

  After having made each architectures support copy_thread_tls() this
  series simply renames that function back to copy_thread(). It also
  switches all architectures that call do_fork() directly over to
  _do_fork() and the struct kernel_clone_args calling convention. This
  is a corollary of switching the architectures that did not yet support
  it over to copy_thread_tls() since do_fork() is conditional on not
  supporting copy_thread_tls() (Mostly because it lacks a separate
  argument for tls which is trivial to fix but there's no need for this
  function to exist.).

  The do_fork() removal is in itself already useful as it allows to to
  remove the export of both do_fork() and _do_fork() we currently have
  in favor of only _do_fork(). This has already been discussed back when
  we added clone3(). The legacy clone() calling convention is - as is
  probably well-known - somewhat odd:

    #
    # ABI hall of shame
    #
    config CLONE_BACKWARDS
    config CLONE_BACKWARDS2
    config CLONE_BACKWARDS3

  that is aggravated by the fact that some architectures such as sparc
  follow the CLONE_BACKWARDSx calling convention but don't really select
  the corresponding config option since they call do_fork() directly.

  So do_fork() enforces a somewhat arbitrary calling convention in the
  first place that doesn't really help the individual architectures that
  deviate from it. They can thus simply be switched to _do_fork()
  enforcing a single calling convention. (I really hope that any new
  architectures will __not__ try to implement their own calling
  conventions...)

  Most architectures already have made a similar switch (m68k comes to
  mind).

  Overall this removes more code than it adds even with a good portion
  of added comments. It simplifies a chunk of arch specific assembly
  either by moving the code into C or by simply rewriting the assembly.

  Architectures that have been touched in non-trivial ways have all been
  actually boot and stress tested: sparc and ia64 have been tested with
  Debian 9 images. They are the two architectures which have been
  touched the most. All non-trivial changes to architectures have seen
  acks from the relevant maintainers. nios2 with a custom built
  buildroot image. h8300 I couldn't get something bootable to test on
  but the changes have been fairly automatic and I'm sure we'll hear
  people yell if I broke something there.

  All other architectures that have been touched in trivial ways have
  been compile tested for each single patch of the series via git rebase
  -x "make ..." v5.8-rc2. arm{64} and x86{_64} have been boot tested
  even though they have just been trivially touched (removal of the
  HAVE_COPY_THREAD_TLS macro from their Kconfig) because well they are
  basically "core architectures" and since it is trivial to get your
  hands on a useable image"

* tag 'fork-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  arch: rename copy_thread_tls() back to copy_thread()
  arch: remove HAVE_COPY_THREAD_TLS
  unicore: switch to copy_thread_tls()
  sh: switch to copy_thread_tls()
  nds32: switch to copy_thread_tls()
  microblaze: switch to copy_thread_tls()
  hexagon: switch to copy_thread_tls()
  c6x: switch to copy_thread_tls()
  alpha: switch to copy_thread_tls()
  fork: remove do_fork()
  h8300: select HAVE_COPY_THREAD_TLS, switch to kernel_clone_args
  nios2: enable HAVE_COPY_THREAD_TLS, switch to kernel_clone_args
  ia64: enable HAVE_COPY_THREAD_TLS, switch to kernel_clone_args
  sparc: unconditionally enable HAVE_COPY_THREAD_TLS
  sparc: share process creation helpers between sparc and sparc64
  sparc64: enable HAVE_COPY_THREAD_TLS
  fork: fold legacy_clone_args_valid() into _do_fork()
2020-08-04 14:47:45 -07:00
Linus Torvalds 3950e97543 Merge branch 'exec-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull execve updates from Eric Biederman:
 "During the development of v5.7 I ran into bugs and quality of
  implementation issues related to exec that could not be easily fixed
  because of the way exec is implemented. So I have been diggin into
  exec and cleaning up what I can.

  This cycle I have been looking at different ideas and different
  implementations to see what is possible to improve exec, and cleaning
  the way exec interfaces with in kernel users. Only cleaning up the
  interfaces of exec with rest of the kernel has managed to stabalize
  and make it through review in time for v5.9-rc1 resulting in 2 sets of
  changes this cycle.

   - Implement kernel_execve

   - Make the user mode driver code a better citizen

  With kernel_execve the code size got a little larger as the copying of
  parameters from userspace and copying of parameters from userspace is
  now separate. The good news is kernel threads no longer need to play
  games with set_fs to use exec. Which when combined with the rest of
  Christophs set_fs changes should security bugs with set_fs much more
  difficult"

* 'exec-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (23 commits)
  exec: Implement kernel_execve
  exec: Factor bprm_stack_limits out of prepare_arg_pages
  exec: Factor bprm_execve out of do_execve_common
  exec: Move bprm_mm_init into alloc_bprm
  exec: Move initialization of bprm->filename into alloc_bprm
  exec: Factor out alloc_bprm
  exec: Remove unnecessary spaces from binfmts.h
  umd: Stop using split_argv
  umd: Remove exit_umh
  bpfilter: Take advantage of the facilities of struct pid
  exit: Factor thread_group_exited out of pidfd_poll
  umd: Track user space drivers with struct pid
  bpfilter: Move bpfilter_umh back into init data
  exec: Remove do_execve_file
  umh: Stop calling do_execve_file
  umd: Transform fork_usermode_blob into fork_usermode_driver
  umd: Rename umd_info.cmdline umd_info.driver_name
  umd: For clarity rename umh_info umd_info
  umh: Separate the user mode driver and the user mode helper support
  umh: Remove call_usermodehelper_setup_file.
  ...
2020-08-04 14:27:25 -07:00
Linus Torvalds e4cbce4d13 The main changes in this cycle were:
- Improve uclamp performance by using a static key for the fast path
 
  - Add the "sched_util_clamp_min_rt_default" sysctl, to optimize for
    better power efficiency of RT tasks on battery powered devices.
    (The default is to maximize performance & reduce RT latencies.)
 
  - Improve utime and stime tracking accuracy, which had a fixed boundary
    of error, which created larger and larger relative errors as the values
    become larger. This is now replaced with more precise arithmetics,
    using the new mul_u64_u64_div_u64() helper in math64.h.
 
  - Improve the deadline scheduler, such as making it capacity aware
 
  - Improve frequency-invariant scheduling
 
  - Misc cleanups in energy/power aware scheduling
 
  - Add sched_update_nr_running tracepoint to track changes to nr_running
 
  - Documentation additions and updates
 
  - Misc cleanups and smaller fixes
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl8oJDURHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1ixLg//bqWzFlfWirvngTgDxDnplwUTyKXmMCcq
 R1IYhlyK2O5FxvhbRmdmW11W3yzyTPvgCs6Q/70negGaPNe2w1OxfxiK9NMKz5eu
 M1LoXas7pL5g7Pr/ZxxHk/8VqJLV4t9MkodiiInmV6lTaznT3sU6a/kpYQjJyFnG
 Tuu9jd6JhdRKmePDJnNmUBoGQ7JiOQDcX4HtkcQ3OA+An3624tmJzbW1yts+uj7J
 ZWo2EY60RfbA9MxQXGPOaR/nAjngWs4Q6tddAh10mftsPq1gR2iFUKju1d31MQt/
 RHLdiqJf+AyUC4popKG7a+7ilCKMBwPociSreTJNPyEUQ1X4AM3vUVk4yjUoiDph
 k2WdsCF8/JRdhXg0NnrpPUqOaAbQj53EeXnitEb92E7WyTZgLOvAtpV//xZo6utp
 2QHerfrQ9SoGQjz/ho78za5vQtV1x25yDhd+X4XV4QEhIy85G9/2JCpC/Kc/TXLf
 OO7A4X69XztKTEJhP60g8ldCPUe4N2vbh1vKY6oAD8AFQVVNZ6n7375/Qa//b0/k
 ++hcYkPc2EK97/aBFdvzDgqb7aUo7Mtn2ibke16sQU4szulaoRuAHQG4jdGKMwbD
 dk2VBoxyxeYFXWHsNneSe87+ha3sd0dSN0ul1EB/SlFrVELMvy634YXnMYGW8ima
 PzyPB0ezpuA=
 =PbO7
 -----END PGP SIGNATURE-----

Merge tag 'sched-core-2020-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Ingo Molnar:

 - Improve uclamp performance by using a static key for the fast path

 - Add the "sched_util_clamp_min_rt_default" sysctl, to optimize for
   better power efficiency of RT tasks on battery powered devices.
   (The default is to maximize performance & reduce RT latencies.)

 - Improve utime and stime tracking accuracy, which had a fixed boundary
   of error, which created larger and larger relative errors as the
   values become larger. This is now replaced with more precise
   arithmetics, using the new mul_u64_u64_div_u64() helper in math64.h.

 - Improve the deadline scheduler, such as making it capacity aware

 - Improve frequency-invariant scheduling

 - Misc cleanups in energy/power aware scheduling

 - Add sched_update_nr_running tracepoint to track changes to nr_running

 - Documentation additions and updates

 - Misc cleanups and smaller fixes

* tag 'sched-core-2020-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (54 commits)
  sched/doc: Factorize bits between sched-energy.rst & sched-capacity.rst
  sched/doc: Document capacity aware scheduling
  sched: Document arch_scale_*_capacity()
  arm, arm64: Fix selection of CONFIG_SCHED_THERMAL_PRESSURE
  Documentation/sysctl: Document uclamp sysctl knobs
  sched/uclamp: Add a new sysctl to control RT default boost value
  sched/uclamp: Fix a deadlock when enabling uclamp static key
  sched: Remove duplicated tick_nohz_full_enabled() check
  sched: Fix a typo in a comment
  sched/uclamp: Remove unnecessary mutex_init()
  arm, arm64: Select CONFIG_SCHED_THERMAL_PRESSURE
  sched: Cleanup SCHED_THERMAL_PRESSURE kconfig entry
  arch_topology, sched/core: Cleanup thermal pressure definition
  trace/events/sched.h: fix duplicated word
  linux/sched/mm.h: drop duplicated words in comments
  smp: Fix a potential usage of stale nr_cpus
  sched/fair: update_pick_idlest() Select group with lowest group_util when idle_cpus are equal
  sched: nohz: stop passing around unused "ticks" parameter.
  sched: Better document ttwu()
  sched: Add a tracepoint to track rq->nr_running
  ...
2020-08-03 14:58:38 -07:00
Valentin Schneider f4470cdf10 sched: Document arch_scale_*_capacity()
Rather that hide their purpose in some dark, damp corner of Documentation/,
add some documentation to the default implementations.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20200731192016.7484-2-valentin.schneider@arm.com
2020-08-01 09:19:43 +02:00
Qais Yousef 13685c4a08 sched/uclamp: Add a new sysctl to control RT default boost value
RT tasks by default run at the highest capacity/performance level. When
uclamp is selected this default behavior is retained by enforcing the
requested uclamp.min (p->uclamp_req[UCLAMP_MIN]) of the RT tasks to be
uclamp_none(UCLAMP_MAX), which is SCHED_CAPACITY_SCALE; the maximum
value.

This is also referred to as 'the default boost value of RT tasks'.

See commit 1a00d99997 ("sched/uclamp: Set default clamps for RT tasks").

On battery powered devices, it is desired to control this default
(currently hardcoded) behavior at runtime to reduce energy consumed by
RT tasks.

For example, a mobile device manufacturer where big.LITTLE architecture
is dominant, the performance of the little cores varies across SoCs, and
on high end ones the big cores could be too power hungry.

Given the diversity of SoCs, the new knob allows manufactures to tune
the best performance/power for RT tasks for the particular hardware they
run on.

They could opt to further tune the value when the user selects
a different power saving mode or when the device is actively charging.

The runtime aspect of it further helps in creating a single kernel image
that can be run on multiple devices that require different tuning.

Keep in mind that a lot of RT tasks in the system are created by the
kernel. On Android for instance I can see over 50 RT tasks, only
a handful of which created by the Android framework.

To control the default behavior globally by system admins and device
integrator, introduce the new sysctl_sched_uclamp_util_min_rt_default
to change the default boost value of the RT tasks.

I anticipate this to be mostly in the form of modifying the init script
of a particular device.

To avoid polluting the fast path with unnecessary code, the approach
taken is to synchronously do the update by traversing all the existing
tasks in the system. This could race with a concurrent fork(), which is
dealt with by introducing sched_post_fork() function which will ensure
the racy fork will get the right update applied.

Tested on Juno-r2 in combination with the RT capacity awareness [1].
By default an RT task will go to the highest capacity CPU and run at the
maximum frequency, which is particularly energy inefficient on high end
mobile devices because the biggest core[s] are 'huge' and power hungry.

With this patch the RT task can be controlled to run anywhere by
default, and doesn't cause the frequency to be maximum all the time.
Yet any task that really needs to be boosted can easily escape this
default behavior by modifying its requested uclamp.min value
(p->uclamp_req[UCLAMP_MIN]) via sched_setattr() syscall.

[1] 804d402fb6f6: ("sched/rt: Make RT capacity-aware")

Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200716110347.19553-2-qais.yousef@arm.com
2020-07-29 13:51:47 +02:00
Pavel Begunkov dd6f843a9f tasks: add put_task_struct_many()
put_task_struct_many() is as put_task_struct() but puts several
references at once. Useful to batching it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24 13:00:45 -06:00
Valentin Schneider 25980c7a79 arch_topology, sched/core: Cleanup thermal pressure definition
The following commit:

  14533a16c4 ("thermal/cpu-cooling, sched/core: Move the arch_set_thermal_pressure() API to generic scheduler code")

moved the definition of arch_set_thermal_pressure() to sched/core.c, but
kept its declaration in linux/arch_topology.h. When building e.g. an x86
kernel with CONFIG_SCHED_THERMAL_PRESSURE=y, cpufreq_cooling.c ends up
getting the declaration of arch_set_thermal_pressure() from
include/linux/arch_topology.h, which is somewhat awkward.

On top of this, sched/core.c unconditionally defines
o The thermal_pressure percpu variable
o arch_set_thermal_pressure()

while arch_scale_thermal_pressure() does nothing unless redefined by the
architecture.

arch_*() functions are meant to be defined by architectures, so revert the
aforementioned commit and re-implement it in a way that keeps
arch_set_thermal_pressure() architecture-definable, and doesn't define the
thermal pressure percpu variable for kernels that don't need
it (CONFIG_SCHED_THERMAL_PRESSURE=n).

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200712165917.9168-2-valentin.schneider@arm.com
2020-07-22 10:22:05 +02:00
Randy Dunlap e0078e2eb8 linux/sched/mm.h: drop duplicated words in comments
Drop doubled words "to" and "that".

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/927ea8d8-3f6c-9b65-4c2b-63ab4bd59ef1@infradead.org
2020-07-22 10:22:05 +02:00
Paul Gortmaker 46132e3ac5 sched: nohz: stop passing around unused "ticks" parameter.
The "ticks" parameter was added in commit 0f004f5a69 ("sched: Cure more
NO_HZ load average woes") since calc_global_nohz() was called and needed
the "ticks" argument.

But in commit c308b56b53 ("sched: Fix nohz load accounting -- again!")
it became unused as the function calc_global_nohz() dropped using "ticks".

Fixes: c308b56b53 ("sched: Fix nohz load accounting -- again!")
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1593628458-32290-1-git-send-email-paul.gortmaker@windriver.com
2020-07-22 10:22:04 +02:00
Peter Zijlstra faa2fd7cba Merge branch 'sched/urgent' 2020-07-08 11:38:59 +02:00
Eric W. Biederman 38fd525a4c exit: Factor thread_group_exited out of pidfd_poll
Create an independent helper thread_group_exited which returns true
when all threads have passed exit_notify in do_exit.  AKA all of the
threads are at least zombies and might be dead or completely gone.

Create this helper by taking the logic out of pidfd_poll where it is
already tested, and adding a READ_ONCE on the read of
task->exit_state.

I will be changing the user mode driver code to use this same logic
to know when a user mode driver needs to be restarted.

Place the new helper thread_group_exited in kernel/exit.c and
EXPORT it so it can be used by modules.

Link: https://lkml.kernel.org/r/20200702164140.4468-13-ebiederm@xmission.com
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-07-07 11:58:17 -05:00
Christian Brauner 714acdbd1c
arch: rename copy_thread_tls() back to copy_thread()
Now that HAVE_COPY_THREAD_TLS has been removed, rename copy_thread_tls()
back simply copy_thread(). It's a simpler name, and doesn't imply that only
tls is copied here. This finishes an outstanding chunk of internal process
creation work since we've added clone3().

Cc: linux-arch@vger.kernel.org
Acked-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>A
Acked-by: Stafford Horne <shorne@gmail.com>
Acked-by: Greentime Hu <green.hu@gmail.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>A
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2020-07-04 23:41:37 +02:00
Christian Brauner 140c8180eb
arch: remove HAVE_COPY_THREAD_TLS
All architectures support copy_thread_tls() now, so remove the legacy
copy_thread() function and the HAVE_COPY_THREAD_TLS config option. Everyone
uses the same process creation calling convention based on
copy_thread_tls() and struct kernel_clone_args. This will make it easier to
maintain the core process creation code under kernel/, simplifies the
callpaths and makes the identical for all architectures.

Cc: linux-arch@vger.kernel.org
Acked-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Acked-by: Greentime Hu <green.hu@gmail.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2020-07-04 23:41:37 +02:00
Christian Brauner ff2a91127b
fork: remove do_fork()
Now that all architectures have been switched to use _do_fork() and the new
struct kernel_clone_args calling convention we can remove the legacy
do_fork() helper completely. The calling convention used to be brittle and
do_fork() didn't buy us anything. The only calling convention accepted
should be based on struct kernel_clone_args going forward. It's cleaner and
uniform.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2020-07-04 23:41:36 +02:00
Oleg Nesterov e91b481623 task_work: teach task_work_add() to do signal_wake_up()
So that the target task will exit the wait_event_interruptible-like
loop and call task_work_run() asap.

The patch turns "bool notify" into 0,TWA_RESUME,TWA_SIGNAL enum, the
new TWA_SIGNAL flag implies signal_wake_up().  However, it needs to
avoid the race with recalc_sigpending(), so the patch also adds the
new JOBCTL_TASK_WORK bit included in JOBCTL_PENDING_MASK.

TODO: once this patch is merged we need to change all current users
of task_work_add(notify = true) to use TWA_RESUME.

Cc: stable@vger.kernel.org # v5.7
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-30 12:18:08 -06:00
Sebastian Andrzej Siewior 01e377c539 sched/core: Remove mmdrop() definition
Commit
   bf2c59fce4 ("sched/core: Fix illegal RCU from offline CPUs")

introduced a definition for mmdrop() but a a few lines above there is
already mmdrop() defined as static inline.

Remove the newly introduced mmdrop() definition.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200618190810.790211-1-bigeasy@linutronix.de
2020-06-25 13:45:45 +02:00
Christian Brauner 3af8588c77
fork: fold legacy_clone_args_valid() into _do_fork()
This separate helper only existed to guarantee the mutual exclusivity of
CLONE_PIDFD and CLONE_PARENT_SETTID for legacy clone since CLONE_PIDFD
abuses the parent_tid field to return the pidfd. But we can actually handle
this uniformely thus removing the helper. For legacy clone we can detect
that CLONE_PIDFD is specified in conjunction with CLONE_PARENT_SETTID
because they will share the same memory which is invalid and for clone3()
setting the separate pidfd and parent_tid fields to the same memory is
bogus as well. So fold that helper directly into _do_fork() by detecting
this case.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: linux-m68k@lists.linux-m68k.org
Cc: x86@kernel.org
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2020-06-22 14:38:38 +02:00
Peter Zijlstra b4098bfc5e sched/deadline: Impose global limits on sched_attr::sched_period
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20190726161357.397880775@infradead.org
2020-06-15 14:10:04 +02:00
Marcelo Tosatti 9cc5b86568 isolcpus: Affine unbound kernel threads to housekeeping cpus
This is a kernel enhancement that configures the cpu affinity of kernel
threads via kernel boot option nohz_full=.

When this option is specified, the cpumask is immediately applied upon
kthread launch. This does not affect kernel threads that specify cpu
and node.

This allows CPU isolation (that is not allowing certain threads
to execute on certain CPUs) without using the isolcpus=domain parameter,
making it possible to enable load balancing on such CPUs
during runtime (see kernel-parameters.txt).

Note-1: this is based off on Wind River's patch at
https://github.com/starlingx-staging/stx-integ/blob/master/kernel/kernel-std/centos/patches/affine-compute-kernel-threads.patch

Difference being that this patch is limited to modifying kernel thread
cpumask. Behaviour of other threads can be controlled via cgroups or
sched_setaffinity.

Note-2: Wind River's patch was based off Christoph Lameter's patch at
https://lwn.net/Articles/565932/ with the only difference being
the kernel parameter changed from kthread to kthread_cpus.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200527142909.23372-3-frederic@kernel.org
2020-06-15 14:10:03 +02:00
Michel Lespinasse c1e8d7c6a7 mmap locking API: convert mmap_sem comments
Convert comments that reference mmap_sem to reference mmap_lock instead.

[akpm@linux-foundation.org: fix up linux-next leftovers]
[akpm@linux-foundation.org: s/lockaphore/lock/, per Vlastimil]
[akpm@linux-foundation.org: more linux-next fixups, per Michel]

Signed-off-by: Michel Lespinasse <walken@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Liam Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ying Han <yinghan@google.com>
Link: http://lkml.kernel.org/r/20200520052908.204642-13-walken@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-09 09:39:14 -07:00
Dmitry Safonov 9cb8f069de kernel: rename show_stack_loglvl() => show_stack()
Now the last users of show_stack() got converted to use an explicit log
level, show_stack_loglvl() can drop it's redundant suffix and become once
again well known show_stack().

Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20200418201944.482088-51-dima@arista.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-09 09:39:13 -07:00
Dmitry Safonov ab34b46d1a sysrq: use show_stack_loglvl()
Show the stack trace on a CPU with the same log level as "CPU%d" header.

Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jiri Slaby <jslaby@suse.com>
Link: http://lkml.kernel.org/r/20200418201944.482088-45-dima@arista.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-09 09:39:12 -07:00
Guilherme G. Piccoli 0ec9dc9bcb kernel/hung_task.c: introduce sysctl to print all traces when a hung task is detected
Commit 401c636a0e ("kernel/hung_task.c: show all hung tasks before
panic") introduced a change in that we started to show all CPUs
backtraces when a hung task is detected _and_ the sysctl/kernel
parameter "hung_task_panic" is set.  The idea is good, because usually
when observing deadlocks (that may lead to hung tasks), the culprit is
another task holding a lock and not necessarily the task detected as
hung.

The problem with this approach is that dumping backtraces is a slightly
expensive task, specially printing that on console (and specially in
many CPU machines, as servers commonly found nowadays).  So, users that
plan to collect a kdump to investigate the hung tasks and narrow down
the deadlock definitely don't need the CPUs backtrace on dmesg/console,
which will delay the panic and pollute the log (crash tool would easily
grab all CPUs traces with 'bt -a' command).

Also, there's the reciprocal scenario: some users may be interested in
seeing the CPUs backtraces but not have the system panic when a hung
task is detected.  The current approach hence is almost as embedding a
policy in the kernel, by forcing the CPUs backtraces' dump (only) on
hung_task_panic.

This patch decouples the panic event on hung task from the CPUs
backtraces dump, by creating (and documenting) a new sysctl called
"hung_task_all_cpu_backtrace", analog to the approach taken on soft/hard
lockups, that have both a panic and an "all_cpu_backtrace" sysctl to
allow individual control.  The new mechanism for dumping the CPUs
backtraces on hung task detection respects "hung_task_warnings" by not
dumping the traces in case there's no warnings left.

Signed-off-by: Guilherme G. Piccoli <gpiccoli@canonical.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Link: http://lkml.kernel.org/r/20200327223646.20779-1-gpiccoli@canonical.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-08 11:05:56 -07:00
Linus Torvalds 9ff7258575 Merge branch 'proc-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull proc updates from Eric Biederman:
 "This has four sets of changes:

   - modernize proc to support multiple private instances

   - ensure we see the exit of each process tid exactly

   - remove has_group_leader_pid

   - use pids not tasks in posix-cpu-timers lookup

  Alexey updated proc so each mount of proc uses a new superblock. This
  allows people to actually use mount options with proc with no fear of
  messing up another mount of proc. Given the kernel's internal mounts
  of proc for things like uml this was a real problem, and resulted in
  Android's hidepid mount options being ignored and introducing security
  issues.

  The rest of the changes are small cleanups and fixes that came out of
  my work to allow this change to proc. In essence it is swapping the
  pids in de_thread during exec which removes a special case the code
  had to handle. Then updating the code to stop handling that special
  case"

* 'proc-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  proc: proc_pid_ns takes super_block as an argument
  remove the no longer needed pid_alive() check in __task_pid_nr_ns()
  posix-cpu-timers: Replace __get_task_for_clock with pid_for_clock
  posix-cpu-timers: Replace cpu_timer_pid_type with clock_pid_type
  posix-cpu-timers: Extend rcu_read_lock removing task_struct references
  signal: Remove has_group_leader_pid
  exec: Remove BUG_ON(has_group_leader_pid)
  posix-cpu-timer:  Unify the now redundant code in lookup_task
  posix-cpu-timer: Tidy up group_leader logic in lookup_task
  proc: Ensure we see the exit of each process tid exactly once
  rculist: Add hlists_swap_heads_rcu
  proc: Use PIDTYPE_TGID in next_tgid
  Use proc_pid_ns() to get pid_namespace from the proc superblock
  proc: use named enums for better readability
  proc: use human-readable values for hidepid
  docs: proc: add documentation for "hidepid=4" and "subset=pid" options and new mount behavior
  proc: add option to mount only a pids subset
  proc: instantiate only pids that we can ptrace on 'hidepid=4' mount option
  proc: allow to mount many instances of proc in one pid namespace
  proc: rename struct proc_fs_info to proc_fs_opts
2020-06-04 13:54:34 -07:00
Linus Torvalds cb8e59cc87 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from David Miller:

 1) Allow setting bluetooth L2CAP modes via socket option, from Luiz
    Augusto von Dentz.

 2) Add GSO partial support to igc, from Sasha Neftin.

 3) Several cleanups and improvements to r8169 from Heiner Kallweit.

 4) Add IF_OPER_TESTING link state and use it when ethtool triggers a
    device self-test. From Andrew Lunn.

 5) Start moving away from custom driver versions, use the globally
    defined kernel version instead, from Leon Romanovsky.

 6) Support GRO vis gro_cells in DSA layer, from Alexander Lobakin.

 7) Allow hard IRQ deferral during NAPI, from Eric Dumazet.

 8) Add sriov and vf support to hinic, from Luo bin.

 9) Support Media Redundancy Protocol (MRP) in the bridging code, from
    Horatiu Vultur.

10) Support netmap in the nft_nat code, from Pablo Neira Ayuso.

11) Allow UDPv6 encapsulation of ESP in the ipsec code, from Sabrina
    Dubroca. Also add ipv6 support for espintcp.

12) Lots of ReST conversions of the networking documentation, from Mauro
    Carvalho Chehab.

13) Support configuration of ethtool rxnfc flows in bcmgenet driver,
    from Doug Berger.

14) Allow to dump cgroup id and filter by it in inet_diag code, from
    Dmitry Yakunin.

15) Add infrastructure to export netlink attribute policies to
    userspace, from Johannes Berg.

16) Several optimizations to sch_fq scheduler, from Eric Dumazet.

17) Fallback to the default qdisc if qdisc init fails because otherwise
    a packet scheduler init failure will make a device inoperative. From
    Jesper Dangaard Brouer.

18) Several RISCV bpf jit optimizations, from Luke Nelson.

19) Correct the return type of the ->ndo_start_xmit() method in several
    drivers, it's netdev_tx_t but many drivers were using
    'int'. From Yunjian Wang.

20) Add an ethtool interface for PHY master/slave config, from Oleksij
    Rempel.

21) Add BPF iterators, from Yonghang Song.

22) Add cable test infrastructure, including ethool interfaces, from
    Andrew Lunn. Marvell PHY driver is the first to support this
    facility.

23) Remove zero-length arrays all over, from Gustavo A. R. Silva.

24) Calculate and maintain an explicit frame size in XDP, from Jesper
    Dangaard Brouer.

25) Add CAP_BPF, from Alexei Starovoitov.

26) Support terse dumps in the packet scheduler, from Vlad Buslov.

27) Support XDP_TX bulking in dpaa2 driver, from Ioana Ciornei.

28) Add devm_register_netdev(), from Bartosz Golaszewski.

29) Minimize qdisc resets, from Cong Wang.

30) Get rid of kernel_getsockopt and kernel_setsockopt in order to
    eliminate set_fs/get_fs calls. From Christoph Hellwig.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2517 commits)
  selftests: net: ip_defrag: ignore EPERM
  net_failover: fixed rollback in net_failover_open()
  Revert "tipc: Fix potential tipc_aead refcnt leak in tipc_crypto_rcv"
  Revert "tipc: Fix potential tipc_node refcnt leak in tipc_rcv"
  vmxnet3: allow rx flow hash ops only when rss is enabled
  hinic: add set_channels ethtool_ops support
  selftests/bpf: Add a default $(CXX) value
  tools/bpf: Don't use $(COMPILE.c)
  bpf, selftests: Use bpf_probe_read_kernel
  s390/bpf: Use bcr 0,%0 as tail call nop filler
  s390/bpf: Maintain 8-byte stack alignment
  selftests/bpf: Fix verifier test
  selftests/bpf: Fix sample_cnt shared between two threads
  bpf, selftests: Adapt cls_redirect to call csum_level helper
  bpf: Add csum_level helper for fixing up csum levels
  bpf: Fix up bpf_skb_adjust_room helper's skb csum setting
  sfc: add missing annotation for efx_ef10_try_update_nic_stats_vf()
  crypto/chtls: IPv6 support for inline TLS
  Crypto/chcr: Fixes a coccinile check error
  Crypto/chcr: Fixes compilations warnings
  ...
2020-06-03 16:27:18 -07:00
Peter Zijlstra bf2c59fce4 sched/core: Fix illegal RCU from offline CPUs
In the CPU-offline process, it calls mmdrop() after idle entry and the
subsequent call to cpuhp_report_idle_dead(). Once execution passes the
call to rcu_report_dead(), RCU is ignoring the CPU, which results in
lockdep complaining when mmdrop() uses RCU from either memcg or
debugobjects below.

Fix it by cleaning up the active_mm state from BP instead. Every arch
which has CONFIG_HOTPLUG_CPU should have already called idle_task_exit()
from AP. The only exception is parisc because it switches them to
&init_mm unconditionally (see smp_boot_one_cpu() and smp_cpu_init()),
but the patch will still work there because it calls mmgrab(&init_mm) in
smp_cpu_init() and then should call mmdrop(&init_mm) in finish_cpu().

  WARNING: suspicious RCU usage
  -----------------------------
  kernel/workqueue.c:710 RCU or wq_pool_mutex should be held!

  other info that might help us debug this:

  RCU used illegally from offline CPU!
  Call Trace:
   dump_stack+0xf4/0x164 (unreliable)
   lockdep_rcu_suspicious+0x140/0x164
   get_work_pool+0x110/0x150
   __queue_work+0x1bc/0xca0
   queue_work_on+0x114/0x120
   css_release+0x9c/0xc0
   percpu_ref_put_many+0x204/0x230
   free_pcp_prepare+0x264/0x570
   free_unref_page+0x38/0xf0
   __mmdrop+0x21c/0x2c0
   idle_task_exit+0x170/0x1b0
   pnv_smp_cpu_kill_self+0x38/0x2e0
   cpu_die+0x48/0x64
   arch_cpu_idle_dead+0x30/0x50
   do_idle+0x2f4/0x470
   cpu_startup_entry+0x38/0x40
   start_secondary+0x7a8/0xa80
   start_secondary_resume+0x10/0x14

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Qian Cai <cai@lca.pw>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Link: https://lkml.kernel.org/r/20200401214033.8448-1-cai@lca.pw
2020-04-30 20:14:41 +02:00
Valentin Schneider 36c5bdc438 sched/topology: Kill SD_LOAD_BALANCE
That flag is set unconditionally in sd_init(), and no one checks for it
anymore. Remove it.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200415210512.805-5-valentin.schneider@arm.com
2020-04-30 20:14:39 +02:00
Eric W. Biederman bbd40fc481 signal: Remove has_group_leader_pid
After the introduction of exchange_tids has_group_leader_pid is
equivalent to thread_group_leader.  After the last couple of cleanups
has_group_leader_pid has no more callers.

So remove the now unused and redundant has_group_leader_pid.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-04-28 16:54:25 -05:00
Daniel Borkmann 0b54142e4b Merge branch 'work.sysctl' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull in Christoph Hellwig's series that changes the sysctl's ->proc_handler
methods to take kernel pointers instead. It gets rid of the set_fs address
space overrides used by BPF. As per discussion, pull in the feature branch
into bpf-next as it relates to BPF sysctl progs.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200427071508.GV23230@ZenIV.linux.org.uk/T/
2020-04-28 21:23:38 +02:00
Christoph Hellwig 32927393dc sysctl: pass kernel pointers to ->proc_handler
Instead of having all the sysctl handlers deal with user pointers, which
is rather hairy in terms of the BPF interaction, copy the input to and
from  userspace in common code.  This also means that the strings are
always NUL-terminated by the common code, making the API a little bit
safer.

As most handler just pass through the data to one of the common handlers
a lot of the changes are mechnical.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-04-27 02:07:40 -04:00
Gustavo A. R. Silva fe946db6ca sched: topology.h: Replace zero-length array with flexible-array member
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
2020-04-18 15:44:56 -05:00
Linus Torvalds d883600523 Merge branch 'for-5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:

 - Christian extended clone3 so that processes can be spawned into
   cgroups directly.

   This is not only neat in terms of semantics but also avoids grabbing
   the global cgroup_threadgroup_rwsem for migration.

 - Daniel added !root xattr support to cgroupfs.

   Userland already uses xattrs on cgroupfs for bookkeeping. This will
   allow delegated cgroups to support such usages.

 - Prateek tried to make cpuset hotplug handling synchronous but that
   led to possible deadlock scenarios. Reverted.

 - Other minor changes including release_agent_path handling cleanup.

* 'for-5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  docs: cgroup-v1: Document the cpuset_v2_mode mount option
  Revert "cpuset: Make cpuset hotplug synchronous"
  cgroupfs: Support user xattrs
  kernfs: Add option to enable user xattrs
  kernfs: Add removed_size out param for simple_xattr_set
  kernfs: kvmalloc xattr value instead of kmalloc
  cgroup: Restructure release_agent_path handling
  selftests/cgroup: add tests for cloning into cgroups
  clone3: allow spawning processes into cgroups
  cgroup: add cgroup_may_write() helper
  cgroup: refactor fork helpers
  cgroup: add cgroup_get_from_file() helper
  cgroup: unify attach permission checking
  cpuset: Make cpuset hotplug synchronous
  cgroup.c: Use built-in RCU list checking
  kselftest/cgroup: add cgroup destruction test
  cgroup: Clean up css_set task traversal
2020-04-03 11:30:20 -07:00
Linus Torvalds 6cad420cc6 Merge branch 'akpm' (patches from Andrew)
Merge updates from Andrew Morton:
 "A large amount of MM, plenty more to come.

  Subsystems affected by this patch series:
   - tools
   - kthread
   - kbuild
   - scripts
   - ocfs2
   - vfs
   - mm: slub, kmemleak, pagecache, gup, swap, memcg, pagemap, mremap,
         sparsemem, kasan, pagealloc, vmscan, compaction, mempolicy,
         hugetlbfs, hugetlb"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (155 commits)
  include/linux/huge_mm.h: check PageTail in hpage_nr_pages even when !THP
  mm/hugetlb: fix build failure with HUGETLB_PAGE but not HUGEBTLBFS
  selftests/vm: fix map_hugetlb length used for testing read and write
  mm/hugetlb: remove unnecessary memory fetch in PageHeadHuge()
  mm/hugetlb.c: clean code by removing unnecessary initialization
  hugetlb_cgroup: add hugetlb_cgroup reservation docs
  hugetlb_cgroup: add hugetlb_cgroup reservation tests
  hugetlb: support file_region coalescing again
  hugetlb_cgroup: support noreserve mappings
  hugetlb_cgroup: add accounting for shared mappings
  hugetlb: disable region_add file_region coalescing
  hugetlb_cgroup: add reservation accounting for private mappings
  mm/hugetlb_cgroup: fix hugetlb_cgroup migration
  hugetlb_cgroup: add interface for charge/uncharge hugetlb reservations
  hugetlb_cgroup: add hugetlb_cgroup reservation counter
  hugetlbfs: Use i_mmap_rwsem to address page fault/truncate race
  hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization
  mm/memblock.c: remove redundant assignment to variable max_addr
  mm: mempolicy: require at least one nodeid for MPOL_PREFERRED
  mm: mempolicy: use VM_BUG_ON_VMA in queue_pages_test_walk()
  ...
2020-04-02 13:55:34 -07:00
Linus Torvalds d987ca1c6b Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull exec/proc updates from Eric Biederman:
 "This contains two significant pieces of work: the work to sort out
  proc_flush_task, and the work to solve a deadlock between strace and
  exec.

  Fixing proc_flush_task so that it no longer requires a persistent
  mount makes improvements to proc possible. The removal of the
  persistent mount solves an old regression that that caused the hidepid
  mount option to only work on remount not on mount. The regression was
  found and reported by the Android folks. This further allows Alexey
  Gladkov's work making proc mount options specific to an individual
  mount of proc to move forward.

  The work on exec starts solving a long standing issue with exec that
  it takes mutexes of blocking userspace applications, which makes exec
  extremely deadlock prone. For the moment this adds a second mutex with
  a narrower scope that handles all of the easy cases. Which makes the
  tricky cases easy to spot. With a little luck the code to solve those
  deadlocks will be ready by next merge window"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (25 commits)
  signal: Extend exec_id to 64bits
  pidfd: Use new infrastructure to fix deadlocks in execve
  perf: Use new infrastructure to fix deadlocks in execve
  proc: io_accounting: Use new infrastructure to fix deadlocks in execve
  proc: Use new infrastructure to fix deadlocks in execve
  kernel/kcmp.c: Use new infrastructure to fix deadlocks in execve
  kernel: doc: remove outdated comment cred.c
  mm: docs: Fix a comment in process_vm_rw_core
  selftests/ptrace: add test cases for dead-locks
  exec: Fix a deadlock in strace
  exec: Add exec_update_mutex to replace cred_guard_mutex
  exec: Move exec_mmap right after de_thread in flush_old_exec
  exec: Move cleanup of posix timers on exec out of de_thread
  exec: Factor unshare_sighand out of de_thread and call it separately
  exec: Only compute current once in flush_old_exec
  pid: Improve the comment about waiting in zap_pid_ns_processes
  proc: Remove the now unnecessary internal mount of proc
  uml: Create a private mount of proc for mconsole
  uml: Don't consult current to find the proc_mnt in mconsole_proc
  proc: Use a list of inodes to flush from proc
  ...
2020-04-02 11:22:17 -07:00
Peter Xu 8b9a65fd28 mm: return faster for non-fatal signals in user mode faults
The idea comes from the upstream discussion between Linus and Andrea:

  https://lore.kernel.org/lkml/20171102193644.GB22686@redhat.com/

A summary to the issue: there was a special path in handle_userfault() in
the past that we'll return a VM_FAULT_NOPAGE when we detected non-fatal
signals when waiting for userfault handling.  We did that by reacquiring
the mmap_sem before returning.  However that brings a risk in that the
vmas might have changed when we retake the mmap_sem and even we could be
holding an invalid vma structure.

This patch is a preparation of removing that special path by allowing the
page fault to return even faster if we were interrupted by a non-fatal
signal during a user-mode page fault handling routine.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Suggested-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Brian Geffon <bgeffon@google.com>
Cc: Bobby Powers <bobbypowers@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Martin Cracauer <cracauer@cons.org>
Cc: Marty McFadden <mcfadden8@llnl.gov>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Maya Gokhale <gokhale2@llnl.gov>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Link: http://lkml.kernel.org/r/20200220160230.9598-1-peterx@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02 09:35:29 -07:00
Peter Xu 4ef873226c mm: introduce fault_signal_pending()
For most architectures, we've got a quick path to detect fatal signal
after a handle_mm_fault().  Introduce a helper for that quick path.

It cleans the current codes a bit so we don't need to duplicate the same
check across archs.  More importantly, this will be an unified place that
we handle the signal immediately right after an interrupted page fault, so
it'll be much easier for us if we want to change the behavior of handling
signals later on for all the archs.

Note that currently only part of the archs are using this new helper,
because some archs have their own way to handle signals.  In the follow up
patches, we'll try to apply this helper to all the rest of archs.

Another note is that the "regs" parameter in the new helper is not used
yet.  It'll be used very soon.  Now we kept it in this patch only to avoid
touching all the archs again in the follow up patches.

[peterx@redhat.com: fix sparse warnings]
  Link: http://lkml.kernel.org/r/20200311145921.GD479302@xz-x1
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Brian Geffon <bgeffon@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bobby Powers <bobbypowers@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Martin Cracauer <cracauer@cons.org>
Cc: Marty McFadden <mcfadden8@llnl.gov>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Maya Gokhale <gokhale2@llnl.gov>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Link: http://lkml.kernel.org/r/20200220155353.8676-4-peterx@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02 09:35:29 -07:00
Eric W. Biederman eea9673250 exec: Add exec_update_mutex to replace cred_guard_mutex
The cred_guard_mutex is problematic as it is held over possibly
indefinite waits for userspace.  The possible indefinite waits for
userspace that I have identified are: The cred_guard_mutex is held in
PTRACE_EVENT_EXIT waiting for the tracer.  The cred_guard_mutex is
held over "put_user(0, tsk->clear_child_tid)" in exit_mm().  The
cred_guard_mutex is held over "get_user(futex_offset, ...")  in
exit_robust_list.  The cred_guard_mutex held over copy_strings.

The functions get_user and put_user can trigger a page fault which can
potentially wait indefinitely in the case of userfaultfd or if
userspace implements part of the page fault path.

In any of those cases the userspace process that the kernel is waiting
for might make a different system call that winds up taking the
cred_guard_mutex and result in deadlock.

Holding a mutex over any of those possibly indefinite waits for
userspace does not appear necessary.  Add exec_update_mutex that will
just cover updating the process during exec where the permissions and
the objects pointed to by the task struct may be out of sync.

The plan is to switch the users of cred_guard_mutex to
exec_update_mutex one by one.  This lets us move forward while still
being careful and not introducing any regressions.

Link: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/
Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03MB5170.eurprd03.prod.outlook.com/
Link: https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/
Link: https://lore.kernel.org/lkml/20160923095031.GA14923@redhat.com/
Link: https://lore.kernel.org/lkml/20170213141452.GA30203@redhat.com/
Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.")
Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2")
Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-03-25 10:03:36 -05:00
Thara Gopinath 36a0df85d2 sched/topology: Add callback to read per CPU thermal pressure
Introduce the arch_scale_thermal_pressure() callback to retrieve per CPU thermal
pressure.

Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200222005213.3873-3-thara.gopinath@linaro.org
2020-03-06 12:57:17 +01:00
Linus Torvalds ef78e5b7de Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
 "Misc fixes all over the place:

   - Fix NUMA over-balancing between lightly loaded nodes. This is
     fallout of the big load-balancer rewrite.

   - Fix the NOHZ remote loadavg update logic, which fixes anomalies
     like reported 150 loadavg on mostly idle CPUs.

   - Fix XFS performance/scalability

   - Fix throttled groups unbound task-execution bug

   - Fix PSI procfs boundary condition

   - Fix the cpu.uclamp.{min,max} cgroup configuration write checks

   - Fix DocBook annotations

   - Fix RCU annotations

   - Fix overly CPU-intensive housekeeper CPU logic loop on large CPU
     counts"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/fair: Fix kernel-doc warning in attach_entity_load_avg()
  sched/core: Annotate curr pointer in rq with __rcu
  sched/psi: Fix OOB write when writing 0 bytes to PSI files
  sched/fair: Allow a per-CPU kthread waking a task to stack on the same CPU, to fix XFS performance regression
  sched/fair: Prevent unlimited runtime on throttled group
  sched/nohz: Optimize get_nohz_timer_target()
  sched/uclamp: Reject negative values in cpu_uclamp_write()
  sched/fair: Allow a small load imbalance between low utilisation SD_NUMA domains
  timers/nohz: Update NOHZ load in remote tick
  sched/core: Don't skip remote tick for idle CPUs
2020-02-15 12:51:22 -08:00
Christian Brauner ef2c41cf38 clone3: allow spawning processes into cgroups
This adds support for creating a process in a different cgroup than its
parent. Callers can limit and account processes and threads right from
the moment they are spawned:
- A service manager can directly spawn new services into dedicated
  cgroups.
- A process can be directly created in a frozen cgroup and will be
  frozen as well.
- The initial accounting jitter experienced by process supervisors and
  daemons is eliminated with this.
- Threaded applications or even thread implementations can choose to
  create a specific cgroup layout where each thread is spawned
  directly into a dedicated cgroup.

This feature is limited to the unified hierarchy. Callers need to pass
a directory file descriptor for the target cgroup. The caller can
choose to pass an O_PATH file descriptor. All usual migration
restrictions apply, i.e. there can be no processes in inner nodes. In
general, creating a process directly in a target cgroup adheres to all
migration restrictions.

One of the biggest advantages of this feature is that CLONE_INTO_GROUP does
not need to grab the write side of the cgroup cgroup_threadgroup_rwsem.
This global lock makes moving tasks/threads around super expensive. With
clone3() this lock is avoided.

Cc: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: cgroups@vger.kernel.org
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2020-02-12 17:57:51 -05:00
Peter Zijlstra (Intel) ebc0f83c78 timers/nohz: Update NOHZ load in remote tick
The way loadavg is tracked during nohz only pays attention to the load
upon entering nohz.  This can be particularly noticeable if full nohz is
entered while non-idle, and then the cpu goes idle and stays that way for
a long time.

Use the remote tick to ensure that full nohz cpus report their deltas
within a reasonable time.

[ swood: Added changelog and removed recheck of stopped tick. ]

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Scott Wood <swood@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/1578736419-14628-3-git-send-email-swood@redhat.com
2020-01-28 21:36:44 +01:00
Linus Torvalds c677124e63 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
 "These were the main changes in this cycle:

   - More -rt motivated separation of CONFIG_PREEMPT and
     CONFIG_PREEMPTION.

   - Add more low level scheduling topology sanity checks and warnings
     to filter out nonsensical topologies that break scheduling.

   - Extend uclamp constraints to influence wakeup CPU placement

   - Make the RT scheduler more aware of asymmetric topologies and CPU
     capacities, via uclamp metrics, if CONFIG_UCLAMP_TASK=y

   - Make idle CPU selection more consistent

   - Various fixes, smaller cleanups, updates and enhancements - please
     see the git log for details"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (58 commits)
  sched/fair: Define sched_idle_cpu() only for SMP configurations
  sched/topology: Assert non-NUMA topology masks don't (partially) overlap
  idle: fix spelling mistake "iterrupts" -> "interrupts"
  sched/fair: Remove redundant call to cpufreq_update_util()
  sched/psi: create /proc/pressure and /proc/pressure/{io|memory|cpu} only when psi enabled
  sched/fair: Fix sgc->{min,max}_capacity calculation for SD_OVERLAP
  sched/fair: calculate delta runnable load only when it's needed
  sched/cputime: move rq parameter in irqtime_account_process_tick
  stop_machine: Make stop_cpus() static
  sched/debug: Reset watchdog on all CPUs while processing sysrq-t
  sched/core: Fix size of rq::uclamp initialization
  sched/uclamp: Fix a bug in propagating uclamp value in new cgroups
  sched/fair: Load balance aggressively for SCHED_IDLE CPUs
  sched/fair : Improve update_sd_pick_busiest for spare capacity case
  watchdog: Remove soft_lockup_hrtimer_cnt and related code
  sched/rt: Make RT capacity-aware
  sched/fair: Make EAS wakeup placement consider uclamp restrictions
  sched/fair: Make task_fits_capacity() consider uclamp restrictions
  sched/uclamp: Rename uclamp_util_with() into uclamp_rq_util_with()
  sched/uclamp: Make uclamp util helpers use and return UL values
  ...
2020-01-28 10:07:09 -08:00
Ming Lei 11ea68f553 genirq, sched/isolation: Isolate from handling managed interrupts
The affinity of managed interrupts is completely handled in the kernel and
cannot be changed via the /proc/irq/* interfaces from user space. As the
kernel tries to spread out interrupts evenly accross CPUs on x86 to prevent
vector exhaustion, it can happen that a managed interrupt whose affinity
mask contains both isolated and housekeeping CPUs is routed to an isolated
CPU. As a consequence IO submitted on a housekeeping CPU causes interrupts
on the isolated CPU.

Add a new sub-parameter 'managed_irq' for 'isolcpus' and the corresponding
logic in the interrupt affinity selection code.

The subparameter indicates to the interrupt affinity selection logic that
it should try to avoid the above scenario.

This isolation is best effort and only effective if the automatically
assigned interrupt mask of a device queue contains isolated and
housekeeping CPUs. If housekeeping CPUs are online then such interrupts are
directed to the housekeeping CPU so that IO submitted on the housekeeping
CPU cannot disturb the isolated CPU.

If a queue's affinity mask contains only isolated CPUs then this parameter
has no effect on the interrupt routing decision, though interrupts are only
happening when tasks running on those isolated CPUs submit IO. IO submitted
on housekeeping CPUs has no influence on those queues.

If the affinity mask contains both housekeeping and isolated CPUs, but none
of the contained housekeeping CPUs is online, then the interrupt is also
routed to an isolated CPU. Interrupts are only delivered when one of the
isolated CPUs in the affinity mask submits IO. If one of the contained
housekeeping CPUs comes online, the CPU hotplug logic migrates the
interrupt automatically back to the upcoming housekeeping CPU. Depending on
the type of interrupt controller, this can require that at least one
interrupt is delivered to the isolated CPU in order to complete the
migration.

[ tglx: Removed unused parameter, added and edited comments/documentation
  	and rephrased the changelog so it contains more details. ]

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20200120091625.17912-1-ming.lei@redhat.com
2020-01-22 16:29:49 +01:00
Vincent Guittot a4f9a0e51b sched/fair: Remove redundant call to cpufreq_update_util()
With commit

  bef69dd878 ("sched/cpufreq: Move the cfs_rq_util_change() call to cpufreq_update_util()")

update_load_avg() has become the central point for calling cpufreq
(not including the update of blocked load). This change helps to
simplify further the number of calls to cpufreq_update_util() and to
remove last redundant ones. With update_load_avg(), we are now sure
that cpufreq_update_util() will be called after every task attachment
to a cfs_rq and especially after propagating this event down to the
util_avg of the root cfs_rq, which is the level that is used by
cpufreq governors like schedutil to set the frequency of a CPU.

The SCHED_CPUFREQ_MIGRATION flag forces an early call to cpufreq when
the migration happens in a cgroup whereas util_avg of root cfs_rq is
not yet updated and this call is duplicated with the one that happens
immediately after when the migration event reaches the root cfs_rq.
The dedicated flag SCHED_CPUFREQ_MIGRATION is now useless and can be
removed. The interface of attach_entity_load_avg() can also be
simplified accordingly.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lkml.kernel.org/r/1579083620-24943-1-git-send-email-vincent.guittot@linaro.org
2020-01-17 10:19:22 +01:00
Rafael J. Wysocki 85572c2c4a cpufreq: Avoid leaving stale IRQ work items during CPU offline
The scheduler code calling cpufreq_update_util() may run during CPU
offline on the target CPU after the IRQ work lists have been flushed
for it, so the target CPU should be prevented from running code that
may queue up an IRQ work item on it at that point.

Unfortunately, that may not be the case if dvfs_possible_from_any_cpu
is set for at least one cpufreq policy in the system, because that
allows the CPU going offline to run the utilization update callback
of the cpufreq governor on behalf of another (online) CPU in some
cases.

If that happens, the cpufreq governor callback may queue up an IRQ
work on the CPU running it, which is going offline, and the IRQ work
may not be flushed after that point.  Moreover, that IRQ work cannot
be flushed until the "offlining" CPU goes back online, so if any
other CPU calls irq_work_sync() to wait for the completion of that
IRQ work, it will have to wait until the "offlining" CPU is back
online and that may not happen forever.  In particular, a system-wide
deadlock may occur during CPU online as a result of that.

The failing scenario is as follows.  CPU0 is the boot CPU, so it
creates a cpufreq policy and becomes the "leader" of it
(policy->cpu).  It cannot go offline, because it is the boot CPU.
Next, other CPUs join the cpufreq policy as they go online and they
leave it when they go offline.  The last CPU to go offline, say CPU3,
may queue up an IRQ work while running the governor callback on
behalf of CPU0 after leaving the cpufreq policy because of the
dvfs_possible_from_any_cpu effect described above.  Then, CPU0 is
the only online CPU in the system and the stale IRQ work is still
queued on CPU3.  When, say, CPU1 goes back online, it will run
irq_work_sync() to wait for that IRQ work to complete and so it
will wait for CPU3 to go back online (which may never happen even
in principle), but (worse yet) CPU0 is waiting for CPU1 at that
point too and a system-wide deadlock occurs.

To address this problem notice that CPUs which cannot run cpufreq
utilization update code for themselves (for example, because they
have left the cpufreq policies that they belonged to), should also
be prevented from running that code on behalf of the other CPUs that
belong to a cpufreq policy with dvfs_possible_from_any_cpu set and so
in that case the cpufreq_update_util_data pointer of the CPU running
the code must not be NULL as well as for the CPU which is the target
of the cpufreq utilization update in progress.

Accordingly, change cpufreq_this_cpu_can_update() into a regular
function in kernel/sched/cpufreq.c (instead of a static inline in a
header file) and make it check the cpufreq_update_util_data pointer
of the local CPU if dvfs_possible_from_any_cpu is set for the target
cpufreq policy.

Also update the schedutil governor to do the
cpufreq_this_cpu_can_update() check in the non-fast-switch
case too to avoid the stale IRQ work issues.

Fixes: 99d14d0e16 ("cpufreq: Process remote callbacks from any CPU if the platform permits")
Link: https://lore.kernel.org/linux-pm/20191121093557.bycvdo4xyinbc5cb@vireshk-i7/
Reported-by: Anson Huang <anson.huang@nxp.com>
Tested-by: Anson Huang <anson.huang@nxp.com>
Cc: 4.14+ <stable@vger.kernel.org> # 4.14+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Tested-by: Peng Fan <peng.fan@nxp.com> (i.MX8QXP-MEK)
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-12-12 17:59:43 +01:00
Linus Torvalds 168829ad09 Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:
 "The main changes in this cycle were:

   - A comprehensive rewrite of the robust/PI futex code's exit handling
     to fix various exit races. (Thomas Gleixner et al)

   - Rework the generic REFCOUNT_FULL implementation using
     atomic_fetch_* operations so that the performance impact of the
     cmpxchg() loops is mitigated for common refcount operations.

     With these performance improvements the generic implementation of
     refcount_t should be good enough for everybody - and this got
     confirmed by performance testing, so remove ARCH_HAS_REFCOUNT and
     REFCOUNT_FULL entirely, leaving the generic implementation enabled
     unconditionally. (Will Deacon)

   - Other misc changes, fixes, cleanups"

* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits)
  lkdtm: Remove references to CONFIG_REFCOUNT_FULL
  locking/refcount: Remove unused 'refcount_error_report()' function
  locking/refcount: Consolidate implementations of refcount_t
  locking/refcount: Consolidate REFCOUNT_{MAX,SATURATED} definitions
  locking/refcount: Move saturation warnings out of line
  locking/refcount: Improve performance of generic REFCOUNT_FULL code
  locking/refcount: Move the bulk of the REFCOUNT_FULL implementation into the <linux/refcount.h> header
  locking/refcount: Remove unused refcount_*_checked() variants
  locking/refcount: Ensure integer operands are treated as signed
  locking/refcount: Define constants for saturation and max refcount values
  futex: Prevent exit livelock
  futex: Provide distinct return value when owner is exiting
  futex: Add mutex around futex exit
  futex: Provide state handling for exec() as well
  futex: Sanitize exit state handling
  futex: Mark the begin of futex exit explicitly
  futex: Set task::futex_state to DEAD right after handling futex exit
  futex: Split futex_mm_release() for exit/exec
  exit/exec: Seperate mm_release()
  futex: Replace PF_EXITPIDONE with a state
  ...
2019-11-26 16:02:40 -08:00
Thomas Gleixner 4610ba7ad8 exit/exec: Seperate mm_release()
mm_release() contains the futex exit handling. mm_release() is called from
do_exit()->exit_mm() and from exec()->exec_mm().

In the exit_mm() case PF_EXITING and the futex state is updated. In the
exec_mm() case these states are not touched.

As the futex exit code needs further protections against exit races, this
needs to be split into two functions.

Preparatory only, no functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20191106224556.240518241@linutronix.de
2019-11-20 09:40:08 +01:00
Adrian Reber 49cb2fc42c fork: extend clone3() to support setting a PID
The main motivation to add set_tid to clone3() is CRIU.

To restore a process with the same PID/TID CRIU currently uses
/proc/sys/kernel/ns_last_pid. It writes the desired (PID - 1) to
ns_last_pid and then (quickly) does a clone(). This works most of the
time, but it is racy. It is also slow as it requires multiple syscalls.

Extending clone3() to support *set_tid makes it possible restore a
process using CRIU without accessing /proc/sys/kernel/ns_last_pid and
race free (as long as the desired PID/TID is available).

This clone3() extension places the same restrictions (CAP_SYS_ADMIN)
on clone3() with *set_tid as they are currently in place for ns_last_pid.

The original version of this change was using a single value for
set_tid. At the 2019 LPC, after presenting set_tid, it was, however,
decided to change set_tid to an array to enable setting the PID of a
process in multiple PID namespaces at the same time. If a process is
created in a PID namespace it is possible to influence the PID inside
and outside of the PID namespace. Details also in the corresponding
selftest.

To create a process with the following PIDs:

      PID NS level         Requested PID
        0 (host)              31496
        1                        42
        2                         1

For that example the two newly introduced parameters to struct
clone_args (set_tid and set_tid_size) would need to be:

  set_tid[0] = 1;
  set_tid[1] = 42;
  set_tid[2] = 31496;
  set_tid_size = 3;

If only the PIDs of the two innermost nested PID namespaces should be
defined it would look like this:

  set_tid[0] = 1;
  set_tid[1] = 42;
  set_tid_size = 2;

The PID of the newly created process would then be the next available
free PID in the PID namespace level 0 (host) and 42 in the PID namespace
at level 1 and the PID of the process in the innermost PID namespace
would be 1.

The set_tid array is used to specify the PID of a process starting
from the innermost nested PID namespaces up to set_tid_size PID namespaces.

set_tid_size cannot be larger then the current PID namespace level.

Signed-off-by: Adrian Reber <areber@redhat.com>
Reviewed-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Dmitry Safonov <0x7f454c46@gmail.com>
Acked-by: Andrei Vagin <avagin@gmail.com>
Link: https://lore.kernel.org/r/20191115123621.142252-1-areber@redhat.com
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2019-11-15 23:49:22 +01:00
Linus Torvalds 9c5efe9ae7 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:

 - Apply a number of membarrier related fixes and cleanups, which fixes
   a use-after-free race in the membarrier code

 - Introduce proper RCU protection for tasks on the runqueue - to get
   rid of the subtle task_rcu_dereference() interface that was easy to
   get wrong

 - Misc fixes, but also an EAS speedup

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/fair: Avoid redundant EAS calculation
  sched/core: Remove double update_max_interval() call on CPU startup
  sched/core: Fix preempt_schedule() interrupt return comment
  sched/fair: Fix -Wunused-but-set-variable warnings
  sched/core: Fix migration to invalid CPU in __set_cpus_allowed_ptr()
  sched/membarrier: Return -ENOMEM to userspace on memory allocation failure
  sched/membarrier: Skip IPIs when mm->mm_users == 1
  selftests, sched/membarrier: Add multi-threaded test
  sched/membarrier: Fix p->mm->membarrier_state racy load
  sched/membarrier: Call sync_core only before usermode for same mm
  sched/membarrier: Remove redundant check
  sched/membarrier: Fix private expedited registration check
  tasks, sched/core: RCUify the assignment of rq->curr
  tasks, sched/core: With a grace period after finish_task_switch(), remove unnecessary code
  tasks, sched/core: Ensure tasks are available for a grace period after leaving the runqueue
  tasks: Add a count of task RCU users
  sched/core: Convert vcpu_is_preempted() from macro to an inline function
  sched/fair: Remove unused cfs_rq_clock_task() function
2019-09-28 12:39:07 -07:00
Mathieu Desnoyers 227a4aadc7 sched/membarrier: Fix p->mm->membarrier_state racy load
The membarrier_state field is located within the mm_struct, which
is not guaranteed to exist when used from runqueue-lock-free iteration
on runqueues by the membarrier system call.

Copy the membarrier_state from the mm_struct into the scheduler runqueue
when the scheduler switches between mm.

When registering membarrier for mm, after setting the registration bit
in the mm membarrier state, issue a synchronize_rcu() to ensure the
scheduler observes the change. In order to take care of the case
where a runqueue keeps executing the target mm without swapping to
other mm, iterate over each runqueue and issue an IPI to copy the
membarrier_state from the mm_struct into each runqueue which have the
same mm which state has just been modified.

Move the mm membarrier_state field closer to pgd in mm_struct to use
a cache line already touched by the scheduler switch_mm.

The membarrier_execve() (now membarrier_exec_mmap) hook now needs to
clear the runqueue's membarrier state in addition to clear the mm
membarrier state, so move its implementation into the scheduler
membarrier code so it can access the runqueue structure.

Add memory barrier in membarrier_exec_mmap() prior to clearing
the membarrier state, ensuring memory accesses executed prior to exec
are not reordered with the stores clearing the membarrier state.

As suggested by Linus, move all membarrier.c RCU read-side locks outside
of the for each cpu loops.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Kirill Tkhai <tkhai@yandex.ru>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190919173705.2181-5-mathieu.desnoyers@efficios.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-25 17:42:30 +02:00
Mathieu Desnoyers 2840cf02fa sched/membarrier: Call sync_core only before usermode for same mm
When the prev and next task's mm change, switch_mm() provides the core
serializing guarantees before returning to usermode. The only case
where an explicit core serialization is needed is when the scheduler
keeps the same mm for prev and next.

Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Kirill Tkhai <tkhai@yandex.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190919173705.2181-4-mathieu.desnoyers@efficios.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-25 17:42:30 +02:00
Eric W. Biederman 154abafc68 tasks, sched/core: With a grace period after finish_task_switch(), remove unnecessary code
Remove work arounds that were written before there was a grace period
after tasks left the runqueue in finish_task_switch().

In particular now that there tasks exiting the runqueue exprience
a RCU grace period none of the work performed by task_rcu_dereference()
excpet the rcu_dereference() is necessary so replace task_rcu_dereference()
with rcu_dereference().

Remove the code in rcuwait_wait_event() that checks to ensure the current
task has not exited.  It is no longer necessary as it is guaranteed
that any running task will experience a RCU grace period after it
leaves the run queueue.

Remove the comment in rcuwait_wake_up() as it is no longer relevant.

Ref: 8f95c90ceb ("sched/wait, RCU: Introduce rcuwait machinery")
Ref: 150593bf86 ("sched/api: Introduce task_rcu_dereference() and try_get_task_struct()")
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Kirill Tkhai <tkhai@yandex.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/87lfurdpk9.fsf_-_@x220.int.ebiederm.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-25 17:42:29 +02:00
Eric W. Biederman 3fbd7ee285 tasks: Add a count of task RCU users
Add a count of the number of RCU users (currently 1) of the task
struct so that we can later add the scheduler case and get rid of the
very subtle task_rcu_dereference(), and just use rcu_dereference().

As suggested by Oleg have the count overlap rcu_head so that no
additional space in task_struct is required.

Inspired-by: Linus Torvalds <torvalds@linux-foundation.org>
Inspired-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Kirill Tkhai <tkhai@yandex.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/87woebdplt.fsf_-_@x220.int.ebiederm.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-25 17:42:29 +02:00
Linus Torvalds 7f2444d38f Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core timer updates from Thomas Gleixner:
 "Timers and timekeeping updates:

   - A large overhaul of the posix CPU timer code which is a preparation
     for moving the CPU timer expiry out into task work so it can be
     properly accounted on the task/process.

     An update to the bogus permission checks will come later during the
     merge window as feedback was not complete before heading of for
     travel.

   - Switch the timerqueue code to use cached rbtrees and get rid of the
     homebrewn caching of the leftmost node.

   - Consolidate hrtimer_init() + hrtimer_init_sleeper() calls into a
     single function

   - Implement the separation of hrtimers to be forced to expire in hard
     interrupt context even when PREEMPT_RT is enabled and mark the
     affected timers accordingly.

   - Implement a mechanism for hrtimers and the timer wheel to protect
     RT against priority inversion and live lock issues when a (hr)timer
     which should be canceled is currently executing the callback.
     Instead of infinitely spinning, the task which tries to cancel the
     timer blocks on a per cpu base expiry lock which is held and
     released by the (hr)timer expiry code.

   - Enable the Hyper-V TSC page based sched_clock for Hyper-V guests
     resulting in faster access to timekeeping functions.

   - Updates to various clocksource/clockevent drivers and their device
     tree bindings.

   - The usual small improvements all over the place"

* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (101 commits)
  posix-cpu-timers: Fix permission check regression
  posix-cpu-timers: Always clear head pointer on dequeue
  hrtimer: Add a missing bracket and hide `migration_base' on !SMP
  posix-cpu-timers: Make expiry_active check actually work correctly
  posix-timers: Unbreak CONFIG_POSIX_TIMERS=n build
  tick: Mark sched_timer to expire in hard interrupt context
  hrtimer: Add kernel doc annotation for HRTIMER_MODE_HARD
  x86/hyperv: Hide pv_ops access for CONFIG_PARAVIRT=n
  posix-cpu-timers: Utilize timerqueue for storage
  posix-cpu-timers: Move state tracking to struct posix_cputimers
  posix-cpu-timers: Deduplicate rlimit handling
  posix-cpu-timers: Remove pointless comparisons
  posix-cpu-timers: Get rid of 64bit divisions
  posix-cpu-timers: Consolidate timer expiry further
  posix-cpu-timers: Get rid of zero checks
  rlimit: Rewrite non-sensical RLIMIT_CPU comment
  posix-cpu-timers: Respect INFINITY for hard RTTIME limit
  posix-cpu-timers: Switch thread group sampling to array
  posix-cpu-timers: Restructure expiry array
  posix-cpu-timers: Remove cputime_expires
  ...
2019-09-17 12:35:15 -07:00
Thomas Gleixner 244d49e306 posix-cpu-timers: Move state tracking to struct posix_cputimers
Put it where it belongs and clean up the ifdeffery in fork completely.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190821192922.743229404@linutronix.de
2019-08-28 11:50:42 +02:00
Thomas Gleixner b7be4ef136 posix-cpu-timers: Switch thread group sampling to array
That allows more simplifications in various places.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192921.988426956@linutronix.de
2019-08-28 11:50:39 +02:00
Thomas Gleixner 11b8462f7e posix-cpu-timers: Provide array based access to expiry cache
Using struct task_cputime for the expiry cache is a pretty odd choice and
comes with magic defines to rename the fields for usage in the expiry
cache.

struct task_cputime is basically a u64 array with 3 members, but it has
distinct members.

The expiry cache content is different than the content of task_cputime
because

  expiry[PROF]  = task_cputime.stime + task_cputime.utime
  expiry[VIRT]  = task_cputime.utime
  expiry[SCHED] = task_cputime.sum_exec_runtime

So there is no direct mapping between task_cputime and the expiry cache and
the #define based remapping is just a horrible hack.

Having the expiry cache array based allows further simplification of the
expiry code.

To avoid an all in one cleanup which is hard to review add a temporary
anonymous union into struct task_cputime which allows array based access to
it. That requires to reorder the members. Add a build time sanity check to
validate that the members are at the same place.

The union and the build time checks will be removed after conversion.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192921.105793824@linutronix.de
2019-08-28 11:50:35 +02:00
Thomas Gleixner 3a245c0f11 posix-cpu-timers: Move expiry cache into struct posix_cputimers
The expiry cache belongs into the posix_cputimers container where the other
cpu timers information is.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192921.014444012@linutronix.de
2019-08-28 11:50:35 +02:00
Thomas Gleixner 9eacb5c7e6 sched: Move struct task_cputime to types.h
For upcoming posix-timer changes to avoid include recursion hell.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190821192920.909530418@linutronix.de
2019-08-28 11:50:34 +02:00
Thomas Gleixner 2b69942f90 posix-cpu-timers: Create a container struct
Per task/process data of posix CPU timers is all over the place which
makes the code hard to follow and requires ifdeffery.

Create a container to hold all this information in one place, so data is
consolidated and the ifdeffery can be confined to the posix timer header
file and removed from places like fork.

As a first step, move the cpu_timers list head array into the new struct
and clean up the initializers and simplify fork. The remaining #ifdef in
fork will be removed later.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192920.819418976@linutronix.de
2019-08-28 11:50:33 +02:00
Thomas Gleixner c506bef424 posix-cpu-timers: Rename thread_group_cputimer() and make it static
thread_group_cputimer() is a complete misnomer. The function does two things:

 - For arming process wide timers it makes sure that the atomic time
   storage is up to date. If no cpu timer is armed yet, then the atomic
   time storage is not updated by the scheduler for performance reasons.

   In that case a full summing up of all threads needs to be done and the
   update needs to be enabled.

- Samples the current time into the caller supplied storage.

Rename it to thread_group_start_cputime(), make it static and fixup the
callsite.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192919.869350319@linutronix.de
2019-08-28 11:50:27 +02:00
Thomas Gleixner 19298fbf45 posix-cpu-timers: Provide quick sample function for itimer
get_itimer() needs a sample of the current thread group cputime. It invokes
thread_group_cputimer() - which is a misnomer. That function also starts
eventually the group cputime accouting which is bogus because the
accounting is already active when a timer is armed.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192919.599658199@linutronix.de
2019-08-28 11:50:26 +02:00
Mathieu Poirier f9a25f776d cpusets: Rebuild root domain deadline accounting information
When the topology of root domains is modified by CPUset or CPUhotplug
operations information about the current deadline bandwidth held in the
root domain is lost.

This patch addresses the issue by recalculating the lost deadline
bandwidth information by circling through the deadline tasks held in
CPUsets and adding their current load to the root domain they are
associated with.

Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
[ Various additional modifications. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bristot@redhat.com
Cc: claudio@evidence.eu.com
Cc: lizefan@huawei.com
Cc: longman@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: rostedt@goodmis.org
Cc: tj@kernel.org
Cc: tommaso.cucinotta@santannapisa.it
Link: https://lkml.kernel.org/r/20190719140000.31694-4-juri.lelli@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25 15:55:01 +02:00
Mathieu Poirier c22645f4c8 sched/topology: Add partition_sched_domains_locked()
Introduce the partition_sched_domains_locked() function by taking
the mutex locking code out of the original function.  That way
the work done by partition_sched_domains_locked() can be reused
without dropping the mutex lock.

No change of functionality is introduced by this patch.

Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bristot@redhat.com
Cc: claudio@evidence.eu.com
Cc: lizefan@huawei.com
Cc: longman@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: rostedt@goodmis.org
Cc: tommaso.cucinotta@santannapisa.it
Link: https://lkml.kernel.org/r/20190719140000.31694-2-juri.lelli@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25 15:51:57 +02:00
Matthew Wilcox (Oracle) 7b3c92b85a sched/core: Convert get_task_struct() to return the task
Returning the pointer that was passed in allows us to write
slightly more idiomatic code.  Convert a few users.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190704221323.24290-1-willy@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25 15:51:54 +02:00
Jann Horn 16d51a590a sched/fair: Don't free p->numa_faults with concurrent readers
When going through execve(), zero out the NUMA fault statistics instead of
freeing them.

During execve, the task is reachable through procfs and the scheduler. A
concurrent /proc/*/sched reader can read data from a freed ->numa_faults
allocation (confirmed by KASAN) and write it back to userspace.
I believe that it would also be possible for a use-after-free read to occur
through a race between a NUMA fault and execve(): task_numa_fault() can
lead to task_numa_compare(), which invokes task_weight() on the currently
running task of a different CPU.

Another way to fix this would be to make ->numa_faults RCU-managed or add
extra locking, but it seems easier to wipe the NUMA fault statistics on
execve.

Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Fixes: 82727018b0 ("sched/numa: Call task_numa_free() from do_execve()")
Link: https://lkml.kernel.org/r/20190716152047.14424-1-jannh@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25 15:37:04 +02:00
Linus Torvalds 07ab9d5bc5 Mostly bugfixes, but also:
- s390 support for KVM selftests
 - LAPIC timer offloading to housekeeping CPUs
 - Extend an s390 optimization for overcommitted hosts to all architectures
 - Debugging cleanups and improvements
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQEcBAABAgAGBQJdMr1FAAoJEL/70l94x66DvIkH/iVuUX9jO1NoQ7qhxeo04MnT
 GP9mX3XnWoI/iN0zAIRfQSP2/9a6+KblgdiziABhju58j5dCfAZGb5793TQppweb
 3ubl11vy7YkzaXJ0b35K7CFhOU9oSlHHGyi5Uh+yyje5qWNxwmHpizxjynbFTKb6
 +/S7O2Ua1VrAVvx0i0IRtwanIK/jF4dStVButgVaVdUva3zLaQmeI71iaJl9ddXY
 bh50xoYua5Ek6+ENi+nwCNVy4OF152AwDbXlxrU0QbeA1B888Qio7nIqb3bwwPpZ
 /8wMVvPzQgL7RmgtY5E5Z4cCYuu7mK8wgGxhuk3oszlVwZJ5rmnaYwGEl4x1s7o=
 =giag
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull more KVM updates from Paolo Bonzini:
 "Mostly bugfixes, but also:

   - s390 support for KVM selftests

   - LAPIC timer offloading to housekeeping CPUs

   - Extend an s390 optimization for overcommitted hosts to all
     architectures

   - Debugging cleanups and improvements"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (25 commits)
  KVM: x86: Add fixed counters to PMU filter
  KVM: nVMX: do not use dangling shadow VMCS after guest reset
  KVM: VMX: dump VMCS on failed entry
  KVM: x86/vPMU: refine kvm_pmu err msg when event creation failed
  KVM: s390: Use kvm_vcpu_wake_up in kvm_s390_vcpu_wakeup
  KVM: Boost vCPUs that are delivering interrupts
  KVM: selftests: Remove superfluous define from vmx.c
  KVM: SVM: Fix detection of AMD Errata 1096
  KVM: LAPIC: Inject timer interrupt via posted interrupt
  KVM: LAPIC: Make lapic timer unpinned
  KVM: x86/vPMU: reset pmc->counter to 0 for pmu fixed_counters
  KVM: nVMX: Ignore segment base for VMX memory operand when segment not FS or GS
  kvm: x86: ioapic and apic debug macros cleanup
  kvm: x86: some tsc debug cleanup
  kvm: vmx: fix coccinelle warnings
  x86: kvm: avoid constant-conversion warning
  x86: kvm: avoid -Wsometimes-uninitized warning
  KVM: x86: expose AVX512_BF16 feature to guest
  KVM: selftests: enable pgste option for the linker on s390
  KVM: selftests: Move kvm_create_max_vcpus test to generic code
  ...
2019-07-20 10:20:27 -07:00
Wanpeng Li 0c5f81dad4 KVM: LAPIC: Inject timer interrupt via posted interrupt
Dedicated instances are currently disturbed by unnecessary jitter due
to the emulated lapic timers firing on the same pCPUs where the
vCPUs reside.  There is no hardware virtual timer on Intel for guest
like ARM, so both programming timer in guest and the emulated timer fires
incur vmexits.  This patch tries to avoid vmexit when the emulated timer
fires, at least in dedicated instance scenario when nohz_full is enabled.

In that case, the emulated timers can be offload to the nearest busy
housekeeping cpus since APICv has been found for several years in server
processors. The guest timer interrupt can then be injected via posted interrupts,
which are delivered by the housekeeping cpu once the emulated timer fires.

The host should tuned so that vCPUs are placed on isolated physical
processors, and with several pCPUs surplus for busy housekeeping.
If disabled mwait/hlt/pause vmexits keep the vCPUs in non-root mode,
~3% redis performance benefit can be observed on Skylake server, and the
number of external interrupt vmexits drops substantially.  Without patch

            VM-EXIT  Samples  Samples%  Time%   Min Time  Max Time   Avg time
EXTERNAL_INTERRUPT    42916    49.43%   39.30%   0.47us   106.09us   0.71us ( +-   1.09% )

While with patch:

            VM-EXIT  Samples  Samples%  Time%   Min Time  Max Time         Avg time
EXTERNAL_INTERRUPT    6871     9.29%     2.96%   0.44us    57.88us   0.72us ( +-   4.02% )

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-20 09:00:40 +02:00
Linus Torvalds 57a8ec387e Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton:
 "VM:
   - z3fold fixes and enhancements by Henry Burns and Vitaly Wool

   - more accurate reclaimed slab caches calculations by Yafang Shao

   - fix MAP_UNINITIALIZED UAPI symbol to not depend on config, by
     Christoph Hellwig

   - !CONFIG_MMU fixes by Christoph Hellwig

   - new novmcoredd parameter to omit device dumps from vmcore, by
     Kairui Song

   - new test_meminit module for testing heap and pagealloc
     initialization, by Alexander Potapenko

   - ioremap improvements for huge mappings, by Anshuman Khandual

   - generalize kprobe page fault handling, by Anshuman Khandual

   - device-dax hotplug fixes and improvements, by Pavel Tatashin

   - enable synchronous DAX fault on powerpc, by Aneesh Kumar K.V

   - add pte_devmap() support for arm64, by Robin Murphy

   - unify locked_vm accounting with a helper, by Daniel Jordan

   - several misc fixes

  core/lib:
   - new typeof_member() macro including some users, by Alexey Dobriyan

   - make BIT() and GENMASK() available in asm, by Masahiro Yamada

   - changed LIST_POISON2 on x86_64 to 0xdead000000000122 for better
     code generation, by Alexey Dobriyan

   - rbtree code size optimizations, by Michel Lespinasse

   - convert struct pid count to refcount_t, by Joel Fernandes

  get_maintainer.pl:
   - add --no-moderated switch to skip moderated ML's, by Joe Perches

  misc:
   - ptrace PTRACE_GET_SYSCALL_INFO interface

   - coda updates

   - gdb scripts, various"

[ Using merge message suggestion from Vlastimil Babka, with some editing - Linus ]

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (100 commits)
  fs/select.c: use struct_size() in kmalloc()
  mm: add account_locked_vm utility function
  arm64: mm: implement pte_devmap support
  mm: introduce ARCH_HAS_PTE_DEVMAP
  mm: clean up is_device_*_page() definitions
  mm/mmap: move common defines to mman-common.h
  mm: move MAP_SYNC to asm-generic/mman-common.h
  device-dax: "Hotremove" persistent memory that is used like normal RAM
  mm/hotplug: make remove_memory() interface usable
  device-dax: fix memory and resource leak if hotplug fails
  include/linux/lz4.h: fix spelling and copy-paste errors in documentation
  ipc/mqueue.c: only perform resource calculation if user valid
  include/asm-generic/bug.h: fix "cut here" for WARN_ON for __WARN_TAINT architectures
  scripts/gdb: add helpers to find and list devices
  scripts/gdb: add lx-genpd-summary command
  drivers/pps/pps.c: clear offset flags in PPS_SETPARAMS ioctl
  kernel/pid.c: convert struct pid count to refcount_t
  drivers/rapidio/devices/rio_mport_cdev.c: NUL terminate some strings
  select: shift restore_saved_sigmask_unless() into poll_select_copy_remaining()
  select: change do_poll() to return -ERESTARTNOHAND rather than -EINTR
  ...
2019-07-17 08:58:04 -07:00
Oleg Nesterov b772434be0 signal: simplify set_user_sigmask/restore_user_sigmask
task->saved_sigmask and ->restore_sigmask are only used in the ret-from-
syscall paths.  This means that set_user_sigmask() can save ->blocked in
->saved_sigmask and do set_restore_sigmask() to indicate that ->blocked
was modified.

This way the callers do not need 2 sigset_t's passed to set/restore and
restore_user_sigmask() renamed to restore_saved_sigmask_unless() turns
into the trivial helper which just calls restore_saved_sigmask().

Link: http://lkml.kernel.org/r/20190606113206.GA9464@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Eric Wong <e@80x24.org>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: David Laight <David.Laight@aculab.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-16 19:23:24 -07:00
Alexey Dobriyan e2d9018e81 signal: reorder struct sighand_struct
struct sighand_struct::siglock field is the most used field by far, put
it first so that is can be accessed without IMM8 or IMM32 encoding on
x86_64.

Space savings (on trimmed down VM test config):

add/remove: 0/0 grow/shrink: 8/68 up/down: 49/-1147 (-1098)
Function                                     old     new   delta
complete_signal                              512     533     +21
do_signalfd4                                 335     346     +11
__cleanup_sighand                             39      43      +4
unhandled_signal                              49      52      +3
prepare_signal                               692     695      +3
ignore_signals                                37      40      +3
__tty_check_change.part                      248     251      +3
ksys_unshare                                 780     781      +1
sighand_ctor                                  33      29      -4
ptrace_trap_notify                            60      56      -4
sigqueue_free                                 98      91      -7
run_posix_cpu_timers                        1389    1382      -7
proc_pid_status                             2448    2441      -7
proc_pid_limits                              344     337      -7
posix_cpu_timer_rearm                        222     215      -7
posix_cpu_timer_get                          249     242      -7
kill_pid_info_as_cred                        243     236      -7
freeze_task                                  197     190      -7
flush_old_exec                              1873    1866      -7
do_task_stat                                3363    3356      -7
do_send_sig_info                              98      91      -7
do_group_exit                                147     140      -7
init_sighand                                2088    2080      -8
do_notify_parent_cldstop                     399     391      -8
signalfd_cleanup                              50      41      -9
do_notify_parent                             557     545     -12
__send_signal                               1029    1017     -12
ptrace_stop                                  590     577     -13
get_signal                                  1576    1563     -13
__lock_task_sighand                          112      99     -13
zap_pid_ns_processes                         391     377     -14
update_rlimit_cpu                             78      64     -14
tty_signal_session_leader                    413     399     -14
tty_open_proc_set_tty                        149     135     -14
tty_jobctrl_ioctl                            936     922     -14
set_cpu_itimer                               339     325     -14
ptrace_resume                                226     212     -14
ptrace_notify                                110      96     -14
proc_clear_tty                                81      67     -14
posix_cpu_timer_del                          229     215     -14
kernel_sigaction                             156     142     -14
getrusage                                    977     963     -14
get_current_tty                               98      84     -14
force_sigsegv                                 89      75     -14
force_sig_info                               205     191     -14
flush_signals                                 83      69     -14
flush_itimer_signals                          85      71     -14
do_timer_create                             1120    1106     -14
do_sigpending                                 88      74     -14
do_signal_stop                               537     523     -14
cgroup_init_fs_context                       644     630     -14
call_usermodehelper_exec_async               402     388     -14
calculate_sigpending                          58      44     -14
__x64_sys_timer_delete                       248     234     -14
__set_current_blocked                         80      66     -14
__ptrace_unlink                              310     296     -14
__ptrace_detach.part                         187     173     -14
send_sigqueue                                362     347     -15
get_cpu_itimer                               214     199     -15
signalfd_poll                                175     159     -16
dequeue_signal                               340     323     -17
do_getitimer                                 192     174     -18
release_task.part                           1060    1040     -20
ptrace_peek_siginfo                          408     387     -21
posix_cpu_timer_set                          827     806     -21
exit_signals                                 437     416     -21
do_sigaction                                 541     520     -21
do_setitimer                                 485     464     -21
disassociate_ctty.part                       545     517     -28
__x64_sys_rt_sigtimedwait                    721     679     -42
__x64_sys_ptrace                            1319    1277     -42
ptrace_request                              1828    1782     -46
signalfd_read                                507     459     -48
wait_consider_task                          2027    1971     -56
do_coredump                                 3672    3616     -56
copy_process.part                           6936    6871     -65

Link: http://lkml.kernel.org/r/20190503192800.GA18004@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-16 19:23:24 -07:00
Dmitry V. Levin 028b6e8a89
clone: fix CLONE_PIDFD support
The introduction of clone3 syscall accidentally broke CLONE_PIDFD
support in traditional clone syscall on compat x86 and those
architectures that use do_fork to implement clone syscall.

This bug was found by strace test suite.

Link: https://strace.io/logs/strace/2019-07-12
Fixes: 7f192e3cd3 ("fork: add clone3")
Bisected-and-tested-by: Anatoly Pugachev <matorola@gmail.com>
Signed-off-by: Dmitry V. Levin <ldv@altlinux.org>
Link: https://lore.kernel.org/r/20190714162047.GB10389@altlinux.org
Signed-off-by: Christian Brauner <christian@brauner.io>
2019-07-14 20:36:12 +02:00
Linus Torvalds 8f6ccf6159 clone3-v5.3
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCXSMhhgAKCRCRxhvAZXjc
 or7kAP9VzDcQaK/WoDd2ezh2C7Wh5hNy9z/qJVCa6Tb+N+g1UgEAxbhFUg55uGOA
 JNf7fGar5JF5hBMIXR+NqOi1/sb4swg=
 =ELWo
 -----END PGP SIGNATURE-----

Merge tag 'clone3-v5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull clone3 system call from Christian Brauner:
 "This adds the clone3 syscall which is an extensible successor to clone
  after we snagged the last flag with CLONE_PIDFD during the 5.2 merge
  window for clone(). It cleanly supports all of the flags from clone()
  and thus all legacy workloads.

  There are few user visible differences between clone3 and clone.
  First, CLONE_DETACHED will cause EINVAL with clone3 so we can reuse
  this flag. Second, the CSIGNAL flag is deprecated and will cause
  EINVAL to be reported. It is superseeded by a dedicated "exit_signal"
  argument in struct clone_args thus freeing up even more flags. And
  third, clone3 gives CLONE_PIDFD a dedicated return argument in struct
  clone_args instead of abusing CLONE_PARENT_SETTID's parent_tidptr
  argument.

  The clone3 uapi is designed to be easy to handle on 32- and 64 bit:

    /* uapi */
    struct clone_args {
            __aligned_u64 flags;
            __aligned_u64 pidfd;
            __aligned_u64 child_tid;
            __aligned_u64 parent_tid;
            __aligned_u64 exit_signal;
            __aligned_u64 stack;
            __aligned_u64 stack_size;
            __aligned_u64 tls;
    };

  and a separate kernel struct is used that uses proper kernel typing:

    /* kernel internal */
    struct kernel_clone_args {
            u64 flags;
            int __user *pidfd;
            int __user *child_tid;
            int __user *parent_tid;
            int exit_signal;
            unsigned long stack;
            unsigned long stack_size;
            unsigned long tls;
    };

  The system call comes with a size argument which enables the kernel to
  detect what version of clone_args userspace is passing in. clone3
  validates that any additional bytes a given kernel does not know about
  are set to zero and that the size never exceeds a page.

  A nice feature is that this patchset allowed us to cleanup and
  simplify various core kernel codepaths in kernel/fork.c by making the
  internal _do_fork() function take struct kernel_clone_args even for
  legacy clone().

  This patch also unblocks the time namespace patchset which wants to
  introduce a new CLONE_TIMENS flag.

  Note, that clone3 has only been wired up for x86{_32,64}, arm{64}, and
  xtensa. These were the architectures that did not require special
  massaging.

  Other architectures treat fork-like system calls individually and
  after some back and forth neither Arnd nor I felt confident that we
  dared to add clone3 unconditionally to all architectures. We agreed to
  leave this up to individual architecture maintainers. This is why
  there's an additional patch that introduces __ARCH_WANT_SYS_CLONE3
  which any architecture can set once it has implemented support for
  clone3. The patch also adds a cond_syscall(clone3) for architectures
  such as nios2 or h8300 that generate their syscall table by simply
  including asm-generic/unistd.h. The hope is to get rid of
  __ARCH_WANT_SYS_CLONE3 and cond_syscall() rather soon"

* tag 'clone3-v5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  arch: handle arches who do not yet define clone3
  arch: wire-up clone3() syscall
  fork: add clone3
2019-07-11 10:09:44 -07:00
Linus Torvalds 5ad18b2e60 Merge branch 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull force_sig() argument change from Eric Biederman:
 "A source of error over the years has been that force_sig has taken a
  task parameter when it is only safe to use force_sig with the current
  task.

  The force_sig function is built for delivering synchronous signals
  such as SIGSEGV where the userspace application caused a synchronous
  fault (such as a page fault) and the kernel responded with a signal.

  Because the name force_sig does not make this clear, and because the
  force_sig takes a task parameter the function force_sig has been
  abused for sending other kinds of signals over the years. Slowly those
  have been fixed when the oopses have been tracked down.

  This set of changes fixes the remaining abusers of force_sig and
  carefully rips out the task parameter from force_sig and friends
  making this kind of error almost impossible in the future"

* 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (27 commits)
  signal/x86: Move tsk inside of CONFIG_MEMORY_FAILURE in do_sigbus
  signal: Remove the signal number and task parameters from force_sig_info
  signal: Factor force_sig_info_to_task out of force_sig_info
  signal: Generate the siginfo in force_sig
  signal: Move the computation of force into send_signal and correct it.
  signal: Properly set TRACE_SIGNAL_LOSE_INFO in __send_signal
  signal: Remove the task parameter from force_sig_fault
  signal: Use force_sig_fault_to_task for the two calls that don't deliver to current
  signal: Explicitly call force_sig_fault on current
  signal/unicore32: Remove tsk parameter from __do_user_fault
  signal/arm: Remove tsk parameter from __do_user_fault
  signal/arm: Remove tsk parameter from ptrace_break
  signal/nds32: Remove tsk parameter from send_sigtrap
  signal/riscv: Remove tsk parameter from do_trap
  signal/sh: Remove tsk parameter from force_sig_info_fault
  signal/um: Remove task parameter from send_sigtrap
  signal/x86: Remove task parameter from send_sigtrap
  signal: Remove task parameter from force_sig_mceerr
  signal: Remove task parameter from force_sig
  signal: Remove task parameter from force_sigsegv
  ...
2019-07-08 21:48:15 -07:00
Linus Torvalds c84ca912b0 Keyrings namespacing
-----BEGIN PGP SIGNATURE-----
 
 iQIVAwUAXRU89Pu3V2unywtrAQIdBBAAmMBsrfv+LUN4Vru/D6KdUO4zdYGcNK6m
 S56bcNfP6oIDEj6HrNNnzKkWIZpdZ61Odv1zle96+v4WZ/6rnLCTpcsdaFNTzaoO
 YT2jk7jplss0ImrMv1DSoykGqO3f0ThMIpGCxHKZADGSu0HMbjSEh+zLPV4BaMtT
 BVuF7P3eZtDRLdDtMtYcgvf5UlbdoBEY8w1FUjReQx8hKGxVopGmCo5vAeiY8W9S
 ybFSZhPS5ka33ynVrLJH2dqDo5A8pDhY8I4bdlcxmNtRhnPCYZnuvTqeAzyUKKdI
 YN9zJeDu1yHs9mi8dp45NPJiKy6xLzWmUwqH8AvR8MWEkrwzqbzNZCEHZ41j74hO
 YZWI0JXi72cboszFvOwqJERvITKxrQQyVQLPRQE2vVbG0bIZPl8i7oslFVhitsl+
 evWqHb4lXY91rI9cC6JIXR1OiUjp68zXPv7DAnxv08O+PGcioU1IeOvPivx8QSx4
 5aUeCkYIIAti/GISzv7xvcYh8mfO76kBjZSB35fX+R9DkeQpxsHmmpWe+UCykzWn
 EwhHQn86+VeBFP6RAXp8CgNCLbrwkEhjzXQl/70s1eYbwvK81VcpDAQ6+cjpf4Hb
 QUmrUJ9iE0wCNl7oqvJZoJvWVGlArvPmzpkTJk3N070X2R0T7x1WCsMlPDMJGhQ2
 fVHvA3QdgWs=
 =Push
 -----END PGP SIGNATURE-----

Merge tag 'keys-namespace-20190627' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

Pull keyring namespacing from David Howells:
 "These patches help make keys and keyrings more namespace aware.

  Firstly some miscellaneous patches to make the process easier:

   - Simplify key index_key handling so that the word-sized chunks
     assoc_array requires don't have to be shifted about, making it
     easier to add more bits into the key.

   - Cache the hash value in the key so that we don't have to calculate
     on every key we examine during a search (it involves a bunch of
     multiplications).

   - Allow keying_search() to search non-recursively.

  Then the main patches:

   - Make it so that keyring names are per-user_namespace from the point
     of view of KEYCTL_JOIN_SESSION_KEYRING so that they're not
     accessible cross-user_namespace.

     keyctl_capabilities() shows KEYCTL_CAPS1_NS_KEYRING_NAME for this.

   - Move the user and user-session keyrings to the user_namespace
     rather than the user_struct. This prevents them propagating
     directly across user_namespaces boundaries (ie. the KEY_SPEC_*
     flags will only pick from the current user_namespace).

   - Make it possible to include the target namespace in which the key
     shall operate in the index_key. This will allow the possibility of
     multiple keys with the same description, but different target
     domains to be held in the same keyring.

     keyctl_capabilities() shows KEYCTL_CAPS1_NS_KEY_TAG for this.

   - Make it so that keys are implicitly invalidated by removal of a
     domain tag, causing them to be garbage collected.

   - Institute a network namespace domain tag that allows keys to be
     differentiated by the network namespace in which they operate. New
     keys that are of a type marked 'KEY_TYPE_NET_DOMAIN' are assigned
     the network domain in force when they are created.

   - Make it so that the desired network namespace can be handed down
     into the request_key() mechanism. This allows AFS, NFS, etc. to
     request keys specific to the network namespace of the superblock.

     This also means that the keys in the DNS record cache are
     thenceforth namespaced, provided network filesystems pass the
     appropriate network namespace down into dns_query().

     For DNS, AFS and NFS are good, whilst CIFS and Ceph are not. Other
     cache keyrings, such as idmapper keyrings, also need to set the
     domain tag - for which they need access to the network namespace of
     the superblock"

* tag 'keys-namespace-20190627' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
  keys: Pass the network namespace into request_key mechanism
  keys: Network namespace domain tag
  keys: Garbage collect keys for which the domain has been removed
  keys: Include target namespace in match criteria
  keys: Move the user and user-session keyrings to the user_namespace
  keys: Namespace keyring names
  keys: Add a 'recurse' flag for keyring searches
  keys: Cache the hash value to avoid lots of recalculation
  keys: Simplify key description management
2019-07-08 19:36:47 -07:00
Linus Torvalds dad1c12ed8 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:

 - Remove the unused per rq load array and all its infrastructure, by
   Dietmar Eggemann.

 - Add utilization clamping support by Patrick Bellasi. This is a
   refinement of the energy aware scheduling framework with support for
   boosting of interactive and capping of background workloads: to make
   sure critical GUI threads get maximum frequency ASAP, and to make
   sure background processing doesn't unnecessarily move to cpufreq
   governor to higher frequencies and less energy efficient CPU modes.

 - Add the bare minimum of tracepoints required for LISA EAS regression
   testing, by Qais Yousef - which allows automated testing of various
   power management features, including energy aware scheduling.

 - Restructure the former tsk_nr_cpus_allowed() facility that the -rt
   kernel used to modify the scheduler's CPU affinity logic such as
   migrate_disable() - introduce the task->cpus_ptr value instead of
   taking the address of &task->cpus_allowed directly - by Sebastian
   Andrzej Siewior.

 - Misc optimizations, fixes, cleanups and small enhancements - see the
   Git log for details.

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
  sched/uclamp: Add uclamp support to energy_compute()
  sched/uclamp: Add uclamp_util_with()
  sched/cpufreq, sched/uclamp: Add clamps for FAIR and RT tasks
  sched/uclamp: Set default clamps for RT tasks
  sched/uclamp: Reset uclamp values on RESET_ON_FORK
  sched/uclamp: Extend sched_setattr() to support utilization clamping
  sched/core: Allow sched_setattr() to use the current policy
  sched/uclamp: Add system default clamps
  sched/uclamp: Enforce last task's UCLAMP_MAX
  sched/uclamp: Add bucket local max tracking
  sched/uclamp: Add CPU's clamp buckets refcounting
  sched/fair: Rename weighted_cpuload() to cpu_runnable_load()
  sched/debug: Export the newly added tracepoints
  sched/debug: Add sched_overutilized tracepoint
  sched/debug: Add new tracepoint to track PELT at se level
  sched/debug: Add new tracepoints to track PELT at rq level
  sched/debug: Add a new sched_trace_*() helper functions
  sched/autogroup: Make autogroup_path() always available
  sched/wait: Deduplicate code with do-while
  sched/topology: Remove unused 'sd' parameter from arch_scale_cpu_capacity()
  ...
2019-07-08 16:39:53 -07:00
David Howells 0f44e4d976 keys: Move the user and user-session keyrings to the user_namespace
Move the user and user-session keyrings to the user_namespace struct rather
than pinning them from the user_struct struct.  This prevents these
keyrings from propagating across user-namespaces boundaries with regard to
the KEY_SPEC_* flags, thereby making them more useful in a containerised
environment.

The issue is that a single user_struct may be represent UIDs in several
different namespaces.

The way the patch does this is by attaching a 'register keyring' in each
user_namespace and then sticking the user and user-session keyrings into
that.  It can then be searched to retrieve them.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jann Horn <jannh@google.com>
2019-06-26 21:02:32 +01:00
Patrick Bellasi e8f14172c6 sched/uclamp: Add system default clamps
Tasks without a user-defined clamp value are considered not clamped
and by default their utilization can have any value in the
[0..SCHED_CAPACITY_SCALE] range.

Tasks with a user-defined clamp value are allowed to request any value
in that range, and the required clamp is unconditionally enforced.
However, a "System Management Software" could be interested in limiting
the range of clamp values allowed for all tasks.

Add a privileged interface to define a system default configuration via:

  /proc/sys/kernel/sched_uclamp_util_{min,max}

which works as an unconditional clamp range restriction for all tasks.

With the default configuration, the full SCHED_CAPACITY_SCALE range of
values is allowed for each clamp index. Otherwise, the task-specific
clamp is capped by the corresponding system default value.

Do that by tracking, for each task, the "effective" clamp value and
bucket the task has been refcounted in at enqueue time. This
allows to lazy aggregate "requested" and "system default" values at
enqueue time and simplifies refcounting updates at dequeue time.

The cached bucket ids are used to avoid (relatively) more expensive
integer divisions every time a task is enqueued.

An active flag is used to report when the "effective" value is valid and
thus the task is actually refcounted in the corresponding rq's bucket.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alessio Balsini <balsini@android.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://lkml.kernel.org/r/20190621084217.8167-5-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-24 19:23:45 +02:00
Patrick Bellasi 69842cba9a sched/uclamp: Add CPU's clamp buckets refcounting
Utilization clamping allows to clamp the CPU's utilization within a
[util_min, util_max] range, depending on the set of RUNNABLE tasks on
that CPU. Each task references two "clamp buckets" defining its minimum
and maximum (util_{min,max}) utilization "clamp values". A CPU's clamp
bucket is active if there is at least one RUNNABLE tasks enqueued on
that CPU and refcounting that bucket.

When a task is {en,de}queued {on,from} a rq, the set of active clamp
buckets on that CPU can change. If the set of active clamp buckets
changes for a CPU a new "aggregated" clamp value is computed for that
CPU. This is because each clamp bucket enforces a different utilization
clamp value.

Clamp values are always MAX aggregated for both util_min and util_max.
This ensures that no task can affect the performance of other
co-scheduled tasks which are more boosted (i.e. with higher util_min
clamp) or less capped (i.e. with higher util_max clamp).

A task has:
   task_struct::uclamp[clamp_id]::bucket_id
to track the "bucket index" of the CPU's clamp bucket it refcounts while
enqueued, for each clamp index (clamp_id).

A runqueue has:
   rq::uclamp[clamp_id]::bucket[bucket_id].tasks
to track how many RUNNABLE tasks on that CPU refcount each
clamp bucket (bucket_id) of a clamp index (clamp_id).
It also has a:
   rq::uclamp[clamp_id]::bucket[bucket_id].value
to track the clamp value of each clamp bucket (bucket_id) of a clamp
index (clamp_id).

The rq::uclamp::bucket[clamp_id][] array is scanned every time it's
needed to find a new MAX aggregated clamp value for a clamp_id. This
operation is required only when it's dequeued the last task of a clamp
bucket tracking the current MAX aggregated clamp value. In this case,
the CPU is either entering IDLE or going to schedule a less boosted or
more clamped task.
The expected number of different clamp values configured at build time
is small enough to fit the full unordered array into a single cache
line, for configurations of up to 7 buckets.

Add to struct rq the basic data structures required to refcount the
number of RUNNABLE tasks for each clamp bucket. Add also the max
aggregation required to update the rq's clamp value at each
enqueue/dequeue event.

Use a simple linear mapping of clamp values into clamp buckets.
Pre-compute and cache bucket_id to avoid integer divisions at
enqueue/dequeue time.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alessio Balsini <balsini@android.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://lkml.kernel.org/r/20190621084217.8167-2-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-24 19:23:44 +02:00
Vincent Guittot 8ec59c0f5f sched/topology: Remove unused 'sd' parameter from arch_scale_cpu_capacity()
The 'struct sched_domain *sd' parameter to arch_scale_cpu_capacity() is
unused since commit:

  765d0af19f ("sched/topology: Remove the ::smt_gain field from 'struct sched_domain'")

Remove it.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: gregkh@linuxfoundation.org
Cc: linux@armlinux.org.uk
Cc: quentin.perret@arm.com
Cc: rafael@kernel.org
Link: https://lkml.kernel.org/r/1560783617-5827-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-24 19:23:39 +02:00
Waiman Long 00f3c5a3df locking/rwsem: Always release wait_lock before waking up tasks
With the use of wake_q, we can do task wakeups without holding the
wait_lock. There is one exception in the rwsem code, though. It is
when the writer in the slowpath detects that there are waiters ahead
but the rwsem is not held by a writer. This can lead to a long wait_lock
hold time especially when a large number of readers are to be woken up.

Remediate this situation by releasing the wait_lock before waking
up tasks and re-acquiring it afterward. The rwsem_try_write_lock()
function is also modified to read the rwsem count directly to avoid
stale count value.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Link: https://lkml.kernel.org/r/20190520205918.22251-9-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-17 12:28:00 +02:00
Ingo Molnar 23da766ab1 Linux 5.2-rc5
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl0Gj1MeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGctkH/0At3+SQPY2JJSy8
 i6+TDeytFx9OggeGLPHChRfehkAlvMb/kd34QHnuEvDqUuCAMU6HZQJFKoK9mvFI
 sDJVayPGDSqpm+iv8qLpMBPShiCXYVnGZeVfOdv36jUswL0k6wHV1pz4avFkDeZa
 1F4pmI6O2XRkNTYQawbUaFkAngWUCBG9ECLnHJnuIY6ohShBvjI4+E2JUaht+8gO
 M2h2b9ieddWmjxV3LTKgsK1v+347RljxdZTWnJ62SCDSEVZvsgSA9W2wnebVhBkJ
 drSmrFLxNiM+W45mkbUFmQixRSmjv++oRR096fxAnodBxMw0TDxE1RiMQWE6rVvG
 N6MC6xA=
 =+B0P
 -----END PGP SIGNATURE-----

Merge tag 'v5.2-rc5' into sched/core, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-17 12:12:27 +02:00
Andrea Arcangeli 59ea6d06cf coredump: fix race condition between collapse_huge_page() and core dumping
When fixing the race conditions between the coredump and the mmap_sem
holders outside the context of the process, we focused on
mmget_not_zero()/get_task_mm() callers in 04f5866e41 ("coredump: fix
race condition between mmget_not_zero()/get_task_mm() and core
dumping"), but those aren't the only cases where the mmap_sem can be
taken outside of the context of the process as Michal Hocko noticed
while backporting that commit to older -stable kernels.

If mmgrab() is called in the context of the process, but then the
mm_count reference is transferred outside the context of the process,
that can also be a problem if the mmap_sem has to be taken for writing
through that mm_count reference.

khugepaged registration calls mmgrab() in the context of the process,
but the mmap_sem for writing is taken later in the context of the
khugepaged kernel thread.

collapse_huge_page() after taking the mmap_sem for writing doesn't
modify any vma, so it's not obvious that it could cause a problem to the
coredump, but it happens to modify the pmd in a way that breaks an
invariant that pmd_trans_huge_lock() relies upon.  collapse_huge_page()
needs the mmap_sem for writing just to block concurrent page faults that
call pmd_trans_huge_lock().

Specifically the invariant that "!pmd_trans_huge()" cannot become a
"pmd_trans_huge()" doesn't hold while collapse_huge_page() runs.

The coredump will call __get_user_pages() without mmap_sem for reading,
which eventually can invoke a lockless page fault which will need a
functional pmd_trans_huge_lock().

So collapse_huge_page() needs to use mmget_still_valid() to check it's
not running concurrently with the coredump...  as long as the coredump
can invoke page faults without holding the mmap_sem for reading.

This has "Fixes: khugepaged" to facilitate backporting, but in my view
it's more a bug in the coredump code that will eventually have to be
rewritten to stop invoking page faults without the mmap_sem for reading.
So the long term plan is still to drop all mmget_still_valid().

Link: http://lkml.kernel.org/r/20190607161558.32104-1-aarcange@redhat.com
Fixes: ba76149f47 ("thp: khugepaged")
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Michal Hocko <mhocko@suse.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-06-13 17:34:56 -10:00
Christian Brauner 7f192e3cd3
fork: add clone3
This adds the clone3 system call.

As mentioned several times already (cf. [7], [8]) here's the promised
patchset for clone3().

We recently merged the CLONE_PIDFD patchset (cf. [1]). It took the last
free flag from clone().

Independent of the CLONE_PIDFD patchset a time namespace has been discussed
at Linux Plumber Conference last year and has been sent out and reviewed
(cf. [5]). It is expected that it will go upstream in the not too distant
future. However, it relies on the addition of the CLONE_NEWTIME flag to
clone(). The only other good candidate - CLONE_DETACHED - is currently not
recyclable as we have identified at least two large or widely used
codebases that currently pass this flag (cf. [2], [3], and [4]). Given that
CLONE_PIDFD grabbed the last clone() flag the time namespace is effectively
blocked. clone3() has the advantage that it will unblock this patchset
again. In general, clone3() is extensible and allows for the implementation
of new features.

The idea is to keep clone3() very simple and close to the original clone(),
specifically, to keep on supporting old clone()-based workloads.
We know there have been various creative proposals how a new process
creation syscall or even api is supposed to look like. Some people even
going so far as to argue that the traditional fork()+exec() split should be
abandoned in favor of an in-kernel version of spawn(). Independent of
whether or not we personally think spawn() is a good idea this patchset has
and does not want to have anything to do with this.
One stance we take is that there's no real good alternative to
clone()+exec() and we need and want to support this model going forward;
independent of spawn().
The following requirements guided clone3():
- bump the number of available flags
- move arguments that are currently passed as separate arguments
  in clone() into a dedicated struct clone_args
  - choose a struct layout that is easy to handle on 32 and on 64 bit
  - choose a struct layout that is extensible
  - give new flags that currently need to abuse another flag's dedicated
    return argument in clone() their own dedicated return argument
    (e.g. CLONE_PIDFD)
  - use a separate kernel internal struct kernel_clone_args that is
    properly typed according to current kernel conventions in fork.c and is
    different from  the uapi struct clone_args
- port _do_fork() to use kernel_clone_args so that all process creation
  syscalls such as fork(), vfork(), clone(), and clone3() behave identical
  (Arnd suggested, that we can probably also port do_fork() itself in a
   separate patchset.)
- ease of transition for userspace from clone() to clone3()
  This very much means that we do *not* remove functionality that userspace
  currently relies on as the latter is a good way of creating a syscall
  that won't be adopted.
- do not try to be clever or complex: keep clone3() as dumb as possible

In accordance with Linus suggestions (cf. [11]), clone3() has the following
signature:

/* uapi */
struct clone_args {
        __aligned_u64 flags;
        __aligned_u64 pidfd;
        __aligned_u64 child_tid;
        __aligned_u64 parent_tid;
        __aligned_u64 exit_signal;
        __aligned_u64 stack;
        __aligned_u64 stack_size;
        __aligned_u64 tls;
};

/* kernel internal */
struct kernel_clone_args {
        u64 flags;
        int __user *pidfd;
        int __user *child_tid;
        int __user *parent_tid;
        int exit_signal;
        unsigned long stack;
        unsigned long stack_size;
        unsigned long tls;
};

long sys_clone3(struct clone_args __user *uargs, size_t size)

clone3() cleanly supports all of the supported flags from clone() and thus
all legacy workloads.
The advantage of sticking close to the old clone() is the low cost for
userspace to switch to this new api. Quite a lot of userspace apis (e.g.
pthreads) are based on the clone() syscall. With the new clone3() syscall
supporting all of the old workloads and opening up the ability to add new
features should make switching to it for userspace more appealing. In
essence, glibc can just write a simple wrapper to switch from clone() to
clone3().

There has been some interest in this patchset already. We have received a
patch from the CRIU corner for clone3() that would set the PID/TID of a
restored process without /proc/sys/kernel/ns_last_pid to eliminate a race.

/* User visible differences to legacy clone() */
- CLONE_DETACHED will cause EINVAL with clone3()
- CSIGNAL is deprecated
  It is superseeded by a dedicated "exit_signal" argument in struct
  clone_args freeing up space for additional flags.
  This is based on a suggestion from Andrei and Linus (cf. [9] and [10])

/* References */
[1]: b3e5838252
[2]: https://dxr.mozilla.org/mozilla-central/source/security/sandbox/linux/SandboxFilter.cpp#343
[3]: https://git.musl-libc.org/cgit/musl/tree/src/thread/pthread_create.c#n233
[4]: https://sources.debian.org/src/blcr/0.8.5-2.3/cr_module/cr_dump_self.c/?hl=740#L740
[5]: https://lore.kernel.org/lkml/20190425161416.26600-1-dima@arista.com/
[6]: https://lore.kernel.org/lkml/20190425161416.26600-2-dima@arista.com/
[7]: https://lore.kernel.org/lkml/CAHrFyr5HxpGXA2YrKza-oB-GGwJCqwPfyhD-Y5wbktWZdt0sGQ@mail.gmail.com/
[8]: https://lore.kernel.org/lkml/20190524102756.qjsjxukuq2f4t6bo@brauner.io/
[9]: https://lore.kernel.org/lkml/20190529222414.GA6492@gmail.com/
[10]: https://lore.kernel.org/lkml/CAHk-=whQP-Ykxi=zSYaV9iXsHsENa+2fdj-zYKwyeyed63Lsfw@mail.gmail.com/
[11]: https://lore.kernel.org/lkml/CAHk-=wieuV4hGwznPsX-8E0G2FKhx3NjZ9X3dTKh5zKd+iqOBw@mail.gmail.com/

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christian Brauner <christian@brauner.io>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Serge Hallyn <serge@hallyn.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Cc: Jann Horn <jannh@google.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Adrian Reber <adrian@lisas.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: linux-api@vger.kernel.org
2019-06-09 09:29:28 +02:00
Dietmar Eggemann 0e1fef63d9 sched/core: Remove sd->*_idx
The sched domain per rq load index files also disappear from the
/proc/sys/kernel/sched_domain/cpuX/domainY directories.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rik van Riel <riel@surriel.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20190527062116.11512-6-dietmar.eggemann@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-03 11:49:40 +02:00
Dietmar Eggemann 5e83eafbfd sched/fair: Remove the rq->cpu_load[] update code
With LB_BIAS disabled, there is no need to update the rq->cpu_load[idx]
any more.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rik van Riel <riel@surriel.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20190527062116.11512-2-dietmar.eggemann@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-03 11:49:38 +02:00
Eric W. Biederman a89e9b8abf signal: Remove the signal number and task parameters from force_sig_info
force_sig_info always delivers to the current task and the signal
parameter always matches info.si_signo.  So remove those parameters to
make it a simpler less error prone interface, and to make it clear
that none of the callers are doing anything clever.

This guarantees that force_sig_info will not grow any new buggy
callers that attempt to call force_sig on a non-current task, or that
pass an signal number that does not match info.si_signo.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2019-05-29 09:31:44 -05:00
Eric W. Biederman 2e1661d267 signal: Remove the task parameter from force_sig_fault
As synchronous exceptions really only make sense against the current
task (otherwise how are you synchronous) remove the task parameter
from from force_sig_fault to make it explicit that is what is going
on.

The two known exceptions that deliver a synchronous exception to a
stopped ptraced task have already been changed to
force_sig_fault_to_task.

The callers have been changed with the following emacs regular expression
(with obvious variations on the architectures that take more arguments)
to avoid typos:

force_sig_fault[(]\([^,]+\)[,]\([^,]+\)[,]\([^,]+\)[,]\W+current[)]
->
force_sig_fault(\1,\2,\3)

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2019-05-29 09:31:43 -05:00
Eric W. Biederman 91ca180dbd signal: Use force_sig_fault_to_task for the two calls that don't deliver to current
In preparation for removing the task parameter from force_sig_fault
introduce force_sig_fault_to_task and use it for the two cases where
it matters.

On mips force_fcr31_sig calls force_sig_fault and is called on either
the current task, or a task that is suspended and is being switched to
by the scheduler.  This is safe because the task being switched to by
the scheduler is guaranteed to be suspended.  This ensures that
task->sighand is stable while the signal is delivered to it.

On parisc user_enable_single_step calls force_sig_fault and is in turn
called by ptrace_request.  The function ptrace_request always calls
user_enable_single_step on a child that is stopped for tracing.  The
child being traced and not reaped ensures that child->sighand is not
NULL, and that the child will not change child->sighand.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2019-05-29 09:31:43 -05:00
Eric W. Biederman f8eac9011b signal: Remove task parameter from force_sig_mceerr
All of the callers pass current into force_sig_mceer so remove the
task parameter to make this obvious.

This also makes it clear that force_sig_mceerr passes current
into force_sig_info.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2019-05-27 09:36:28 -05:00
Eric W. Biederman 3cf5d076fb signal: Remove task parameter from force_sig
All of the remaining callers pass current into force_sig so
remove the task parameter to make this obvious and to make
misuse more difficult in the future.

This also makes it clear force_sig passes current into force_sig_info.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2019-05-27 09:36:28 -05:00
Eric W. Biederman cb44c9a0ab signal: Remove task parameter from force_sigsegv
The function force_sigsegv is always called on the current task
so passing in current is redundant and not passing in current
makes this fact obvious.

This also makes it clear force_sigsegv always calls force_sig
on the current task.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2019-05-27 09:36:28 -05:00
Eric W. Biederman 70f1b0d34b signal/usb: Replace kill_pid_info_as_cred with kill_pid_usb_asyncio
The usb support for asyncio encoded one of it's values in the wrong
field.  It should have used si_value but instead used si_addr which is
not present in the _rt union member of struct siginfo.

The practical result of this is that on a 64bit big endian kernel
when delivering a signal to a 32bit process the si_addr field
is set to NULL, instead of the expected pointer value.

This issue can not be fixed in copy_siginfo_to_user32 as the usb
usage of the the _sigfault (aka si_addr) member of the siginfo
union when SI_ASYNCIO is set is incompatible with the POSIX and
glibc usage of the _rt member of the siginfo union.

Therefore replace kill_pid_info_as_cred with kill_pid_usb_asyncio a
dedicated function for this one specific case.  There are no other
users of kill_pid_info_as_cred so this specialization should have no
impact on the amount of code in the kernel.  Have kill_pid_usb_asyncio
take instead of a siginfo_t which is difficult and error prone, 3
arguments, a signal number, an errno value, and an address enconded as
a sigval_t.  The encoding of the address as a sigval_t allows the
code that reads the userspace request for a signal to handle this
compat issue along with all of the other compat issues.

Add BUILD_BUG_ONs in kernel/signal.c to ensure that we can now place
the pointer value at the in si_pid (instead of si_addr).  That is the
code now verifies that si_pid and si_addr always occur at the same
location.  Further the code veries that for native structures a value
placed in si_pid and spilling into si_uid will appear in userspace in
si_addr (on a byte by byte copy of siginfo or a field by field copy of
siginfo).  The code also verifies that for a 64bit kernel and a 32bit
userspace the 32bit pointer will fit in si_pid.

I have used the usbsig.c program below written by Alan Stern and
slightly tweaked by me to run on a big endian machine to verify the
issue exists (on sparc64) and to confirm the patch below fixes the issue.

 /* usbsig.c -- test USB async signal delivery */

 #define _GNU_SOURCE
 #include <stdio.h>
 #include <fcntl.h>
 #include <signal.h>
 #include <string.h>
 #include <sys/ioctl.h>
 #include <unistd.h>
 #include <endian.h>
 #include <linux/usb/ch9.h>
 #include <linux/usbdevice_fs.h>

 static struct usbdevfs_urb urb;
 static struct usbdevfs_disconnectsignal ds;
 static volatile sig_atomic_t done = 0;

 void urb_handler(int sig, siginfo_t *info , void *ucontext)
 {
 	printf("Got signal %d, signo %d errno %d code %d addr: %p urb: %p\n",
 	       sig, info->si_signo, info->si_errno, info->si_code,
 	       info->si_addr, &urb);

 	printf("%s\n", (info->si_addr == &urb) ? "Good" : "Bad");
 }

 void ds_handler(int sig, siginfo_t *info , void *ucontext)
 {
 	printf("Got signal %d, signo %d errno %d code %d addr: %p ds: %p\n",
 	       sig, info->si_signo, info->si_errno, info->si_code,
 	       info->si_addr, &ds);

 	printf("%s\n", (info->si_addr == &ds) ? "Good" : "Bad");
 	done = 1;
 }

 int main(int argc, char **argv)
 {
 	char *devfilename;
 	int fd;
 	int rc;
 	struct sigaction act;
 	struct usb_ctrlrequest *req;
 	void *ptr;
 	char buf[80];

 	if (argc != 2) {
 		fprintf(stderr, "Usage: usbsig device-file-name\n");
 		return 1;
 	}

 	devfilename = argv[1];
 	fd = open(devfilename, O_RDWR);
 	if (fd == -1) {
 		perror("Error opening device file");
 		return 1;
 	}

 	act.sa_sigaction = urb_handler;
 	sigemptyset(&act.sa_mask);
 	act.sa_flags = SA_SIGINFO;

 	rc = sigaction(SIGUSR1, &act, NULL);
 	if (rc == -1) {
 		perror("Error in sigaction");
 		return 1;
 	}

 	act.sa_sigaction = ds_handler;
 	sigemptyset(&act.sa_mask);
 	act.sa_flags = SA_SIGINFO;

 	rc = sigaction(SIGUSR2, &act, NULL);
 	if (rc == -1) {
 		perror("Error in sigaction");
 		return 1;
 	}

 	memset(&urb, 0, sizeof(urb));
 	urb.type = USBDEVFS_URB_TYPE_CONTROL;
 	urb.endpoint = USB_DIR_IN | 0;
 	urb.buffer = buf;
 	urb.buffer_length = sizeof(buf);
 	urb.signr = SIGUSR1;

 	req = (struct usb_ctrlrequest *) buf;
 	req->bRequestType = USB_DIR_IN | USB_TYPE_STANDARD | USB_RECIP_DEVICE;
 	req->bRequest = USB_REQ_GET_DESCRIPTOR;
 	req->wValue = htole16(USB_DT_DEVICE << 8);
 	req->wIndex = htole16(0);
 	req->wLength = htole16(sizeof(buf) - sizeof(*req));

 	rc = ioctl(fd, USBDEVFS_SUBMITURB, &urb);
 	if (rc == -1) {
 		perror("Error in SUBMITURB ioctl");
 		return 1;
 	}

 	rc = ioctl(fd, USBDEVFS_REAPURB, &ptr);
 	if (rc == -1) {
 		perror("Error in REAPURB ioctl");
 		return 1;
 	}

 	memset(&ds, 0, sizeof(ds));
 	ds.signr = SIGUSR2;
 	ds.context = &ds;
 	rc = ioctl(fd, USBDEVFS_DISCSIGNAL, &ds);
 	if (rc == -1) {
 		perror("Error in DISCSIGNAL ioctl");
 		return 1;
 	}

 	printf("Waiting for usb disconnect\n");
 	while (!done) {
 		sleep(1);
 	}

 	close(fd);
 	return 0;
 }

Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: linux-usb@vger.kernel.org
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Oliver Neukum <oneukum@suse.com>
Fixes: v2.3.39
Cc: stable@vger.kernel.org
Acked-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2019-05-22 16:53:48 -05:00
Andrei Vagin 9e9291c71e include/linux/sched/signal.h: replace `tsk' with `task'
This file uses "task" 85 times and "tsk" 25 times.  It is better to be
consistent.

Link: http://lkml.kernel.org/r/20181129180547.15976-1-avagin@gmail.com
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 19:52:52 -07:00
Linus Torvalds abde77eb5c Merge branch 'for-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
 "This includes Roman's cgroup2 freezer implementation.

  It's a separate machanism from cgroup1 freezer. Instead of blocking
  user tasks in arbitrary uninterruptible sleeps, the new implementation
  extends jobctl stop - frozen tasks are trapped in jobctl stop until
  thawed and can be killed and ptraced. Lots of thanks to Oleg for
  sheperding the effort.

  Other than that, there are a few trivial changes"

* 'for-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: never call do_group_exit() with task->frozen bit set
  kernel: cgroup: fix misuse of %x
  cgroup: get rid of cgroup_freezer_frozen_exit()
  cgroup: prevent spurious transition into non-frozen state
  cgroup: Remove unused cgrp variable
  cgroup: document cgroup v2 freezer interface
  cgroup: add tracing points for cgroup v2 freezer
  cgroup: make TRACE_CGROUP_PATH irq-safe
  kselftests: cgroup: add freezer controller self-tests
  kselftests: cgroup: don't fail on cg_kill_all() error in cg_destroy()
  cgroup: cgroup v2 freezer
  cgroup: protect cgroup->nr_(dying_)descendants by css_set_lock
  cgroup: implement __cgroup_task_count() helper
  cgroup: rename freezer.c into legacy_freezer.c
  cgroup: remove extra cgroup_migrate_finish() call
2019-05-09 13:52:12 -07:00
Linus Torvalds 78ee8b1b9b Merge branch 'next-general' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull security subsystem updates from James Morris:
 "Just a few bugfixes and documentation updates"

* 'next-general' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
  seccomp: fix up grammar in comment
  Revert "security: inode: fix a missing check for securityfs_create_file"
  Yama: mark function as static
  security: inode: fix a missing check for securityfs_create_file
  keys: safe concurrent user->{session,uid}_keyring access
  security: don't use RCU accessors for cred->session_keyring
  Yama: mark local symbols as static
  LSM: lsm_hooks.h: fix documentation format
  LSM: fix documentation for the shm_* hooks
  LSM: fix documentation for the sem_* hooks
  LSM: fix documentation for the msg_queue_* hooks
  LSM: fix documentation for the audit_* hooks
  LSM: fix documentation for the path_chmod hook
  LSM: fix documentation for the socket_getpeersec_dgram hook
  LSM: fix documentation for the task_setscheduler hook
  LSM: fix documentation for the socket_post_create hook
  LSM: fix documentation for the syslog hook
  LSM: fix documentation for sb_copy_data hook
2019-05-07 08:39:54 -07:00
Linus Torvalds 0bc40e549a Merge branch 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 mm updates from Ingo Molnar:
 "The changes in here are:

   - text_poke() fixes and an extensive set of executability lockdowns,
     to (hopefully) eliminate the last residual circumstances under
     which we are using W|X mappings even temporarily on x86 kernels.
     This required a broad range of surgery in text patching facilities,
     module loading, trampoline handling and other bits.

   - tweak page fault messages to be more informative and more
     structured.

   - remove DISCONTIGMEM support on x86-32 and make SPARSEMEM the
     default.

   - reduce KASLR granularity on 5-level paging kernels from 512 GB to
     1 GB.

   - misc other changes and updates"

* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits)
  x86/mm: Initialize PGD cache during mm initialization
  x86/alternatives: Add comment about module removal races
  x86/kprobes: Use vmalloc special flag
  x86/ftrace: Use vmalloc special flag
  bpf: Use vmalloc special flag
  modules: Use vmalloc special flag
  mm/vmalloc: Add flag for freeing of special permsissions
  mm/hibernation: Make hibernation handle unmapped pages
  x86/mm/cpa: Add set_direct_map_*() functions
  x86/alternatives: Remove the return value of text_poke_*()
  x86/jump-label: Remove support for custom text poker
  x86/modules: Avoid breaking W^X while loading modules
  x86/kprobes: Set instruction page as executable
  x86/ftrace: Set trampoline pages as executable
  x86/kgdb: Avoid redundant comparison of patched code
  x86/alternatives: Use temporary mm for text poking
  x86/alternatives: Initialize temporary mm for patching
  fork: Provide a function for copying init_mm
  uprobes: Initialize uprobes earlier
  x86/mm: Save debug registers when loading a temporary mm
  ...
2019-05-06 16:13:31 -07:00
Ingo Molnar 176d2323c7 Merge branch 'linus' into sched/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-05-03 12:52:45 +02:00
Nadav Amit 13585fa066 fork: Provide a function for copying init_mm
Provide a function for copying init_mm. This function will be later used
for setting a temporary mm.

Tested-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: <akpm@linux-foundation.org>
Cc: <ard.biesheuvel@linaro.org>
Cc: <deneen.t.dock@intel.com>
Cc: <kernel-hardening@lists.openwall.com>
Cc: <kristen@linux.intel.com>
Cc: <linux_dti@icloud.com>
Cc: <will.deacon@arm.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190426001143.4983-6-namit@vmware.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-30 12:37:51 +02:00
Roman Gushchin 76f969e894 cgroup: cgroup v2 freezer
Cgroup v1 implements the freezer controller, which provides an ability
to stop the workload in a cgroup and temporarily free up some
resources (cpu, io, network bandwidth and, potentially, memory)
for some other tasks. Cgroup v2 lacks this functionality.

This patch implements freezer for cgroup v2.

Cgroup v2 freezer tries to put tasks into a state similar to jobctl
stop. This means that tasks can be killed, ptraced (using
PTRACE_SEIZE*), and interrupted. It is possible to attach to
a frozen task, get some information (e.g. read registers) and detach.
It's also possible to migrate a frozen tasks to another cgroup.

This differs cgroup v2 freezer from cgroup v1 freezer, which mostly
tried to imitate the system-wide freezer. However uninterruptible
sleep is fine when all tasks are going to be frozen (hibernation case),
it's not the acceptable state for some subset of the system.

Cgroup v2 freezer is not supporting freezing kthreads.
If a non-root cgroup contains kthread, the cgroup still can be frozen,
but the kthread will remain running, the cgroup will be shown
as non-frozen, and the notification will not be delivered.

* PTRACE_ATTACH is not working because non-fatal signal delivery
is blocked in frozen state.

There are some interface differences between cgroup v1 and cgroup v2
freezer too, which are required to conform the cgroup v2 interface
design principles:
1) There is no separate controller, which has to be turned on:
the functionality is always available and is represented by
cgroup.freeze and cgroup.events cgroup control files.
2) The desired state is defined by the cgroup.freeze control file.
Any hierarchical configuration is allowed.
3) The interface is asynchronous. The actual state is available
using cgroup.events control file ("frozen" field). There are no
dedicated transitional states.
4) It's allowed to make any changes with the cgroup hierarchy
(create new cgroups, remove old cgroups, move tasks between cgroups)
no matter if some cgroups are frozen.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
No-objection-from-me-by: Oleg Nesterov <oleg@redhat.com>
Cc: kernel-team@fb.com
2019-04-19 11:26:48 -07:00
Andrea Arcangeli 04f5866e41 coredump: fix race condition between mmget_not_zero()/get_task_mm() and core dumping
The core dumping code has always run without holding the mmap_sem for
writing, despite that is the only way to ensure that the entire vma
layout will not change from under it.  Only using some signal
serialization on the processes belonging to the mm is not nearly enough.
This was pointed out earlier.  For example in Hugh's post from Jul 2017:

  https://lkml.kernel.org/r/alpine.LSU.2.11.1707191716030.2055@eggly.anvils

  "Not strictly relevant here, but a related note: I was very surprised
   to discover, only quite recently, how handle_mm_fault() may be called
   without down_read(mmap_sem) - when core dumping. That seems a
   misguided optimization to me, which would also be nice to correct"

In particular because the growsdown and growsup can move the
vm_start/vm_end the various loops the core dump does around the vma will
not be consistent if page faults can happen concurrently.

Pretty much all users calling mmget_not_zero()/get_task_mm() and then
taking the mmap_sem had the potential to introduce unexpected side
effects in the core dumping code.

Adding mmap_sem for writing around the ->core_dump invocation is a
viable long term fix, but it requires removing all copy user and page
faults and to replace them with get_dump_page() for all binary formats
which is not suitable as a short term fix.

For the time being this solution manually covers the places that can
confuse the core dump either by altering the vma layout or the vma flags
while it runs.  Once ->core_dump runs under mmap_sem for writing the
function mmget_still_valid() can be dropped.

Allowing mmap_sem protected sections to run in parallel with the
coredump provides some minor parallelism advantage to the swapoff code
(which seems to be safe enough by never mangling any vma field and can
keep doing swapins in parallel to the core dumping) and to some other
corner case.

In order to facilitate the backporting I added "Fixes: 86039bd3b4e6"
however the side effect of this same race condition in /proc/pid/mem
should be reproducible since before 2.6.12-rc2 so I couldn't add any
other "Fixes:" because there's no hash beyond the git genesis commit.

Because find_extend_vma() is the only location outside of the process
context that could modify the "mm" structures under mmap_sem for
reading, by adding the mmget_still_valid() check to it, all other cases
that take the mmap_sem for reading don't need the new check after
mmget_not_zero()/get_task_mm().  The expand_stack() in page fault
context also doesn't need the new check, because all tasks under core
dumping are frozen.

Link: http://lkml.kernel.org/r/20190325224949.11068-1-aarcange@redhat.com
Fixes: 86039bd3b4 ("userfaultfd: add new syscall to provide memory externalization")
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Jann Horn <jannh@google.com>
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Jann Horn <jannh@google.com>
Acked-by: Jason Gunthorpe <jgg@mellanox.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-04-19 09:46:05 -07:00
Jann Horn 0b9dc6c9f0 keys: safe concurrent user->{session,uid}_keyring access
The current code can perform concurrent updates and reads on
user->session_keyring and user->uid_keyring. Add a comment to
struct user_struct to document the nontrivial locking semantics, and use
READ_ONCE() for unlocked readers and smp_store_release() for writers to
prevent memory ordering issues.

Fixes: 69664cf16a ("keys: don't generate user and user session keyrings unless they're accessed")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: James Morris <james.morris@microsoft.com>
2019-04-10 10:29:50 -07:00
Joel Fernandes (Google) 994aeb7a93 sched_domain: Annotate RCU pointers properly
The scheduler uses RCU API in various places to access sched_domain
pointers. These cause sparse errors as below.

Many new errors show up because of an annotation check I added to
rcu_assign_pointer(). Let us annotate the pointers correctly which also
will help sparse catch any potential future bugs.

This fixes the following sparse errors:

  rt.c:1681:9: error: incompatible types in comparison expression
  deadline.c:1904:9: error: incompatible types in comparison expression
  core.c:519:9: error: incompatible types in comparison expression
  core.c:1634:17: error: incompatible types in comparison expression
  fair.c:6193:14: error: incompatible types in comparison expression
  fair.c:9883:22: error: incompatible types in comparison expression
  fair.c:9897:9: error: incompatible types in comparison expression
  sched.h:1287:9: error: incompatible types in comparison expression
  topology.c:612:9: error: incompatible types in comparison expression
  topology.c:615:9: error: incompatible types in comparison expression
  sched.h:1300:9: error: incompatible types in comparison expression
  topology.c:618:9: error: incompatible types in comparison expression
  sched.h:1287:9: error: incompatible types in comparison expression
  topology.c:621:9: error: incompatible types in comparison expression
  sched.h:1300:9: error: incompatible types in comparison expression
  topology.c:624:9: error: incompatible types in comparison expression
  topology.c:671:9: error: incompatible types in comparison expression
  stats.c:45:17: error: incompatible types in comparison expression
  fair.c:5998:15: error: incompatible types in comparison expression
  fair.c:5989:15: error: incompatible types in comparison expression
  fair.c:5998:15: error: incompatible types in comparison expression
  fair.c:5989:15: error: incompatible types in comparison expression
  fair.c:6120:19: error: incompatible types in comparison expression
  fair.c:6506:14: error: incompatible types in comparison expression
  fair.c:6515:14: error: incompatible types in comparison expression
  fair.c:6623:9: error: incompatible types in comparison expression
  fair.c:5970:17: error: incompatible types in comparison expression
  fair.c:8642:21: error: incompatible types in comparison expression
  fair.c:9253:9: error: incompatible types in comparison expression
  fair.c:9331:9: error: incompatible types in comparison expression
  fair.c:9519:15: error: incompatible types in comparison expression
  fair.c:9533:14: error: incompatible types in comparison expression
  fair.c:9542:14: error: incompatible types in comparison expression
  fair.c:9567:14: error: incompatible types in comparison expression
  fair.c:9597:14: error: incompatible types in comparison expression
  fair.c:9421:16: error: incompatible types in comparison expression
  fair.c:9421:16: error: incompatible types in comparison expression

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[ From an RCU perspective. ]
Reviewed-by: Paul E. McKenney <paulmck@linux.ibm.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luc Van Oostenryck <luc.vanoostenryck@gmail.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: keescook@chromium.org
Cc: kernel-hardening@lists.openwall.com
Cc: kernel-team@android.com
Link: https://lkml.kernel.org/r/20190321003426.160260-3-joel@joelfernandes.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-03 12:34:31 +02:00
Andrei Vagin fcfc2aa018 ptrace: take into account saved_sigmask in PTRACE{GET,SET}SIGMASK
There are a few system calls (pselect, ppoll, etc) which replace a task
sigmask while they are running in a kernel-space

When a task calls one of these syscalls, the kernel saves a current
sigmask in task->saved_sigmask and sets a syscall sigmask.

On syscall-exit-stop, ptrace traps a task before restoring the
saved_sigmask, so PTRACE_GETSIGMASK returns the syscall sigmask and
PTRACE_SETSIGMASK does nothing, because its sigmask is replaced by
saved_sigmask, when the task returns to user-space.

This patch fixes this problem.  PTRACE_GETSIGMASK returns saved_sigmask
if it's set.  PTRACE_SETSIGMASK drops the TIF_RESTORE_SIGMASK flag.

Link: http://lkml.kernel.org/r/20181120060616.6043-1-avagin@gmail.com
Fixes: 29000caecb ("ptrace: add ability to get/set signal-blocked mask")
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-29 10:01:37 -07:00
Linus Torvalds 38e7571c07 io_uring-2019-03-06
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlyAJvAQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgphb+EACFaKI2HIdjExQ5T7Cxebzwky+Qiro3FV55
 ziW00FZrkJ5g0h4ItBzh/5SDlcNQYZDMlA3s4xzWIMadWl5PjMPq1uJul0cITbSl
 WIJO5hpgNMXeUEhvcXUl6+f/WzpgYUxN40uW8N5V7EKlooaFVfudDqJGlvEv+UgB
 g8NWQYThSG+/e7r9OGwK0xDRVKfpjxVvmqmnDH3DrxKaDgSOwTf4xn1u41wKwfQ3
 3uPfQ+GBeTqt4a2AhOi7K6KQFNnj5Jz5CXYMiOZI2JGtLPcL6dmyBVD7K0a0HUr+
 rs4ghNdd1+puvPGNK4TX8qV0uiNrMctoRNVA/JDd1ZTYEKTmNLxeFf+olfYHlwuK
 K5FRs60/lgNzNkzcUpFvJHitPwYtxYJdB36PyswE1FZP1YviEeVoKNt9W8aIhEoA
 549uj90brfA74eCINGhq98pJqj9CNyCPw3bfi76f5Ej2utwYDb9S5Cp2gfSa853X
 qc/qNda9efEq7ikwCbPzhekRMXZo6TSXtaSmC2C+Vs5+mD1Scc4kdAvdCKGQrtr9
 aoy0iQMYO2NDZ/G5fppvXtMVuEPAZWbsGftyOe15IlMysjRze2ycJV8cFahKEVM9
 uBeXLyH1pqGU/j7ABP4+XRZ/sbHJTwjKJbnXhTgBsdU8XO/CR3U+kRQFTsidKMfH
 Wlo3uH2h2A==
 =p78E
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-2019-03-06' of git://git.kernel.dk/linux-block

Pull io_uring IO interface from Jens Axboe:
 "Second attempt at adding the io_uring interface.

  Since the first one, we've added basic unit testing of the three
  system calls, that resides in liburing like the other unit tests that
  we have so far. It'll take a while to get full coverage of it, but
  we're working towards it. I've also added two basic test programs to
  tools/io_uring. One uses the raw interface and has support for all the
  various features that io_uring supports outside of standard IO, like
  fixed files, fixed IO buffers, and polled IO. The other uses the
  liburing API, and is a simplified version of cp(1).

  This adds support for a new IO interface, io_uring.

  io_uring allows an application to communicate with the kernel through
  two rings, the submission queue (SQ) and completion queue (CQ) ring.
  This allows for very efficient handling of IOs, see the v5 posting for
  some basic numbers:

    https://lore.kernel.org/linux-block/20190116175003.17880-1-axboe@kernel.dk/

  Outside of just efficiency, the interface is also flexible and
  extendable, and allows for future use cases like the upcoming NVMe
  key-value store API, networked IO, and so on. It also supports async
  buffered IO, something that we've always failed to support in the
  kernel.

  Outside of basic IO features, it supports async polled IO as well.
  This particular feature has already been tested at Facebook months ago
  for flash storage boxes, with 25-33% improvements. It makes polled IO
  actually useful for real world use cases, where even basic flash sees
  a nice win in terms of efficiency, latency, and performance. These
  boxes were IOPS bound before, now they are not.

  This series adds three new system calls. One for setting up an
  io_uring instance (io_uring_setup(2)), one for submitting/completing
  IO (io_uring_enter(2)), and one for aux functions like registrating
  file sets, buffers, etc (io_uring_register(2)). Through the help of
  Arnd, I've coordinated the syscall numbers so merge on that front
  should be painless.

  Jon did a writeup of the interface a while back, which (except for
  minor details that have been tweaked) is still accurate. Find that
  here:

    https://lwn.net/Articles/776703/

  Huge thanks to Al Viro for helping getting the reference cycle code
  correct, and to Jann Horn for his extensive reviews focused on both
  security and bugs in general.

  There's a userspace library that provides basic functionality for
  applications that don't need or want to care about how to fiddle with
  the rings directly. It has helpers to allow applications to easily set
  up an io_uring instance, and submit/complete IO through it without
  knowing about the intricacies of the rings. It also includes man pages
  (thanks to Jeff Moyer), and will continue to grow support helper
  functions and features as time progresses. Find it here:

    git://git.kernel.dk/liburing

  Fio has full support for the raw interface, both in the form of an IO
  engine (io_uring), but also with a small test application (t/io_uring)
  that can exercise and benchmark the interface"

* tag 'io_uring-2019-03-06' of git://git.kernel.dk/linux-block:
  io_uring: add a few test tools
  io_uring: allow workqueue item to handle multiple buffered requests
  io_uring: add support for IORING_OP_POLL
  io_uring: add io_kiocb ref count
  io_uring: add submission polling
  io_uring: add file set registration
  net: split out functions related to registering inflight socket files
  io_uring: add support for pre-mapped user IO buffers
  block: implement bio helper to add iter bvec pages to bio
  io_uring: batch io_kiocb allocation
  io_uring: use fget/fput_many() for file references
  fs: add fget_many() and fput_many()
  io_uring: support for IO polling
  io_uring: add fsync support
  Add io_uring IO interface
2019-03-08 14:48:40 -08:00
Linus Torvalds 8dcd175bc3 Merge branch 'akpm' (patches from Andrew)
Merge misc updates from Andrew Morton:

 - a few misc things

 - ocfs2 updates

 - most of MM

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (159 commits)
  tools/testing/selftests/proc/proc-self-syscall.c: remove duplicate include
  proc: more robust bulk read test
  proc: test /proc/*/maps, smaps, smaps_rollup, statm
  proc: use seq_puts() everywhere
  proc: read kernel cpu stat pointer once
  proc: remove unused argument in proc_pid_lookup()
  fs/proc/thread_self.c: code cleanup for proc_setup_thread_self()
  fs/proc/self.c: code cleanup for proc_setup_self()
  proc: return exit code 4 for skipped tests
  mm,mremap: bail out earlier in mremap_to under map pressure
  mm/sparse: fix a bad comparison
  mm/memory.c: do_fault: avoid usage of stale vm_area_struct
  writeback: fix inode cgroup switching comment
  mm/huge_memory.c: fix "orig_pud" set but not used
  mm/hotplug: fix an imbalance with DEBUG_PAGEALLOC
  mm/memcontrol.c: fix bad line in comment
  mm/cma.c: cma_declare_contiguous: correct err handling
  mm/page_ext.c: fix an imbalance with kmemleak
  mm/compaction: pass pgdat to too_many_isolated() instead of zone
  mm: remove zone_lru_lock() function, access ->lru_lock directly
  ...
2019-03-06 10:31:36 -08:00
Linus Torvalds 45802da05e Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
 "The main changes in this cycle were:

   - refcount conversions

   - Solve the rq->leaf_cfs_rq_list can of worms for real.

   - improve power-aware scheduling

   - add sysctl knob for Energy Aware Scheduling

   - documentation updates

   - misc other changes"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (34 commits)
  kthread: Do not use TIMER_IRQSAFE
  kthread: Convert worker lock to raw spinlock
  sched/fair: Use non-atomic cpumask_{set,clear}_cpu()
  sched/fair: Remove unused 'sd' parameter from select_idle_smt()
  sched/wait: Use freezable_schedule() when possible
  sched/fair: Prune, fix and simplify the nohz_balancer_kick() comment block
  sched/fair: Explain LLC nohz kick condition
  sched/fair: Simplify nohz_balancer_kick()
  sched/topology: Fix percpu data types in struct sd_data & struct s_data
  sched/fair: Simplify post_init_entity_util_avg() by calling it with a task_struct pointer argument
  sched/fair: Fix O(nr_cgroups) in the load balancing path
  sched/fair: Optimize update_blocked_averages()
  sched/fair: Fix insertion in rq->leaf_cfs_rq_list
  sched/fair: Add tmp_alone_branch assertion
  sched/core: Use READ_ONCE()/WRITE_ONCE() in move_queued_task()/task_rq_lock()
  sched/debug: Initialize sd_sysctl_cpus if !CONFIG_CPUMASK_OFFSTACK
  sched/pelt: Skip updating util_est when utilization is higher than CPU's capacity
  sched/fair: Update scale invariance of PELT
  sched/fair: Move the rq_of() helper function
  sched/core: Convert task_struct.stack_refcount to refcount_t
  ...
2019-03-06 08:14:05 -08:00
Aneesh Kumar K.V d7fefcc8de mm/cma: add PF flag to force non cma alloc
Patch series "mm/kvm/vfio/ppc64: Migrate compound pages out of CMA
region", v8.

ppc64 uses the CMA area for the allocation of guest page table (hash
page table).  We won't be able to start guest if we fail to allocate
hash page table.  We have observed hash table allocation failure because
we failed to migrate pages out of CMA region because they were pinned.
This happen when we are using VFIO.  VFIO on ppc64 pins the entire guest
RAM.  If the guest RAM pages get allocated out of CMA region, we won't
be able to migrate those pages.  The pages are also pinned for the
lifetime of the guest.

Currently we support migration of non-compound pages.  With THP and with
the addition of hugetlb migration we can end up allocating compound
pages from CMA region.  This patch series add support for migrating
compound pages.

This patch (of 4):

Add PF_MEMALLOC_NOCMA which make sure any allocation in that context is
marked non-movable and hence cannot be satisfied by CMA region.

This is useful with get_user_pages_longterm where we want to take a page
pin by migrating pages from CMA region.  Marking the section
PF_MEMALLOC_NOCMA ensures that we avoid unnecessary page migration
later.

Link: http://lkml.kernel.org/r/20190114095438.32470-2-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Suggested-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Alexey Kardashevskiy <aik@ozlabs.ru>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:19 -08:00
Jens Axboe 2b188cc1bb Add io_uring IO interface
The submission queue (SQ) and completion queue (CQ) rings are shared
between the application and the kernel. This eliminates the need to
copy data back and forth to submit and complete IO.

IO submissions use the io_uring_sqe data structure, and completions
are generated in the form of io_uring_cqe data structures. The SQ
ring is an index into the io_uring_sqe array, which makes it possible
to submit a batch of IOs without them being contiguous in the ring.
The CQ ring is always contiguous, as completion events are inherently
unordered, and hence any io_uring_cqe entry can point back to an
arbitrary submission.

Two new system calls are added for this:

io_uring_setup(entries, params)
	Sets up an io_uring instance for doing async IO. On success,
	returns a file descriptor that the application can mmap to
	gain access to the SQ ring, CQ ring, and io_uring_sqes.

io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize)
	Initiates IO against the rings mapped to this fd, or waits for
	them to complete, or both. The behavior is controlled by the
	parameters passed in. If 'to_submit' is non-zero, then we'll
	try and submit new IO. If IORING_ENTER_GETEVENTS is set, the
	kernel will wait for 'min_complete' events, if they aren't
	already available. It's valid to set IORING_ENTER_GETEVENTS
	and 'min_complete' == 0 at the same time, this allows the
	kernel to return already completed events without waiting
	for them. This is useful only for polling, as for IRQ
	driven IO, the application can just check the CQ ring
	without entering the kernel.

With this setup, it's possible to do async IO with a single system
call. Future developments will enable polled IO with this interface,
and polled submission as well. The latter will enable an application
to do IO without doing ANY system calls at all.

For IRQ driven IO, an application only needs to enter the kernel for
completions if it wants to wait for them to occur.

Each io_uring is backed by a workqueue, to support buffered async IO
as well. We will only punt to an async context if the command would
need to wait for IO on the device side. Any data that can be accessed
directly in the page cache is done inline. This avoids the slowness
issue of usual threadpools, since cached data is accessed as quickly
as a sync interface.

Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.c

Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-28 08:24:23 -07:00
Luc Van Oostenryck 99687cdbb3 sched/topology: Fix percpu data types in struct sd_data & struct s_data
The percpu members of struct sd_data and s_data are declared as:

	struct ... ** __percpu member;

So their type is:

	__percpu pointer to pointer to struct ...

But looking at how they're used, their type should be:

	pointer to __percpu pointer to struct ...

and they should thus be declared as:

	struct ... * __percpu *member;

So fix the placement of '__percpu' in the definition of these
structures.

This addresses a bunch of Sparse's warnings like:

	warning: incorrect type in initializer (different address spaces)
	  expected void const [noderef] <asn:3> *__vpp_verify
	  got struct sched_domain **

Signed-off-by: Luc Van Oostenryck <luc.vanoostenryck@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190118144936.79158-1-luc.vanoostenryck@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-02-11 08:02:15 +01:00
Ingo Molnar c9ba7560c5 Linux 5.0-rc6
-----BEGIN PGP SIGNATURE-----
 
 iQFRBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAlxgqNUeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGwsoH+OVXu0NQofwTvVru
 8lgF3BSDG2mhf7mxbBBlBizGVy9jnjRNGCFMC+Jq8IwiFLwprja/G27kaDTkpuF1
 PHC3yfjKvjTeUP5aNdHlmxv6j1sSJfZl0y46DQal4UeTG/Giq8TFTi+Tbz7Wb/WV
 yCx4Lr8okAwTuNhnL8ojUCVIpd3c8QsyR9v6nEQ14Mj+MvEbokyTkMJV0bzOrM38
 JOB+/X1XY4JPZ6o3MoXrBca3bxbAJzMneq+9CWw1U5eiIG3msg4a+Ua3++RQMDNr
 8BP0yCZ6wo32S8uu0PI6HrZaBnLYi5g9Wh7Q7yc0mn1Uh1zWFykA6TtqK90agJeR
 A6Ktjw==
 =scY4
 -----END PGP SIGNATURE-----

Merge tag 'v5.0-rc6' into sched/core, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-02-11 08:01:50 +01:00
Davidlohr Bueso 07879c6a37 sched/wake_q: Reduce reference counting for special users
Some users, specifically futexes and rwsems, required fixes
that allowed the callers to be safe when wakeups occur before
they are expected by wake_up_q(). Such scenarios also play
games and rely on reference counting, and until now were
pivoting on wake_q doing it. With the wake_q_add() call being
moved down, this can no longer be the case. As such we end up
with a a double task refcounting overhead; and these callers
care enough about this (being rather core-ish).

This patch introduces a wake_q_add_safe() call that serves
for callers that have already done refcounting and therefore the
task is 'safe' from wake_q point of view (int that it requires
reference throughout the entire queue/>wakeup cycle). In the one
case it has internal reference counting, in the other case it
consumes the reference counting.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Xie Yongji <xieyongji@baidu.com>
Cc: Yongji Xie <elohimes@gmail.com>
Cc: andrea.parri@amarulasolutions.com
Cc: lilin24@baidu.com
Cc: liuqi16@baidu.com
Cc: nixun@baidu.com
Cc: yuanlinsi01@baidu.com
Cc: zhangyu31@baidu.com
Link: https://lkml.kernel.org/r/20181218195352.7orq3upiwfdbrdne@linux-r8p5
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-02-04 09:03:28 +01:00
Elena Reshetova f0b89d3958 sched/core: Convert task_struct.stack_refcount to refcount_t
atomic_t variables are currently used to implement reference
counters with the following properties:

 - counter is initialized to 1 using atomic_set()
 - a resource is freed upon counter reaching zero
 - once counter reaches zero, its further
   increments aren't allowed
 - counter schema uses basic atomic operations
   (set, inc, inc_not_zero, dec_and_test, etc.)

Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.

The variable task_struct.stack_refcount is used as pure reference counter.
Convert it to refcount_t and fix up the operations.

** Important note for maintainers:

Some functions from refcount_t API defined in lib/refcount.c
have different memory ordering guarantees than their atomic
counterparts.

The full comparison can be seen in
https://lkml.org/lkml/2017/11/15/57 and it is hopefully soon
in state to be merged to the documentation tree.

Normally the differences should not matter since refcount_t provides
enough guarantees to satisfy the refcounting use cases, but in
some rare cases it might matter.

Please double check that you don't have some undocumented
memory guarantees for this variable usage.

For the task_struct.stack_refcount it might make a difference
in following places:

 - try_get_task_stack(): increment in refcount_inc_not_zero() only
   guarantees control dependency on success vs. fully ordered
   atomic counterpart
 - put_task_stack(): decrement in refcount_dec_and_test() only
   provides RELEASE ordering and control dependency on success
   vs. fully ordered atomic counterpart

Suggested-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: David Windsor <dwindsor@gmail.com>
Reviewed-by: Hans Liljestrand <ishkamiel@gmail.com>
Reviewed-by: Andrea Parri <andrea.parri@amarulasolutions.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: akpm@linux-foundation.org
Cc: viro@zeniv.linux.org.uk
Link: https://lkml.kernel.org/r/1547814450-18902-6-git-send-email-elena.reshetova@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-02-04 08:53:56 +01:00
Elena Reshetova ec1d281923 sched/core: Convert task_struct.usage to refcount_t
atomic_t variables are currently used to implement reference
counters with the following properties:

 - counter is initialized to 1 using atomic_set()
 - a resource is freed upon counter reaching zero
 - once counter reaches zero, its further
   increments aren't allowed
 - counter schema uses basic atomic operations
   (set, inc, inc_not_zero, dec_and_test, etc.)

Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.

The variable task_struct.usage is used as pure reference counter.
Convert it to refcount_t and fix up the operations.

** Important note for maintainers:

Some functions from refcount_t API defined in lib/refcount.c
have different memory ordering guarantees than their atomic
counterparts.

The full comparison can be seen in
https://lkml.org/lkml/2017/11/15/57 and it is hopefully soon
in state to be merged to the documentation tree.

Normally the differences should not matter since refcount_t provides
enough guarantees to satisfy the refcounting use cases, but in
some rare cases it might matter.

Please double check that you don't have some undocumented
memory guarantees for this variable usage.

For the task_struct.usage it might make a difference
in following places:

 - put_task_struct(): decrement in refcount_dec_and_test() only
   provides RELEASE ordering and control dependency on success
   vs. fully ordered atomic counterpart

Suggested-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: David Windsor <dwindsor@gmail.com>
Reviewed-by: Hans Liljestrand <ishkamiel@gmail.com>
Reviewed-by: Andrea Parri <andrea.parri@amarulasolutions.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: akpm@linux-foundation.org
Cc: viro@zeniv.linux.org.uk
Link: https://lkml.kernel.org/r/1547814450-18902-5-git-send-email-elena.reshetova@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-02-04 08:53:55 +01:00
Elena Reshetova 60d4de3ff7 sched/core: Convert signal_struct.sigcnt to refcount_t
atomic_t variables are currently used to implement reference
counters with the following properties:

 - counter is initialized to 1 using atomic_set()
 - a resource is freed upon counter reaching zero
 - once counter reaches zero, its further
   increments aren't allowed
 - counter schema uses basic atomic operations
   (set, inc, inc_not_zero, dec_and_test, etc.)

Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.

The variable signal_struct.sigcnt is used as pure reference counter.
Convert it to refcount_t and fix up the operations.

** Important note for maintainers:

Some functions from refcount_t API defined in lib/refcount.c
have different memory ordering guarantees than their atomic
counterparts.

The full comparison can be seen in
https://lkml.org/lkml/2017/11/15/57 and it is hopefully soon
in state to be merged to the documentation tree.

Normally the differences should not matter since refcount_t provides
enough guarantees to satisfy the refcounting use cases, but in
some rare cases it might matter.

Please double check that you don't have some undocumented
memory guarantees for this variable usage.

For the signal_struct.sigcnt it might make a difference
in following places:

 - put_signal_struct(): decrement in refcount_dec_and_test() only
   provides RELEASE ordering and control dependency on success
   vs. fully ordered atomic counterpart

Suggested-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: David Windsor <dwindsor@gmail.com>
Reviewed-by: Hans Liljestrand <ishkamiel@gmail.com>
Reviewed-by: Andrea Parri <andrea.parri@amarulasolutions.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: akpm@linux-foundation.org
Cc: viro@zeniv.linux.org.uk
Link: https://lkml.kernel.org/r/1547814450-18902-3-git-send-email-elena.reshetova@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-02-04 08:53:53 +01:00
Elena Reshetova d036bda7d0 sched/core: Convert sighand_struct.count to refcount_t
atomic_t variables are currently used to implement reference
counters with the following properties:

 - counter is initialized to 1 using atomic_set()
 - a resource is freed upon counter reaching zero
 - once counter reaches zero, its further
   increments aren't allowed
 - counter schema uses basic atomic operations
   (set, inc, inc_not_zero, dec_and_test, etc.)

Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.

The variable sighand_struct.count is used as pure reference counter.
Convert it to refcount_t and fix up the operations.

** Important note for maintainers:

Some functions from refcount_t API defined in lib/refcount.c
have different memory ordering guarantees than their atomic
counterparts.

The full comparison can be seen in
https://lkml.org/lkml/2017/11/15/57 and it is hopefully soon
in state to be merged to the documentation tree.

Normally the differences should not matter since refcount_t provides
enough guarantees to satisfy the refcounting use cases, but in
some rare cases it might matter.

Please double check that you don't have some undocumented
memory guarantees for this variable usage.

For the sighand_struct.count it might make a difference
in following places:

 - __cleanup_sighand: decrement in refcount_dec_and_test() only
   provides RELEASE ordering and control dependency on success
   vs. fully ordered atomic counterpart

Suggested-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: David Windsor <dwindsor@gmail.com>
Reviewed-by: Hans Liljestrand <ishkamiel@gmail.com>
Reviewed-by: Andrea Parri <andrea.parri@amarulasolutions.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: akpm@linux-foundation.org
Cc: viro@zeniv.linux.org.uk
Link: https://lkml.kernel.org/r/1547814450-18902-2-git-send-email-elena.reshetova@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-02-04 08:53:52 +01:00
Tetsuo Handa 9bcdeb51bd oom, oom_reaper: do not enqueue same task twice
Arkadiusz reported that enabling memcg's group oom killing causes
strange memcg statistics where there is no task in a memcg despite the
number of tasks in that memcg is not 0.  It turned out that there is a
bug in wake_oom_reaper() which allows enqueuing same task twice which
makes impossible to decrease the number of tasks in that memcg due to a
refcount leak.

This bug existed since the OOM reaper became invokable from
task_will_free_mem(current) path in out_of_memory() in Linux 4.7,

  T1@P1     |T2@P1     |T3@P1     |OOM reaper
  ----------+----------+----------+------------
                                   # Processing an OOM victim in a different memcg domain.
                        try_charge()
                          mem_cgroup_out_of_memory()
                            mutex_lock(&oom_lock)
             try_charge()
               mem_cgroup_out_of_memory()
                 mutex_lock(&oom_lock)
  try_charge()
    mem_cgroup_out_of_memory()
      mutex_lock(&oom_lock)
                            out_of_memory()
                              oom_kill_process(P1)
                                do_send_sig_info(SIGKILL, @P1)
                                mark_oom_victim(T1@P1)
                                wake_oom_reaper(T1@P1) # T1@P1 is enqueued.
                            mutex_unlock(&oom_lock)
                 out_of_memory()
                   mark_oom_victim(T2@P1)
                   wake_oom_reaper(T2@P1) # T2@P1 is enqueued.
                 mutex_unlock(&oom_lock)
      out_of_memory()
        mark_oom_victim(T1@P1)
        wake_oom_reaper(T1@P1) # T1@P1 is enqueued again due to oom_reaper_list == T2@P1 && T1@P1->oom_reaper_list == NULL.
      mutex_unlock(&oom_lock)
                                   # Completed processing an OOM victim in a different memcg domain.
                                   spin_lock(&oom_reaper_lock)
                                   # T1P1 is dequeued.
                                   spin_unlock(&oom_reaper_lock)

but memcg's group oom killing made it easier to trigger this bug by
calling wake_oom_reaper() on the same task from one out_of_memory()
request.

Fix this bug using an approach used by commit 855b018325 ("oom,
oom_reaper: disable oom_reaper for oom_kill_allocating_task").  As a
side effect of this patch, this patch also avoids enqueuing multiple
threads sharing memory via task_will_free_mem(current) path.

Link: http://lkml.kernel.org/r/e865a044-2c10-9858-f4ef-254bc71d6cc2@i-love.sakura.ne.jp
Link: http://lkml.kernel.org/r/5ee34fc6-1485-34f8-8790-903ddabaa809@i-love.sakura.ne.jp
Fixes: af8e15cc85 ("oom, oom_reaper: do not enqueue task if it is on the oom_reaper_list head")
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-by: Arkadiusz Miskiewicz <arekm@maven.pl>
Tested-by: Arkadiusz Miskiewicz <arekm@maven.pl>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Aleksa Sarai <asarai@suse.de>
Cc: Jay Kamat <jgkamat@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-01 15:46:23 -08:00
Quentin Perret 8d5d0cfb63 sched/topology: Introduce a sysctl for Energy Aware Scheduling
In its current state, Energy Aware Scheduling (EAS) starts automatically
on asymmetric platforms having an Energy Model (EM). However, there are
users who want to have an EM (for thermal management for example), but
don't want EAS with it.

In order to let users disable EAS explicitly, introduce a new sysctl
called 'sched_energy_aware'. It is enabled by default so that EAS can
start automatically on platforms where it makes sense. Flipping it to 0
rebuilds the scheduling domains and disables EAS.

Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: adharmap@codeaurora.org
Cc: chris.redpath@arm.com
Cc: currojerez@riseup.net
Cc: dietmar.eggemann@arm.com
Cc: edubezval@gmail.com
Cc: gregkh@linuxfoundation.org
Cc: javi.merino@kernel.org
Cc: joel@joelfernandes.org
Cc: juri.lelli@redhat.com
Cc: morten.rasmussen@arm.com
Cc: patrick.bellasi@arm.com
Cc: pkondeti@codeaurora.org
Cc: rjw@rjwysocki.net
Cc: skannan@codeaurora.org
Cc: smuckle@google.com
Cc: srinivas.pandruvada@linux.intel.com
Cc: thara.gopinath@linaro.org
Cc: tkjos@google.com
Cc: valentin.schneider@arm.com
Cc: vincent.guittot@linaro.org
Cc: viresh.kumar@linaro.org
Link: https://lkml.kernel.org/r/20181203095628.11858-11-quentin.perret@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-01-27 12:29:37 +01:00
Peter Zijlstra e6018c0f5c sched/wake_q: Document wake_q_add()
The only guarantee provided by wake_q_add() is that a wakeup will
happen after it, it does _NOT_ guarantee the wakeup will be delayed
until the matching wake_up_q().

If wake_q_add() fails the cmpxchg() a concurrent wakeup is pending and
that can happen at any time after the cmpxchg(). This means we should
not rely on the wakeup happening at wake_q_up(), but should be ready
for wake_q_add() to issue the wakeup.

The delay; if provided (most likely); should only result in more efficient
behaviour.

Reported-by: Yongji Xie <elohimes@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-01-21 11:15:36 +01:00
Yi Wang fb5bf31722 fork: fix some -Wmissing-prototypes warnings
We get a warning when building kernel with W=1:

  kernel/fork.c:167:13: warning: no previous prototype for `arch_release_thread_stack' [-Wmissing-prototypes]
  kernel/fork.c:779:13: warning: no previous prototype for `fork_init' [-Wmissing-prototypes]

Add the missing declaration in head file to fix this.

Also, remove arch_release_thread_stack() completely because no arch
seems to implement it since bb9d81264 (arch: remove tile port).

Link: http://lkml.kernel.org/r/1542170087-23645-1-git-send-email-wang.yi59@zte.com.cn
Signed-off-by: Yi Wang <wang.yi59@zte.com.cn>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-04 13:13:47 -08:00
Quentin Perret 938e5e4b0d sched/cpufreq: Prepare schedutil for Energy Aware Scheduling
Schedutil requests frequency by aggregating utilization signals from
the scheduler (CFS, RT, DL, IRQ) and applying a 25% margin on top of
them. Since Energy Aware Scheduling (EAS) needs to be able to predict
the frequency requests, it needs to forecast the decisions made by the
governor.

In order to prepare the introduction of EAS, introduce
schedutil_freq_util() to centralize the aforementioned signal
aggregation and make it available to both schedutil and EAS. Since
frequency selection and energy estimation still need to deal with RT and
DL signals slightly differently, schedutil_freq_util() is called with a
different 'type' parameter in those two contexts, and returns an
aggregated utilization signal accordingly. While at it, introduce the
map_util_freq() function which is designed to make schedutil's 25%
margin usable easily for both sugov and EAS.

As EAS will be able to predict schedutil's frequency requests more
accurately than any other governor by design, it'd be sensible to make
sure EAS cannot be used without schedutil. This will be done later, once
EAS has actually been introduced.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: adharmap@codeaurora.org
Cc: chris.redpath@arm.com
Cc: currojerez@riseup.net
Cc: dietmar.eggemann@arm.com
Cc: edubezval@gmail.com
Cc: gregkh@linuxfoundation.org
Cc: javi.merino@kernel.org
Cc: joel@joelfernandes.org
Cc: juri.lelli@redhat.com
Cc: morten.rasmussen@arm.com
Cc: patrick.bellasi@arm.com
Cc: pkondeti@codeaurora.org
Cc: rjw@rjwysocki.net
Cc: skannan@codeaurora.org
Cc: smuckle@google.com
Cc: srinivas.pandruvada@linux.intel.com
Cc: thara.gopinath@linaro.org
Cc: tkjos@google.com
Cc: valentin.schneider@arm.com
Cc: vincent.guittot@linaro.org
Cc: viresh.kumar@linaro.org
Link: https://lkml.kernel.org/r/20181203095628.11858-3-quentin.perret@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-12-11 15:16:58 +01:00
Quentin Perret 5bd0988be1 sched/topology: Relocate arch_scale_cpu_capacity() to the internal header
By default, arch_scale_cpu_capacity() is only visible from within the
kernel/sched folder. Relocate it to include/linux/sched/topology.h to
make it visible to other clients needing to know about the capacity of
CPUs, such as the Energy Model framework.

This also shrinks the <linux/sched/topology.h> public header.

Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: adharmap@codeaurora.org
Cc: chris.redpath@arm.com
Cc: currojerez@riseup.net
Cc: dietmar.eggemann@arm.com
Cc: edubezval@gmail.com
Cc: gregkh@linuxfoundation.org
Cc: javi.merino@kernel.org
Cc: joel@joelfernandes.org
Cc: juri.lelli@redhat.com
Cc: morten.rasmussen@arm.com
Cc: patrick.bellasi@arm.com
Cc: pkondeti@codeaurora.org
Cc: rjw@rjwysocki.net
Cc: skannan@codeaurora.org
Cc: smuckle@google.com
Cc: srinivas.pandruvada@linux.intel.com
Cc: thara.gopinath@linaro.org
Cc: tkjos@google.com
Cc: valentin.schneider@arm.com
Cc: vincent.guittot@linaro.org
Cc: viresh.kumar@linaro.org
Link: https://lkml.kernel.org/r/20181203095628.11858-2-quentin.perret@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-12-11 15:16:58 +01:00
Vincent Guittot 765d0af19f sched/topology: Remove the ::smt_gain field from 'struct sched_domain'
::smt_gain is used to compute the capacity of CPUs of a SMT core with the
constraint 1 < ::smt_gain < 2 in order to be able to compute number of CPUs
per core. The field has_free_capacity of struct numa_stat, which was the
last user of this computation of number of CPUs per core, has been removed
by:

  2d4056fafa ("sched/numa: Remove numa_has_capacity()")

We can now remove this constraint on core capacity and use the defautl value
SCHED_CAPACITY_SCALE for SMT CPUs. With this remove, SCHED_CAPACITY_SCALE
becomes the maximum compute capacity of CPUs on every systems. This should
help to simplify some code and remove fields like rd->max_cpu_capacity

Furthermore, arch_scale_cpu_capacity() is used with a NULL sd in several other
places in the code when it wants the capacity of a CPUs to scale
some metrics like in pelt, deadline or schedutil. In case on SMT, the value
returned is not the capacity of SMT CPUs but default SCHED_CAPACITY_SCALE.

So remove it.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1535548752-4434-4-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-12-11 15:16:57 +01:00
Ingo Molnar dfcb245e28 sched: Fix various typos in comments
Go over the scheduler source code and fix common typos
in comments - and a typo in an actual variable name.

No change in functionality intended.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-12-03 11:55:42 +01:00
Thomas Gleixner a74cfffb03 x86/speculation: Rework SMT state change
arch_smt_update() is only called when the sysfs SMT control knob is
changed. This means that when SMT is enabled in the sysfs control knob the
system is considered to have SMT active even if all siblings are offline.

To allow finegrained control of the speculation mitigations, the actual SMT
state is more interesting than the fact that siblings could be enabled.

Rework the code, so arch_smt_update() is invoked from each individual CPU
hotplug function, and simplify the update function while at it.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Casey Schaufler <casey.schaufler@intel.com>
Cc: Asit Mallick <asit.k.mallick@intel.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Jon Masters <jcm@redhat.com>
Cc: Waiman Long <longman9394@gmail.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Dave Stewart <david.c.stewart@intel.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20181125185004.521974984@linutronix.de
2018-11-28 11:57:07 +01:00
Thomas Gleixner 321a874a7e sched/smt: Expose sched_smt_present static key
Make the scheduler's 'sched_smt_present' static key globaly available, so
it can be used in the x86 speculation control code.

Provide a query function and a stub for the CONFIG_SMP=n case.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Casey Schaufler <casey.schaufler@intel.com>
Cc: Asit Mallick <asit.k.mallick@intel.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Jon Masters <jcm@redhat.com>
Cc: Waiman Long <longman9394@gmail.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Dave Stewart <david.c.stewart@intel.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20181125185004.430168326@linutronix.de
2018-11-28 11:57:07 +01:00
Linus Torvalds 6ef746769e More power management updates for 4.20-rc1
- Fix build regression in the intel_pstate driver that doesn't
    build without CONFIG_ACPI after recent changes (Dominik Brodowski).
 
  - One of the heuristics in the menu cpuidle governor is based on a
    function returning 0 most of the time, so drop it and clean up
    the scheduler code related to it (Daniel Lezcano).
 
  - Prevent the arm_big_little cpufreq driver from being used on ARM64
    which is not suitable for it and drop the arm_big_little_dt driver
    that is not used any more (Sudeep Holla).
 
  - Prevent the hung task watchdog from triggering during resume from
    system-wide sleep states by disabling it before freezing tasks and
    enabling it again after they have been thawed (Vitaly Kuznetsov).
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJb2BJ7AAoJEILEb/54YlRx/kwP/iD7tUUZ6mT84OI0FTbEj8A/
 fM+uHrwy25PmqyWGGtbHpaWU9OxVxUReSicsBCt+2LZmX3sFYpbSb243mv3pmxqb
 A0kLflG4lWCKJNIfa/a3OMDTUw26mxSTCidE3jJXkd8HkWrzeAWvMair+UCuzMf3
 A4Omu0IkNL8C0MKtUOb3PlUk3dnLYMxuairNhozBPhi+P+0tLW9/9XvgPJBVhnbZ
 CKn/aFsDoc08tAfxC8N32cgKwE7nbeIgTJTBFyu2lQmInsd4TTuoM50vSC5i+x88
 AmBOoH9IX0fhXJ6hgm+VMW8+x9S+H7jAVy/3C2xoUBeCclzlxX6eUCtjV5YNZqqn
 1nXQfGeAwgzX6Tyu6HjM7vjbfObk59ZwpmDRPJEUEhLDEBMS+iDStlp9zmKTedNm
 G4iSTzS6qJCNPtx4y5wkLp/FvzTofIuWqVFJSJC4+EoVKkbbw9xwaY+JKXUt1Uwx
 j+U6EtRhzL/kVX0nq+iQXXeANxCFNzI56Ov5O7mxjF1m/hDE/Gb2QEeIb6nRZC2A
 H3I2so2J3h1yTgadpGFFvJWaqfHkgcBTsm06tSgHVb86quiTANJIQ9mqfFyOzDDJ
 KaZ82MROt7UuCMI6X9n+oIBDZWLHmADge6RdHCD1wB+zrUmusCtNEHUZACXd0mPf
 s8MUK4bWVhViVXGS5bMP
 =/bnR
 -----END PGP SIGNATURE-----

Merge tag 'pm-4.20-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull more power management updates from Rafael Wysocki:
 "These remove a questionable heuristic from the menu cpuidle governor,
  fix a recent build regression in the intel_pstate driver, clean up ARM
  big-Little support in cpufreq and fix up hung task watchdog's
  interaction with system-wide power management transitions.

  Specifics:

   - Fix build regression in the intel_pstate driver that doesn't build
     without CONFIG_ACPI after recent changes (Dominik Brodowski).

   - One of the heuristics in the menu cpuidle governor is based on a
     function returning 0 most of the time, so drop it and clean up the
     scheduler code related to it (Daniel Lezcano).

   - Prevent the arm_big_little cpufreq driver from being used on ARM64
     which is not suitable for it and drop the arm_big_little_dt driver
     that is not used any more (Sudeep Holla).

   - Prevent the hung task watchdog from triggering during resume from
     system-wide sleep states by disabling it before freezing tasks and
     enabling it again after they have been thawed (Vitaly Kuznetsov)"

* tag 'pm-4.20-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  kernel: hung_task.c: disable on suspend
  cpufreq: remove unused arm_big_little_dt driver
  cpufreq: drop ARM_BIG_LITTLE_CPUFREQ support for ARM64
  cpufreq: intel_pstate: Fix compilation for !CONFIG_ACPI
  cpuidle: menu: Remove get_loadavg() from the performance multiplier
  sched: Factor out nr_iowait and nr_iowait_cpu
2018-10-30 09:08:07 -07:00