[ Upstream commit 5b268d8aba ]
When calling debugfs_lookup() the result must have dput() called on it,
otherwise the memory will leak over time. To make things simpler, just
call debugfs_lookup_and_remove() instead which handles all of the logic at
once.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20230202151214.2306822-1-gregkh@linuxfoundation.org
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit b7082cdfc4 ]
Bugs have been reported on 8 sockets x86 machines in which the TSC was
wrongly disabled when the system is under heavy workload.
[ 818.380354] clocksource: timekeeping watchdog on CPU336: hpet wd-wd read-back delay of 1203520ns
[ 818.436160] clocksource: wd-tsc-wd read-back delay of 181880ns, clock-skew test skipped!
[ 819.402962] clocksource: timekeeping watchdog on CPU338: hpet wd-wd read-back delay of 324000ns
[ 819.448036] clocksource: wd-tsc-wd read-back delay of 337240ns, clock-skew test skipped!
[ 819.880863] clocksource: timekeeping watchdog on CPU339: hpet read-back delay of 150280ns, attempt 3, marking unstable
[ 819.936243] tsc: Marking TSC unstable due to clocksource watchdog
[ 820.068173] TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
[ 820.092382] sched_clock: Marking unstable (818769414384, 1195404998)
[ 820.643627] clocksource: Checking clocksource tsc synchronization from CPU 267 to CPUs 0,4,25,70,126,430,557,564.
[ 821.067990] clocksource: Switched to clocksource hpet
This can be reproduced by running memory intensive 'stream' tests,
or some of the stress-ng subcases such as 'ioport'.
The reason for these issues is the when system is under heavy load, the
read latency of the clocksources can be very high. Even lightweight TSC
reads can show high latencies, and latencies are much worse for external
clocksources such as HPET or the APIC PM timer. These latencies can
result in false-positive clocksource-unstable determinations.
These issues were initially reported by a customer running on a production
system, and this problem was reproduced on several generations of Xeon
servers, especially when running the stress-ng test. These Xeon servers
were not production systems, but they did have the latest steppings
and firmware.
Given that the clocksource watchdog is a continual diagnostic check with
frequency of twice a second, there is no need to rush it when the system
is under heavy load. Therefore, when high clocksource read latencies
are detected, suspend the watchdog timer for 5 minutes.
Signed-off-by: Feng Tang <feng.tang@intel.com>
Acked-by: Waiman Long <longman@redhat.com>
Cc: John Stultz <jstultz@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Stephen Boyd <sboyd@kernel.org>
Cc: Feng Tang <feng.tang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 9f76d59173 ]
The nanosleep syscalls use the restart_block mechanism, with a quirk:
The `type` and `rmtp`/`compat_rmtp` fields are set up unconditionally on
syscall entry, while the rest of the restart_block is only set up in the
unlikely case that the syscall is actually interrupted by a signal (or
pseudo-signal) that doesn't have a signal handler.
If the restart_block was set up by a previous syscall (futex(...,
FUTEX_WAIT, ...) or poll()) and hasn't been invalidated somehow since then,
this will clobber some of the union fields used by futex_wait_restart() and
do_restart_poll().
If userspace afterwards wrongly calls the restart_syscall syscall,
futex_wait_restart()/do_restart_poll() will read struct fields that have
been clobbered.
This doesn't actually lead to anything particularly interesting because
none of the union fields contain trusted kernel data, and
futex(..., FUTEX_WAIT, ...) and poll() aren't syscalls where it makes much
sense to apply seccomp filters to their arguments.
So the current consequences are just of the "if userspace does bad stuff,
it can damage itself, and that's not a problem" flavor.
But still, it seems like a hazard for future developers, so invalidate the
restart_block when partly setting it up in the nanosleep syscalls.
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20230105134403.754986-1-jannh@google.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit d125d1349a upstream.
syzbot reported a RCU stall which is caused by setting up an alarmtimer
with a very small interval and ignoring the signal. The reproducer arms the
alarm timer with a relative expiry of 8ns and an interval of 9ns. Not a
problem per se, but that's an issue when the signal is ignored because then
the timer is immediately rearmed because there is no way to delay that
rearming to the signal delivery path. See posix_timer_fn() and commit
58229a1899 ("posix-timers: Prevent softirq starvation by small intervals
and SIG_IGN") for details.
The reproducer does not set SIG_IGN explicitely, but it sets up the timers
signal with SIGCONT. That has the same effect as explicitely setting
SIG_IGN for a signal as SIGCONT is ignored if there is no handler set and
the task is not ptraced.
The log clearly shows that:
[pid 5102] --- SIGCONT {si_signo=SIGCONT, si_code=SI_TIMER, si_timerid=0, si_overrun=316014, si_int=0, si_ptr=NULL} ---
It works because the tasks are traced and therefore the signal is queued so
the tracer can see it, which delays the restart of the timer to the signal
delivery path. But then the tracer is killed:
[pid 5087] kill(-5102, SIGKILL <unfinished ...>
...
./strace-static-x86_64: Process 5107 detached
and after it's gone the stall can be observed:
syzkaller login: [ 79.439102][ C0] hrtimer: interrupt took 68471 ns
[ 184.460538][ C1] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
...
[ 184.658237][ C1] rcu: Stack dump where RCU GP kthread last ran:
[ 184.664574][ C1] Sending NMI from CPU 1 to CPUs 0:
[ 184.669821][ C0] NMI backtrace for cpu 0
[ 184.669831][ C0] CPU: 0 PID: 5108 Comm: syz-executor192 Not tainted 6.2.0-rc6-next-20230203-syzkaller #0
...
[ 184.670036][ C0] Call Trace:
[ 184.670041][ C0] <IRQ>
[ 184.670045][ C0] alarmtimer_fired+0x327/0x670
posix_timer_fn() prevents that by checking whether the interval for
timers which have the signal ignored is smaller than a jiffie and
artifically delay it by shifting the next expiry out by a jiffie. That's
accurate vs. the overrun accounting, but slightly inaccurate
vs. timer_gettimer(2).
The comment in that function says what needs to be done and there was a fix
available for the regular userspace induced SIG_IGN mechanism, but that did
not work due to the implicit ignore for SIGCONT and similar signals. This
needs to be worked on, but for now the only available workaround is to do
exactly what posix_timer_fn() does:
Increase the interval of self-rearming timers, which have their signal
ignored, to at least a jiffie.
Interestingly this has been fixed before via commit ff86bf0c65
("alarmtimer: Rate limit periodic intervals") already, but that fix got
lost in a later rework.
Reported-by: syzbot+b9564ba6e8e00694511b@syzkaller.appspotmail.com
Fixes: f2c45807d3 ("alarmtimer: Switch over to generic set/get/rearm routine")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/87k00q1no2.ffs@tglx
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit b8ac29b401 ]
The rng's random_init() function contributes the real time to the rng at
boot time, so that events can at least start in relation to something
particular in the real world. But this clock might not yet be set that
point in boot, so nothing is contributed. In addition, the relation
between minor clock changes from, say, NTP, and the cycle counter is
potentially useful entropic data.
This commit addresses this by mixing in a time stamp on calls to
settimeofday and adjtimex. No entropy is credited in doing so, so it
doesn't make initialization faster, but it is still useful input to
have.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Cc: stable@vger.kernel.org
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 151c8e499f ]
Using msleep() is problematic because it's compared against
ratelimiter.c's ktime_get_coarse_boottime_ns(), which means on systems
with slow jiffies (such as UML's forced HZ=100), the result is
inaccurate. So switch to using schedule_hrtimeout().
However, hrtimer gives us access only to the traditional posix timers,
and none of the _COARSE variants. So now, rather than being too
imprecise like jiffies, it's too precise.
One solution would be to give it a large "range" value, but this will
still fire early on a loaded system. A better solution is to align the
timeout to the actual coarse timer, and then round up to the nearest
tick, plus change.
So add the timeout to the current coarse time, and then
schedule_hrtimer() until the absolute computed time.
This should hopefully reduce flakes in CI as well. Note that we keep the
retry loop in case the entire function is running behind, because the
test could still be scheduled out, by either the kernel or by the
hypervisor's kernel, in which case restarting the test and hoping to not
be scheduled out still helps.
Fixes: e7096c131e ("net: WireGuard secure network tunnel")
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit d5b36a4dbd upstream.
As Chris explains, the comment above exit_itimers() is not correct,
we can race with proc_timers_seq_ops. Change exit_itimers() to clear
signal->posix_timers with ->siglock held.
Cc: <stable@vger.kernel.org>
Reported-by: chris@accessvector.net
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 2390095113 upstream.
EXPORT_SYMBOL and __init is a bad combination because the .init.text
section is freed up after the initialization. Hence, modules cannot
use symbols annotated __init. The access to a freed symbol may end up
with kernel panic.
modpost used to detect it, but it had been broken for a decade.
Commit 28438794ab ("modpost: fix section mismatch check for exported
init/exit sections") fixed it so modpost started to warn it again, then
this showed up:
MODPOST vmlinux.symvers
WARNING: modpost: vmlinux.o(___ksymtab_gpl+tick_nohz_full_setup+0x0): Section mismatch in reference from the variable __ksymtab_tick_nohz_full_setup to the function .init.text:tick_nohz_full_setup()
The symbol tick_nohz_full_setup is exported and annotated __init
Fix this by removing the __init annotation of tick_nohz_full_setup or drop the export.
Drop the export because tick_nohz_full_setup() is only called from the
built-in code in kernel/sched/isolation.c.
Fixes: ae9e557b5b ("time: Export tick start/stop functions for rcutorture")
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Backlund <tmb@tmb.nu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 1366992e16 upstream.
The addition of random_get_entropy_fallback() provides access to
whichever time source has the highest frequency, which is useful for
gathering entropy on platforms without available cycle counters. It's
not necessarily as good as being able to quickly access a cycle counter
that the CPU has, but it's still something, even when it falls back to
being jiffies-based.
In the event that a given arch does not define get_cycles(), falling
back to the get_cycles() default implementation that returns 0 is really
not the best we can do. Instead, at least calling
random_get_entropy_fallback() would be preferable, because that always
needs to return _something_, even falling back to jiffies eventually.
It's not as though random_get_entropy_fallback() is super high precision
or guaranteed to be entropic, but basically anything that's not zero all
the time is better than returning zero all the time.
Finally, since random_get_entropy_fallback() is used during extremely
early boot when randomizing freelists in mm_init(), it can be called
before timekeeping has been initialized. In that case there really is
nothing we can do; jiffies hasn't even started ticking yet. So just give
up and return 0.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit c54bc0fc84 upstream.
When the timer base is empty, base::next_expiry is set to base::clk +
NEXT_TIMER_MAX_DELTA and base::next_expiry_recalc is false. When no timer
is queued until jiffies reaches base::next_expiry value, the warning for
not finding any expired timer and base::next_expiry_recalc is false in
__run_timers() triggers.
To prevent triggering the warning in this valid scenario
base::timers_pending needs to be added to the warning condition.
Fixes: 31cd0e119d ("timers: Recalculate next timer interrupt only when necessary")
Reported-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20220405191732.7438-3-anna-maria@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 40e97e4296 upstream.
While running some testing on code that happened to allow the variable
tick_nohz_full_running to get set but with no "possible" NOHZ cores to
back up that setting, this warning triggered:
if (unlikely(tick_do_timer_cpu == TICK_DO_TIMER_NONE))
WARN_ON(tick_nohz_full_running);
The console was overwhemled with an endless stream of one WARN per tick
per core and there was no way to even see what was going on w/o using a
serial console to capture it and then trace it back to this.
Change it to WARN_ON_ONCE().
Fixes: 08ae95f4fd ("nohz_full: Allow the boot CPU to be nohz_full")
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20211206145950.10927-3-paul.gortmaker@windriver.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit c86ff8c55b ]
Since commit db3a34e174 ("clocksource: Retry clock read if long delays
detected") and commit 2e27e793e2 ("clocksource: Reduce clocksource-skew
threshold"), it is found that tsc clocksource fallback to hpet can
sometimes happen on both Intel and AMD systems especially when they are
running stressful benchmarking workloads. Of the 23 systems tested with
a v5.14 kernel, 10 of them have switched to hpet clock source during
the test run.
The result of falling back to hpet is a drastic reduction of performance
when running benchmarks. For example, the fio performance tests can
drop up to 70% whereas the iperf3 performance can drop up to 80%.
4 hpet fallbacks happened during bootup. They were:
[ 8.749399] clocksource: timekeeping watchdog on CPU13: hpet read-back delay of 263750ns, attempt 4, marking unstable
[ 12.044610] clocksource: timekeeping watchdog on CPU19: hpet read-back delay of 186166ns, attempt 4, marking unstable
[ 17.336941] clocksource: timekeeping watchdog on CPU28: hpet read-back delay of 182291ns, attempt 4, marking unstable
[ 17.518565] clocksource: timekeeping watchdog on CPU34: hpet read-back delay of 252196ns, attempt 4, marking unstable
Other fallbacks happen when the systems were running stressful
benchmarks. For example:
[ 2685.867873] clocksource: timekeeping watchdog on CPU117: hpet read-back delay of 57269ns, attempt 4, marking unstable
[46215.471228] clocksource: timekeeping watchdog on CPU8: hpet read-back delay of 61460ns, attempt 4, marking unstable
Commit 2e27e793e2 ("clocksource: Reduce clocksource-skew threshold"),
changed the skew margin from 100us to 50us. I think this is too small
and can easily be exceeded when running some stressful workloads on a
thermally stressed system. So it is switched back to 100us.
Even a maximum skew margin of 100us may be too small in for some systems
when booting up especially if those systems are under thermal stress. To
eliminate the case that the large skew is due to the system being too
busy slowing down the reading of both the watchdog and the clocksource,
an extra consecutive read of watchdog clock is being done to check this.
The consecutive watchdog read delay is compared against
WATCHDOG_MAX_SKEW/2. If the delay exceeds the limit, we assume that
the system is just too busy. A warning will be printed to the console
and the clock skew check is skipped for this round.
Fixes: db3a34e174 ("clocksource: Retry clock read if long delays detected")
Fixes: 2e27e793e2 ("clocksource: Reduce clocksource-skew threshold")
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 4e8c11b6b3 upstream.
Even after commit e1d7ba8735 ("time: Always make sure wall_to_monotonic
isn't positive") it is still possible to make wall_to_monotonic positive
by running the following code:
int main(void)
{
struct timespec time;
clock_gettime(CLOCK_MONOTONIC, &time);
time.tv_nsec = 0;
clock_settime(CLOCK_REALTIME, &time);
return 0;
}
The reason is that the second parameter of timespec64_compare(), ts_delta,
may be unnormalized because the delta is calculated with an open coded
substraction which causes the comparison of tv_sec to yield the wrong
result:
wall_to_monotonic = { .tv_sec = -10, .tv_nsec = 900000000 }
ts_delta = { .tv_sec = -9, .tv_nsec = -900000000 }
That makes timespec64_compare() claim that wall_to_monotonic < ts_delta,
but actually the result should be wall_to_monotonic > ts_delta.
After normalization, the result of timespec64_compare() is correct because
the tv_sec comparison is not longer misleading:
wall_to_monotonic = { .tv_sec = -10, .tv_nsec = 900000000 }
ts_delta = { .tv_sec = -10, .tv_nsec = 100000000 }
Use timespec64_sub() to ensure that ts_delta is normalized, which fixes the
issue.
Fixes: e1d7ba8735 ("time: Always make sure wall_to_monotonic isn't positive")
Signed-off-by: Yu Liao <liaoyu15@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20211213135727.1656662-1-liaoyu15@huawei.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit e4779015fd upstream.
Patch series "mm/damon: Fix fake /proc/loadavg reports", v3.
This patchset fixes DAMON's fake load report issue. The first patch
makes yet another variant of usleep_range() for this fix, and the second
patch fixes the issue of DAMON by making it using the newly introduced
function.
This patch (of 2):
Some kernel threads such as DAMON could need to repeatedly sleep in
micro seconds level. Because usleep_range() sleeps in uninterruptible
state, however, such threads would make /proc/loadavg reports fake load.
To help such cases, this commit implements a variant of usleep_range()
called usleep_idle_range(). It is same to usleep_range() but sets the
state of the current task as TASK_IDLE while sleeping.
Link: https://lkml.kernel.org/r/20211126145015.15862-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20211126145015.15862-2-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Cc: John Stultz <john.stultz@linaro.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit ca7752caea upstream.
copy_process currently copies task_struct.posix_cputimers_work as-is. If a
timer interrupt arrives while handling clone and before dup_task_struct
completes then the child task will have:
1. posix_cputimers_work.scheduled = true
2. posix_cputimers_work.work queued.
copy_process clears task_struct.task_works, so (2) will have no effect and
posix_cpu_timers_work will never run (not to mention it doesn't make sense
for two tasks to share a common linked list).
Since posix_cpu_timers_work never runs, posix_cputimers_work.scheduled is
never cleared. Since scheduled is set, future timer interrupts will skip
scheduling work, with the ultimate result that the task will never receive
timer expirations.
Together, the complete flow is:
1. Task 1 calls clone(), enters kernel.
2. Timer interrupt fires, schedules task work on Task 1.
2a. task_struct.posix_cputimers_work.scheduled = true
2b. task_struct.posix_cputimers_work.work added to
task_struct.task_works.
3. dup_task_struct() copies Task 1 to Task 2.
4. copy_process() clears task_struct.task_works for Task 2.
5. Future timer interrupts on Task 2 see
task_struct.posix_cputimers_work.scheduled = true and skip scheduling
work.
Fix this by explicitly clearing contents of task_struct.posix_cputimers_work
in copy_process(). This was never meant to be shared or inherited across
tasks in the first place.
Fixes: 1fb497dd00 ("posix-cpu-timers: Provide mechanisms to defer timer handling to task_work")
Reported-by: Rhys Hiltner <rhys@justin.tv>
Signed-off-by: Michael Pratt <mpratt@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/20211101210615.716522-1-mpratt@google.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Resetting/stopping an itimer eventually leads to it being reprogrammed
with an actual "0" value. As a result the itimer expires on the next
tick, triggering an unexpected signal.
To fix this, make sure that
struct signal_struct::it[CPUCLOCK_PROF/VIRT]::expires is set to 0 when
setitimer() passes a 0 it_value, indicating that the timer must stop.
Fixes: 406dd42bd1 ("posix-cpu-timers: Force next expiration recalc after itimer reset")
Reported-by: Victor Stinner <vstinner@redhat.com>
Reported-by: Chris Hixon <linux-kernel-bugs@hixontech.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210913145332.232023-1-frederic@kernel.org
Merge misc updates from Andrew Morton:
"173 patches.
Subsystems affected by this series: ia64, ocfs2, block, and mm (debug,
pagecache, gup, swap, shmem, memcg, selftests, pagemap, mremap,
bootmem, sparsemem, vmalloc, kasan, pagealloc, memory-failure,
hugetlb, userfaultfd, vmscan, compaction, mempolicy, memblock,
oom-kill, migration, ksm, percpu, vmstat, and madvise)"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (173 commits)
mm/madvise: add MADV_WILLNEED to process_madvise()
mm/vmstat: remove unneeded return value
mm/vmstat: simplify the array size calculation
mm/vmstat: correct some wrong comments
mm/percpu,c: remove obsolete comments of pcpu_chunk_populated()
selftests: vm: add COW time test for KSM pages
selftests: vm: add KSM merging time test
mm: KSM: fix data type
selftests: vm: add KSM merging across nodes test
selftests: vm: add KSM zero page merging test
selftests: vm: add KSM unmerge test
selftests: vm: add KSM merge test
mm/migrate: correct kernel-doc notation
mm: wire up syscall process_mrelease
mm: introduce process_mrelease system call
memblock: make memblock_find_in_range method private
mm/mempolicy.c: use in_task() in mempolicy_slab_node()
mm/mempolicy: unify the create() func for bind/interleave/prefer-many policies
mm/mempolicy: advertise new MPOL_PREFERRED_MANY
mm/hugetlb: add support for mempolicy MPOL_PREFERRED_MANY
...
A program may create multiple interval timers using timer_create(). For
each timer the kernel preallocates a "queued real-time signal",
Consequently, the number of timers is limited by the RLIMIT_SIGPENDING
resource limit. The allocated object is quite small, ~250 bytes, but even
the default signal limits allow to consume up to 100 megabytes per user.
It makes sense to account for them to limit the host's memory consumption
from inside the memcg-limited container.
Link: https://lkml.kernel.org/r/57795560-025c-267c-6b1a-dea852d95530@virtuozzo.com
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jiri Slaby <jirislaby@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Yutian Yang <nglaive@gmail.com>
Cc: Zefan Li <lizefan.x@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Container admin can create new namespaces and force kernel to allocate up
to several pages of memory for the namespaces and its associated
structures.
Net and uts namespaces have enabled accounting for such allocations. It
makes sense to account for rest ones to restrict the host's memory
consumption from inside the memcg-limited container.
Link: https://lkml.kernel.org/r/5525bcbf-533e-da27-79b7-158686c64e13@virtuozzo.com
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jiri Slaby <jirislaby@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Yutian Yang <nglaive@gmail.com>
Cc: Zefan Li <lizefan.x@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The clocksource watchdog test sets a local JIFFIES_SHIFT macro and assumes
that HZ is >= 100. For smaller HZ values this shift value is too large and
causes undefined behaviour.
Move the HZ-based definitions of JIFFIES_SHIFT from kernel/time/jiffies.c
to kernel/time/tick-internal.h so the clocksource watchdog test can utilize
them, which makes it work correctly with all HZ values.
[ tglx: Resolved conflicts and massaged changelog ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/lkml/20210812000133.GA402890@paulmck-ThinkPad-P17-Gen-1/
Since the recent consoliation of reprogramming functions,
hrtimer_force_reprogram() is affected by a check whether the new expiry
time is past the current expiry time.
This breaks the NOHZ logic as that relies on the fact that the tick hrtimer
is moved into the future. That means cpu_base->expires_next becomes stale
and subsequent reprogramming attempts fail as well until the situation is
cleaned up by an hrtimer interrupts.
For some yet unknown reason this leads to a complete stall, so for now
partially revert the offending commit to a known working state. The root
cause for the stall is still investigated and will be fixed in a subsequent
commit.
Fixes: b14bca97c9 ("hrtimer: Consolidate reprogramming code")
Reported-by: Mike Galbraith <efault@gmx.de>
Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Mike Galbraith <efault@gmx.de>
Link: https://lore.kernel.org/r/8735recskh.ffs@tglx
clock_was_set() can be invoked from preemptible context. Use raw_cpu_ptr()
to check whether high resolution mode is active or not. It does not matter
whether the task migrates after acquiring the pointer.
Fixes: e71a4153b7 ("hrtimer: Force clock_was_set() handling for the HIGHRES=n, NOHZ=y case")
Reported-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/875ywacsmb.ffs@tglx
By unconditionally updating the offsets there are more indicators
whether the SMP function calls on clock_was_set() can be avoided:
- When the offset update already happened on the remote CPU then the
remote update attempt will yield the same seqeuence number and no
IPI is required.
- When the remote CPU is currently handling hrtimer_interrupt(). In
that case the remote CPU will reevaluate the timer bases before
reprogramming anyway, so nothing to do.
- After updating it can be checked whether the first expiring timer in
the affected clock bases moves before the first expiring (softirq)
timer of the CPU. If that's not the case then sending the IPI is not
required.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210713135158.887322464@linutronix.de
Setting of clocks triggers an unconditional SMP function call on all online
CPUs to reprogram the clock event device.
However, only some clocks have their offsets updated and therefore
potentially require a reprogram. That's CLOCK_REALTIME and CLOCK_TAI and in
the case of resume (delayed sleep time injection) also CLOCK_BOOTTIME.
Instead of sending an IPI unconditionally, check each per CPU hrtimer base
whether it has active timers in the affected clock bases which are
indicated by the caller in the @bases argument of clock_was_set().
If that's not the case, skip the IPI and update the offsets remotely which
ensures that any subsequently armed timers on the affected clocks are
evaluated with the correct offsets.
[ tglx: Adopted to the new bases argument, removed the softirq_active
check, added comment, fixed up stale comment ]
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210713135158.787536542@linutronix.de
clock_was_set() unconditionaly invokes retrigger_next_event() on all online
CPUs. This was necessary because that mechanism was also used for resume
from suspend to idle which is not longer the case.
The bases arguments allows the callers of clock_was_set() to hand in a mask
which tells clock_was_set() which of the hrtimer clock bases are affected
by the clock setting. This mask will be used in the next step to check
whether a CPU base has timers queued on a clock base affected by the event
and avoid the SMP function call if there are none.
Add a @bases argument, provide defines for the active bases masking and
fixup all callsites.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210713135158.691083465@linutronix.de
do_adjtimex() might end up scheduling a delayed clock_was_set() via
timekeeping_advance() and then invoke clock_was_set() directly which is
pointless.
Make timekeeping_advance() return whether an invocation of clock_was_set()
is required and handle it at the call sites which allows do_adjtimex() to
issue a single direct call if required.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210713135158.580966888@linutronix.de
Resuming timekeeping is a clock-was-set event and uses the clock-was-set
notification mechanism. This is in the way of making the clock-was-set
update for hrtimers selective so unnecessary IPIs are avoided when a CPU
base does not have timers queued which are affected by the clock setting.
Distangle it by invoking hrtimer_resume() on each unfreezing CPU and invoke
the new timerfd_resume() function from timekeeping_resume() which is the
only place where this is needed.
Rename hrtimer_resume() to hrtimer_resume_local() to reflect the change.
With this the clock_was_set*() functions are not longer required to IPI all
CPUs unconditionally and can get some smarts to avoid them.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210713135158.488853478@linutronix.de
When CONFIG_HIGH_RES_TIMERS is disabled, but NOHZ is enabled then
clock_was_set() is not doing anything. With HIGHRES=n the kernel relies on
the periodic tick to update the clock offsets, but when NOHZ is enabled and
active then CPUs which are in a deep idle sleep do not have a periodic tick
which means the expiry of timers affected by clock_was_set() can be
arbitrarily delayed up to the point where the CPUs are brought out of idle
again.
Make the clock_was_set() logic unconditionaly available so that idle CPUs
are kicked out of idle to handle the update.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210713135158.288697903@linutronix.de
If high resolution timers are disabled the timerfd notification about a
clock was set event is not happening for all cases which use
clock_was_set_delayed() because that's a NOP for HIGHRES=n, which is wrong.
Make clock_was_set_delayed() unconditially available to fix that.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210713135158.196661266@linutronix.de
This code is mostly duplicated. The redudant store in the force reprogram
case does no harm and the in hrtimer interrupt condition cannot be true for
the force reprogram invocations.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210713135158.054424875@linutronix.de
If __hrtimer_start_range_ns() is invoked with an already armed hrtimer then
the timer has to be canceled first and then added back. If the timer is the
first expiring timer then on removal the clockevent device is reprogrammed
to the next expiring timer to avoid that the pending expiry fires needlessly.
If the new expiry time ends up to be the first expiry again then the clock
event device has to reprogrammed again.
Avoid this by checking whether the timer is the first to expire and in that
case, keep the timer on the current CPU and delay the reprogramming up to
the point where the timer has been enqueued again.
Reported-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210713135157.873137732@linutronix.de
There are several scenarios that can result in posix_cpu_timer_set()
not queueing the timer but still leaving the threadgroup cputime counter
running or keeping the tick dependency around for a random amount of time.
1) If timer_settime() is called with a 0 expiration on a timer that is
already disabled, the process wide cputime counter will be started
and won't ever get a chance to be stopped by stop_process_timer()
since no timer is actually armed to be processed.
The following snippet is enough to trigger the issue.
void trigger_process_counter(void)
{
timer_t id;
struct itimerspec val = { };
timer_create(CLOCK_PROCESS_CPUTIME_ID, NULL, &id);
timer_settime(id, TIMER_ABSTIME, &val, NULL);
timer_delete(id);
}
2) If timer_settime() is called with a 0 expiration on a timer that is
already armed, the timer is dequeued but not really disarmed. So the
process wide cputime counter and the tick dependency may still remain
a while around.
The following code snippet keeps this overhead around for one week after
the timer deletion:
void trigger_process_counter(void)
{
timer_t id;
struct itimerspec val = { };
val.it_value.tv_sec = 604800;
timer_create(CLOCK_PROCESS_CPUTIME_ID, NULL, &id);
timer_settime(id, 0, &val, NULL);
timer_delete(id);
}
3) If the timer was initially deactivated, this call to timer_settime()
with an early expiration may have started the process wide cputime
counter even though the timer hasn't been queued and armed because it
has fired early and inline within posix_cpu_timer_set() itself. As a
result the process wide cputime counter may never stop until a new
timer is ever armed in the future.
The following code snippet can reproduce this:
void trigger_process_counter(void)
{
timer_t id;
struct itimerspec val = { };
signal(SIGALRM, SIG_IGN);
timer_create(CLOCK_PROCESS_CPUTIME_ID, NULL, &id);
val.it_value.tv_nsec = 1;
timer_settime(id, TIMER_ABSTIME, &val, NULL);
}
4) If the timer was initially armed with a former expiration value
before this call to timer_settime() and the current call sets an
early deadline that has already expired, the timer fires inline
within posix_cpu_timer_set(). In this case it must have been dequeued
before firing inline with its new expiration value, yet it hasn't
been disarmed in this case. So the process wide cputime counter and
the tick dependency may still be around for a while even after the
timer fired.
The following code snippet can reproduce this:
void trigger_process_counter(void)
{
timer_t id;
struct itimerspec val = { };
signal(SIGALRM, SIG_IGN);
timer_create(CLOCK_PROCESS_CPUTIME_ID, NULL, &id);
val.it_value.tv_sec = 100;
timer_settime(id, TIMER_ABSTIME, &val, NULL);
val.it_value.tv_sec = 0;
val.it_value.tv_nsec = 1;
timer_settime(id, TIMER_ABSTIME, &val, NULL);
}
Fix all these issues with triggering the related base next expiration
recalculation on the next tick. This also implies to re-evaluate the need
to keep around the process wide cputime counter and the tick dependency, in
a similar fashion to disarm_timer().
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210726125513.271824-7-frederic@kernel.org
Remove the ad-hoc timer base accessors and provide a consolidated one.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210726125513.271824-6-frederic@kernel.org
The end of the function cannot be reached with an error in variable
ret. Unconfuse reviewers about that.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210726125513.271824-5-frederic@kernel.org
When an itimer deactivates a previously armed expiration, it simply doesn't
do anything. As a result the process wide cputime counter keeps running and
the tick dependency stays set until it reaches the old ghost expiration
value.
This can be reproduced with the following snippet:
void trigger_process_counter(void)
{
struct itimerval n = {};
n.it_value.tv_sec = 100;
setitimer(ITIMER_VIRTUAL, &n, NULL);
n.it_value.tv_sec = 0;
setitimer(ITIMER_VIRTUAL, &n, NULL);
}
Fix this with resetting the relevant base expiration. This is similar to
disarming a timer.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210726125513.271824-4-frederic@kernel.org
A timer deletion only dequeues the timer but it doesn't shutdown
the related costly process wide cputimer counter and the tick dependency.
The following code snippet keeps this overhead around for one week after
the timer deletion:
void trigger_process_counter(void)
{
timer_t id;
struct itimerspec val = { };
val.it_value.tv_sec = 604800;
timer_create(CLOCK_PROCESS_CPUTIME_ID, NULL, &id);
timer_settime(id, 0, &val, NULL);
timer_delete(id);
}
Make sure the next target's tick recalculates the nearest expiration and
clears the process wide counter and tick dependency if necessary.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210726125513.271824-3-frederic@kernel.org
Starting the process wide cputime counter needs to be done in the same
sighand locking sequence than actually arming the related timer otherwise
this races against concurrent timers setting/expiring in the same
threadgroup.
Detecting that the cputime counter is started without holding the sighand
lock is a first step toward debugging such situations.
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210726125513.271824-2-frederic@kernel.org
The variable ret is being initialized with a value that is never read, it
is being updated later on. The assignment is redundant and can be removed.
Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210721120147.109570-1-colin.king@canonical.com
The functions get_online_cpus() and put_online_cpus() have been
deprecated during the CPU hotplug rework. They map directly to
cpus_read_lock() and cpus_read_unlock().
Replace deprecated CPU-hotplug functions with the official version.
The behavior remains unchanged.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210803141621.780504-35-bigeasy@linutronix.de
syzbot reported KCSAN data races vs. timer_base::timer_running being set to
NULL without holding base::lock in expire_timers().
This looks innocent and most reads are clearly not problematic, but
Frederic identified an issue which is:
int data = 0;
void timer_func(struct timer_list *t)
{
data = 1;
}
CPU 0 CPU 1
------------------------------ --------------------------
base = lock_timer_base(timer, &flags); raw_spin_unlock(&base->lock);
if (base->running_timer != timer) call_timer_fn(timer, fn, baseclk);
ret = detach_if_pending(timer, base, true); base->running_timer = NULL;
raw_spin_unlock_irqrestore(&base->lock, flags); raw_spin_lock(&base->lock);
x = data;
If the timer has previously executed on CPU 1 and then CPU 0 can observe
base->running_timer == NULL and returns, assuming the timer has completed,
but it's not guaranteed on all architectures. The comment for
del_timer_sync() makes that guarantee. Moving the assignment under
base->lock prevents this.
For non-RT kernel it's performance wise completely irrelevant whether the
store happens before or after taking the lock. For an RT kernel moving the
store under the lock requires an extra unlock/lock pair in the case that
there is a waiter for the timer, but that's not the end of the world.
Reported-by: syzbot+aa7c2385d46c5eba0b89@syzkaller.appspotmail.com
Reported-by: syzbot+abea4558531bae1ba9fe@syzkaller.appspotmail.com
Fixes: 030dcdd197 ("timers: Prepare support for PREEMPT_RT")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lore.kernel.org/r/87lfea7gw8.fsf@nanos.tec.linutronix.de
Cc: stable@vger.kernel.org
31cd0e119d ("timers: Recalculate next timer interrupt only when
necessary") subtly altered get_next_timer_interrupt()'s behaviour. The
function no longer consistently returns KTIME_MAX with no timers
pending.
In order to decide if there are any timers pending we check whether the
next expiry will happen NEXT_TIMER_MAX_DELTA jiffies from now.
Unfortunately, the next expiry time and the timer base clock are no
longer updated in unison. The former changes upon certain timer
operations (enqueue, expire, detach), whereas the latter keeps track of
jiffies as they move forward. Ultimately breaking the logic above.
A simplified example:
- Upon entering get_next_timer_interrupt() with:
jiffies = 1
base->clk = 0;
base->next_expiry = NEXT_TIMER_MAX_DELTA;
'base->next_expiry == base->clk + NEXT_TIMER_MAX_DELTA', the function
returns KTIME_MAX.
- 'base->clk' is updated to the jiffies value.
- The next time we enter get_next_timer_interrupt(), taking into account
no timer operations happened:
base->clk = 1;
base->next_expiry = NEXT_TIMER_MAX_DELTA;
'base->next_expiry != base->clk + NEXT_TIMER_MAX_DELTA', the function
returns a valid expire time, which is incorrect.
This ultimately might unnecessarily rearm sched's timer on nohz_full
setups, and add latency to the system[1].
So, introduce 'base->timers_pending'[2], update it every time
'base->next_expiry' changes, and use it in get_next_timer_interrupt().
[1] See tick_nohz_stop_tick().
[2] A quick pahole check on x86_64 and arm64 shows it doesn't make
'struct timer_base' any bigger.
Fixes: 31cd0e119d ("timers: Recalculate next timer interrupt only when necessary")
Signed-off-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Since the process wide cputime counter is started locklessly from
posix_cpu_timer_rearm(), it can be concurrently stopped by operations
on other timers from the same thread group, such as in the following
unlucky scenario:
CPU 0 CPU 1
----- -----
timer_settime(TIMER B)
posix_cpu_timer_rearm(TIMER A)
cpu_clock_sample_group()
(pct->timers_active already true)
handle_posix_cpu_timers()
check_process_timers()
stop_process_timers()
pct->timers_active = false
arm_timer(TIMER A)
tick -> run_posix_cpu_timers()
// sees !pct->timers_active, ignore
// our TIMER A
Fix this with simply locking process wide cputime counting start and
timer arm in the same block.
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Fixes: 60f2ceaa81 ("posix-cpu-timers: Remove unnecessary locking around cpu_clock_sample_group")
Cc: stable@vger.kernel.org
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Pull RCU updates from Paul McKenney:
- Bitmap parsing support for "all" as an alias for all bits
- Documentation updates
- Miscellaneous fixes, including some that overlap into mm and lockdep
- kvfree_rcu() updates
- mem_dump_obj() updates, with acks from one of the slab-allocator
maintainers
- RCU NOCB CPU updates, including limited deoffloading
- SRCU updates
- Tasks-RCU updates
- Torture-test updates
* 'core-rcu-2021.07.04' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (78 commits)
tasks-rcu: Make show_rcu_tasks_gp_kthreads() be static inline
rcu-tasks: Make ksoftirqd provide RCU Tasks quiescent states
rcu: Add missing __releases() annotation
rcu: Remove obsolete rcu_read_unlock() deadlock commentary
rcu: Improve comments describing RCU read-side critical sections
rcu: Create an unrcu_pointer() to remove __rcu from a pointer
srcu: Early test SRCU polling start
rcu: Fix various typos in comments
rcu/nocb: Unify timers
rcu/nocb: Prepare for fine-grained deferred wakeup
rcu/nocb: Only cancel nocb timer if not polling
rcu/nocb: Delete bypass_timer upon nocb_gp wakeup
rcu/nocb: Cancel nocb_timer upon nocb_gp wakeup
rcu/nocb: Allow de-offloading rdp leader
rcu/nocb: Directly call __wake_nocb_gp() from bypass timer
rcu: Don't penalize priority boosting when there is nothing to boost
rcu: Point to documentation of ordering guarantees
rcu: Make rcu_gp_cleanup() be noinline for tracing
rcu: Restrict RCU_STRICT_GRACE_PERIOD to at most four CPUs
rcu: Make show_rcu_gp_kthreads() dump rcu_node structures blocking GP
...
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmDbLo4THHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoZFyD/4icyCNaeV2R8fufdQGWjPwZfpc8JiQ
pqEKWlIGaImG3NgbL953/or8pDZe3LCk+p0hJOwYKtPP0LGjgZvPp6glOofAzvC8
sM5RCsJoDOI7mrc23JRXy8z78C/9tmth5UFw1RlXXuiE4hVr2Gc31YpoyvJLQWn0
XcrkSx2J3Cn7WFpjZCZkeC+Wr34+AVXhAY9t8S3WMn2bPj8Bw5vkxmnR2zbZ0PQI
KZcbYI6r/dJv8ov2AXfkD+EJIe5dzjdRVSX5UZYXWIQMB/vMkt8HinHPm+hFuHWn
Swz7ldBznFDTasoEUVMpn2mObjIuEs0jOYIxlXHYEgl1elRmBbgzQhMY5UGnAUnU
na4RHgZ0WOygwXcZIYYrl7aDuSvt4BvlVz17wNQ4P85QsOcGINSH3c0At0JdEeIg
WPJuBIq02A9bHXg+fvVtZMCvnyTYe7DRVL+J7eVopGIka8b07nUcP5UB+nRJGjxI
uOzdA2oFtucWRAxqtQh8FKVYR9vrIeSMfKhqaIQmzlBgbAzSo1OPX23O8gwkLSab
bzjPb5XOw23w20Oqh7SkTTIMR2m633IZBqnd5gPL4nUZTmB40EEYhwH6vfopeCS+
q4+1tzHmTkAvrnjhN9QTr2bGGGhPeehiYVdQ8QwvB10nF3Lca47hopSoJa5fKIeC
nWb2ZXUN1YwUMQ==
=5Hb8
-----END PGP SIGNATURE-----
Merge tag 'timers-core-2021-06-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
"Time and clocksource/clockevent related updates:
Core changes:
- Infrastructure to support per CPU "broadcast" devices for per CPU
clockevent devices which stop in deep idle states. This allows us
to utilize the more efficient architected timer on certain ARM SoCs
for normal operation instead of permanentely using the slow to
access SoC specific clockevent device.
- Print the name of the broadcast/wakeup device in /proc/timer_list
- Make the clocksource watchdog more robust against delays between
reading the current active clocksource and the watchdog
clocksource. Such delays can be caused by NMIs, SMIs and vCPU
preemption.
Handle this by reading the watchdog clocksource twice, i.e. before
and after reading the current active clocksource. In case that the
two watchdog reads shows an excessive time delta, the read sequence
is repeated up to 3 times.
- Improve the debug output and add a test module for the watchdog
mechanism.
- Reimplementation of the venerable time64_to_tm() function with a
faster and significantly smaller version. Straight from the source,
i.e. the author of the related research paper contributed this!
Driver changes:
- No new drivers, not even new device tree bindings!
- Fixes, improvements and cleanups and all over the place"
* tag 'timers-core-2021-06-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (30 commits)
time/kunit: Add missing MODULE_LICENSE()
time: Improve performance of time64_to_tm()
clockevents: Use list_move() instead of list_del()/list_add()
clocksource: Print deviation in nanoseconds when a clocksource becomes unstable
clocksource: Provide kernel module to test clocksource watchdog
clocksource: Reduce clocksource-skew threshold
clocksource: Limit number of CPUs checked for clock synchronization
clocksource: Check per-CPU clock synchronization when marked unstable
clocksource: Retry clock read if long delays detected
clockevents: Add missing parameter documentation
clocksource/drivers/timer-ti-dm: Drop unnecessary restore
clocksource/arm_arch_timer: Improve Allwinner A64 timer workaround
clocksource/drivers/arm_global_timer: Remove duplicated argument in arm_global_timer
clocksource/drivers/arm_global_timer: Make symbol 'gt_clk_rate_change_nb' static
arm: zynq: don't disable CONFIG_ARM_GLOBAL_TIMER due to CONFIG_CPU_FREQ anymore
clocksource/drivers/arm_global_timer: Implement rate compensation whenever source clock changes
clocksource/drivers/ingenic: Rename unreasonable array names
clocksource/drivers/timer-ti-dm: Save and restore timer TIOCP_CFG
clocksource/drivers/mediatek: Ack and disable interrupts on suspend
clocksource/drivers/samsung_pwm: Constify source IO memory
...
- Micro-optimize tick_nohz_full_cpu()
- Optimize idle exit tick restarts to be less eager
- Optimize tick_nohz_dep_set_task() to only wake up
a single CPU. This reduces IPIs and interruptions
on nohz_full CPUs.
- Optimize tick_nohz_dep_set_signal() in a similar
fashion.
- Skip IPIs in tick_nohz_kick_task() when trying
to kick a non-running task.
- Micro-optimize tick_nohz_task_switch() IRQ flags
handling to reduce context switching costs.
- Misc cleanups and fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmDZcycRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1jItRAAn1/vI0+pWQWjyWQ+CL8AMNNWTbtBpC7W
ZUR+IEtEoYEufYXH9RgcweIgopBExVlC9CWzUX5o7AuVdN2YyzcBuQbza4vlYeIm
azcdIlKCwjdgODJBTgHNH7IR0QKF/Gq+fVCGX3Xc37BlyD389CQ33HXC7X2JZLB3
Mb5wxAJoZ2HQzGGJoz4JyA0rl6lY3jYzLMK7mqxkUqIqT45xLpgw5+imRM2J1ddV
d/73P4TwFY+E8KXSLctUfgmkmCzJYISGSlH49jX3CkwAktwTY17JjWjxT9Z5b2D8
6TTpsDoLtI4tXg0U2KsBxBoDHK/a4hAwo+GnE/RMT6ghqaX5IrANrgtTVPBN9dvh
qUGVAMHVDN3Ed7wwFvCm4tPUz/iXzBsP8xPl28WPHsyV9BE9tcrk2ynzSWy47Twd
z1GVZDNTwCfdvH62WS/HvbPdGl2hHH5/oe3HaF1ROLPHq8UzaxwKEX+A0rwLJrBp
ZU8Lnvu3rPVa5cHc4z1AE7sbX7OkTTNjxY/qQzDhNKwVwfkaPcBiok9VgEIEGS7A
n3U/yuQCn307sr7SlJ6z4yu3YCw3aEJ3pTxUprmNTh3+x4yF5ZaOimqPyvzBaUVM
Hm3LYrxHIScisFJio4FiC2dghZryM37RFonvqrCAOuA+afMU2GOFnaoDruXU27SE
tqxR6c/hw+4=
=18pN
-----END PGP SIGNATURE-----
Merge tag 'timers-nohz-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timers/nohz updates from Ingo Molnar:
- Micro-optimize tick_nohz_full_cpu()
- Optimize idle exit tick restarts to be less eager
- Optimize tick_nohz_dep_set_task() to only wake up a single CPU.
This reduces IPIs and interruptions on nohz_full CPUs.
- Optimize tick_nohz_dep_set_signal() in a similar fashion.
- Skip IPIs in tick_nohz_kick_task() when trying to kick a
non-running task.
- Micro-optimize tick_nohz_task_switch() IRQ flags handling to
reduce context switching costs.
- Misc cleanups and fixes
* tag 'timers-nohz-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
MAINTAINERS: Add myself as context tracking maintainer
tick/nohz: Call tick_nohz_task_switch() with interrupts disabled
tick/nohz: Kick only _queued_ task whose tick dependency is updated
tick/nohz: Change signal tick dependency to wake up CPUs of member tasks
tick/nohz: Only wake up a single target cpu when kicking a task
tick/nohz: Update nohz_full Kconfig help
tick/nohz: Update idle_exittime on actual idle exit
tick/nohz: Remove superflous check for CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
tick/nohz: Conditionally restart tick on idle exit
tick/nohz: Evaluate the CPU expression after the static key
The current implementation of time64_to_tm() contains unnecessary loops,
branches and look-up tables. The new one uses an arithmetic-based algorithm
appeared in [1] and is approximately 3x faster (YMMV).
The drawback is that the new code isn't intuitive and contains many 'magic
numbers' (not unusual for this type of algorithm). However, [1] justifies
all those numbers and, given this function's history, the code is unlikely
to need much maintenance, if any at all.
Add a KUnit test for it which checks every day in a 160,000 years interval
centered at 1970-01-01 against the expected result.
[1] Neri, Schneider, "Euclidean Affine Functions and Applications to
Calendar Algorithms". https://arxiv.org/abs/2102.06959
Signed-off-by: Cassio Neri <cassio.neri@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210622213616.313046-1-cassio.neri@gmail.com
Currently when an unstable clocksource is detected, the raw counters of
that clocksource and watchdog will be printed, which can only be understood
after some math calculation.
So print the delta in nanoseconds as well to make it easier for humans to
check the results.
[ paulmck: Fix typo. ]
Signed-off-by: Feng Tang <feng.tang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210527190124.440372-6-paulmck@kernel.org