The rng's random_init() function contributes the real time to the rng at
boot time, so that events can at least start in relation to something
particular in the real world. But this clock might not yet be set that
point in boot, so nothing is contributed. In addition, the relation
between minor clock changes from, say, NTP, and the cycle counter is
potentially useful entropic data.
This commit addresses this by mixing in a time stamp on calls to
settimeofday and adjtimex. No entropy is credited in doing so, so it
doesn't make initialization faster, but it is still useful input to
have.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Cc: stable@vger.kernel.org
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEq5lC5tSkz8NBJiCnSfxwEqXeA64FAmKKpM8ACgkQSfxwEqXe
A6726w/+OJimGd4arvpSmdn+vxepSyDLgKfwM0x5zprRVd16xg8CjJr4eMonTesq
YvtJRqpetb53MB+sMhutlvQqQzrjtf2MBkgPwF4I2gUrk7vLD45Q+AGdGhi/rUwz
wHGA7xg1FHLHia2M/9idSqi8QlZmUP4u4l5ZnMyTUHiwvRD6XOrWKfqvUSawNzyh
hCWlTUxDrjizsW5YpsJX/MkRadSC8loJEk5ByZebow6nRPfurJvqfrcOMgHyNrbY
pOZ/CGPxcetMqotL2TuuJt5wKmenqYhIWGAp3YM2SWWgU2ueBZekW8AYeMfgUcvh
LWV93RpSuAnE5wsdjIULvjFnEDJBf8ihfMnMrd9G5QjQu44tuKWfY2MghLSpYzaR
V6UFbRmhrqhqiStHQXOvk1oqxtpbHlc9zzJLmvPmDJcbvzXQ9Opk5GVXAmdtnHnj
M/ty3wGWxucY6mHqT8MkCShSSslbgEtc1pEIWHdrUgnaiSVoCVBEO+9LqLbjvOTm
XA/6YtoiCE5FasK51pir1zVb2GORQn0v8HnuAOsusD/iPAlRQ/G5jZkaXbwRQI6j
atYL1svqvSKn5POnzqAlMUXfMUr19K5xqJdp7i6qmlO1Vq6Z+tWbCQgD1JV+Wjkb
CMyvXomFCFu4aYKGRE2SBRnWLRghG3kYHqEQ15yTPMQerxbUDNg=
=SUr3
-----END PGP SIGNATURE-----
Merge tag 'random-5.19-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random
Pull random number generator updates from Jason Donenfeld:
"These updates continue to refine the work began in 5.17 and 5.18 of
modernizing the RNG's crypto and streamlining and documenting its
code.
New for 5.19, the updates aim to improve entropy collection methods
and make some initial decisions regarding the "premature next" problem
and our threat model. The cloc utility now reports that random.c is
931 lines of code and 466 lines of comments, not that basic metrics
like that mean all that much, but at the very least it tells you that
this is very much a manageable driver now.
Here's a summary of the various updates:
- The random_get_entropy() function now always returns something at
least minimally useful. This is the primary entropy source in most
collectors, which in the best case expands to something like RDTSC,
but prior to this change, in the worst case it would just return 0,
contributing nothing. For 5.19, additional architectures are wired
up, and architectures that are entirely missing a cycle counter now
have a generic fallback path, which uses the highest resolution
clock available from the timekeeping subsystem.
Some of those clocks can actually be quite good, despite the CPU
not having a cycle counter of its own, and going off-core for a
stamp is generally thought to increase jitter, something positive
from the perspective of entropy gathering. Done very early on in
the development cycle, this has been sitting in next getting some
testing for a while now and has relevant acks from the archs, so it
should be pretty well tested and fine, but is nonetheless the thing
I'll be keeping my eye on most closely.
- Of particular note with the random_get_entropy() improvements is
MIPS, which, on CPUs that lack the c0 count register, will now
combine the high-speed but short-cycle c0 random register with the
lower-speed but long-cycle generic fallback path.
- With random_get_entropy() now always returning something useful,
the interrupt handler now collects entropy in a consistent
construction.
- Rather than comparing two samples of random_get_entropy() for the
jitter dance, the algorithm now tests many samples, and uses the
amount of differing ones to determine whether or not jitter entropy
is usable and how laborious it must be. The problem with comparing
only two samples was that if the cycle counter was extremely slow,
but just so happened to be on the cusp of a change, the slowness
wouldn't be detected. Taking many samples fixes that to some
degree.
This, combined with the other improvements to random_get_entropy(),
should make future unification of /dev/random and /dev/urandom
maybe more possible. At the very least, were we to attempt it again
today (we're not), it wouldn't break any of Guenter's test rigs
that broke when we tried it with 5.18. So, not today, but perhaps
down the road, that's something we can revisit.
- We attempt to reseed the RNG immediately upon waking up from system
suspend or hibernation, making use of the various timestamps about
suspend time and such available, as well as the usual inputs such
as RDRAND when available.
- Batched randomness now falls back to ordinary randomness before the
RNG is initialized. This provides more consistent guarantees to the
types of random numbers being returned by the various accessors.
- The "pre-init injection" code is now gone for good. I suspect you
in particular will be happy to read that, as I recall you
expressing your distaste for it a few months ago. Instead, to avoid
a "premature first" issue, while still allowing for maximal amount
of entropy availability during system boot, the first 128 bits of
estimated entropy are used immediately as it arrives, with the next
128 bits being buffered. And, as before, after the RNG has been
fully initialized, it winds up reseeding anyway a few seconds later
in most cases. This resulted in a pretty big simplification of the
initialization code and let us remove various ad-hoc mechanisms
like the ugly crng_pre_init_inject().
- The RNG no longer pretends to handle the "premature next" security
model, something that various academics and other RNG designs have
tried to care about in the past. After an interesting mailing list
thread, these issues are thought to be a) mainly academic and not
practical at all, and b) actively harming the real security of the
RNG by delaying new entropy additions after a potential compromise,
making a potentially bad situation even worse. As well, in the
first place, our RNG never even properly handled the premature next
issue, so removing an incomplete solution to a fake problem was
particularly nice.
This allowed for numerous other simplifications in the code, which
is a lot cleaner as a consequence. If you didn't see it before,
https://lore.kernel.org/lkml/YmlMGx6+uigkGiZ0@zx2c4.com/ may be a
thread worth skimming through.
- While the interrupt handler received a separate code path years ago
that avoids locks by using per-cpu data structures and a faster
mixing algorithm, in order to reduce interrupt latency, input and
disk events that are triggered in hardirq handlers were still
hitting locks and more expensive algorithms. Those are now
redirected to use the faster per-cpu data structures.
- Rather than having the fake-crypto almost-siphash-based random32
implementation be used right and left, and in many places where
cryptographically secure randomness is desirable, the batched
entropy code is now fast enough to replace that.
- As usual, numerous code quality and documentation cleanups. For
example, the initialization state machine now uses enum symbolic
constants instead of just hard coding numbers everywhere.
- Since the RNG initializes once, and then is always initialized
thereafter, a pretty heavy amount of code used during that
initialization is never used again. It is now completely cordoned
off using static branches and it winds up in the .text.unlikely
section so that it doesn't reduce cache compactness after the RNG
is ready.
- A variety of functions meant for waiting on the RNG to be
initialized were only used by vsprintf, and in not a particularly
optimal way. Replacing that usage with a more ordinary setup made
it possible to remove those functions.
- A cleanup of how we warn userspace about the use of uninitialized
/dev/urandom and uninitialized get_random_bytes() usage.
Interestingly, with the change you merged for 5.18 that attempts to
use jitter (but does not block if it can't), the majority of users
should never see those warnings for /dev/urandom at all now, and
the one for in-kernel usage is mainly a debug thing.
- The file_operations struct for /dev/[u]random now implements
.read_iter and .write_iter instead of .read and .write, allowing it
to also implement .splice_read and .splice_write, which makes
splice(2) work again after it was broken here (and in many other
places in the tree) during the set_fs() removal. This was a bit of
a last minute arrival from Jens that hasn't had as much time to
bake, so I'll be keeping my eye on this as well, but it seems
fairly ordinary. Unfortunately, read_iter() is around 3% slower
than read() in my tests, which I'm not thrilled about. But Jens and
Al, spurred by this observation, seem to be making progress in
removing the bottlenecks on the iter paths in the VFS layer in
general, which should remove the performance gap for all drivers.
- Assorted other bug fixes, cleanups, and optimizations.
- A small SipHash cleanup"
* tag 'random-5.19-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random: (49 commits)
random: check for signals after page of pool writes
random: wire up fops->splice_{read,write}_iter()
random: convert to using fops->write_iter()
random: convert to using fops->read_iter()
random: unify batched entropy implementations
random: move randomize_page() into mm where it belongs
random: remove mostly unused async readiness notifier
random: remove get_random_bytes_arch() and add rng_has_arch_random()
random: move initialization functions out of hot pages
random: make consistent use of buf and len
random: use proper return types on get_random_{int,long}_wait()
random: remove extern from functions in header
random: use static branch for crng_ready()
random: credit architectural init the exact amount
random: handle latent entropy and command line from random_init()
random: use proper jiffies comparison macro
random: remove ratelimiting for in-kernel unseeded randomness
random: move initialization out of reseeding hot path
random: avoid initializing twice in credit race
random: use symbolic constants for crng_init states
...
- Expose CLOCK_TAI to instrumentation to aid with TSN debugging.
- Ensure that the clockevent is stopped when there is no timer armed to
avoid pointless wakeups.
- Make the sched clock frequency handling and rounding consistent.
- Provide a better debugobject hint for delayed works. The timer callback
is always the same, which makes it difficult to identify the underlying
work. Use the work function as a hint instead.
- Move the timer specific sysctl code into the timer subsystem.
- The usual set of improvements and cleanups
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmKLPHMTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoZBoEACIURtS8w9PFZ6q/2mFq0pTYi/uI/HQ
vqbB6gCbrjfL6QwInd7jxDc/UoqEOllG9pTaGdWx/0Gi9syDosEbeop7cvvt2xi+
pReoEN1kVI3JAVrQFIAuGw4EMuzYB8PfuZkm1PdozcCP9qkgDmtippVxe05sFQ+/
RPdA29vE3g63eXkSFBhEID23pQR8yKLbqVq6KcH87OipZedL+2fry3yB+/9sLuuU
/PFLbI6B9f43S2sfo6szzpFkpd6tJlBlu02IrB6gh4IxKrslmZb5onpvcf6iT+19
rFh5A15GFWoZUC8EjH1sBpATq3wA/jfGEOPWgy07N5SmobtJvWSM5yvT+gC3qXqm
C/bjyjqXzLKftG7KIXo/hWewtsjdovMbdfcMBsGiatytNBZfI1GR/4Pq60/qpTHZ
qJo35trOUcP6o1njphwONy3lisq78S7xaozpWO1hIMTcAqGgBkm/lOieGMM4hGnE
Ps0Im3ZsOXNGllulN+3h+UHstM5/y6f/vzBsw7pfIG66i6KqebAiNjbMfHCr22sX
7UavNCoFggUQgZVgUYX/AscdW4/Dwx6R5YUqj1EBqztknd70Ac4TqjaIz4Xa6ZER
z+eQSSt5XqqV2eKWA4FsQYmCIc+BvQ4apSA6+whz9vmsvCYtB7zzSfeh+xkgcl1/
Cc0N6G5+L9v0Gw==
=De28
-----END PGP SIGNATURE-----
Merge tag 'timers-core-2022-05-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer and timekeeping updates from Thomas Gleixner:
- Expose CLOCK_TAI to instrumentation to aid with TSN debugging.
- Ensure that the clockevent is stopped when there is no timer armed to
avoid pointless wakeups.
- Make the sched clock frequency handling and rounding consistent.
- Provide a better debugobject hint for delayed works. The timer
callback is always the same, which makes it difficult to identify the
underlying work. Use the work function as a hint instead.
- Move the timer specific sysctl code into the timer subsystem.
- The usual set of improvements and cleanups
* tag 'timers-core-2022-05-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timers: Provide a better debugobjects hint for delayed works
time/sched_clock: Fix formatting of frequency reporting code
time/sched_clock: Use Hz as the unit for clock rate reporting below 4kHz
time/sched_clock: Round the frequency reported to nearest rather than down
timekeeping: Consolidate fast timekeeper
timekeeping: Annotate ktime_get_boot_fast_ns() with data_race()
timers/nohz: Switch to ONESHOT_STOPPED in the low-res handler when the tick is stopped
timekeeping: Introduce fast accessor to clock tai
tracing/timer: Add missing argument documentation of trace points
clocksource: Replace cpumask_weight() with cpumask_empty()
timers: Move timer sysctl into the timer code
clockevents: Use dedicated list iterator variable
timers: Simplify calc_index()
timers: Initialize base::next_expiry_recalc in timers_prepare_cpu()
The addition of random_get_entropy_fallback() provides access to
whichever time source has the highest frequency, which is useful for
gathering entropy on platforms without available cycle counters. It's
not necessarily as good as being able to quickly access a cycle counter
that the CPU has, but it's still something, even when it falls back to
being jiffies-based.
In the event that a given arch does not define get_cycles(), falling
back to the get_cycles() default implementation that returns 0 is really
not the best we can do. Instead, at least calling
random_get_entropy_fallback() would be preferable, because that always
needs to return _something_, even falling back to jiffies eventually.
It's not as though random_get_entropy_fallback() is super high precision
or guaranteed to be entropic, but basically anything that's not zero all
the time is better than returning zero all the time.
Finally, since random_get_entropy_fallback() is used during extremely
early boot when randomizing freelists in mm_init(), it can be called
before timekeeping has been initialized. In that case there really is
nothing we can do; jiffies hasn't even started ticking yet. So just give
up and return 0.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Theodore Ts'o <tytso@mit.edu>
Accessing timekeeper::offset_boot in ktime_get_boot_fast_ns() is an
intended data race as the reader side cannot synchronize with a writer and
there is no space in struct tk_read_base of the NMI safe timekeeper.
Mark it so.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20220415091920.956045162@linutronix.de
Introduce fast/NMI safe accessor to clock tai for tracing. The Linux kernel
tracing infrastructure has support for using different clocks to generate
timestamps for trace events. Especially in TSN networks it's useful to have TAI
as trace clock, because the application scheduling is done in accordance to the
network time, which is based on TAI. With a tai trace_clock in place, it becomes
very convenient to correlate network activity with Linux kernel application
traces.
Use the same implementation as ktime_get_boot_fast_ns() does by reading the
monotonic time and adding the TAI offset. The same limitations as for the fast
boot implementation apply. The TAI offset may change at run time e.g., by
setting the time or using adjtimex() with an offset. However, these kind of
offset changes are rare events. Nevertheless, the user has to be aware and deal
with it in post processing.
An alternative approach would be to use the same implementation as
ktime_get_real_fast_ns() does. However, this requires to add an additional u64
member to the tk_read_base struct. This struct together with a seqcount is
designed to fit into a single cache line on 64 bit architectures. Adding a new
member would violate this constraint.
Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: https://lore.kernel.org/r/20220414091805.89667-2-kurt@linutronix.de
Even after commit e1d7ba8735 ("time: Always make sure wall_to_monotonic
isn't positive") it is still possible to make wall_to_monotonic positive
by running the following code:
int main(void)
{
struct timespec time;
clock_gettime(CLOCK_MONOTONIC, &time);
time.tv_nsec = 0;
clock_settime(CLOCK_REALTIME, &time);
return 0;
}
The reason is that the second parameter of timespec64_compare(), ts_delta,
may be unnormalized because the delta is calculated with an open coded
substraction which causes the comparison of tv_sec to yield the wrong
result:
wall_to_monotonic = { .tv_sec = -10, .tv_nsec = 900000000 }
ts_delta = { .tv_sec = -9, .tv_nsec = -900000000 }
That makes timespec64_compare() claim that wall_to_monotonic < ts_delta,
but actually the result should be wall_to_monotonic > ts_delta.
After normalization, the result of timespec64_compare() is correct because
the tv_sec comparison is not longer misleading:
wall_to_monotonic = { .tv_sec = -10, .tv_nsec = 900000000 }
ts_delta = { .tv_sec = -10, .tv_nsec = 100000000 }
Use timespec64_sub() to ensure that ts_delta is normalized, which fixes the
issue.
Fixes: e1d7ba8735 ("time: Always make sure wall_to_monotonic isn't positive")
Signed-off-by: Yu Liao <liaoyu15@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20211213135727.1656662-1-liaoyu15@huawei.com
clock_was_set() unconditionaly invokes retrigger_next_event() on all online
CPUs. This was necessary because that mechanism was also used for resume
from suspend to idle which is not longer the case.
The bases arguments allows the callers of clock_was_set() to hand in a mask
which tells clock_was_set() which of the hrtimer clock bases are affected
by the clock setting. This mask will be used in the next step to check
whether a CPU base has timers queued on a clock base affected by the event
and avoid the SMP function call if there are none.
Add a @bases argument, provide defines for the active bases masking and
fixup all callsites.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210713135158.691083465@linutronix.de
do_adjtimex() might end up scheduling a delayed clock_was_set() via
timekeeping_advance() and then invoke clock_was_set() directly which is
pointless.
Make timekeeping_advance() return whether an invocation of clock_was_set()
is required and handle it at the call sites which allows do_adjtimex() to
issue a single direct call if required.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210713135158.580966888@linutronix.de
Resuming timekeeping is a clock-was-set event and uses the clock-was-set
notification mechanism. This is in the way of making the clock-was-set
update for hrtimers selective so unnecessary IPIs are avoided when a CPU
base does not have timers queued which are affected by the clock setting.
Distangle it by invoking hrtimer_resume() on each unfreezing CPU and invoke
the new timerfd_resume() function from timekeeping_resume() which is the
only place where this is needed.
Rename hrtimer_resume() to hrtimer_resume_local() to reflect the change.
With this the clock_was_set*() functions are not longer required to IPI all
CPUs unconditionally and can get some smarts to avoid them.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210713135158.488853478@linutronix.de
- Stage-2 isolation for the host kernel when running in protected mode
- Guest SVE support when running in nVHE mode
- Force W^X hypervisor mappings in nVHE mode
- ITS save/restore for guests using direct injection with GICv4.1
- nVHE panics now produce readable backtraces
- Guest support for PTP using the ptp_kvm driver
- Performance improvements in the S2 fault handler
x86:
- Optimizations and cleanup of nested SVM code
- AMD: Support for virtual SPEC_CTRL
- Optimizations of the new MMU code: fast invalidation,
zap under read lock, enable/disably dirty page logging under
read lock
- /dev/kvm API for AMD SEV live migration (guest API coming soon)
- support SEV virtual machines sharing the same encryption context
- support SGX in virtual machines
- add a few more statistics
- improved directed yield heuristics
- Lots and lots of cleanups
Generic:
- Rework of MMU notifier interface, simplifying and optimizing
the architecture-specific code
- Some selftests improvements
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmCJ13kUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroM1HAgAqzPxEtiTPTFeFJV5cnPPJ3dFoFDK
y/juZJUQ1AOtvuWzzwuf175ewkv9vfmtG6rVohpNSkUlJYeoc6tw7n8BTTzCVC1b
c/4Dnrjeycr6cskYlzaPyV6MSgjSv5gfyj1LA5UEM16LDyekmaynosVWY5wJhju+
Bnyid8l8Utgz+TLLYogfQJQECCrsU0Wm//n+8TWQgLf1uuiwshU5JJe7b43diJrY
+2DX+8p9yWXCTz62sCeDWNahUv8AbXpMeJ8uqZPYcN1P0gSEUGu8xKmLOFf9kR7b
M4U1Gyz8QQbjd2lqnwiWIkvRLX6gyGVbq2zH0QbhUe5gg3qGUX7JjrhdDQ==
=AXUi
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
"This is a large update by KVM standards, including AMD PSP (Platform
Security Processor, aka "AMD Secure Technology") and ARM CoreSight
(debug and trace) changes.
ARM:
- CoreSight: Add support for ETE and TRBE
- Stage-2 isolation for the host kernel when running in protected
mode
- Guest SVE support when running in nVHE mode
- Force W^X hypervisor mappings in nVHE mode
- ITS save/restore for guests using direct injection with GICv4.1
- nVHE panics now produce readable backtraces
- Guest support for PTP using the ptp_kvm driver
- Performance improvements in the S2 fault handler
x86:
- AMD PSP driver changes
- Optimizations and cleanup of nested SVM code
- AMD: Support for virtual SPEC_CTRL
- Optimizations of the new MMU code: fast invalidation, zap under
read lock, enable/disably dirty page logging under read lock
- /dev/kvm API for AMD SEV live migration (guest API coming soon)
- support SEV virtual machines sharing the same encryption context
- support SGX in virtual machines
- add a few more statistics
- improved directed yield heuristics
- Lots and lots of cleanups
Generic:
- Rework of MMU notifier interface, simplifying and optimizing the
architecture-specific code
- a handful of "Get rid of oprofile leftovers" patches
- Some selftests improvements"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (379 commits)
KVM: selftests: Speed up set_memory_region_test
selftests: kvm: Fix the check of return value
KVM: x86: Take advantage of kvm_arch_dy_has_pending_interrupt()
KVM: SVM: Skip SEV cache flush if no ASIDs have been used
KVM: SVM: Remove an unnecessary prototype declaration of sev_flush_asids()
KVM: SVM: Drop redundant svm_sev_enabled() helper
KVM: SVM: Move SEV VMCB tracking allocation to sev.c
KVM: SVM: Explicitly check max SEV ASID during sev_hardware_setup()
KVM: SVM: Unconditionally invoke sev_hardware_teardown()
KVM: SVM: Enable SEV/SEV-ES functionality by default (when supported)
KVM: SVM: Condition sev_enabled and sev_es_enabled on CONFIG_KVM_AMD_SEV=y
KVM: SVM: Append "_enabled" to module-scoped SEV/SEV-ES control variables
KVM: SEV: Mask CPUID[0x8000001F].eax according to supported features
KVM: SVM: Move SEV module params/variables to sev.c
KVM: SVM: Disable SEV/SEV-ES if NPT is disabled
KVM: SVM: Free sev_asid_bitmap during init if SEV setup fails
KVM: SVM: Zero out the VMCB array used to track SEV ASID association
x86/sev: Drop redundant and potentially misleading 'sev_enabled'
KVM: x86: Move reverse CPUID helpers to separate header file
KVM: x86: Rename GPR accessors to make mode-aware variants the defaults
...
System time snapshots are not conveying information about the current
clocksource which was used, but callers like the PTP KVM guest
implementation have the requirement to evaluate the clocksource type to
select the appropriate mechanism.
Introduce a clocksource id field in struct clocksource which is by default
set to CSID_GENERIC (0). Clocksource implementations can set that field to
a value which allows to identify the clocksource.
Store the clocksource id of the current clocksource in the
system_time_snapshot so callers can evaluate which clocksource was used to
take the snapshot and act accordingly.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Jianyong Wu <jianyong.wu@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20201209060932.212364-5-jianyong.wu@arm.com
The struct clocksource callbacks enable() and disable() are described as a
way to allow clock sources to enter a power save mode. See commit
4614e6adaf ("clocksource: add enable() and disable() callbacks")
But using runtime PM from these callbacks triggers a cyclic lockdep warning when
switching clock source using change_clocksource().
# echo e60f0000.timer > /sys/devices/system/clocksource/clocksource0/current_clocksource
======================================================
WARNING: possible circular locking dependency detected
------------------------------------------------------
migration/0/11 is trying to acquire lock:
ffff0000403ed220 (&dev->power.lock){-...}-{2:2}, at: __pm_runtime_resume+0x40/0x74
but task is already holding lock:
ffff8000113c8f88 (tk_core.seq.seqcount){----}-{0:0}, at: multi_cpu_stop+0xa4/0x190
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #2 (tk_core.seq.seqcount){----}-{0:0}:
ktime_get+0x28/0xa0
hrtimer_start_range_ns+0x210/0x2dc
generic_sched_clock_init+0x70/0x88
sched_clock_init+0x40/0x64
start_kernel+0x494/0x524
-> #1 (hrtimer_bases.lock){-.-.}-{2:2}:
hrtimer_start_range_ns+0x68/0x2dc
rpm_suspend+0x308/0x5dc
rpm_idle+0xc4/0x2a4
pm_runtime_work+0x98/0xc0
process_one_work+0x294/0x6f0
worker_thread+0x70/0x45c
kthread+0x154/0x160
ret_from_fork+0x10/0x20
-> #0 (&dev->power.lock){-...}-{2:2}:
_raw_spin_lock_irqsave+0x7c/0xc4
__pm_runtime_resume+0x40/0x74
sh_cmt_start+0x1c4/0x260
sh_cmt_clocksource_enable+0x28/0x50
change_clocksource+0x9c/0x160
multi_cpu_stop+0xa4/0x190
cpu_stopper_thread+0x90/0x154
smpboot_thread_fn+0x244/0x270
kthread+0x154/0x160
ret_from_fork+0x10/0x20
other info that might help us debug this:
Chain exists of:
&dev->power.lock --> hrtimer_bases.lock --> tk_core.seq.seqcount
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(tk_core.seq.seqcount);
lock(hrtimer_bases.lock);
lock(tk_core.seq.seqcount);
lock(&dev->power.lock);
*** DEADLOCK ***
2 locks held by migration/0/11:
#0: ffff8000113c9278 (timekeeper_lock){-.-.}-{2:2}, at: change_clocksource+0x2c/0x160
#1: ffff8000113c8f88 (tk_core.seq.seqcount){----}-{0:0}, at: multi_cpu_stop+0xa4/0x190
Rework change_clocksource() so it enables the new clocksource and disables
the old clocksource outside of the timekeeper_lock and seqcount write held
region. There is no requirement that these callbacks are invoked from the
lock held region.
Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Link: https://lore.kernel.org/r/20210211134318.323910-1-niklas.soderlund+renesas@ragnatech.se
This cleans up two ancient timer features that were never completed in
the past, CONFIG_GENERIC_CLOCKEVENTS and CONFIG_ARCH_USES_GETTIMEOFFSET.
There was only one user left for the ARCH_USES_GETTIMEOFFSET variant
of clocksource implementations, the ARM EBSA110 platform. Rather than
changing to use modern timekeeping, we remove the platform entirely as
Russell no longer uses his machine and nobody else seems to have one
any more.
The conditional code for using arch_gettimeoffset() is removed as
a result.
For CONFIG_GENERIC_CLOCKEVENTS, there are still a couple of platforms
not using clockevent drivers: parisc, ia64, most of m68k, and one
Arm platform. These all do timer ticks slighly differently, and this
gets cleaned up to the point they at least all call the same helper
function. Instead of most platforms using 'select GENERIC_CLOCKEVENTS'
in Kconfig, the polarity is now reversed, with the few remaining ones
selecting LEGACY_TIMER_TICK instead.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEo6/YBQwIrVS28WGKmmx57+YAGNkFAl/Y1v8ACgkQmmx57+YA
GNmCvQ/9EDlgCt92r8SB+LGafDtgB8TUQZeIrs9S2mByzdxwnw0lxObIXFCnhQgh
RpG3dR+ONRDnC5eI149B377JOEFMZWe2+BtYHUHkFARtUEWatslQcz7yAGvVRK/l
TS/qReb6piKltlzuanF1bMZbjy2OhlaDRcm+OlC3y5mALR33M4emb+rJ6cSdfk3K
v1iZhrxtfQT77ztesh/oPkPiyQ6kNcz7SfpyYOb6f5VLlml2BZ7YwBSVyGY7urHk
RL3XqOUP4KKlMEAI8w0E2nvft6Fk+luziBhrMYWK0GvbmI1OESENuX/c6tgT2OQ1
DRaVHvcPG/EAY8adOKxxVyHhEJDSoz5GJV/EtjlOegsJk6RomczR1uuiT3Kvm7Ah
PktMKv4xQht1E15KPSKbOvNIEP18w2s5z6gw+jVDv8pw42pVEQManm1D+BICqrhl
fcpw6T1drf9UxAjwX4+zXtmNs+a+mqiFG8puU4VVgT4GpQ8umHvunXz2WUjZO0jc
3m8ErJHBvtJwW5TOHGyXnjl9SkwPzHOfF6IcXTYWEDU4/gQIK9TwUvCjLc0lE27t
FMCV2ds7/K1CXwRgpa5IrefSkb8yOXSbRZ56NqqF7Ekxw4J5bYRSaY7jb+qD/e+3
5O1y+iPxFrpH+16hSahvzrtcdFNbLQvBBuRtEQOYuHLt2UJrNoU=
=QpNs
-----END PGP SIGNATURE-----
Merge tag 'asm-generic-timers-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic
Pull asm-generic cross-architecture timer cleanup from Arnd Bergmann:
"This cleans up two ancient timer features that were never completed in
the past, CONFIG_GENERIC_CLOCKEVENTS and CONFIG_ARCH_USES_GETTIMEOFFSET.
There was only one user left for the ARCH_USES_GETTIMEOFFSET variant
of clocksource implementations, the ARM EBSA110 platform. Rather than
changing to use modern timekeeping, we remove the platform entirely as
Russell no longer uses his machine and nobody else seems to have one
any more.
The conditional code for using arch_gettimeoffset() is removed as a
result.
For CONFIG_GENERIC_CLOCKEVENTS, there are still a couple of platforms
not using clockevent drivers: parisc, ia64, most of m68k, and one Arm
platform. These all do timer ticks slighly differently, and this gets
cleaned up to the point they at least all call the same helper
function.
Instead of most platforms using 'select GENERIC_CLOCKEVENTS' in
Kconfig, the polarity is now reversed, with the few remaining ones
selecting LEGACY_TIMER_TICK instead"
* tag 'asm-generic-timers-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic:
timekeeping: default GENERIC_CLOCKEVENTS to enabled
timekeeping: remove xtime_update
m68k: remove timer_interrupt() function
m68k: change remaining timers to legacy_timer_tick
m68k: m68328: use legacy_timer_tick()
m68k: sun3/sun3c: use legacy_timer_tick
m68k: split heartbeat out of timer function
m68k: coldfire: use legacy_timer_tick()
parisc: use legacy_timer_tick
ARM: rpc: use legacy_timer_tick
ia64: convert to legacy_timer_tick
timekeeping: add CONFIG_LEGACY_TIMER_TICK
timekeeping: remove arch_gettimeoffset
net: remove am79c961a driver
ARM: remove ebsa110 platform
The kernel-doc parser complains:
kernel/time/timekeeping.c:1543: warning: Function parameter or member
'ts' not described in 'read_persistent_clock64'
kernel/time/timekeeping.c:764: warning: Function parameter or member
'tk' not described in 'timekeeping_forward_now'
kernel/time/timekeeping.c:1331: warning: Function parameter or member
'ts' not described in 'timekeeping_inject_offset'
kernel/time/timekeeping.c:1331: warning: Excess function parameter 'tv'
description in 'timekeeping_inject_offset'
Add the missing parameter documentations and rename the 'tv' parameter of
timekeeping_inject_offset() to 'ts' so it matches the implemention.
[ tglx: Reworded a few docs and massaged changelog ]
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/1605252275-63652-5-git-send-email-alex.shi@linux.alibaba.com
Address the following kernel-doc markup warnings:
kernel/time/timekeeping.c:1563: warning: Function parameter or member
'wall_time' not described in 'read_persistent_wall_and_boot_offset'
kernel/time/timekeeping.c:1563: warning: Function parameter or member
'boot_offset' not described in 'read_persistent_wall_and_boot_offset'
The parameters are described but miss the leading '@' and the colon after
the parameter names.
[ tglx: Massaged changelog ]
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/1605252275-63652-6-git-send-email-alex.shi@linux.alibaba.com
The kernel-doc parser complains about:
kernel/time/timekeeping.c:651: warning: Function parameter or member
'nb' not described in 'pvclock_gtod_register_notifier'
kernel/time/timekeeping.c:670: warning: Function parameter or member
'nb' not described in 'pvclock_gtod_unregister_notifier'
Add the missing parameter explanations.
[ tglx: Massaged changelog ]
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/1605252275-63652-3-git-send-email-alex.shi@linux.alibaba.com
Alex reported the following warning:
kernel/time/timekeeping.c:464: warning: Function parameter or member
'tkf' not described in '__ktime_get_fast_ns'
which is not entirely correct because the documented function is
ktime_get_mono_fast_ns() which does not have a parameter, but the
kernel-doc parser looks at the function declaration which follows the
comment and complains about the missing parameter documentation.
Aside of that the documentation for the rest of the NMI safe accessors is
either incomplete or missing.
- Move the function documentation to the right place
- Fixup the references and inconsistencies
- Add the missing documentation for ktime_get_raw_fast_ns()
Reported-by: Alex Shi <alex.shi@linux.alibaba.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Address the following warning:
kernel/time/timekeeping.c:415: warning: Function parameter or member
'tkf' not described in 'update_fast_timekeeper'
[ tglx: Remove the bogus ktime_get_mono_fast_ns() part ]
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/1605252275-63652-2-git-send-email-alex.shi@linux.alibaba.com
Various static functions in the timekeeping code have function comments
which pretend to be kernel-doc, but are incomplete and trigger parser
warnings.
As these functions are local to the timekeeping core code there is no need
to expose them via kernel-doc.
Remove the double star kernel-doc marker and remove excess newlines.
[ tglx: Massaged changelog and removed excess newlines ]
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/1605252275-63652-4-git-send-email-alex.shi@linux.alibaba.com
There are no more users of xtime_update aside from legacy_timer_tick(),
so fold it into that function and remove the declaration.
update_process_times() is now only called inside of the kernel/time/
code, so the declaration can be moved there.
Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org>
Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
With Arm EBSA110 gone, nothing uses it any more, so the corresponding
code and the Kconfig option can be removed.
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
- Add deadlock detection for recursive read-locks. The rationale is outlined
in:
224ec489d3cd: ("lockdep/Documention: Recursive read lock detection reasoning")
The main deadlock pattern we want to detect is:
TASK A: TASK B:
read_lock(X);
write_lock(X);
read_lock_2(X);
- Add "latch sequence counters" (seqcount_latch_t):
A sequence counter variant where the counter even/odd value is used to
switch between two copies of protected data. This allows the read path,
typically NMIs, to safely interrupt the write side critical section.
We utilize this new variant for sched-clock, and to make x86 TSC handling safer.
- Other seqlock cleanups, fixes and enhancements
- KCSAN updates
- LKMM updates
- Misc updates, cleanups and fixes.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl+EX6QRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1g3gxAAkg+Jy/tcdRxlxlEDOQPFy1mBqvFmulNA
pGFPkB6dzqmAWF/NfOZSl4g/h/mqGYsq2V+PfK5E8Sq8DQ/yCmnLhjgVOHNUUliv
x0WWfOysNgJdtdf69NLYJufIQhxhyI0dwFHHoHIsCdGdGqjh2DVevQFPFTBjdpOc
BUZYo+u3gCaCdB6A2nmlcWYbEw8eVEHgv3qLG6dq46J0KJOV0HfliqJoU3EZqH+s
977LvEIo+THfuYWMo/Jepwngbi0y36KeeukOAdwm9fK196htBHIUR+YPPrAe+FWD
z+UXP5IS5XIw9V1sGLmUaC2m+6gpdW19jKBtlzPkxHXmJmsgiZdLLeytEh3WYey7
nzfH+9Jd4NyyZKucLssYkOjf6P5BxGKCyJ9LXb7vlSthIhiDdFNx47oKtW4hxjOY
jubsI3BP5c3G1sIBIjTS53XmOhJg+Z52FxTpQ33JswXn1wGidcHZiuNHZuU5q28p
+tn8rGb2NGJFb4Sw/Vp0yTcqIpEXf+vweiQoaxm6tc9BWzcVzZntGnh0i3gFotx/
VgKafN4+pgXgo6bwHbN2WBK2FGyvcXFaptfaOMZL48En82hJ1DI6EnBEYN+vuERQ
JcCXg+iHeeVbxoou7q8NJxITkBmEL5xNBIugXRRqNSP3fXLxKjFuPYqT84/e7yZi
elGTReYcq6g=
=Iq51
-----END PGP SIGNATURE-----
Merge tag 'locking-core-2020-10-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:
"These are the locking updates for v5.10:
- Add deadlock detection for recursive read-locks.
The rationale is outlined in commit 224ec489d3 ("lockdep/
Documention: Recursive read lock detection reasoning")
The main deadlock pattern we want to detect is:
TASK A: TASK B:
read_lock(X);
write_lock(X);
read_lock_2(X);
- Add "latch sequence counters" (seqcount_latch_t):
A sequence counter variant where the counter even/odd value is used
to switch between two copies of protected data. This allows the
read path, typically NMIs, to safely interrupt the write side
critical section.
We utilize this new variant for sched-clock, and to make x86 TSC
handling safer.
- Other seqlock cleanups, fixes and enhancements
- KCSAN updates
- LKMM updates
- Misc updates, cleanups and fixes"
* tag 'locking-core-2020-10-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (67 commits)
lockdep: Revert "lockdep: Use raw_cpu_*() for per-cpu variables"
lockdep: Fix lockdep recursion
lockdep: Fix usage_traceoverflow
locking/atomics: Check atomic-arch-fallback.h too
locking/seqlock: Tweak DEFINE_SEQLOCK() kernel doc
lockdep: Optimize the memory usage of circular queue
seqlock: Unbreak lockdep
seqlock: PREEMPT_RT: Do not starve seqlock_t writers
seqlock: seqcount_LOCKNAME_t: Introduce PREEMPT_RT support
seqlock: seqcount_t: Implement all read APIs as statement expressions
seqlock: Use unique prefix for seqcount_t property accessors
seqlock: seqcount_LOCKNAME_t: Standardize naming convention
seqlock: seqcount latch APIs: Only allow seqcount_latch_t
rbtree_latch: Use seqcount_latch_t
x86/tsc: Use seqcount_latch_t
timekeeping: Use seqcount_latch_t
time/sched_clock: Use seqcount_latch_t
seqlock: Introduce seqcount_latch_t
mm/swap: Do not abuse the seqcount_t latching API
time/sched_clock: Use raw_read_seqcount_latch() during suspend
...
Latch sequence counters are a multiversion concurrency control mechanism
where the seqcount_t counter even/odd value is used to switch between
two data storage copies. This allows the seqcount_t read path to safely
interrupt its write side critical section (e.g. from NMIs).
Initially, latch sequence counters were implemented as a single write
function, raw_write_seqcount_latch(), above plain seqcount_t. The read
path was expected to use plain seqcount_t raw_read_seqcount().
A specialized read function was later added, raw_read_seqcount_latch(),
and became the standardized way for latch read paths. Having unique read
and write APIs meant that latch sequence counters are basically a data
type of their own -- just inappropriately overloading plain seqcount_t.
The seqcount_latch_t data type was thus introduced at seqlock.h.
Use that new data type instead of seqcount_raw_spinlock_t. This ensures
that only latch-safe APIs are to be used with the sequence counter.
Note that the use of seqcount_raw_spinlock_t was not very useful in the
first place. Only the "raw_" subset of seqcount_t APIs were used at
timekeeping.c. This subset was created for contexts where lockdep cannot
be used. seqcount_LOCKTYPE_t's raison d'être -- verifying that the
seqcount_t writer serialization lock is held -- cannot thus be done.
References: 0c3351d451 ("seqlock: Use raw_ prefix instead of _no_lockdep")
References: 55f3560df9 ("seqlock: Extend seqcount API with associated locks")
Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200827114044.11173-6-a.darwish@linutronix.de
printk wants to store various timestamps (MONOTONIC, REALTIME, BOOTTIME) to
make correlation of dmesg from several systems easier.
Provide an interface to retrieve all three timestamps in one go.
There are some caveats:
1) Boot time and late sleep time injection
Boot time is a racy access on 32bit systems if the sleep time injection
happens late during resume and not in timekeeping_resume(). That could be
avoided by expanding struct tk_read_base with boot offset for 32bit and
adding more overhead to the update. As this is a hard to observe once per
resume event which can be filtered with reasonable effort using the
accurate mono/real timestamps, it's probably not worth the trouble.
Aside of that it might be possible on 32 and 64 bit to observe the
following when the sleep time injection happens late:
CPU 0 CPU 1
timekeeping_resume()
ktime_get_fast_timestamps()
mono, real = __ktime_get_real_fast()
inject_sleep_time()
update boot offset
boot = mono + bootoffset;
That means that boot time already has the sleep time adjustment, but
real time does not. On the next readout both are in sync again.
Preventing this for 64bit is not really feasible without destroying the
careful cache layout of the timekeeper because the sequence count and
struct tk_read_base would then need two cache lines instead of one.
2) Suspend/resume timestamps
Access to the time keeper clock source is disabled accross the innermost
steps of suspend/resume. The accessors still work, but the timestamps
are frozen until time keeping is resumed which happens very early.
For regular suspend/resume there is no observable difference vs. sched
clock, but it might affect some of the nasty low level debug printks.
OTOH, access to sched clock is not guaranteed accross suspend/resume on
all systems either so it depends on the hardware in use.
If that turns out to be a real problem then this could be mitigated by
using sched clock in a similar way as during early boot. But it's not as
trivial as on early boot because it needs some careful protection
against the clock monotonic timestamp jumping backwards on resume.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20200814115512.159981360@linutronix.de
During early boot the NMI safe timekeeper returns 0 until the first
clocksource becomes available.
This prevents it from being used for printk or other facilities which today
use sched clock. sched clock can be available way before timekeeping is
initialized.
The obvious workaround for this is to utilize the early sched clock in the
default dummy clock read function until a clocksource becomes available.
After switching to the clocksource clock MONOTONIC and BOOTTIME will not
jump because the timekeeping_init() bases clock MONOTONIC on sched clock
and the offset between clock MONOTONIC and BOOTTIME is zero during boot.
Clock REALTIME cannot provide useful timestamps during early boot up to
the point where a persistent clock becomes available, which is either in
timekeeping_init() or later when the RTC driver which might depend on I2C
or other subsystems is initialized.
There is a minor difference to sched_clock() vs. suspend/resume. As the
timekeeper clock source might not be accessible during suspend, after
timekeeping_suspend() timestamps freeze up to the point where
timekeeping_resume() is invoked. OTOH this is true for some sched clock
implementations as well.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20200814115512.041422402@linutronix.de
- Preparatory work to allow S390 to switch over to the generic VDSO
implementation.
S390 requires that the VDSO data pointer is handed in to the counter
read function when time namespace support is enabled. Adding the pointer
is a NOOP for all other architectures because the compiler is supposed
to optimize that out when it is unused in the architecture specific
inline. The change also solved a similar problem for MIPS which
fortunately has time namespaces not yet enabled.
S390 needs to update clock related VDSO data independent of the
timekeeping updates. This was solved so far with yet another sequence
counter in the S390 implementation. A better solution is to utilize the
already existing VDSO sequence count for this. The core code now exposes
helper functions which allow to serialize against the timekeeper code
and against concurrent readers.
S390 needs extra data for their clock readout function. The initial
common VDSO data structure did not provide a way to add that. It now has
an embedded architecture specific struct embedded which defaults to an
empty struct.
Doing this now avoids tree dependencies and conflicts post rc1 and
allows all other architectures which work on generic VDSO support to
work from a common upstream base.
- A trivial comment fix.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl82tGYTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoRkKD/9YEYlYPQ4omRNVNIJRnalBH6OB/GOk
jTJ4RCvNP2ew6XtgEz5Yg1VqxrmJP4MLNCnMr7mQulfezUmslK0uJMlqZC4dgYth
PUhliLyFi5PK+CKaY+2NFlZMAoE53YlJ2FVPq114FUW4ASVbucDPXpmhO22cc2Iu
0RD3z9/+vQmA8lUqI6wPIFTC+euN+2kbkeZjt7BlkBAdiRBga5UnarFzetq0nWyc
kcprQ2qZfGLYzRY6dRuvNLz27Ta7SAlVGOGUDpWr9MISLDFQzHwhVATDNFW3hLGT
Fr5xNqStUVxxTzYkfCj/Podez0aR3por8bm9SoWxZn7oeLdLgTsDwn2pY0J0PjyB
wWz9lmqT1vzrHEfQH1YhHvycowl6azue9rT2ERWwZTdbADEwu6Zr8ufv2XHcMu0J
dyzSYa81cQrTeAwwdNjODs+QCTX+0G6u86AU2Xg+YgqkAywcAMvzcff/9D62hfv2
5BSz+0OeitQCnSvHILUPw4XT/2rNZfhlcmc4tkzoBFewzDsMEqWT19p+GgqcRNiU
5Jl4kGnaeHjP0e5Vn/ZJurKaF3YEJwgjkohDORloaqo0AXiYo1ANhDlKvSRu5hnU
GDIWOVu8ATXwkjMFcLQz7O5/J1MqJCkleIjSCDjLDhhMbLY/nR9L3QS9jbqiVVRN
nTZlSMF6HeQmew==
=y8Z5
-----END PGP SIGNATURE-----
Merge tag 'timers-urgent-2020-08-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timekeeping updates from Thomas Gleixner:
"A set of timekeeping/VDSO updates:
- Preparatory work to allow S390 to switch over to the generic VDSO
implementation.
S390 requires that the VDSO data pointer is handed in to the
counter read function when time namespace support is enabled.
Adding the pointer is a NOOP for all other architectures because
the compiler is supposed to optimize that out when it is unused in
the architecture specific inline. The change also solved a similar
problem for MIPS which fortunately has time namespaces not yet
enabled.
S390 needs to update clock related VDSO data independent of the
timekeeping updates. This was solved so far with yet another
sequence counter in the S390 implementation. A better solution is
to utilize the already existing VDSO sequence count for this. The
core code now exposes helper functions which allow to serialize
against the timekeeper code and against concurrent readers.
S390 needs extra data for their clock readout function. The initial
common VDSO data structure did not provide a way to add that. It
now has an embedded architecture specific struct embedded which
defaults to an empty struct.
Doing this now avoids tree dependencies and conflicts post rc1 and
allows all other architectures which work on generic VDSO support
to work from a common upstream base.
- A trivial comment fix"
* tag 'timers-urgent-2020-08-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
time: Delete repeated words in comments
lib/vdso: Allow to add architecture-specific vdso data
timekeeping/vsyscall: Provide vdso_update_begin/end()
vdso/treewide: Add vdso_data pointer argument to __arch_get_hw_counter()
- Untangle the header spaghetti which causes build failures in various
situations caused by the lockdep additions to seqcount to validate that
the write side critical sections are non-preemptible.
- The seqcount associated lock debug addons which were blocked by the
above fallout.
seqcount writers contrary to seqlock writers must be externally
serialized, which usually happens via locking - except for strict per
CPU seqcounts. As the lock is not part of the seqcount, lockdep cannot
validate that the lock is held.
This new debug mechanism adds the concept of associated locks.
sequence count has now lock type variants and corresponding
initializers which take a pointer to the associated lock used for
writer serialization. If lockdep is enabled the pointer is stored and
write_seqcount_begin() has a lockdep assertion to validate that the
lock is held.
Aside of the type and the initializer no other code changes are
required at the seqcount usage sites. The rest of the seqcount API is
unchanged and determines the type at compile time with the help of
_Generic which is possible now that the minimal GCC version has been
moved up.
Adding this lockdep coverage unearthed a handful of seqcount bugs which
have been addressed already independent of this.
While generaly useful this comes with a Trojan Horse twist: On RT
kernels the write side critical section can become preemtible if the
writers are serialized by an associated lock, which leads to the well
known reader preempts writer livelock. RT prevents this by storing the
associated lock pointer independent of lockdep in the seqcount and
changing the reader side to block on the lock when a reader detects
that a writer is in the write side critical section.
- Conversion of seqcount usage sites to associated types and initializers.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl8xmPYTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoTuQEACyzQCjU8PgehPp9oMqWzaX2fcVyuZO
QU2yw6gmz2oTz3ZHUNwdW8UnzGh2OWosK3kDruoD9FtSS51lER1/ISfSPCGfyqxC
KTjOcB1Kvxwq/3LcCx7Zi3ZxWApat74qs3EhYhKtEiQ2Y9xv9rLq8VV1UWAwyxq0
eHpjlIJ6b6rbt+ARslaB7drnccOsdK+W/roNj4kfyt+gezjBfojGRdMGQNMFcpnv
shuTC+vYurAVIiVA/0IuizgHfwZiXOtVpjVoEWaxg6bBH6HNuYMYzdSa/YrlDkZs
n/aBI/Xkvx+Eacu8b1Zwmbzs5EnikUK/2dMqbzXKUZK61eV4hX5c2xrnr1yGWKTs
F/juh69Squ7X6VZyKVgJ9RIccVueqwR2EprXWgH3+RMice5kjnXH4zURp0GHALxa
DFPfB6fawcH3Ps87kcRFvjgm6FBo0hJ1AxmsW1dY4ACFB9azFa2euW+AARDzHOy2
VRsUdhL9CGwtPjXcZ/9Rhej6fZLGBXKr8uq5QiMuvttp4b6+j9FEfBgD4S6h8csl
AT2c2I9LcbWqyUM9P4S7zY/YgOZw88vHRuDH7tEBdIeoiHfrbSBU7EQ9jlAKq/59
f+Htu2Io281c005g7DEeuCYvpzSYnJnAitj5Lmp/kzk2Wn3utY1uIAVszqwf95Ul
81ppn2KlvzUK8g==
=7Gj+
-----END PGP SIGNATURE-----
Merge tag 'locking-urgent-2020-08-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Thomas Gleixner:
"A set of locking fixes and updates:
- Untangle the header spaghetti which causes build failures in
various situations caused by the lockdep additions to seqcount to
validate that the write side critical sections are non-preemptible.
- The seqcount associated lock debug addons which were blocked by the
above fallout.
seqcount writers contrary to seqlock writers must be externally
serialized, which usually happens via locking - except for strict
per CPU seqcounts. As the lock is not part of the seqcount, lockdep
cannot validate that the lock is held.
This new debug mechanism adds the concept of associated locks.
sequence count has now lock type variants and corresponding
initializers which take a pointer to the associated lock used for
writer serialization. If lockdep is enabled the pointer is stored
and write_seqcount_begin() has a lockdep assertion to validate that
the lock is held.
Aside of the type and the initializer no other code changes are
required at the seqcount usage sites. The rest of the seqcount API
is unchanged and determines the type at compile time with the help
of _Generic which is possible now that the minimal GCC version has
been moved up.
Adding this lockdep coverage unearthed a handful of seqcount bugs
which have been addressed already independent of this.
While generally useful this comes with a Trojan Horse twist: On RT
kernels the write side critical section can become preemtible if
the writers are serialized by an associated lock, which leads to
the well known reader preempts writer livelock. RT prevents this by
storing the associated lock pointer independent of lockdep in the
seqcount and changing the reader side to block on the lock when a
reader detects that a writer is in the write side critical section.
- Conversion of seqcount usage sites to associated types and
initializers"
* tag 'locking-urgent-2020-08-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits)
locking/seqlock, headers: Untangle the spaghetti monster
locking, arch/ia64: Reduce <asm/smp.h> header dependencies by moving XTP bits into the new <asm/xtp.h> header
x86/headers: Remove APIC headers from <asm/smp.h>
seqcount: More consistent seqprop names
seqcount: Compress SEQCNT_LOCKNAME_ZERO()
seqlock: Fold seqcount_LOCKNAME_init() definition
seqlock: Fold seqcount_LOCKNAME_t definition
seqlock: s/__SEQ_LOCKDEP/__SEQ_LOCK/g
hrtimer: Use sequence counter with associated raw spinlock
kvm/eventfd: Use sequence counter with associated spinlock
userfaultfd: Use sequence counter with associated spinlock
NFSv4: Use sequence counter with associated spinlock
iocost: Use sequence counter with associated spinlock
raid5: Use sequence counter with associated spinlock
vfs: Use sequence counter with associated spinlock
timekeeping: Use sequence counter with associated raw spinlock
xfrm: policy: Use sequence counters with associated lock
netfilter: nft_set_rbtree: Use sequence counter with associated rwlock
netfilter: conntrack: Use sequence counter with associated spinlock
sched: tasks: Use sequence counter with associated spinlock
...
Drop repeated words in kernel/time/. {when, one, into}
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <john.stultz@linaro.org>
Link: https://lore.kernel.org/r/20200807033248.8452-1-rdunlap@infradead.org
Architectures can have the requirement to add additional architecture
specific data to the VDSO data page which needs to be updated independent
of the timekeeper updates.
To protect these updates vs. concurrent readers and a conflicting update
through timekeeping, provide helper functions to make such updates safe.
vdso_update_begin() takes the timekeeper_lock to protect against a
potential update from timekeeper code and increments the VDSO sequence
count to signal data inconsistency to concurrent readers. vdso_update_end()
makes the sequence count even again to signal data consistency and drops
the timekeeper lock.
[ Sven: Add interrupt disable handling to the functions ]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200804150124.41692-3-svens@linux.ibm.com
A sequence counter write side critical section must be protected by some
form of locking to serialize writers. A plain seqcount_t does not
contain the information of which lock must be held when entering a write
side critical section.
Use the new seqcount_raw_spinlock_t data type, which allows to associate
a raw spinlock with the sequence counter. This enables lockdep to verify
that the raw spinlock used for writer serialization is held when the
write side critical section is entered.
If lockdep is disabled this lock association is compiled out and has
neither storage size nor runtime overhead.
Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200720155530.1173732-18-a.darwish@linutronix.de
The "ticks" parameter was added in commit 0f004f5a69 ("sched: Cure more
NO_HZ load average woes") since calc_global_nohz() was called and needed
the "ticks" argument.
But in commit c308b56b53 ("sched: Fix nohz load accounting -- again!")
it became unused as the function calc_global_nohz() dropped using "ticks".
Fixes: c308b56b53 ("sched: Fix nohz load accounting -- again!")
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1593628458-32290-1-git-send-email-paul.gortmaker@windriver.com
Mark the relevant functions noinstr, use the plain non-instrumented MSR
accessors. The only odd part is the instrumentation_begin()/end() pair around the
indirect machine_check_vector() call as objtool can't figure that out. The
possible invoked functions are annotated correctly.
Also use notrace variant of nmi_enter/exit(). If MCEs happen then hardware
latency tracing is the least of the worries.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Andy Lutomirski <luto@kernel.org>
Link: https://lkml.kernel.org/r/20200505135315.476734898@linutronix.de
Core:
- Consolidation of the vDSO build infrastructure to address the
difficulties of cross-builds for ARM64 compat vDSO libraries by
restricting the exposure of header content to the vDSO build.
This is achieved by splitting out header content into separate
headers. which contain only the minimaly required information which is
necessary to build the vDSO. These new headers are included from the
kernel headers and the vDSO specific files.
- Enhancements to the generic vDSO library allowing more fine grained
control over the compiled in code, further reducing architecture
specific storage and preparing for adopting the generic library by PPC.
- Cleanup and consolidation of the exit related code in posix CPU timers.
- Small cleanups and enhancements here and there
Drivers:
- The obligatory new drivers: Ingenic JZ47xx and X1000 TCU support
- Correct the clock rate of PIT64b global clock
- setup_irq() cleanup
- Preparation for PWM and suspend support for the TI DM timer
- Expand the fttmr010 driver to support ast2600 systems
- The usual small fixes, enhancements and cleanups all over the place
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl6B+QETHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYofJ5D/94s5fpaqiuNcaAsLq2D3DRIrTnqxx7
yEeAOPcbYV1bM1SgY/M83L5yGc2S8ny787e26abwRTCZhZV3eAmRTphIFFIZR0Xk
xS+i67odscbdJTRtztKj3uQ9rFxefszRuphyaa89pwSY9nnyMWLcahGSQOGs0LJK
hvmgwPjyM1drNfPxgPiaFg7vDr2XxNATpQr/FBt+BhelvVan8TlAfrkcNPiLr++Y
Axz925FP7jMaRRbZ1acji34gLiIAZk0jLCUdbix7YkPrqDB4GfO+v8Vez+fGClbJ
uDOYeR4r1+Be/BtSJtJ2tHqtsKCcAL6agtaE2+epZq5HbzaZFRvBFaxgFNF8WVcn
3FFibdEMdsRNfZTUVp5wwgOLN0UIqE/7LifE12oLEL2oFB5H2PiNEUw3E02XHO11
rL3zgHhB6Ke1sXKPCjSGdmIQLbxZmV5kOlQFy7XuSeo5fmRapVzKNffnKcftIliF
1HNtZbgdA+3tdxMFCqoo1QX+kotl9kgpslmdZ0qHAbaRb3xqLoSskbqEjFRMuSCC
8bjJrwboD9T5GPfwodSCgqs/58CaSDuqPFbIjCay+p90Fcg6wWAkZtyG04ZLdPRc
GgNNdN4gjTD9bnrRi8cH47z1g8OO4vt4K4SEbmjo8IlDW+9jYMxuwgR88CMeDXd7
hu7aKsr2I2q/WQ==
=5o9G
-----END PGP SIGNATURE-----
Merge tag 'timers-core-2020-03-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timekeeping and timer updates from Thomas Gleixner:
"Core:
- Consolidation of the vDSO build infrastructure to address the
difficulties of cross-builds for ARM64 compat vDSO libraries by
restricting the exposure of header content to the vDSO build.
This is achieved by splitting out header content into separate
headers. which contain only the minimaly required information which
is necessary to build the vDSO. These new headers are included from
the kernel headers and the vDSO specific files.
- Enhancements to the generic vDSO library allowing more fine grained
control over the compiled in code, further reducing architecture
specific storage and preparing for adopting the generic library by
PPC.
- Cleanup and consolidation of the exit related code in posix CPU
timers.
- Small cleanups and enhancements here and there
Drivers:
- The obligatory new drivers: Ingenic JZ47xx and X1000 TCU support
- Correct the clock rate of PIT64b global clock
- setup_irq() cleanup
- Preparation for PWM and suspend support for the TI DM timer
- Expand the fttmr010 driver to support ast2600 systems
- The usual small fixes, enhancements and cleanups all over the
place"
* tag 'timers-core-2020-03-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (80 commits)
Revert "clocksource/drivers/timer-probe: Avoid creating dead devices"
vdso: Fix clocksource.h macro detection
um: Fix header inclusion
arm64: vdso32: Enable Clang Compilation
lib/vdso: Enable common headers
arm: vdso: Enable arm to use common headers
x86/vdso: Enable x86 to use common headers
mips: vdso: Enable mips to use common headers
arm64: vdso32: Include common headers in the vdso library
arm64: vdso: Include common headers in the vdso library
arm64: Introduce asm/vdso/processor.h
arm64: vdso32: Code clean up
linux/elfnote.h: Replace elf.h with UAPI equivalent
scripts: Fix the inclusion order in modpost
common: Introduce processor.h
linux/ktime.h: Extract common header for vDSO
linux/jiffies.h: Extract common header for vDSO
linux/time64.h: Extract common header for vDSO
linux/time32.h: Extract common header for vDSO
linux/time.h: Extract common header for vDSO
...
seqlock consists of a sequence counter and a spinlock_t which is used to
serialize the writers. spinlock_t is substituted by a "sleeping" spinlock
on PREEMPT_RT enabled kernels which breaks the usage in the timekeeping
code as the writers are executed in hard interrupt and therefore
non-preemptible context even on PREEMPT_RT.
The spinlock in seqlock cannot be unconditionally replaced by a
raw_spinlock_t as many seqlock users have nesting spinlock sections or
other code which is not suitable to run in truly atomic context on RT.
Instead of providing a raw_seqlock API for a single use case, open code the
seqlock for the jiffies use case and implement it with a raw_spinlock_t and
a sequence counter.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200321113242.120587764@linutronix.de
While unlikely the divisor in scale64_check_overflow() could be >= 32bit in
scale64_check_overflow(). do_div() truncates the divisor to 32bit at least
on 32bit platforms.
Use div64_u64() instead to avoid the truncation to 32-bit.
[ tglx: Massaged changelog ]
Signed-off-by: Wen Yang <wenyang@linux.alibaba.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200120100523.45656-1-wenyang@linux.alibaba.com
The VDSO update for CLOCK_BOOTTIME has a overflow issue as it shifts the
nanoseconds based boot time offset left by the clocksource shift. That
overflows once the boot time offset becomes large enough. As a consequence
CLOCK_BOOTTIME in the VDSO becomes a random number causing applications to
misbehave.
Fix it by storing a timespec64 representation of the offset when boot time
is adjusted and add that to the MONOTONIC base time value in the vdso data
page. Using the timespec64 representation avoids a 64bit division in the
update code.
Fixes: 44f57d788e ("timekeeping: Provide a generic update_vsyscall() implementation")
Reported-by: Chris Clayton <chris2553@googlemail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Chris Clayton <chris2553@googlemail.com>
Tested-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1908221257580.1983@nanos.tec.linutronix.de
While this doesn't actually amount to a real difference, since the macro
evaluates to the same thing, every place else operates on ktime_t using
these functions, so let's not break the pattern.
Fixes: e3ff9c3678 ("timekeeping: Repair ktime_get_coarse*() granularity")
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lkml.kernel.org/r/20190621203249.3909-1-Jason@zx2c4.com
Jason reported that the coarse ktime based time getters advance only once
per second and not once per tick as advertised.
The code reads only the monotonic base time, which advances once per
second. The nanoseconds are accumulated on every tick in xtime_nsec up to
a second and the regular time getters take this nanoseconds offset into
account, but the ktime_get_coarse*() implementation fails to do so.
Add the accumulated xtime_nsec value to the monotonic base time to get the
proper per tick advancing coarse tinme.
Fixes: b9ff604cff ("timekeeping: Add ktime_get_coarse_with_offset")
Reported-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Clemens Ladisch <clemens@ladisch.de>
Cc: Sultan Alsawaf <sultan@kerneltoast.com>
Cc: Waiman Long <longman@redhat.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1906132136280.1791@nanos.tec.linutronix.de
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAlzRrzoUHHBhdWxAcGF1
bC1tb29yZS5jb20ACgkQ6iDy2pc3iXNc7hAApgsi+3Jf9i29mgrKdrTciZ35TegK
C8pTlOIndpBcmdwDakR50/PgfMHdHll8M9TReVNEjbe0S+Ww5GTE7eWtL3YqoPC2
MuXEqcriz6UNi5Xma6vCZrDznWLXkXnzMDoDoYGDSoKuUYxef0fuqxDBnERM60Ht
s52+0XvR5ZseBw7I1KIv/ix2fXuCGq6eCdqassm0rvLPQ7bq6nWzFAlNXOLud303
DjIWu6Op2EL0+fJSmG+9Z76zFjyEbhMIhw5OPDeH4eO3pxX29AIv0m0JlI7ZXxfc
/VVC3r5G4WrsWxwKMstOokbmsQxZ5pB3ZaceYpco7U+9N2e3SlpsNM9TV+Y/0ac/
ynhYa//GK195LpMXx1BmWmLpjBHNgL8MvQkVTIpDia0GT+5sX7+haDxNLGYbocmw
A/mR+KM2jAU3QzNseGh6c659j3K4tbMIFMNxt7pUBxVPLafcccNngFGTpzCwu5GU
b7y4d21g6g/3Irj14NYU/qS8dTjW0rYrCMDquTpxmMfZ2xYuSvQmnBw91NQzVBp2
98L2/fsUG3yOa5MApgv+ryJySsIM+SW+7leKS5tjy/IJINzyPEZ85l3o8ck8X4eT
nohpKc/ELmeyi3omFYq18ecvFf2YRS5jRnz89i9q65/3ESgGiC0wyGOhNTvjvsyv
k4jT0slIK614aGk=
=p8Fp
-----END PGP SIGNATURE-----
Merge tag 'audit-pr-20190507' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit
Pull audit updates from Paul Moore:
"We've got a reasonably broad set of audit patches for the v5.2 merge
window, the highlights are below:
- The biggest change, and the source of all the arch/* changes, is
the patchset from Dmitry to help enable some of the work he is
doing around PTRACE_GET_SYSCALL_INFO.
To be honest, including this in the audit tree is a bit of a
stretch, but it does help move audit a little further along towards
proper syscall auditing for all arches, and everyone else seemed to
agree that audit was a "good" spot for this to land (or maybe they
just didn't want to merge it? dunno.).
- We can now audit time/NTP adjustments.
- We continue the work to connect associated audit records into a
single event"
* tag 'audit-pr-20190507' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit: (21 commits)
audit: fix a memory leak bug
ntp: Audit NTP parameters adjustment
timekeeping: Audit clock adjustments
audit: purge unnecessary list_empty calls
audit: link integrity evm_write_xattrs record to syscall event
syscall_get_arch: add "struct task_struct *" argument
unicore32: define syscall_get_arch()
Move EM_UNICORE to uapi/linux/elf-em.h
nios2: define syscall_get_arch()
nds32: define syscall_get_arch()
Move EM_NDS32 to uapi/linux/elf-em.h
m68k: define syscall_get_arch()
hexagon: define syscall_get_arch()
Move EM_HEXAGON to uapi/linux/elf-em.h
h8300: define syscall_get_arch()
c6x: define syscall_get_arch()
arc: define syscall_get_arch()
Move EM_ARCOMPACT and EM_ARCV2 to uapi/linux/elf-em.h
audit: Make audit_log_cap and audit_copy_inode static
audit: connect LOGIN record to its syscall record
...
Emit an audit record every time selected NTP parameters are modified
from userspace (via adjtimex(2) or clock_adjtime(2)). These parameters
may be used to indirectly change system clock, and thus their
modifications should be audited.
Such events will now generate records of type AUDIT_TIME_ADJNTPVAL
containing the following fields:
- op -- which value was adjusted:
- offset -- corresponding to the time_offset variable
- freq -- corresponding to the time_freq variable
- status -- corresponding to the time_status variable
- adjust -- corresponding to the time_adjust variable
- tick -- corresponding to the tick_usec variable
- tai -- corresponding to the timekeeping's TAI offset
- old -- the old value
- new -- the new value
Example records:
type=TIME_ADJNTPVAL msg=audit(1530616044.507:7): op=status old=64 new=8256
type=TIME_ADJNTPVAL msg=audit(1530616044.511:11): op=freq old=0 new=49180377088000
The records of this type will be associated with the corresponding
syscall records.
An overview of parameter changes that can be done via do_adjtimex()
(based on information from Miroslav Lichvar) and whether they are
audited:
__timekeeping_set_tai_offset() -- sets the offset from the
International Atomic Time
(AUDITED)
NTP variables:
time_offset -- can adjust the clock by up to 0.5 seconds per call
and also speed it up or slow down by up to about
0.05% (43 seconds per day) (AUDITED)
time_freq -- can speed up or slow down by up to about 0.05%
(AUDITED)
time_status -- can insert/delete leap seconds and it also enables/
disables synchronization of the hardware real-time
clock (AUDITED)
time_maxerror, time_esterror -- change error estimates used to
inform userspace applications
(NOT AUDITED)
time_constant -- controls the speed of the clock adjustments that
are made when time_offset is set (NOT AUDITED)
time_adjust -- can temporarily speed up or slow down the clock by up
to 0.05% (AUDITED)
tick_usec -- a more extreme version of time_freq; can speed up or
slow down the clock by up to 10% (AUDITED)
Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com>
Reviewed-by: Richard Guy Briggs <rgb@redhat.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Emit an audit record whenever the system clock is changed (i.e. shifted
by a non-zero offset) by a syscall from userspace. The syscalls than can
(at the time of writing) trigger such record are:
- settimeofday(2), stime(2), clock_settime(2) -- via
do_settimeofday64()
- adjtimex(2), clock_adjtime(2) -- via do_adjtimex()
The new records have type AUDIT_TIME_INJOFFSET and contain the following
fields:
- sec -- the 'seconds' part of the offset
- nsec -- the 'nanoseconds' part of the offset
Example record (time was shifted backwards by ~15.875 seconds):
type=TIME_INJOFFSET msg=audit(1530616049.652:13): sec=-16 nsec=124887145
The records of this type will be associated with the corresponding
syscall records.
Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com>
Reviewed-by: Richard Guy Briggs <rgb@redhat.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
[PM: fixed a line width problem in __audit_tk_injoffset()]
Signed-off-by: Paul Moore <paul@paul-moore.com>
Several people reported testing failures after setting CLOCK_REALTIME close
to the limits of the kernel internal representation in nanoseconds,
i.e. year 2262.
The failures are exposed in subsequent operations, i.e. when arming timers
or when the advancing CLOCK_MONOTONIC makes the calculation of
CLOCK_REALTIME overflow into negative space.
Now people start to paper over the underlying problem by clamping
calculations to the valid range, but that's just wrong because such
workarounds will prevent detection of real issues as well.
It is reasonable to force an upper bound for the various methods of setting
CLOCK_REALTIME. Year 2262 is the absolute upper bound. Assume a maximum
uptime of 30 years which is plenty enough even for esoteric embedded
systems. That results in an upper bound of year 2232 for setting the time.
Once that limit is reached in reality this limit is only a small part of
the problem space. But until then this stops people from trying to paper
over the problem at the wrong places.
Reported-by: Xiongfeng Wang <wangxiongfeng2@huawei.com>
Reported-by: Hongbo Yao <yaohongbo@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Stephen Boyd <sboyd@kernel.org>
Cc: Miroslav Lichvar <mlichvar@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1903231125480.2157@nanos.tec.linutronix.de
The timekeeping code uses a random mix of "unsigned long" and "unsigned
int" for the seqcount snapshots (ratio 14:12). Since the seqlock.h API is
entirely based on unsigned int, use that throughout.
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Stephen Boyd <sboyd@kernel.org>
Link: https://lkml.kernel.org/r/20190318195557.20773-1-linux@rasmusvillemoes.dk
struct timex is not y2038 safe.
Replace all uses of timex with y2038 safe __kernel_timex.
Note that struct __kernel_timex is an ABI interface definition.
We could define a new structure based on __kernel_timex that
is only available internally instead. Right now, there isn't
a strong motivation for this as the structure is isolated to
a few defined struct timex interfaces and such a structure would
be exactly the same as struct timex.
The patch was generated by the following coccinelle script:
virtual patch
@depends on patch forall@
identifier ts;
expression e;
@@
(
- struct timex ts;
+ struct __kernel_timex ts;
|
- struct timex ts = {};
+ struct __kernel_timex ts = {};
|
- struct timex ts = e;
+ struct __kernel_timex ts = e;
|
- struct timex *ts;
+ struct __kernel_timex *ts;
|
(memset \| copy_from_user \| copy_to_user \)(...,
- sizeof(struct timex))
+ sizeof(struct __kernel_timex))
)
@depends on patch forall@
identifier ts;
identifier fn;
@@
fn(...,
- struct timex *ts,
+ struct __kernel_timex *ts,
...) {
...
}
@depends on patch forall@
identifier ts;
identifier fn;
@@
fn(...,
- struct timex *ts) {
+ struct __kernel_timex *ts) {
...
}
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: linux-alpha@vger.kernel.org
Cc: netdev@vger.kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
This concludes the main part of the system call rework for 64-bit time_t,
which has spread over most of year 2018, the last six system calls being
- ppoll
- pselect6
- io_pgetevents
- recvmmsg
- futex
- rt_sigtimedwait
As before, nothing changes for 64-bit architectures, while 32-bit
architectures gain another entry point that differs only in the layout
of the timespec structure. Hopefully in the next release we can wire up
all 22 of those system calls on all 32-bit architectures, which gives
us a baseline version for glibc to start using them.
This does not include the clock_adjtime, getrusage/waitid, and
getitimer/setitimer system calls. I still plan to have new versions
of those as well, but they are not required for correct operation of
the C library since they can be emulated using the old 32-bit time_t
based system calls.
Aside from the system calls, there are also a few cleanups here,
removing old kernel internal interfaces that have become unused after
all references got removed. The arch/sh cleanups are part of this,
there were posted several times over the past year without a reaction
from the maintainers, while the corresponding changes made it into all
other architectures.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJcHCCRAAoJEGCrR//JCVInkqsP/3TuLgSyQwolFRXcoBOjR1Ar
JoX33GuDlAxHSqPadButVfflmRIWvL3aNMFFwcQM4uYgQ593FoHbmnusCdFgHcQ7
Q13pGo7szbfEFxydhnDMVust/hxd5C9Y5zNSJ+eMLGLLJXosEyjd9YjRoHDROWal
oDLqpPCArlLN1B1XFhjH8J847+JgS+hUrAfk3AOU0B2TuuFkBnRImlCGCR5JcgPh
XIpHRBOgEMP4kZ3LjztPfS3v/XJeGrguRcbD3FsPKdPeYO9QRUiw0vahEQRr7qXL
9hOgDq1YHPUQeUFhy3hJPCZdsDFzWoIE7ziNkZCZvGBw+qSw9i8KChGUt6PcSNlJ
nqKJY5Wneb4svu+kOdK7d8ONbTdlVYvWf5bj/sKoNUA4BVeIjNcDXplvr3cXiDzI
e40CcSQ3oLEvrIxMcoyNPPG63b+FYG8nMaCOx4dB4pZN7sSvZUO9a1DbDBtzxMON
xy5Kfk1n5gIHcfBJAya5CnMQ1Jm4FCCu/LHVanYvb/nXA/2jEegSm24Md17icE/Q
VA5jJqIdICExor4VHMsG0lLQxBJsv/QqYfT2OCO6Oykh28mjFqf+X+9Ctz1w6KVG
VUkY1u97x8jB0M4qolGO7ZGn6P1h0TpNVFD1zDNcDt2xI63cmuhgKWiV2pv5b7No
ty6insmmbJWt3tOOPyfb
=yIAT
-----END PGP SIGNATURE-----
Merge tag 'y2038-for-4.21' of ssh://gitolite.kernel.org:/pub/scm/linux/kernel/git/arnd/playground
Pull y2038 updates from Arnd Bergmann:
"More syscalls and cleanups
This concludes the main part of the system call rework for 64-bit
time_t, which has spread over most of year 2018, the last six system
calls being
- ppoll
- pselect6
- io_pgetevents
- recvmmsg
- futex
- rt_sigtimedwait
As before, nothing changes for 64-bit architectures, while 32-bit
architectures gain another entry point that differs only in the layout
of the timespec structure. Hopefully in the next release we can wire
up all 22 of those system calls on all 32-bit architectures, which
gives us a baseline version for glibc to start using them.
This does not include the clock_adjtime, getrusage/waitid, and
getitimer/setitimer system calls. I still plan to have new versions of
those as well, but they are not required for correct operation of the
C library since they can be emulated using the old 32-bit time_t based
system calls.
Aside from the system calls, there are also a few cleanups here,
removing old kernel internal interfaces that have become unused after
all references got removed. The arch/sh cleanups are part of this,
there were posted several times over the past year without a reaction
from the maintainers, while the corresponding changes made it into all
other architectures"
* tag 'y2038-for-4.21' of ssh://gitolite.kernel.org:/pub/scm/linux/kernel/git/arnd/playground:
timekeeping: remove obsolete time accessors
vfs: replace current_kernel_time64 with ktime equivalent
timekeeping: remove timespec_add/timespec_del
timekeeping: remove unused {read,update}_persistent_clock
sh: remove board_time_init() callback
sh: remove unused rtc_sh_get/set_time infrastructure
sh: sh03: rtc: push down rtc class ops into driver
sh: dreamcast: rtc: push down rtc class ops into driver
y2038: signal: Add compat_sys_rt_sigtimedwait_time64
y2038: signal: Add sys_rt_sigtimedwait_time32
y2038: socket: Add compat_sys_recvmmsg_time64
y2038: futex: Add support for __kernel_timespec
y2038: futex: Move compat implementation into futex.c
io_pgetevents: use __kernel_timespec
pselect6: use __kernel_timespec
ppoll: use __kernel_timespec
signal: Add restore_user_sigmask()
signal: Add set_user_sigmask()