With 5.9 kernel on ARM64, I found ftrace_dump output was broken but
it had no problem with normal output "cat /sys/kernel/debug/tracing/trace".
With investigation, it seems coping the data into temporal buffer seems to
break the align binary printf expects if the static buffer is not aligned
with 4-byte. IIUC, get_arg in bstr_printf expects that args has already
right align to be decoded and seq_buf_bprintf says ``the arguments are saved
in a 32bit word array that is defined by the format string constraints``.
So if we don't keep the align under copy to temporal buffer, the output
will be broken by shifting some bytes.
This patch fixes it.
Link: https://lkml.kernel.org/r/20201125225654.1618966-1-minchan@kernel.org
Cc: <stable@vger.kernel.org>
Fixes: 8e99cf91b9 ("tracing: Do not allocate buffer in trace_find_next_entry() in atomic")
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
This patch reverts commit 978defee11 ("tracing: Do a WARN_ON()
if start_thread() in hwlat is called when thread exists")
.start hook can be legally called several times if according
tracer is stopped
screen window 1
[root@localhost ~]# echo 1 > /sys/kernel/tracing/events/kmem/kfree/enable
[root@localhost ~]# echo 1 > /sys/kernel/tracing/options/pause-on-trace
[root@localhost ~]# less -F /sys/kernel/tracing/trace
screen window 2
[root@localhost ~]# cat /sys/kernel/debug/tracing/tracing_on
0
[root@localhost ~]# echo hwlat > /sys/kernel/debug/tracing/current_tracer
[root@localhost ~]# echo 1 > /sys/kernel/debug/tracing/tracing_on
[root@localhost ~]# cat /sys/kernel/debug/tracing/tracing_on
0
[root@localhost ~]# echo 2 > /sys/kernel/debug/tracing/tracing_on
triggers warning in dmesg:
WARNING: CPU: 3 PID: 1403 at kernel/trace/trace_hwlat.c:371 hwlat_tracer_start+0xc9/0xd0
Link: https://lkml.kernel.org/r/bd4d3e70-400d-9c82-7b73-a2d695e86b58@virtuozzo.com
Cc: Ingo Molnar <mingo@redhat.com>
Cc: stable@vger.kernel.org
Fixes: 978defee11 ("tracing: Do a WARN_ON() if start_thread() in hwlat is called when thread exists")
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
In the slow path of __rb_reserve_next() a nested event(s) can happen
between evaluating the timestamp delta of the current event and updating
write_stamp via local_cmpxchg(); in this case the delta is not valid
anymore and it should be set to 0 (same timestamp as the interrupting
event), since the event that we are currently processing is not the last
event in the buffer.
Link: https://lkml.kernel.org/r/X8IVJcp1gRE+FJCJ@xps-13-7390
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lwn.net/Articles/831207
Fixes: a389d86f7f ("ring-buffer: Have nested events still record running time stamp")
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The write stamp, used to calculate deltas between events, was updated with
the stale "ts" value in the "info" structure, and not with the updated "ts"
variable. This caused the deltas between events to be inaccurate, and when
crossing into a new sub buffer, had time go backwards.
Link: https://lkml.kernel.org/r/20201124223917.795844-1-elavila@google.com
Cc: stable@vger.kernel.org
Fixes: a389d86f7f ("ring-buffer: Have nested events still record running time stamp")
Reported-by: "J. Avila" <elavila@google.com>
Tested-by: Daniel Mentz <danielmentz@google.com>
Tested-by: Will McVicker <willmcvicker@google.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
When an interrupt allocation fails for N interrupts, it is pretty
common for the error handling code to free the same number of interrupts,
no matter how many interrupts have actually been allocated.
This may result in the domain freeing code to be unexpectedly called
for interrupts that have no mapping in that domain. Things end pretty
badly.
Instead, add some checks to irq_domain_free_irqs_hierarchy() to make sure
that thiss does not follow the hierarchy if no mapping exists for a given
interrupt.
Fixes: 6a6544e520 ("genirq/irqdomain: Remove auto-recursive hierarchy support")
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201129135551.396777-1-maz@kernel.org
There is currently no way to convey the affinity of an interrupt
via irq_create_mapping(), which creates issues for devices that
expect that affinity to be managed by the kernel.
In order to sort this out, rename irq_create_mapping() to
irq_create_mapping_affinity() with an additional affinity parameter that
can be passed down to irq_domain_alloc_descs().
irq_create_mapping() is re-implemented as a wrapper around
irq_create_mapping_affinity().
No functional change.
Fixes: e75eafb9b0 ("genirq/msi: Switch to new irq spreading infrastructure")
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Greg Kurz <groug@kaod.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20201126082852.1178497-2-lvivier@redhat.com
idle path. Similar to the entry path the low level idle functions have to
be non-instrumentable.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl/DpAUTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoXSLD/9klc0YimnEnROW6Q5Svb2IcyIutmXF
bOIY1bYYoKILOBj3wyvDUhmdMuq5zh7H9yG11hO8MaVVWVQcLcOMLdHTYm9dcdmF
xQk33+xqjuhRShB+nEmC9ayYtWogtH6W6uZ6WDtF9ZltMKU85n5ddGJ/Fvo+HoCb
NbOdHGJdJ3/3ZCeHnxOnxM+5/GwjkBuccTV/tXmb3yXrfU9DBySyQ4/UchcpF43w
LcEb0kiQbpZsBTByKJOQV8+RR654S0sILlvRwVXpmj94vrgGwhlVk1/9rz7tkOhF
ksoo1mTVu75LMt22G/hXxE63787yRvFdHjapf0+kCOAuhl992NK+xlGDH8o9DXcu
9y73D4bI0HnDFs20w6vs20iLvxECJiYHJqlgR5ZwFUToceaNgtiYr8kzuD7Zbae1
KG2E7BuNSwHWMtf97fGn44GZknPEOaKdDn4Wv6/bvKHxLm77qe11RKF70Stcz2AI
am13KmQzzsHGF5qNWwpElRUxSdxfJMR66RnOdTQULGrRedaZTFol/y2pnVzTSe3k
SZnlpL5kE7y92UYDogPb5wWA7b+YkJN0OdSkRFy1FH26ZG8E4M7ZJ2tql5Sw7pGM
lsTjXpAUphnK5rz7QcYE8KAZWj//fIAcElIrvdklVcBnS3IqjfksYW27B64133vx
cT1B/lA1PHXj6Q==
=raED
-----END PGP SIGNATURE-----
Merge tag 'locking-urgent-2020-11-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fixes from Thomas Gleixner:
"Two more places which invoke tracing from RCU disabled regions in the
idle path.
Similar to the entry path the low level idle functions have to be
non-instrumentable"
* tag 'locking-urgent-2020-11-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
intel_idle: Fix intel_idle() vs tracing
sched/idle: Fix arch_cpu_idle() vs tracing
Trivial conflict in CAN, keep the net-next + the byteswap wrapper.
Conflicts:
drivers/net/can/usb/gs_usb.c
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl+fOigeHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGoQ0H/RLJU2FMIjO0mzLX
9LqePQ9QmNWG4KeqxwWaKq90MinIbnSG3CDPKruu8RNh2Rr6nsEJmqg1DWyEiFRB
8gzsBXMAC1i2aPfOrOnCJEfP+L+svKlbSii475tNdZw2DhP+/FBT0RVCt3rRhrRs
atc8+dM7ViGLnlvRJ4LlVqA3d1kjOr5bsPYcIcnGIHY8mYWBLFzTSVgDdrcB9+3l
7lZud/zMhJ3dS0bcnbIUS1YpBxHCsgEaMFQYmcv3RruIaaFbh5THkfQUSmbmrAru
/EeVjwVMuvpvb2jxS1ofLx2in7t4tsNgItu4AfMmV0BurM5NhpqKo7mo/1nmR/X9
Q4tjPRc=
=cUbb
-----END PGP SIGNATURE-----
Backmerge tag 'v5.10-rc2' into arm/drivers
The SCMI pull request for the arm/drivers branch requires v5.10-rc2
because of dependencies with other git trees, so merge that in here.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESH4wyp42V4tXvYsjUqAMR0iAlPIFAl/A+VEACgkQUqAMR0iA
lPIwGw/9F/E2ZdX+Vgi3ZiR/5GdfVZeIW+QwhKXBQc8Jr9+p2JJ+UOPeeazKQA5l
bFt6GR67yjqtFS5gO76EPCQ6/Uu3cPA+A3HQRQZuE6p0zM+mrMXc/upLMy5DKi4Z
f4zkW8dYWSBpAWPvM9bb0gIKO9wVV6Aj1IyyZLfEghX/KrJPx0zutioO4ScYxhA9
YVITmnUQ6YzHEVE8CwWGV4lArC50ILGdIqNlZrkjuG3CuGTdyB2OY60P8XCy8bzn
W3WgRGI/bvfHwCPh8oYKm/5nM9JAVdhbEpoFQj8cMPKoH5DeSGNWfYXkali2gqhL
1Y2SntTcR7zclMcN0/gIn9ViVsma/eayAyawSYgQjmAdl6H/vv9B7x9ZswmK/b38
JzOzHwP+H3lXVg2yN4EbH3uDMTMjqflYuC7QiZ/HNa43KURXhoritw2hBRczhazp
mdyRQf4iv8NoYSthggD6LolCs+ay5NZpCeB3YXgnlpxiYFGCE+ykSz41AGdTyYTl
jOWVtK1VawFD0/FgpgF8XK7/gOXWeYb+4WeBYgGKgCJdneiB5eJt8eWT7zmpAPpG
FECexdAd4TAjD+EEbidiFWpMjJcY2TnOJp76O3/Wlo1QLbEgRHOklM/Rrq0zGg5b
vm3w0kobGZfpIJuzSOAHyErX0jGEVTq6yUi381jSQpf4bTttIpc=
=7IkU
-----END PGP SIGNATURE-----
Merge tag 'printk-for-5.10-rc6-fixup' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux
Pull printk fixes from Petr Mladek:
- do not lose trailing newline in pr_cont() calls
- two trivial fixes for a dead store and a config description
* tag 'printk-for-5.10-rc6-fixup' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux:
printk: finalize records with trailing newlines
printk: remove unneeded dead-store assignment
init/Kconfig: Fix CPU number in LOG_CPU_MAX_BUF_SHIFT description
Any record with a trailing newline (LOG_NEWLINE flag) cannot
be continued because the newline has been stripped and will
not be visible if the message is appended. This was already
handled correctly when committing in log_output() but was
not handled correctly when committing in log_store().
Fixes: f5f022e53b ("printk: reimplement log_cont using record extension")
Link: https://lore.kernel.org/r/20201126114836.14750-1-john.ogness@linutronix.de
Reported-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Tested-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Nowadays, there are increasing requirements to benchmark the performance
of dma_map and dma_unmap particually while the device is attached to an
IOMMU.
This patch enables the support. Users can run specified number of threads
to do dma_map_page and dma_unmap_page on a specific NUMA node with the
specified duration. Then dma_map_benchmark will calculate the average
latency for map and unmap.
A difficulity for this benchmark is that dma_map/unmap APIs must run on
a particular device. Each device might have different backend of IOMMU or
non-IOMMU.
So we use the driver_override to bind dma_map_benchmark to a particual
device by:
For platform devices:
echo dma_map_benchmark > /sys/bus/platform/devices/xxx/driver_override
echo xxx > /sys/bus/platform/drivers/xxx/unbind
echo xxx > /sys/bus/platform/drivers/dma_map_benchmark/bind
For PCI devices:
echo dma_map_benchmark > /sys/bus/pci/devices/0000:00:01.0/driver_override
echo 0000:00:01.0 > /sys/bus/pci/drivers/xxx/unbind
echo 0000:00:01.0 > /sys/bus/pci/drivers/dma_map_benchmark/bind
Cc: Will Deacon <will@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
[hch: folded in two fixes from Colin Ian King <colin.king@canonical.com>]
Signed-off-by: Christoph Hellwig <hch@lst.de>
When calling debugfs functions, there is no need to ever check the
return value. The function can work or not, but the code logic should
never do something different based on this.
Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
At the moment we allow bypassing DMA ops only when we can do this for
the entire RAM. However there are configs with mixed type memory
where we could still allow bypassing IOMMU in most cases;
POWERPC with persistent memory is one example.
This adds an arch hook to determine where bypass can still work and
we invoke direct DMA API. The following patch checks the bus limit
on POWERPC to allow or disallow direct mapping.
This adds a ARCH_HAS_DMA_MAP_DIRECT config option to make the arch_xxxx
hooks no-op by default.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Christoph Hellwig <hch@lst.de>
powerpc/64s keeps a counter in the mm which counts bits set in
mm_cpumask as well as other things. This means it can't use generic code
to clear bits out of the mask and doesn't adjust the arch specific
counter.
Add an arch override that allows powerpc/64s to use
clear_tasks_mm_cpumask().
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20201126102530.691335-4-npiggin@gmail.com
Provide a wrapper function to get the IMA hash of an inode. This helper
is useful in fingerprinting files (e.g executables on execution) and
using these fingerprints in detections like an executable unlinking
itself.
Since the ima_inode_hash can sleep, it's only allowed for sleepable
LSM hooks.
Signed-off-by: KP Singh <kpsingh@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20201124151210.1081188-3-kpsingh@chromium.org
we have supplied the inline function: of_cft() in cgroup.h.
So replace the direct use 'of->kn->priv' with inline func
of_cft(), which is more readable.
Signed-off-by: Hui Su <sh_def@163.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
In realtime scenario, We do not want to have interference on the
isolated cpu cores. but when invoking alloc_workqueue() for percpu wq
on the housekeeping cpu, it kick a kworker on the isolated cpu.
alloc_workqueue
pwq_adjust_max_active
wake_up_worker
The comment in pwq_adjust_max_active() said:
"Need to kick a worker after thawed or an unbound wq's
max_active is bumped"
So it is unnecessary to kick a kworker for percpu's wq when invoking
alloc_workqueue(). this patch only kick a worker based on the actual
activation of delayed works.
Signed-off-by: Yunfeng Ye <yeyunfeng@huawei.com>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
modpost complains that module has no licence provided.
Provide it via meaningful MODULE_LICENSE().
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Instead of using the array-of-pointers trick to avoid having gcc mess up
the built-in module-version array stride, specify type alignment when
declaring entries to prevent gcc from increasing alignment.
This is essentially an alternative (one-line) fix to the problem
addressed by commit b4bc842802 ("module: deal with alignment issues in
built-in module versions").
gcc can increase the alignment of larger objects with static extent as
an optimisation, but this can be suppressed by using the aligned
attribute when declaring variables.
Note that we have been relying on this behaviour for kernel parameters
for 16 years and it indeed hasn't changed since the introduction of the
aligned attribute in gcc-3.1.
Link: https://lore.kernel.org/lkml/20201103175711.10731-1-johan@kernel.org
Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
The current implementation uses a number of gotos to implement a loop
and different paths within the loop, which makes the code less readable
than it would be with an explicit while-loop. This patch also replaces a
chain of if/if-elses keyed on the same expression with a switch
statement.
No change in behaviour is intended.
Signed-off-by: Wedson Almeida Filho <wedsonaf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20201121015509.3594191-1-wedsonaf@google.com
Some unused macros could cause gcc warning:
kernel/audit.c:68:0: warning: macro "AUDIT_UNINITIALIZED" is not used
[-Wunused-macros]
kernel/auditsc.c:104:0: warning: macro "AUDIT_AUX_IPCPERM" is not used
[-Wunused-macros]
kernel/auditsc.c:82:0: warning: macro "AUDITSC_INVALID" is not used
[-Wunused-macros]
AUDIT_UNINITIALIZED and AUDITSC_INVALID are still meaningful and should
be in incorporated.
Just remove AUDIT_AUX_IPCPERM.
Thanks comments from Richard Guy Briggs and Paul Moore.
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Richard Guy Briggs <rgb@redhat.com>
Cc: Eric Paris <eparis@redhat.com>
Cc: linux-audit@redhat.com
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Paul Moore <paul@paul-moore.com>
Given .BTF section is not allocatable, it will get trimmed after module is
loaded. BPF system handles that properly by creating an independent copy of
data. But prevent any accidental misused by resetting the pointer to BTF data.
Fixes: 36e68442d1 ("bpf: Load and verify kernel module BTFs")
Suggested-by: Jessica Yu <jeyu@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jessica Yu <jeyu@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/bpf/20201121070829.2612884-2-andrii@kernel.org
Trade one atomic op for a full memory barrier.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Get rid of the __call_single_node union and cleanup the API a little
to avoid external code relying on the structure layout as much.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Get rid of the __call_single_node union and clean up the API a little
to avoid external code relying on the structure layout as much.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
At fork time currently, a local node can be allowed to fill completely
and allow the periodic load balancer to fix the problem. This can be
problematic in cases where a task creates lots of threads that idle until
woken as part of a worker poll causing a memory bandwidth problem.
However, a "real" workload suffers badly from this behaviour. The workload
in question is mostly NUMA aware but spawns large numbers of threads
that act as a worker pool that can be called from anywhere. These need
to spread early to get reasonable behaviour.
This patch limits how much a local node can fill before spilling over
to another node and it will not be a universal win. Specifically,
very short-lived workloads that fit within a NUMA node would prefer
the memory bandwidth.
As I cannot describe the "real" workload, the best proxy measure I found
for illustration was a page fault microbenchmark. It's not representative
of the workload but demonstrates the hazard of the current behaviour.
pft timings
5.10.0-rc2 5.10.0-rc2
imbalancefloat-v2 forkspread-v2
Amean elapsed-1 46.37 ( 0.00%) 46.05 * 0.69%*
Amean elapsed-4 12.43 ( 0.00%) 12.49 * -0.47%*
Amean elapsed-7 7.61 ( 0.00%) 7.55 * 0.81%*
Amean elapsed-12 4.79 ( 0.00%) 4.80 ( -0.17%)
Amean elapsed-21 3.13 ( 0.00%) 2.89 * 7.74%*
Amean elapsed-30 3.65 ( 0.00%) 2.27 * 37.62%*
Amean elapsed-48 3.08 ( 0.00%) 2.13 * 30.69%*
Amean elapsed-79 2.00 ( 0.00%) 1.90 * 4.95%*
Amean elapsed-80 2.00 ( 0.00%) 1.90 * 4.70%*
This is showing the time to fault regions belonging to threads. The target
machine has 80 logical CPUs and two nodes. Note the ~30% gain when the
machine is approximately the point where one node becomes fully utilised.
The slower results are borderline noise.
Kernel building shows similar benefits around the same balance point.
Generally performance was either neutral or better in the tests conducted.
The main consideration with this patch is the point where fork stops
spreading a task so some workloads may benefit from different balance
points but it would be a risky tuning parameter.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20201120090630.3286-5-mgorman@techsingularity.net
Currently, an imbalance is only allowed when a destination node
is almost completely idle. This solved one basic class of problems
and was the cautious approach.
This patch revisits the possibility that NUMA nodes can be imbalanced
until 25% of the CPUs are occupied. The reasoning behind 25% is somewhat
superficial -- it's half the cores when HT is enabled. At higher
utilisations, balancing should continue as normal and keep things even
until scheduler domains are fully busy or over utilised.
Note that this is not expected to be a universal win. Any benchmark
that prefers spreading as wide as possible with limited communication
will favour the old behaviour as there is more memory bandwidth.
Workloads that communicate heavily in pairs such as netperf or tbench
benefit. For the tests I ran, the vast majority of workloads saw
a benefit so it seems to be a worthwhile trade-off.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20201120090630.3286-4-mgorman@techsingularity.net
In find_idlest_group(), the load imbalance is only relevant when the group
is either overloaded or fully busy but it is calculated unconditionally.
This patch moves the imbalance calculation to the context it is required.
Technically, it is a micro-optimisation but really the benefit is avoiding
confusing one type of imbalance with another depending on the group_type
in the next patch.
No functional change.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20201120090630.3286-3-mgorman@techsingularity.net
This is simply a preparation patch to make the following patches easier
to read. No functional change.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20201120090630.3286-2-mgorman@techsingularity.net
We call arch_cpu_idle() with RCU disabled, but then use
local_irq_{en,dis}able(), which invokes tracing, which relies on RCU.
Switch all arch_cpu_idle() implementations to use
raw_local_irq_{en,dis}able() and carefully manage the
lockdep,rcu,tracing state like we do in entry.
(XXX: we really should change arch_cpu_idle() to not return with
interrupts enabled)
Reported-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mark Rutland <mark.rutland@arm.com>
Tested-by: Mark Rutland <mark.rutland@arm.com>
Link: https://lkml.kernel.org/r/20201120114925.594122626@infradead.org
Instead of storing the map per CPU provide and use per task storage. That
prepares for local kmaps which are preemptible.
The context switch code is preparatory and not yet in use because
kmap_atomic() runs with preemption disabled. Will be made usable in the
next step.
The context switch logic is safe even when an interrupt happens after
clearing or before restoring the kmaps. The kmap index in task struct is
not modified so any nesting kmap in an interrupt will use unused indices
and on return the counter is the same as before.
Also add an assert into the return to user space code. Going back to user
space with an active kmap local is a nono.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20201118204007.372935758@linutronix.de
Now that the scheduler can deal with migrate disable properly, there is no
real compelling reason to make it only available for RT.
There are quite some code pathes which needlessly disable preemption in
order to prevent migration and some constructs like kmap_atomic() enforce
it implicitly.
Making it available independent of RT allows to provide a preemptible
variant of kmap_atomic() and makes the code more consistent in general.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Grudgingly-Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20201118204007.269943012@linutronix.de
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl+69egeHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGTSYH/ifRBlaxy5UiHFc0
2zdR7pkjWrYfDTTT3sazIAhdlzzcfnkUqgFxOP45F4ZIqeTzunH3sUY+5UlT9IX7
liUgnLxQ/1R9Gx8kPGQfu+tLCey78xVFydGsqJoW9sPRw2R+apMdGGa/lOrk+OXz
DXIN+dDnGFqwCCNJpK+rxQQhFf++IPpSI8z6Y23moOFhsDZrEziHuVFy2FGyRM6z
prZ/us/tcobE8ptCk1RmOxLoJ1DR6UxpA2vLimTE+JD8siOsSWPbjE0KudnWCnd5
BLqIjrsPJbSxyuzzK3v9dnO5wMv7tMDuMIuYM/MQTXDttNwtsqt/aP6gdnUCym7N
5eHEj5g=
=MuO1
-----END PGP SIGNATURE-----
Merge tag 'v5.10-rc5' into rdma.git for-next
For dependencies in following patches
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Add parameter explanation to fix kernel-doc marks:
kernel/power/suspend.c:233: warning: Function parameter or member
'state' not described in 'suspend_valid_only_mem'
kernel/power/suspend.c:344: warning: Function parameter or member
'state' not described in 'suspend_prepare'
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
[ rjw: Change the proposed parameter descriptions. ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Architectures that support address tagging, such as arm64, may want to
expose fault address tag bits to the signal handler to help diagnose
memory errors. However, these bits have not been previously set,
and their presence may confuse unaware user applications. Therefore,
introduce a SA_EXPOSE_TAGBITS flag bit in sa_flags that a signal
handler may use to explicitly request that the bits are set.
The generic signal handler APIs expect to receive tagged addresses.
Architectures may specify how to untag addresses in the case where
SA_EXPOSE_TAGBITS is clear by defining the arch_untagged_si_addr
function.
Signed-off-by: Peter Collingbourne <pcc@google.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Link: https://linux-review.googlesource.com/id/I16dd0ed2081f091fce97be0190cb8caa874c26cb
Link: https://lkml.kernel.org/r/13cf24d00ebdd8e1f55caf1821c7c29d54100191.1605904350.git.pcc@google.com
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Define a sa_flags bit, SA_UNSUPPORTED, which will never be supported
in the uapi. The purpose of this flag bit is to allow userspace to
distinguish an old kernel that does not clear unknown sa_flags bits
from a kernel that supports every flag bit.
In other words, if userspace does something like:
act.sa_flags |= SA_UNSUPPORTED;
sigaction(SIGSEGV, &act, 0);
sigaction(SIGSEGV, 0, &oldact);
and finds that SA_UNSUPPORTED remains set in oldact.sa_flags, it means
that the kernel cannot be trusted to have cleared unknown flag bits
from sa_flags, so no assumptions about flag bit support can be made.
Signed-off-by: Peter Collingbourne <pcc@google.com>
Reviewed-by: Dave Martin <Dave.Martin@arm.com>
Link: https://linux-review.googlesource.com/id/Ic2501ad150a3a79c1cf27fb8c99be342e9dffbcb
Link: https://lkml.kernel.org/r/bda7ddff8895a9bc4ffc5f3cf3d4d37a32118077.1605582887.git.pcc@google.com
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Previously we were not clearing non-uapi flag bits in
sigaction.sa_flags when storing the userspace-provided sa_flags or
when returning them via oldact. Start doing so.
This allows userspace to detect missing support for flag bits and
allows the kernel to use non-uapi bits internally, as we are already
doing in arch/x86 for two flag bits. Now that this change is in
place, we no longer need the code in arch/x86 that was hiding these
bits from userspace, so remove it.
This is technically a userspace-visible behavior change for sigaction, as
the unknown bits returned via oldact.sa_flags are no longer set. However,
we are free to define the behavior for unknown bits exactly because
their behavior is currently undefined, so for now we can define the
meaning of each of them to be "clear the bit in oldact.sa_flags unless
the bit becomes known in the future". Furthermore, this behavior is
consistent with OpenBSD [1], illumos [2] and XNU [3] (FreeBSD [4] and
NetBSD [5] fail the syscall if unknown bits are set). So there is some
precedent for this behavior in other kernels, and in particular in XNU,
which is probably the most popular kernel among those that I looked at,
which means that this change is less likely to be a compatibility issue.
Link: [1] f634a6a4b5/sys/kern/kern_sig.c (L278)
Link: [2] 76f19f5fdc/usr/src/uts/common/syscall/sigaction.c (L86)
Link: [3] a449c6a3b8/bsd/kern/kern_sig.c (L480)
Link: [4] eded70c370/sys/kern/kern_sig.c (L699)
Link: [5] 3365779bec/sys/kern/sys_sig.c (L473)
Signed-off-by: Peter Collingbourne <pcc@google.com>
Reviewed-by: Dave Martin <Dave.Martin@arm.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Link: https://linux-review.googlesource.com/id/I35aab6f5be932505d90f3b3450c083b4db1eca86
Link: https://lkml.kernel.org/r/878dbcb5f47bc9b11881c81f745c0bef5c23f97f.1605235762.git.pcc@google.com
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
To prepare for adding a RT aware variant of softirq serialization and
processing move related code into one section so the necessary #ifdeffery
is reduced to one.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20201113141733.974214480@linutronix.de
- Make the conditional update of the overutilized state work correctly by
caching the relevant flags state before overwriting them and checking
them afterwards.
- Fix a data race in the wakeup path which caused loadavg on ARM64
platforms to become a random number generator.
- Fix the ordering of the iowaiter accounting operations so it can't be
decremented before it is incremented.
- Fix a bug in the deadline scheduler vs. priority inheritance when a
non-deadline task A has inherited the parameters of a deadline task B
and then blocks on a non-deadline task C.
The second inheritance step used the static deadline parameters of task
A, which are usually 0, instead of further propagating task B's
parameters. The zero initialized parameters trigger a bug in the
deadline scheduler.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl+6edsTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoaJCEAC7VGr9IlWRzCI/173tKAXkLRrGXHVb
yOYc/YjLMCTcERNxqpf8uIURd/ATSHU/RMwfFcB558NedKZ/QKZDoKmLqeCXnVeM
e20tXv/fmpqRS7lgtmbBfhQ8mSDhst960oD1mHifdEwEBCCm7mLEaipTuTWjnZ0x
rOz70Hir1mSjsP0E7ZorsxCr1yExbrt+jZfKCe9D2kUSvlWHf1ipzAYNlqb/DsfG
n81G7q9LYV8NUhX3lt8oSZDq0K44aO6G6fEaP4EkfwsIAOh37yPHwuEuqDZCBmXw
rQ17XUU3jQ2MtubPvVEKG/6Z+hAUyOsAKynpq/RhzueXQm/9Ns6+qHX/xY8yh39y
S5qPd5DLRlac8f7cFwz2zPxP5E+xTJLONgRkuN1XlitMJZBxru9AzDNa0/6on8TM
OtvbvVR+bPUfHiHULk4fTz7fLcbgYgxbCgfGoFsVlfskOxnzgEG8WfuI2Up2rRJ0
nr1MCER+5fprciqPPs+18rVEFiC4mQSrV01cnwrNbpW8pqibZSomMilQ0oQvcTGL
VDEHkaDTa5YbR92Szq4rYbr7Sf0ihFU0EZUNVQnu7SujdVFxTdHb1yr8UYcYp09b
LqGFhr1FHBNYKbw3rEPx2R/FGuCii21oQkhz94ujDo1Np8EGVZYwFGh+iwbsa2Xn
K1u0HzqLTfTkMw==
=HiGq
-----END PGP SIGNATURE-----
Merge tag 'sched-urgent-2020-11-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Thomas Gleixner:
"A couple of scheduler fixes:
- Make the conditional update of the overutilized state work
correctly by caching the relevant flags state before overwriting
them and checking them afterwards.
- Fix a data race in the wakeup path which caused loadavg on ARM64
platforms to become a random number generator.
- Fix the ordering of the iowaiter accounting operations so it can't
be decremented before it is incremented.
- Fix a bug in the deadline scheduler vs. priority inheritance when a
non-deadline task A has inherited the parameters of a deadline task
B and then blocks on a non-deadline task C.
The second inheritance step used the static deadline parameters of
task A, which are usually 0, instead of further propagating task
B's parameters. The zero initialized parameters trigger a bug in
the deadline scheduler"
* tag 'sched-urgent-2020-11-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/deadline: Fix priority inheritance with multiple scheduling classes
sched: Fix rq->nr_iowait ordering
sched: Fix data-race in wakeup
sched/fair: Fix overutilized update in enqueue_task_fair()
lock/unlock.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl+6dQ4THHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoXLiD/91TaQTrgVht6NxF5CmvEpoYPzGg/TO
MeU7GmgeXv7ciIIDaPONsCLOGjGx5dSxk1QJ5OH+GhJ/lIXbtvbVUgbs7Ezjkl4d
SyjRHNv3wYV7VJySEgqxo+Rd6fCQN7rOf/J3ZbzPHtMl40qxSpz1FeXTRft7cn62
pkunRKM+f/tE/Ue96tiHCQVbwRxG3LEPIuXtLJQqtPvgMzb9zhC9mZpUcr5ZJyVD
BfX69iIs1W1XKql6yH/KuBUY9d3qRzzmhko6a2mMSL3+VX0j4o46LMlP4HRndKTT
5fEsUC3UOItqn7X4f3HHeQhckEAKXNGEMR30th84CligiEanQuIIHNNKazsnp87A
GT/XA8RO9+OAd5TRP4CEaFDUR0a0OOZjiw3vW6E77BObi31Se9/EO3Z7XpJC2BSl
vTsukplOmeilyspQJT7inVgZrtUlF0lpVV5YbMLjtbn5kzLu/75GZN3JagwCmflN
GKbix/8tnH0KrbizCE8ujUZl+aNjk2TM64mAzOXMlmg+yZ47af31hFs846tCMhv5
JHXkiMaXpWx3PapgGHM+4PJ1WDGTxm6mUvgAcyCVWgT4gxVXsppnkn0RLdRTGQ50
C9movzOn1K2fEe7Pp40arG0tm18l9Ukp7mqN1CL+givRKNb61Owybe6qh2qk0IWT
DAcgdC9h2rdTqQ==
=dsm4
-----END PGP SIGNATURE-----
Merge tag 'locking-urgent-2020-11-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fix from Thomas Gleixner:
"A single fix for lockdep which makes the recursion protection cover
graph lock/unlock"
* tag 'locking-urgent-2020-11-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
lockdep: Put graph lock/unlock under lock_recursion protection
- Fix typos in seccomp selftests on powerpc and sh (Kees Cook)
- Fix PF_SUPERPRIV audit marking in seccomp and ptrace (Mickaël Salaün)
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAl+4FQgACgkQiXL039xt
wCYERA/8Cwb8p8PWbtOq6uUKsZ4kJQuMZee3crP+LM0B1a427hOCSuezoDtOY1wO
N7IEOr3ipHxoxZ1Zf2Ln4lGyktxlwzA8pqlGZyqR5zFIl+0HArbXdAfcRMcaeqUp
eSV30CNBD5XfGLc2R5s1qHnrshVBJFebCpgMfSCOQQWMpZ51nnaoFN8N8iSEx6PN
kYHC1C1WX5g3vtKo29xS2Y7KCThMOXvcNI7eFpVD0C4ZwEr8lywbTTzBBhXUIGBX
6NoNOV7kVxIjNLQ8x17F1OacrC6h4ZzNTl4MEYnMZ/Mw0NVB3MvoHQohwW+Y98Rf
97sPQPZjYeJ6xURolRsWX+kvXC7PyLYvfldsQi00QDfdc6RGu0pnsG4UuivsldlY
OhswE9Q/KKHmzXiHnZBmcw4NcSyhZiL3LYB1VZl3jDobeOhVKyHw72vo8Zrhhz8A
ksCDg3vNvOo/x2iH9GSUG4Fjk8coXRif8P6lH5Btw6V+x9ZlFiaW5WbSbP0G3PzJ
zS5nPu8PE6Sm70XlRn0BbRmIjV9AhEZqNNZoOsndrbR86klH6WolyCB4ifj2MKuR
ZwbeDblUrYyRne/Ll9XGQVDSFv8J5phxtDQM0phiGK0jOsvXqbl8RvlckCKqBwm0
7VgtEumU5vJTx01avrXw86Sj7B2IR4M1nTgpwWJ2EVs9U98Emew=
=1Mse
-----END PGP SIGNATURE-----
Merge tag 'seccomp-v5.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull seccomp fixes from Kees Cook:
"This gets the seccomp selftests running again on powerpc and sh, and
fixes an audit reporting oversight noticed in both seccomp and ptrace.
- Fix typos in seccomp selftests on powerpc and sh (Kees Cook)
- Fix PF_SUPERPRIV audit marking in seccomp and ptrace (Mickaël
Salaün)"
* tag 'seccomp-v5.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
selftests/seccomp: sh: Fix register names
selftests/seccomp: powerpc: Fix typo in macro variable name
seccomp: Set PF_SUPERPRIV when checking capability
ptrace: Set PF_SUPERPRIV when checking capability
Buffers that are passed to read_actions_logged() and write_actions_logged()
are in kernel memory; the sysctl core takes care of copying from/to
userspace.
Fixes: 32927393dc ("sysctl: pass kernel pointers to ->proc_handler")
Reviewed-by: Tyler Hicks <code@tyhicks.com>
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20201120170545.1419332-1-jannh@google.com
Simplify task_file_seq_get_next() by removing two in/out arguments: task
and fstruct. Use info->task and info->files instead.
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20201120002833.2481110-1-songliubraving@fb.com
Currently the kernel does not provide an infrastructure to translate
architecture numbers to a human-readable name. Translating syscall
numbers to syscall names is possible through FTRACE_SYSCALL
infrastructure but it does not provide support for compat syscalls.
This will create a file for each PID as /proc/pid/seccomp_cache.
The file will be empty when no seccomp filters are loaded, or be
in the format of:
<arch name> <decimal syscall number> <ALLOW | FILTER>
where ALLOW means the cache is guaranteed to allow the syscall,
and filter means the cache will pass the syscall to the BPF filter.
For the docker default profile on x86_64 it looks like:
x86_64 0 ALLOW
x86_64 1 ALLOW
x86_64 2 ALLOW
x86_64 3 ALLOW
[...]
x86_64 132 ALLOW
x86_64 133 ALLOW
x86_64 134 FILTER
x86_64 135 FILTER
x86_64 136 FILTER
x86_64 137 ALLOW
x86_64 138 ALLOW
x86_64 139 FILTER
x86_64 140 ALLOW
x86_64 141 ALLOW
[...]
This file is guarded by CONFIG_SECCOMP_CACHE_DEBUG with a default
of N because I think certain users of seccomp might not want the
application to know which syscalls are definitely usable. For
the same reason, it is also guarded by CAP_SYS_ADMIN.
Suggested-by: Jann Horn <jannh@google.com>
Link: https://lore.kernel.org/lkml/CAG48ez3Ofqp4crXGksLmZY6=fGrF_tWyUCg7PBkAetvbbOPeOA@mail.gmail.com/
Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/94e663fa53136f5a11f432c661794d1ee7060779.1605101222.git.yifeifz2@illinois.edu
SECCOMP_CACHE will only operate on syscalls that do not access
any syscall arguments or instruction pointer. To facilitate
this we need a static analyser to know whether a filter will
return allow regardless of syscall arguments for a given
architecture number / syscall number pair. This is implemented
here with a pseudo-emulator, and stored in a per-filter bitmap.
In order to build this bitmap at filter attach time, each filter is
emulated for every syscall (under each possible architecture), and
checked for any accesses of struct seccomp_data that are not the "arch"
nor "nr" (syscall) members. If only "arch" and "nr" are examined, and
the program returns allow, then we can be sure that the filter must
return allow independent from syscall arguments.
Nearly all seccomp filters are built from these cBPF instructions:
BPF_LD | BPF_W | BPF_ABS
BPF_JMP | BPF_JEQ | BPF_K
BPF_JMP | BPF_JGE | BPF_K
BPF_JMP | BPF_JGT | BPF_K
BPF_JMP | BPF_JSET | BPF_K
BPF_JMP | BPF_JA
BPF_RET | BPF_K
BPF_ALU | BPF_AND | BPF_K
Each of these instructions are emulated. Any weirdness or loading
from a syscall argument will cause the emulator to bail.
The emulation is also halted if it reaches a return. In that case,
if it returns an SECCOMP_RET_ALLOW, the syscall is marked as good.
Emulator structure and comments are from Kees [1] and Jann [2].
Emulation is done at attach time. If a filter depends on more
filters, and if the dependee does not guarantee to allow the
syscall, then we skip the emulation of this syscall.
[1] https://lore.kernel.org/lkml/20200923232923.3142503-5-keescook@chromium.org/
[2] https://lore.kernel.org/lkml/CAG48ez1p=dR_2ikKq=xVxkoGg0fYpTBpkhJSv1w-6BG=76PAvw@mail.gmail.com/
Suggested-by: Jann Horn <jannh@google.com>
Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>
Reviewed-by: Jann Horn <jannh@google.com>
Co-developed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/71c7be2db5ee08905f41c3be5c1ad6e2601ce88f.1602431034.git.yifeifz2@illinois.edu
The overhead of running Seccomp filters has been part of some past
discussions [1][2][3]. Oftentimes, the filters have a large number
of instructions that check syscall numbers one by one and jump based
on that. Some users chain BPF filters which further enlarge the
overhead. A recent work [6] comprehensively measures the Seccomp
overhead and shows that the overhead is non-negligible and has a
non-trivial impact on application performance.
We observed some common filters, such as docker's [4] or
systemd's [5], will make most decisions based only on the syscall
numbers, and as past discussions considered, a bitmap where each bit
represents a syscall makes most sense for these filters.
The fast (common) path for seccomp should be that the filter permits
the syscall to pass through, and failing seccomp is expected to be
an exceptional case; it is not expected for userspace to call a
denylisted syscall over and over.
When it can be concluded that an allow must occur for the given
architecture and syscall pair (this determination is introduced in
the next commit), seccomp will immediately allow the syscall,
bypassing further BPF execution.
Each architecture number has its own bitmap. The architecture
number in seccomp_data is checked against the defined architecture
number constant before proceeding to test the bit against the
bitmap with the syscall number as the index of the bit in the
bitmap, and if the bit is set, seccomp returns allow. The bitmaps
are all clear in this patch and will be initialized in the next
commit.
When only one architecture exists, the check against architecture
number is skipped, suggested by Kees Cook [7].
[1] https://lore.kernel.org/linux-security-module/c22a6c3cefc2412cad00ae14c1371711@huawei.com/T/
[2] https://lore.kernel.org/lkml/202005181120.971232B7B@keescook/T/
[3] https://github.com/seccomp/libseccomp/issues/116
[4] ae0ef82b90/profiles/seccomp/default.json
[5] 6743a1caf4/src/shared/seccomp-util.c (L270)
[6] Draco: Architectural and Operating System Support for System Call Security
https://tianyin.github.io/pub/draco.pdf, MICRO-53, Oct. 2020
[7] https://lore.kernel.org/bpf/202010091614.8BB0EB64@keescook/
Co-developed-by: Dimitrios Skarlatos <dskarlat@cs.cmu.edu>
Signed-off-by: Dimitrios Skarlatos <dskarlat@cs.cmu.edu>
Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>
Reviewed-by: Jann Horn <jannh@google.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/10f91a367ec4fcdea7fc3f086de3f5f13a4a7436.1602431034.git.yifeifz2@illinois.edu
The commit 48021f9813 ("printk: handle blank console arguments
passed in.") prevented crash caused by empty console= parameter value.
Unfortunately, this value is widely used on Chromebooks to disable
the console output. The above commit caused performance regression
because the messages were pushed on slow console even though nobody
was watching it.
Use ttynull driver explicitly for console="" and console=null
parameters. It has been created for exactly this purpose.
It causes that preferred_console is set. As a result, ttySX and ttyX
are not used as a fallback. And only ttynull console gets registered by
default.
It still allows to register other consoles either by additional console=
parameters or SPCR. It prevents regression because it worked this way even
before. Also it is a sane semantic. Preventing output on all consoles
should be done another way, for example, by introducing mute_console
parameter.
Link: https://lore.kernel.org/r/20201006025935.GA597@jagdpanzerIV.localdomain
Suggested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Reviewed-by: Guenter Roeck <linux@roeck-us.net>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20201111135450.11214-3-pmladek@suse.com
Currently <crypto/sha.h> contains declarations for both SHA-1 and SHA-2,
and <crypto/sha3.h> contains declarations for SHA-3.
This organization is inconsistent, but more importantly SHA-1 is no
longer considered to be cryptographically secure. So to the extent
possible, SHA-1 shouldn't be grouped together with any of the other SHA
versions, and usage of it should be phased out.
Therefore, split <crypto/sha.h> into two headers <crypto/sha1.h> and
<crypto/sha2.h>, and make everyone explicitly specify whether they want
the declarations for SHA-1, SHA-2, or both.
This avoids making the SHA-1 declarations visible to files that don't
want anything to do with SHA-1. It also prepares for potentially moving
sha1.h into a new insecure/ or dangerous/ directory.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
It turns out that init_srcu_struct() can be invoked from usermode tasks,
and that fatal signals received by these tasks can cause memory-allocation
failures. These failures are not handled well by init_srcu_struct(),
so much so that NULL pointer dereferences can result. This commit
therefore causes init_srcu_struct() to take an early exit upon detection
of memory-allocation failure.
Link: https://lore.kernel.org/lkml/20200908144306.33355-1-aik@ozlabs.ru/
Reported-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The current memmory-allocation interface causes the following difficulties
for kvfree_rcu():
a) If built with CONFIG_PROVE_RAW_LOCK_NESTING, the lockdep will
complain about violation of the nesting rules, as in "BUG: Invalid
wait context". This Kconfig option checks for proper raw_spinlock
vs. spinlock nesting, in particular, it is not legal to acquire a
spinlock_t while holding a raw_spinlock_t.
This is a problem because kfree_rcu() uses raw_spinlock_t whereas the
"page allocator" internally deals with spinlock_t to access to its
zones. The code also can be broken from higher level of view:
<snip>
raw_spin_lock(&some_lock);
kfree_rcu(some_pointer, some_field_offset);
<snip>
b) If built with CONFIG_PREEMPT_RT, spinlock_t is converted into
sleeplock. This means that invoking the page allocator from atomic
contexts results in "BUG: scheduling while atomic".
c) Please note that call_rcu() is already invoked from raw atomic context,
so it is only reasonable to expaect that kfree_rcu() and kvfree_rcu()
will also be called from atomic raw context.
This commit therefore defers page allocation to a clean context using the
combination of an hrtimer and a workqueue. The hrtimer stage is required
in order to avoid deadlocks with the scheduler. This deferred allocation
is required only when kvfree_rcu()'s per-CPU page cache is empty.
Link: https://lore.kernel.org/lkml/20200630164543.4mdcf6zb4zfclhln@linutronix.de/
Fixes: 3042f83f19 ("rcu: Support reclaim for head-less object")
Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
An outgoing CPU is marked offline in a stop-machine handler and most
of that CPU's services stop at that point, including IRQ work queues.
However, that CPU must take another pass through the scheduler and through
a number of CPU-hotplug notifiers, many of which contain RCU readers.
In the past, these readers were not a problem because the outgoing CPU
has interrupts disabled, so that rcu_read_unlock_special() would not
be invoked, and thus RCU would never attempt to queue IRQ work on the
outgoing CPU.
This changed with the advent of the CONFIG_RCU_STRICT_GRACE_PERIOD
Kconfig option, in which rcu_read_unlock_special() is invoked upon exit
from almost all RCU read-side critical sections. Worse yet, because
interrupts are disabled, rcu_read_unlock_special() cannot immediately
report a quiescent state and will therefore attempt to defer this
reporting, for example, by queueing IRQ work. Which fails with a splat
because the CPU is already marked as being offline.
But it turns out that there is no need to report this quiescent state
because rcu_report_dead() will do this job shortly after the outgoing
CPU makes its final dive into the idle loop. This commit therefore
makes rcu_read_unlock_special() refrain from queuing IRQ work onto
outgoing CPUs.
Fixes: 44bad5b3cc ("rcu: Do full report for .need_qs for strict GPs")
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Jann Horn <jannh@google.com>
This commit fixes a typo in the rcu_blocking_is_gp() function's header
comment.
Signed-off-by: Zhouyi Zhou <zhouzhouyi@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The rcu_cpu_starting() and rcu_report_dead() functions transition the
current CPU between online and offline state from an RCU perspective.
Unfortunately, this means that the rcu_cpu_starting() function's lock
acquisition and the rcu_report_dead() function's lock releases happen
while the CPU is offline from an RCU perspective, which can result
in lockdep-RCU splats about using RCU from an offline CPU. And this
situation can also result in too-short grace periods, especially in
guest OSes that are subject to vCPU preemption.
This commit therefore uses sequence-count-like synchronization to forgive
use of RCU while RCU thinks a CPU is offline across the full extent of
the rcu_cpu_starting() and rcu_report_dead() function's lock acquisitions
and releases.
One approach would have been to use the actual sequence-count primitives
provided by the Linux kernel. Unfortunately, the resulting code looks
completely broken and wrong, and is likely to result in patches that
break RCU in an attempt to address this appearance of broken wrongness.
Plus there is no net savings in lines of code, given the additional
explicit memory barriers required.
Therefore, this sequence count is instead implemented by a new ->ofl_seq
field in the rcu_node structure. If this counter's value is an odd
number, RCU forgives RCU read-side critical sections on other CPUs covered
by the same rcu_node structure, even if those CPUs are offline from
an RCU perspective. In addition, if a given leaf rcu_node structure's
->ofl_seq counter value is an odd number, rcu_gp_init() delays starting
the grace period until that counter value changes.
[ paulmck: Apply Peter Zijlstra feedback. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Testing showed that rcu_pending() can return 1 when offloaded callbacks
are ready to execute. This invokes RCU core processing, for example,
by raising RCU_SOFTIRQ, eventually resulting in a call to rcu_core().
However, rcu_core() explicitly avoids in any way manipulating offloaded
callbacks, which are instead handled by the rcuog and rcuoc kthreads,
which work independently of rcu_core().
One exception to this independence is that rcu_core() invokes
do_nocb_deferred_wakeup(), however, rcu_pending() also checks
rcu_nocb_need_deferred_wakeup() in order to correctly handle this case,
invoking rcu_core() when needed.
This commit therefore avoids needlessly invoking RCU core processing
by checking rcu_segcblist_ready_cbs() only on non-offloaded CPUs.
This reduces overhead, for example, by reducing softirq activity.
This change passed 30 minute tests of TREE01 through TREE09 each.
On TREE08, there is at most 150us from the time that rcu_pending() chose
not to invoke RCU core processing to the time when the ready callbacks
were invoked by the rcuoc kthread. This provides further evidence that
there is no need to invoke rcu_core() for offloaded callbacks that are
ready to invoke.
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Kim reported that perf-ftrace made his box unhappy. It turns out that
commit:
ff5c4f5cad ("rcu/tree: Mark the idle relevant functions noinstr")
removed one too many notrace qualifiers, probably due to there not being
a helpful comment.
This commit therefore reinstates the notrace and adds a comment to avoid
losing it again.
[ paulmck: Apply Steven Rostedt's feedback on the comment. ]
Fixes: ff5c4f5cad ("rcu/tree: Mark the idle relevant functions noinstr")
Reported-by: Kim Phillips <kim.phillips@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Currently, rcu_cpu_starting() checks to see if the RCU core expects a
quiescent state from the incoming CPU. However, the current interaction
between RCU quiescent-state reporting and CPU-hotplug operations should
mean that the incoming CPU never needs to report a quiescent state.
First, the outgoing CPU reports a quiescent state if needed. Second,
the race where the CPU is leaving just as RCU is initializing a new
grace period is handled by an explicit check for this condition. Third,
the CPU's leaf rcu_node structure's ->lock serializes these checks.
This means that if rcu_cpu_starting() ever feels the need to report
a quiescent state, then there is a bug somewhere in the CPU hotplug
code or the RCU grace-period handling code. This commit therefore
adds a WARN_ON_ONCE() to bring that bug to everyone's attention.
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Suggested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit clarifies that the "p" and the "s" in the in the RCU_NOCB_CPU
config-option description refer to the "x" in the "rcuox/N" kthread name.
Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
[ paulmck: While in the area, update description and advice. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Currently, for CONFIG_PREEMPTION=n kernels, rcu_blocking_is_gp() uses
num_online_cpus() to determine whether there is only one CPU online. When
there is only a single CPU online, the simple fact that synchronize_rcu()
could be legally called implies that a full grace period has elapsed.
Therefore, in the single-CPU case, synchronize_rcu() simply returns
immediately. Unfortunately, num_online_cpus() is unreliable while a
CPU-hotplug operation is transitioning to or from single-CPU operation
because:
1. num_online_cpus() uses atomic_read(&__num_online_cpus) to
locklessly sample the number of online CPUs. The hotplug locks
are not held, which means that an incoming CPU can concurrently
update this count. This in turn means that an RCU read-side
critical section on the incoming CPU might observe updates
prior to the grace period, but also that this critical section
might extend beyond the end of the optimized synchronize_rcu().
This breaks RCU's fundamental guarantee.
2. In addition, num_online_cpus() does no ordering, thus providing
another way that RCU's fundamental guarantee can be broken by
the current code.
3. The most probable failure mode happens on outgoing CPUs.
The outgoing CPU updates the count of online CPUs in the
CPUHP_TEARDOWN_CPU stop-machine handler, which is fine in
and of itself due to preemption being disabled at the call
to num_online_cpus(). Unfortunately, after that stop-machine
handler returns, the CPU takes one last trip through the
scheduler (which has RCU readers) and, after the resulting
context switch, one final dive into the idle loop. During this
time, RCU needs to keep track of two CPUs, but num_online_cpus()
will say that there is only one, which in turn means that the
surviving CPU will incorrectly ignore the outgoing CPU's RCU
read-side critical sections.
This problem is illustrated by the following litmus test in which P0()
corresponds to synchronize_rcu() and P1() corresponds to the incoming CPU.
The herd7 tool confirms that the "exists" clause can be satisfied,
thus demonstrating that this breakage can happen according to the Linux
kernel memory model.
{
int x = 0;
atomic_t numonline = ATOMIC_INIT(1);
}
P0(int *x, atomic_t *numonline)
{
int r0;
WRITE_ONCE(*x, 1);
r0 = atomic_read(numonline);
if (r0 == 1) {
smp_mb();
} else {
synchronize_rcu();
}
WRITE_ONCE(*x, 2);
}
P1(int *x, atomic_t *numonline)
{
int r0; int r1;
atomic_inc(numonline);
smp_mb();
rcu_read_lock();
r0 = READ_ONCE(*x);
smp_rmb();
r1 = READ_ONCE(*x);
rcu_read_unlock();
}
locations [x;numonline;]
exists (1:r0=0 /\ 1:r1=2)
It is important to note that these problems arise only when the system
is transitioning to or from single-CPU operation.
One solution would be to hold the CPU-hotplug locks while sampling
num_online_cpus(), which was in fact the intent of the (redundant)
preempt_disable() and preempt_enable() surrounding this call to
num_online_cpus(). Actually blocking CPU hotplug would not only result
in excessive overhead, but would also unnecessarily impede CPU-hotplug
operations.
This commit therefore follows long-standing RCU tradition by maintaining
a separate RCU-specific set of CPU-hotplug books.
This separate set of books is implemented by a new ->n_online_cpus field
in the rcu_state structure that maintains RCU's count of the online CPUs.
This count is incremented early in the CPU-online process, so that
the critical transition away from single-CPU operation will occur when
there is only a single CPU. Similarly for the critical transition to
single-CPU operation, the counter is decremented late in the CPU-offline
process, again while there is only a single CPU. Because there is only
ever a single CPU when the ->n_online_cpus field undergoes the critical
1->2 and 2->1 transitions, full memory ordering and mutual exclusion is
provided implicitly and, better yet, for free.
In the case where the CPU is coming online, nothing will happen until
the current CPU helps it come online. Therefore, the new CPU will see
all accesses prior to the optimized grace period, which means that RCU
does not need to further delay this new CPU. In the case where the CPU
is going offline, the outgoing CPU is totally out of the picture before
the optimized grace period starts, which means that this outgoing CPU
cannot see any of the accesses following that grace period. Again,
RCU needs no further interaction with the outgoing CPU.
This does mean that synchronize_rcu() will unnecessarily do a few grace
periods the hard way just before the second CPU comes online and just
after the second-to-last CPU goes offline, but it is not worth optimizing
this uncommon case.
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit simplifies the use of the rcu_segcblist_is_offloaded() API so
that its callers no longer need to check the RCU_NOCB_CPU Kconfig option.
Note that rcu_segcblist_is_offloaded() is defined in the header file,
which means that the generated code should be just as efficient as before.
Suggested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Some stalls are transient, so that system fully recovers. This commit
therefore allows users to configure the number of stalls that must happen
in order to trigger kernel panic.
Signed-off-by: chao <chao@eero.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Eugenio managed to tickle #PF from NMI context which resulted in
hitting a WARN in RCU through irqentry_enter() ->
__rcu_irq_enter_check_tick().
However, this situation is perfectly sane and does not warrant an
WARN. The #PF will (necessarily) be atomic and not require messing
with the tick state, so early return is correct. This commit
therefore removes the WARN.
Fixes: aaf2bc50df ("rcu: Abstract out rcu_irq_enter_check_tick() from rcu_nmi_enter()")
Reported-by: "Eugenio Pérez" <eupm90@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
can and bpf (including the strncpy_from_user fix).
Current release - regressions:
- mac80211: fix memory leak of filtered powersave frames
- mac80211: free sta in sta_info_insert_finish() on errors to avoid
sleeping in atomic context
- netlabel: fix an uninitialized variable warning added in -rc4
Previous release - regressions:
- vsock: forward all packets to the host when no H2G is registered,
un-breaking AWS Nitro Enclaves
- net: Exempt multicast addresses from five-second neighbor lifetime
requirement, decreasing the chances neighbor tables fill up
- net/tls: fix corrupted data in recvmsg
- qed: fix ILT configuration of SRC block
- can: m_can: process interrupt only when not runtime suspended
Previous release - always broken:
- page_frag: Recover from memory pressure by not recycling pages
allocating from the reserves
- strncpy_from_user: Mask out bytes after NUL terminator
- ip_tunnels: Set tunnel option flag only when tunnel metadata is
present, always setting it confuses Open vSwitch
- bpf, sockmap:
- Fix partial copy_page_to_iter so progress can still be made
- Fix socket memory accounting and obeying SO_RCVBUF
- net: Have netpoll bring-up DSA management interface
- net: bridge: add missing counters to ndo_get_stats64 callback
- tcp: brr: only postpone PROBE_RTT if RTT is < current min_rtt
- enetc: Workaround MDIO register access HW bug
- net/ncsi: move netlink family registration to a subsystem init,
instead of tying it to driver probe
- net: ftgmac100: unregister NC-SI when removing driver to avoid crash
- lan743x: prevent interrupt storm on open
- lan743x: fix freeing skbs in the wrong context
- net/mlx5e: Fix socket refcount leak on kTLS RX resync
- net: dsa: mv88e6xxx: Avoid VLAN database corruption on 6097
- fix 21 unset return codes and other mistakes on error paths,
mostly detected by the Hulk Robot
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAl+226AACgkQMUZtbf5S
IruE1w/+JX3CqJwGIqyzyhwVshNaKxmX9gAOMJzkckjEohn8932zPaNq7kbmNYqt
5QsJoou3cXjFeoIEAkQA5fqR4stTZpZMnLO+7JnxxQ0vb2YBN+tIGQRNCnmd1Q0h
u9gb5+5AdORdlmk3E7oC8v50dzQRfboJXLEEZTo2uGJwUgLlEAiqTSV2w4YDHMhL
JtgtWA/fraL0CUc2WMoxuimg9NirbRuMijsU6+d/yExzznDpdoho/qsxL+Odu1NF
hSdaKirA8B8ml0pOd/b4mj+fm4+lKyXZBfSyLx4Ki1TqluEMLzDp7gQPRnU6yyJm
AOu4zsKxx6qitOX2qLQCNlEpkQp6LA2N2Zb1orliUV3Bsq2DJRhU35FgLcghtdRP
GTRSdKHr2BvMScOZ7dQo8l4TqVc3e/khSZDRGdvpsM275Dt0JyS/l7yAWxunPqMb
+/483/s75OuBRO57ULLJ/hR02TG37g/Jv5sI0sG/7oDpGfnulinQX+fxy9izyTEM
KYl0mAPSqhb6RcjE0YXWG0rhJN6FSvc/lwPQHjq8wPSkwEdD/FTb6/eYqbXDi1ld
UTYhFpkh1PQrwct14eSScMeJqTsNKvG0VV39/uZLZCzcqa3yOY5+oTzwaCFlMsy3
a5yGGxqoh7/FTM8t1ml21is9uZ31LAQEnNTMPv69pZPwAv5G5yE=
=SRwI
-----END PGP SIGNATURE-----
Merge tag 'net-5.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Networking fixes for 5.10-rc5, including fixes from the WiFi
(mac80211), can and bpf (including the strncpy_from_user fix).
Current release - regressions:
- mac80211: fix memory leak of filtered powersave frames
- mac80211: free sta in sta_info_insert_finish() on errors to avoid
sleeping in atomic context
- netlabel: fix an uninitialized variable warning added in -rc4
Previous release - regressions:
- vsock: forward all packets to the host when no H2G is registered,
un-breaking AWS Nitro Enclaves
- net: Exempt multicast addresses from five-second neighbor lifetime
requirement, decreasing the chances neighbor tables fill up
- net/tls: fix corrupted data in recvmsg
- qed: fix ILT configuration of SRC block
- can: m_can: process interrupt only when not runtime suspended
Previous release - always broken:
- page_frag: Recover from memory pressure by not recycling pages
allocating from the reserves
- strncpy_from_user: Mask out bytes after NUL terminator
- ip_tunnels: Set tunnel option flag only when tunnel metadata is
present, always setting it confuses Open vSwitch
- bpf, sockmap:
- Fix partial copy_page_to_iter so progress can still be made
- Fix socket memory accounting and obeying SO_RCVBUF
- net: Have netpoll bring-up DSA management interface
- net: bridge: add missing counters to ndo_get_stats64 callback
- tcp: brr: only postpone PROBE_RTT if RTT is < current min_rtt
- enetc: Workaround MDIO register access HW bug
- net/ncsi: move netlink family registration to a subsystem init,
instead of tying it to driver probe
- net: ftgmac100: unregister NC-SI when removing driver to avoid
crash
- lan743x:
- prevent interrupt storm on open
- fix freeing skbs in the wrong context
- net/mlx5e: Fix socket refcount leak on kTLS RX resync
- net: dsa: mv88e6xxx: Avoid VLAN database corruption on 6097
- fix 21 unset return codes and other mistakes on error paths, mostly
detected by the Hulk Robot"
* tag 'net-5.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (115 commits)
fail_function: Remove a redundant mutex unlock
selftest/bpf: Test bpf_probe_read_user_str() strips trailing bytes after NUL
lib/strncpy_from_user.c: Mask out bytes after NUL terminator.
net/smc: fix direct access to ib_gid_addr->ndev in smc_ib_determine_gid()
net/smc: fix matching of existing link groups
ipv6: Remove dependency of ipv6_frag_thdr_truncated on ipv6 module
libbpf: Fix VERSIONED_SYM_COUNT number parsing
net/mlx4_core: Fix init_hca fields offset
atm: nicstar: Unmap DMA on send error
page_frag: Recover from memory pressure
net: dsa: mv88e6xxx: Wait for EEPROM done after HW reset
mlxsw: core: Use variable timeout for EMAD retries
mlxsw: Fix firmware flashing
net: Have netpoll bring-up DSA management interface
atl1e: fix error return code in atl1e_probe()
atl1c: fix error return code in atl1c_probe()
ah6: fix error return code in ah6_input()
net: usb: qmi_wwan: Set DTR quirk for MR400
can: m_can: process interrupt only when not runtime suspended
can: flexcan: flexcan_chip_start(): fix erroneous flexcan_transceiver_enable() during bus-off recovery
...
Alexei Starovoitov says:
====================
1) libbpf should not attempt to load unused subprogs, from Andrii.
2) Make strncpy_from_user() mask out bytes after NUL terminator, from Daniel.
3) Relax return code check for subprograms in the BPF verifier, from Dmitrii.
4) Fix several sockmap issues, from John.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
fail_function: Remove a redundant mutex unlock
selftest/bpf: Test bpf_probe_read_user_str() strips trailing bytes after NUL
lib/strncpy_from_user.c: Mask out bytes after NUL terminator.
libbpf: Fix VERSIONED_SYM_COUNT number parsing
bpf, sockmap: Avoid failures from skb_to_sgvec when skb has frag_list
bpf, sockmap: Handle memory acct if skb_verdict prog redirects to self
bpf, sockmap: Avoid returning unneeded EAGAIN when redirecting to self
bpf, sockmap: Use truesize with sk_rmem_schedule()
bpf, sockmap: Ensure SO_RCVBUF memory is observed on ingress redirect
bpf, sockmap: Fix partial copy_page_to_iter so progress can still be made
selftests/bpf: Fix error return code in run_getsockopt_test()
bpf: Relax return code check for subprograms
tools, bpftool: Add missing close before bpftool net attach exit
MAINTAINERS/bpf: Update Andrii's entry.
selftests/bpf: Fix unused attribute usage in subprogs_unused test
bpf: Fix unsigned 'datasec_id' compared with zero in check_pseudo_btf_id
bpf: Fix passing zero to PTR_ERR() in bpf_btf_printf_prepare
libbpf: Don't attempt to load unused subprog as an entry-point BPF program
====================
Link: https://lore.kernel.org/r/20201119200721.288-1-alexei.starovoitov@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Fix a mutex_unlock() issue where before copy_from_user() is
not called mutex_locked.
Fixes: 4b1a29a7f5 ("error-injection: Support fault injection framework")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Luo Meng <luomeng12@huawei.com>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Link: https://lore.kernel.org/bpf/160570737118.263807.8358435412898356284.stgit@devnote2
do_strncpy_from_user() may copy some extra bytes after the NUL
terminator into the destination buffer. This usually does not matter for
normal string operations. However, when BPF programs key BPF maps with
strings, this matters a lot.
A BPF program may read strings from user memory by calling the
bpf_probe_read_user_str() helper which eventually calls
do_strncpy_from_user(). The program can then key a map with the
destination buffer. BPF map keys are fixed-width and string-agnostic,
meaning that map keys are treated as a set of bytes.
The issue is when do_strncpy_from_user() overcopies bytes after the NUL
terminator, it can result in seemingly identical strings occupying
multiple slots in a BPF map. This behavior is subtle and totally
unexpected by the user.
This commit masks out the bytes following the NUL while preserving
long-sized stride in the fast path.
Fixes: 6ae08ae3de ("bpf: Add probe_read_{user, kernel} and probe_read_{user, kernel}_str helpers")
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/21efc982b3e9f2f7b0379eed642294caaa0c27a7.1605642949.git.dxu@dxuuu.xyz
In order to make accurate predictions across CPUs and for all performance
states, Energy Aware Scheduling (EAS) needs frequency-invariant load
tracking signals.
EAS task placement aims to minimize energy consumption, and does so in
part by limiting the search space to only CPUs with the highest spare
capacity (CPU capacity - CPU utilization) in their performance domain.
Those candidates are the placement choices that will keep frequency at
its lowest possible and therefore save the most energy.
But without frequency invariance, a CPU's utilization is relative to the
CPU's current performance level, and not relative to its maximum
performance level, which determines its capacity. As a result, it will
fail to correctly indicate any potential spare capacity obtained by an
increase in a CPU's performance level. Therefore, a non-invariant
utilization signal would render the EAS task placement logic invalid.
Now that we properly report support for the Frequency Invariance Engine
(FIE) through arch_scale_freq_invariant() for arm and arm64 systems,
while also ensuring a re-evaluation of the EAS use conditions for
possible invariance status change, we can assert this is the case when
initializing EAS. Warn and bail out otherwise.
Suggested-by: Quentin Perret <qperret@google.com>
Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20201027180713.7642-4-ionela.voinescu@arm.com
Add the rebuild_sched_domains_energy() function to wrap the functionality
that rebuilds the scheduling domains if any of the Energy Aware Scheduling
(EAS) initialisation conditions change. This functionality is used when
schedutil is added or removed or when EAS is enabled or disabled
through the sched_energy_aware sysctl.
Therefore, create a single function that is used in both these cases and
that can be later reused.
Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Quentin Perret <qperret@google.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lkml.kernel.org/r/20201027180713.7642-2-ionela.voinescu@arm.com
In case the user wants to stop controlling a uclamp constraint value
for a task, use the magic value -1 in sched_util_{min,max} with the
appropriate sched_flags (SCHED_FLAG_UTIL_CLAMP_{MIN,MAX}) to indicate
the reset.
The advantage over the 'additional flag' approach (i.e. introducing
SCHED_FLAG_UTIL_CLAMP_RESET) is that no additional flag has to be
exported via uapi. This avoids the need to document how this new flag
has be used in conjunction with the existing uclamp related flags.
The following subtle issue is fixed as well. When a uclamp constraint
value is set on a !user_defined uclamp_se it is currently first reset
and then set.
Fix this by AND'ing !user_defined with !SCHED_FLAG_UTIL_CLAMP which
stands for the 'sched class change' case.
The related condition 'if (uc_se->user_defined)' moved from
__setscheduler_uclamp() into uclamp_reset().
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Yun Hsiang <hsiang023167@gmail.com>
Link: https://lkml.kernel.org/r/20201113113454.25868-1-dietmar.eggemann@arm.com
NUMA topologies where the shortest path between some two nodes requires
three or more hops (i.e. diameter > 2) end up being misrepresented in the
scheduler topology structures.
This is currently detected when booting a kernel with CONFIG_SCHED_DEBUG=y
+ sched_debug on the cmdline, although this will only yield a warning about
sched_group spans not matching sched_domain spans:
ERROR: groups don't span domain->span
Add an explicit warning for that case, triggered regardless of
CONFIG_SCHED_DEBUG, and decorate it with an appropriate comment.
The topology described in the comment can be booted up on QEMU by appending
the following to your usual QEMU incantation:
-smp cores=4 \
-numa node,cpus=0,nodeid=0 -numa node,cpus=1,nodeid=1, \
-numa node,cpus=2,nodeid=2, -numa node,cpus=3,nodeid=3, \
-numa dist,src=0,dst=1,val=20, -numa dist,src=0,dst=2,val=30, \
-numa dist,src=0,dst=3,val=40, -numa dist,src=1,dst=2,val=20, \
-numa dist,src=1,dst=3,val=30, -numa dist,src=2,dst=3,val=20
A somewhat more realistic topology (6-node mesh) with the same affliction
can be conjured with:
-smp cores=6 \
-numa node,cpus=0,nodeid=0 -numa node,cpus=1,nodeid=1, \
-numa node,cpus=2,nodeid=2, -numa node,cpus=3,nodeid=3, \
-numa node,cpus=4,nodeid=4, -numa node,cpus=5,nodeid=5, \
-numa dist,src=0,dst=1,val=20, -numa dist,src=0,dst=2,val=30, \
-numa dist,src=0,dst=3,val=40, -numa dist,src=0,dst=4,val=30, \
-numa dist,src=0,dst=5,val=20, \
-numa dist,src=1,dst=2,val=20, -numa dist,src=1,dst=3,val=30, \
-numa dist,src=1,dst=4,val=20, -numa dist,src=1,dst=5,val=30, \
-numa dist,src=2,dst=3,val=20, -numa dist,src=2,dst=4,val=30, \
-numa dist,src=2,dst=5,val=40, \
-numa dist,src=3,dst=4,val=20, -numa dist,src=3,dst=5,val=30, \
-numa dist,src=4,dst=5,val=20
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Link: https://lore.kernel.org/lkml/jhjtux5edo2.mognet@arm.com
One of our machines keeled over trying to rebuild the scheduler domains.
Mainline produces the same splat:
BUG: unable to handle page fault for address: 0000607f820054db
CPU: 2 PID: 149 Comm: kworker/1:1 Not tainted 5.10.0-rc1-master+ #6
Workqueue: events cpuset_hotplug_workfn
RIP: build_sched_domains
Call Trace:
partition_sched_domains_locked
rebuild_sched_domains_locked
cpuset_hotplug_workfn
It happens with cgroup2 and exclusive cpusets only. This reproducer
triggers it on an 8-cpu vm and works most effectively with no
preexisting child cgroups:
cd $UNIFIED_ROOT
mkdir cg1
echo 4-7 > cg1/cpuset.cpus
echo root > cg1/cpuset.cpus.partition
# with smt/control reading 'on',
echo off > /sys/devices/system/cpu/smt/control
RIP maps to
sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
from sd_init(). sd_id is calculated earlier in the same function:
cpumask_and(sched_domain_span(sd), cpu_map, tl->mask(cpu));
sd_id = cpumask_first(sched_domain_span(sd));
tl->mask(cpu), which reads cpu_sibling_map on x86, returns an empty mask
and so cpumask_first() returns >= nr_cpu_ids, which leads to the bogus
value from per_cpu_ptr() above.
The problem is a race between cpuset_hotplug_workfn() and a later
offline of CPU N. cpuset_hotplug_workfn() updates the effective masks
when N is still online, the offline clears N from cpu_sibling_map, and
then the worker uses the stale effective masks that still have N to
generate the scheduling domains, leading the worker to read
N's empty cpu_sibling_map in sd_init().
rebuild_sched_domains_locked() prevented the race during the cgroup2
cpuset series up until the Fixes commit changed its check. Make the
check more robust so that it can detect an offline CPU in any exclusive
cpuset's effective mask, not just the top one.
Fixes: 0ccea8feb9 ("cpuset: Make generate_sched_domains() work with partition")
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20201112171711.639541-1-daniel.m.jordan@oracle.com
Oleksandr reported hitting the WARN in the 'task_rq(p) != rq' branch
of migration_cpu_stop(). Valentin noted that using cpu_of(rq) in that
case is just plain wrong to begin with, since per the earlier branch
that isn't the actual CPU of the task.
Replace both instances of is_cpu_allowed() by a direct p->cpus_mask
test using task_cpu().
Reported-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Debugged-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Qian reported that some fuzzer issuing sched_setaffinity() ends up stuck on
a wait_for_completion(). The problematic pattern seems to be:
affine_move_task()
// task_running() case
stop_one_cpu();
wait_for_completion(&pending->done);
Combined with, on the stopper side:
migration_cpu_stop()
// Task moved between unlocks and scheduling the stopper
task_rq(p) != rq &&
// task_running() case
dest_cpu >= 0
=> no complete_all()
This can happen with both PREEMPT and !PREEMPT, although !PREEMPT should
be more likely to see this given the targeted task has a much bigger window
to block and be woken up elsewhere before the stopper runs.
Make migration_cpu_stop() always look at pending affinity requests; signal
their completion if the stopper hits a rq mismatch but the task is
still within its allowed mask. When Migrate-Disable isn't involved, this
matches the previous set_cpus_allowed_ptr() vs migration_cpu_stop()
behaviour.
Fixes: 6d337eab04 ("sched: Fix migrate_disable() vs set_cpus_allowed_ptr()")
Reported-by: Qian Cai <cai@redhat.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/lkml/8b62fd1ad1b18def27f18e2ee2df3ff5b36d0762.camel@redhat.com
schedule_user() was traditionally used by the entry code's tail to
preempt userspace after the call to user_enter(). Indeed the call to
user_enter() used to be performed upon syscall exit slow path which was
right before the last opportunity to schedule() while resuming to
userspace. The context tracking state had to be saved on the task stack
and set back to CONTEXT_KERNEL temporarily in order to safely switch to
another task.
Only a few archs use it now (namely sparc64 and powerpc64) and those
implementing HAVE_CONTEXT_TRACKING_OFFSTACK definetly can't rely on it.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20201117151637.259084-5-frederic@kernel.org
Detect calls to schedule() between user_enter() and user_exit(). Those
are symptoms of early entry code that either forgot to protect a call
to schedule() inside exception_enter()/exception_exit() or, in the case
of HAVE_CONTEXT_TRACKING_OFFSTACK, enabled interrupts or preemption in
a wrong spot.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20201117151637.259084-4-frederic@kernel.org
We already have a dedicated helper that handles reference count
checking so stop open-coding the reference count check in
switch_task_namespaces() and use the dedicated put_nsproxy() helper
instead.
Take the change to fix a whitespace issue too.
Signed-off-by: Hui Su <sh_def@163.com>
[christian.brauner@ubuntu.com: expand commit message]
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/r/20201115180054.GA371317@rlk
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
The variable tick_period is initialized to NSEC_PER_TICK / HZ during boot
and never updated again.
If NSEC_PER_TICK is not an integer multiple of HZ this computation is less
accurate than TICK_NSEC which has proper rounding in place.
Aside of the inaccuracy there is no reason for having this variable at
all. It's just a pointless indirection and all usage sites can just use the
TICK_NSEC constant.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201117132006.766643526@linutronix.de
calc_load_global() does not need the sequence count protection.
[ tglx: Split it up properly and added comments ]
Signed-off-by: Yunfeng Ye <yeyunfeng@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201117132006.660902274@linutronix.de
If jiffies are up to date already (caller lost the race against another
CPU) there is no point to change the sequence count. Doing that just forces
other CPUs into the seqcount retry loop in tick_nohz_next_event() for
nothing.
Just bail out early.
[ tglx: Rewrote most of it ]
Signed-off-by: Yunfeng Ye <yeyunfeng@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201117132006.462195901@linutronix.de
No point in doing calculations.
tick_next_period = last_jiffies_update + tick_period
Just check whether now is before tick_next_period to figure out whether
jiffies need an update.
Add a comment why the intentional data race in the quick check is safe or
not so safe in a 32bit corner case and why we don't worry about it.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201117132006.337366695@linutronix.de
tick_broadcast_setup_oneshot() accesses tick_next_period twice without any
serialization. This is wrong in two aspects:
- Reading it twice might make the broadcast data inconsistent if the
variable is updated concurrently.
- On 32bit systems the access might see an partial update
Protect it with jiffies_lock. That's safe as none of the callchains leading
up to this function can create a lock ordering violation:
timer interrupt
run_local_timers()
hrtimer_run_queues()
hrtimer_switch_to_hres()
tick_init_highres()
tick_switch_to_oneshot()
tick_broadcast_switch_to_oneshot()
or
tick_check_oneshot_change()
tick_nohz_switch_to_nohz()
tick_switch_to_oneshot()
tick_broadcast_switch_to_oneshot()
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201117132006.061341507@linutronix.de
The helper uses CLOCK_MONOTONIC_COARSE source of time that is less
accurate but more performant.
We have a BPF CGROUP_SKB firewall that supports event logging through
bpf_perf_event_output(). Each event has a timestamp and currently we use
bpf_ktime_get_ns() for it. Use of bpf_ktime_get_coarse_ns() saves ~15-20
ns in time required for event logging.
bpf_ktime_get_ns():
EgressLogByRemoteEndpoint 113.82ns 8.79M
bpf_ktime_get_coarse_ns():
EgressLogByRemoteEndpoint 95.40ns 10.48M
Signed-off-by: Dmitrii Banshchikov <me@ubique.spb.ru>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20201117184549.257280-1-me@ubique.spb.ru
timens_on_fork() always return 0, and maybe not
need to judge the return value in copy_namespaces().
So make timens_on_fork() return nothing and do not
judge its return val in copy_namespaces().
Signed-off-by: Hui Su <sh_def@163.com>
Link: https://lore.kernel.org/r/20201117161750.GA45121@rlk
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Drop the dma_direct_set_offset export and move the declaration to
dma-map-ops.h now that the Allwinner drivers have stopped calling it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Maxime Ripard <maxime@cerno.tech>
The helper allows modification of certain bits on the linux_binprm
struct starting with the secureexec bit which can be updated using the
BPF_F_BPRM_SECUREEXEC flag.
secureexec can be set by the LSM for privilege gaining executions to set
the AT_SECURE auxv for glibc. When set, the dynamic linker disables the
use of certain environment variables (like LD_PRELOAD).
Signed-off-by: KP Singh <kpsingh@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20201117232929.2156341-1-kpsingh@chromium.org
Replace the use of security_capable(current_cred(), ...) with
ns_capable_noaudit() which set PF_SUPERPRIV.
Since commit 98f368e9e2 ("kernel: Add noaudit variant of
ns_capable()"), a new ns_capable_noaudit() helper is available. Let's
use it!
Cc: Jann Horn <jannh@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Tyler Hicks <tyhicks@linux.microsoft.com>
Cc: Will Drewry <wad@chromium.org>
Cc: stable@vger.kernel.org
Fixes: e2cfabdfd0 ("seccomp: add system call filtering using BPF")
Signed-off-by: Mickaël Salaün <mic@linux.microsoft.com>
Reviewed-by: Jann Horn <jannh@google.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20201030123849.770769-3-mic@digikod.net
Commit 69f594a389 ("ptrace: do not audit capability check when outputing
/proc/pid/stat") replaced the use of ns_capable() with
has_ns_capability{,_noaudit}() which doesn't set PF_SUPERPRIV.
Commit 6b3ad6649a ("ptrace: reintroduce usage of subjective credentials in
ptrace_has_cap()") replaced has_ns_capability{,_noaudit}() with
security_capable(), which doesn't set PF_SUPERPRIV neither.
Since commit 98f368e9e2 ("kernel: Add noaudit variant of ns_capable()"), a
new ns_capable_noaudit() helper is available. Let's use it!
As a result, the signature of ptrace_has_cap() is restored to its original one.
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Eric Paris <eparis@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Serge E. Hallyn <serge@hallyn.com>
Cc: Tyler Hicks <tyhicks@linux.microsoft.com>
Cc: stable@vger.kernel.org
Fixes: 6b3ad6649a ("ptrace: reintroduce usage of subjective credentials in ptrace_has_cap()")
Fixes: 69f594a389 ("ptrace: do not audit capability check when outputing /proc/pid/stat")
Signed-off-by: Mickaël Salaün <mic@linux.microsoft.com>
Reviewed-by: Jann Horn <jannh@google.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20201030123849.770769-2-mic@digikod.net
Now that the RDMA core deals with devices that only do DMA mapping in
lower layers properly, there is no user for dma_virt_ops and it can be
removed.
Link: https://lore.kernel.org/r/20201106181941.1878556-11-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Pull RCU fix from Paul McKenney:
"A single commit that fixes a bug that was introduced a couple of merge
windows ago, but which rather more recently converged to an
agreed-upon fix. The bug is that interrupts can be incorrectly enabled
while holding an irq-disabled spinlock. This can of course result in
self-deadlocks.
The bug is a bit difficult to trigger. It requires that a preempted
task be blocking a preemptible-RCU grace period long enough to trigger
an RCU CPU stall warning. In addition, an interrupt must occur at just
the right time, and that interrupt's handler must acquire that same
irq-disabled spinlock. Still, a deadlock is a deadlock.
Furthermore, we do now have a fix, and that fix survives kernel test
robot, -next, and rcutorture testing. It has also been verified by
Sebastian as fixing the bug. Therefore..."
* 'urgent-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu:
rcu: Don't invoke try_invoke_on_locked_down_task() with irqs disabled
Add test cases for newly added resource APIs.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Now we have for 'other' and 'type' variables
other type return
0 0 REGION_DISJOINT
0 x REGION_INTERSECTS
x 0 REGION_DISJOINT
x x REGION_MIXED
Obviously it's easier to check 'type' for 0 first instead of
currently checked 'other'.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Hanjun Guo <guohanjun@huawei.com>
Tested-by: Hanjun Guo <guohanjun@huawei.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
A warning was hit when running xfstests/generic/068 in a Hyper-V guest:
[...] ------------[ cut here ]------------
[...] DEBUG_LOCKS_WARN_ON(lockdep_hardirqs_enabled())
[...] WARNING: CPU: 2 PID: 1350 at kernel/locking/lockdep.c:5280 check_flags.part.0+0x165/0x170
[...] ...
[...] Workqueue: events pwq_unbound_release_workfn
[...] RIP: 0010:check_flags.part.0+0x165/0x170
[...] ...
[...] Call Trace:
[...] lock_is_held_type+0x72/0x150
[...] ? lock_acquire+0x16e/0x4a0
[...] rcu_read_lock_sched_held+0x3f/0x80
[...] __send_ipi_one+0x14d/0x1b0
[...] hv_send_ipi+0x12/0x30
[...] __pv_queued_spin_unlock_slowpath+0xd1/0x110
[...] __raw_callee_save___pv_queued_spin_unlock_slowpath+0x11/0x20
[...] .slowpath+0x9/0xe
[...] lockdep_unregister_key+0x128/0x180
[...] pwq_unbound_release_workfn+0xbb/0xf0
[...] process_one_work+0x227/0x5c0
[...] worker_thread+0x55/0x3c0
[...] ? process_one_work+0x5c0/0x5c0
[...] kthread+0x153/0x170
[...] ? __kthread_bind_mask+0x60/0x60
[...] ret_from_fork+0x1f/0x30
The cause of the problem is we have call chain lockdep_unregister_key()
-> <irq disabled by raw_local_irq_save()> lockdep_unlock() ->
arch_spin_unlock() -> __pv_queued_spin_unlock_slowpath() -> pv_kick() ->
__send_ipi_one() -> trace_hyperv_send_ipi_one().
Although this particular warning is triggered because Hyper-V has a
trace point in ipi sending, but in general arch_spin_unlock() may call
another function having a trace point in it, so put the arch_spin_lock()
and arch_spin_unlock() after lock_recursion protection to fix this
problem and avoid similiar problems.
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20201113110512.1056501-1-boqun.feng@gmail.com
Glenn reported that "an application [he developed produces] a BUG in
deadline.c when a SCHED_DEADLINE task contends with CFS tasks on nested
PTHREAD_PRIO_INHERIT mutexes. I believe the bug is triggered when a CFS
task that was boosted by a SCHED_DEADLINE task boosts another CFS task
(nested priority inheritance).
------------[ cut here ]------------
kernel BUG at kernel/sched/deadline.c:1462!
invalid opcode: 0000 [#1] PREEMPT SMP
CPU: 12 PID: 19171 Comm: dl_boost_bug Tainted: ...
Hardware name: ...
RIP: 0010:enqueue_task_dl+0x335/0x910
Code: ...
RSP: 0018:ffffc9000c2bbc68 EFLAGS: 00010002
RAX: 0000000000000009 RBX: ffff888c0af94c00 RCX: ffffffff81e12500
RDX: 000000000000002e RSI: ffff888c0af94c00 RDI: ffff888c10b22600
RBP: ffffc9000c2bbd08 R08: 0000000000000009 R09: 0000000000000078
R10: ffffffff81e12440 R11: ffffffff81e1236c R12: ffff888bc8932600
R13: ffff888c0af94eb8 R14: ffff888c10b22600 R15: ffff888bc8932600
FS: 00007fa58ac55700(0000) GS:ffff888c10b00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fa58b523230 CR3: 0000000bf44ab003 CR4: 00000000007606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
? intel_pstate_update_util_hwp+0x13/0x170
rt_mutex_setprio+0x1cc/0x4b0
task_blocks_on_rt_mutex+0x225/0x260
rt_spin_lock_slowlock_locked+0xab/0x2d0
rt_spin_lock_slowlock+0x50/0x80
hrtimer_grab_expiry_lock+0x20/0x30
hrtimer_cancel+0x13/0x30
do_nanosleep+0xa0/0x150
hrtimer_nanosleep+0xe1/0x230
? __hrtimer_init_sleeper+0x60/0x60
__x64_sys_nanosleep+0x8d/0xa0
do_syscall_64+0x4a/0x100
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7fa58b52330d
...
---[ end trace 0000000000000002 ]—
He also provided a simple reproducer creating the situation below:
So the execution order of locking steps are the following
(N1 and N2 are non-deadline tasks. D1 is a deadline task. M1 and M2
are mutexes that are enabled * with priority inheritance.)
Time moves forward as this timeline goes down:
N1 N2 D1
| | |
| | |
Lock(M1) | |
| | |
| Lock(M2) |
| | |
| | Lock(M2)
| | |
| Lock(M1) |
| (!!bug triggered!) |
Daniel reported a similar situation as well, by just letting ksoftirqd
run with DEADLINE (and eventually block on a mutex).
Problem is that boosted entities (Priority Inheritance) use static
DEADLINE parameters of the top priority waiter. However, there might be
cases where top waiter could be a non-DEADLINE entity that is currently
boosted by a DEADLINE entity from a different lock chain (i.e., nested
priority chains involving entities of non-DEADLINE classes). In this
case, top waiter static DEADLINE parameters could be null (initialized
to 0 at fork()) and replenish_dl_entity() would hit a BUG().
Fix this by keeping track of the original donor and using its parameters
when a task is boosted.
Reported-by: Glenn Elliott <glenn@aurora.tech>
Reported-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Link: https://lkml.kernel.org/r/20201117061432.517340-1-juri.lelli@redhat.com
schedule() ttwu()
deactivate_task(); if (p->on_rq && ...) // false
atomic_dec(&task_rq(p)->nr_iowait);
if (prev->in_iowait)
atomic_inc(&rq->nr_iowait);
Allows nr_iowait to be decremented before it gets incremented,
resulting in more dodgy IO-wait numbers than usual.
Note that because we can now do ttwu_queue_wakelist() before
p->on_cpu==0, we lose the natural ordering and have to further delay
the decrement.
Fixes: c6e7bd7afa ("sched/core: Optimize ttwu() spinning on p->on_cpu")
Reported-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Link: https://lkml.kernel.org/r/20201117093829.GD3121429@hirez.programming.kicks-ass.net
enqueue_task_fair() attempts to skip the overutilized update for new
tasks as their util_avg is not accurate yet. However, the flag we check
to do so is overwritten earlier on in the function, which makes the
condition pretty much a nop.
Fix this by saving the flag early on.
Fixes: 2802bf3cd9 ("sched/fair: Add over-utilization/tipping point indicator")
Reported-by: Rick Yiu <rickyiu@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20201112111201.2081902-1-qperret@google.com
Now that the flags migration in the common syscall entry code is complete
and the code relies exclusively on thread_info::syscall_work, clean up the
accesses to TI flags in that path.
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Link: https://lore.kernel.org/r/20201116174206.2639648-10-krisman@collabora.com
On architectures using the generic syscall entry code the architecture
independent syscall work is moved to flags in thread_info::syscall_work.
This removes architecture dependencies and frees up TIF bits.
Define SYSCALL_WORK_SYSCALL_AUDIT, use it in the generic entry code and
convert the code which uses the TIF specific helper functions to use the
new *_syscall_work() helpers which either resolve to the new mode for users
of the generic entry code or to the TIF based functions for the other
architectures.
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Link: https://lore.kernel.org/r/20201116174206.2639648-9-krisman@collabora.com
On architectures using the generic syscall entry code the architecture
independent syscall work is moved to flags in thread_info::syscall_work.
This removes architecture dependencies and frees up TIF bits.
Define SYSCALL_WORK_SYSCALL_EMU, use it in the generic entry code and
convert the code which uses the TIF specific helper functions to use the
new *_syscall_work() helpers which either resolve to the new mode for users
of the generic entry code or to the TIF based functions for the other
architectures.
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Link: https://lore.kernel.org/r/20201116174206.2639648-8-krisman@collabora.com
On architectures using the generic syscall entry code the architecture
independent syscall work is moved to flags in thread_info::syscall_work.
This removes architecture dependencies and frees up TIF bits.
Define SYSCALL_WORK_SYSCALL_TRACE, use it in the generic entry code and
convert the code which uses the TIF specific helper functions to use the
new *_syscall_work() helpers which either resolve to the new mode for users
of the generic entry code or to the TIF based functions for the other
architectures.
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Link: https://lore.kernel.org/r/20201116174206.2639648-7-krisman@collabora.com
On architectures using the generic syscall entry code the architecture
independent syscall work is moved to flags in thread_info::syscall_work.
This removes architecture dependencies and frees up TIF bits.
Define SYSCALL_WORK_SYSCALL_TRACEPOINT, use it in the generic entry code
and convert the code which uses the TIF specific helper functions to use
the new *_syscall_work() helpers which either resolve to the new mode for
users of the generic entry code or to the TIF based functions for the other
architectures.
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Link: https://lore.kernel.org/r/20201116174206.2639648-6-krisman@collabora.com
On architectures using the generic syscall entry code the architecture
independent syscall work is moved to flags in thread_info::syscall_work.
This removes architecture dependencies and frees up TIF bits.
Define SYSCALL_WORK_SECCOMP, use it in the generic entry code and convert
the code which uses the TIF specific helper functions to use the new
*_syscall_work() helpers which either resolve to the new mode for users of
the generic entry code or to the TIF based functions for the other
architectures.
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Link: https://lore.kernel.org/r/20201116174206.2639648-5-krisman@collabora.com
Prepare the common entry code to use the SYSCALL_WORK flags. They will
be defined in subsequent patches for each type of syscall
work. SYSCALL_WORK_ENTRY/EXIT are defined for the transition, as they
will replace the TIF_ equivalent defines.
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Link: https://lore.kernel.org/r/20201116174206.2639648-4-krisman@collabora.com
The functions event_{set,clear,}_no_set_filter_flag were only used in
replace_system_preds() [now, renamed to process_system_preds()].
Commit 80765597bc ("tracing: Rewrite filter logic to be simpler and
faster") removed the use of those functions in replace_system_preds().
Since then, the functions event_{set,clear,}_no_set_filter_flag were
unused. Fortunately, make CC=clang W=1 indicates this with
-Wunused-function warnings on those three functions.
So, clean up these obsolete unused functions.
Link: https://lkml.kernel.org/r/20201115155336.20248-1-lukas.bulwahn@gmail.com
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Calls to nla_strlcpy are now replaced by calls to nla_strscpy which is the new
name of this function.
Signed-off-by: Francis Laniel <laniel_francis@privacyrequired.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The hrtimer_get_remaining() markup is documenting, instead,
__hrtimer_get_remaining(), as it is placed at the C file.
In order to properly document it, a kernel-doc markup is needed together
with the function prototype. So, add a new one, while preserving the
existing one, just fixing the function name.
The hrtimer_is_queued prototype has a typo: it is using
'=' instead of '-' to split: identifier - description
as required by kernel-doc markup.
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/9dc87808c2fd07b7e050bafcd033c5ef05808fea.1605521731.git.mchehab+huawei@kernel.org
No users outside of the timer code. Move the caller below this function to
avoid a pointless forward declaration.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The kernel-doc parser complains:
kernel/time/timekeeping.c:1543: warning: Function parameter or member
'ts' not described in 'read_persistent_clock64'
kernel/time/timekeeping.c:764: warning: Function parameter or member
'tk' not described in 'timekeeping_forward_now'
kernel/time/timekeeping.c:1331: warning: Function parameter or member
'ts' not described in 'timekeeping_inject_offset'
kernel/time/timekeeping.c:1331: warning: Excess function parameter 'tv'
description in 'timekeeping_inject_offset'
Add the missing parameter documentations and rename the 'tv' parameter of
timekeeping_inject_offset() to 'ts' so it matches the implemention.
[ tglx: Reworded a few docs and massaged changelog ]
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/1605252275-63652-5-git-send-email-alex.shi@linux.alibaba.com
Address the following kernel-doc markup warnings:
kernel/time/timekeeping.c:1563: warning: Function parameter or member
'wall_time' not described in 'read_persistent_wall_and_boot_offset'
kernel/time/timekeeping.c:1563: warning: Function parameter or member
'boot_offset' not described in 'read_persistent_wall_and_boot_offset'
The parameters are described but miss the leading '@' and the colon after
the parameter names.
[ tglx: Massaged changelog ]
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/1605252275-63652-6-git-send-email-alex.shi@linux.alibaba.com
The kernel-doc parser complains about:
kernel/time/timekeeping.c:651: warning: Function parameter or member
'nb' not described in 'pvclock_gtod_register_notifier'
kernel/time/timekeeping.c:670: warning: Function parameter or member
'nb' not described in 'pvclock_gtod_unregister_notifier'
Add the missing parameter explanations.
[ tglx: Massaged changelog ]
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/1605252275-63652-3-git-send-email-alex.shi@linux.alibaba.com
Alex reported the following warning:
kernel/time/timekeeping.c:464: warning: Function parameter or member
'tkf' not described in '__ktime_get_fast_ns'
which is not entirely correct because the documented function is
ktime_get_mono_fast_ns() which does not have a parameter, but the
kernel-doc parser looks at the function declaration which follows the
comment and complains about the missing parameter documentation.
Aside of that the documentation for the rest of the NMI safe accessors is
either incomplete or missing.
- Move the function documentation to the right place
- Fixup the references and inconsistencies
- Add the missing documentation for ktime_get_raw_fast_ns()
Reported-by: Alex Shi <alex.shi@linux.alibaba.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Address the following warning:
kernel/time/timekeeping.c:415: warning: Function parameter or member
'tkf' not described in 'update_fast_timekeeper'
[ tglx: Remove the bogus ktime_get_mono_fast_ns() part ]
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/1605252275-63652-2-git-send-email-alex.shi@linux.alibaba.com
Various static functions in the timekeeping code have function comments
which pretend to be kernel-doc, but are incomplete and trigger parser
warnings.
As these functions are local to the timekeeping core code there is no need
to expose them via kernel-doc.
Remove the double star kernel-doc marker and remove excess newlines.
[ tglx: Massaged changelog and removed excess newlines ]
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/1605252275-63652-4-git-send-email-alex.shi@linux.alibaba.com
Address these kernel-doc warnings:
kernel/time/timeconv.c:79: warning: Function parameter or member
'totalsecs' not described in 'time64_to_tm'
kernel/time/timeconv.c:79: warning: Function parameter or member
'offset' not described in 'time64_to_tm'
kernel/time/timeconv.c:79: warning: Function parameter or member
'result' not described in 'time64_to_tm'
The parameters are described but lack colons after the parameter name.
[ tglx: Massaged changelog ]
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/1605252275-63652-1-git-send-email-alex.shi@linux.alibaba.com
PREEMPT_RT does not spin and wait until a running timer completes its
callback but instead it blocks on a sleeping lock to prevent a livelock in
the case that the task waiting for the callback completion preempted the
callback.
This cannot be done for timers flagged with TIMER_IRQSAFE. These timers can
be canceled from an interrupt disabled context even on RT kernels.
The expiry callback of such timers is invoked with interrupts disabled so
there is no need to use the expiry lock mechanism because obviously the
callback cannot be preempted even on RT kernels.
Do not use the timer_base::expiry_lock mechanism when waiting for a running
callback to complete if the timer is flagged with TIMER_IRQSAFE.
Also add a lockdep assertion for RT kernels to validate that the expiry
lock mechanism is always invoked in preemptible context.
Reported-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201103190937.hga67rqhvknki3tp@linutronix.de
Use the "%ps" printk format string to resolve symbol names.
This works on all platforms, including ia64, ppc64 and parisc64 on which
one needs to dereference pointers to function descriptors instead of
function pointers.
Signed-off-by: Helge Deller <deller@gmx.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201104163401.GA3984@ls3530.fritz.box
- A set of commits which reduce the stack usage of various perf event
handling functions which allocated large data structs on stack causing
stack overflows in the worst case.
- Use the proper mechanism for detecting soft interrupts in the recursion
protection.
- Make the resursion protection simpler and more robust.
- Simplify the scheduling of event groups to make the code more robust and
prepare for fixing the issues vs. scheduling of exclusive event groups.
- Prevent event multiplexing and rotation for exclusive event groups
- Correct the perf event attribute exclusive semantics to take pinned
events, e.g. the PMU watchdog, into account
- Make the anythread filtering conditional for Intel's generic PMU
counters as it is not longer guaranteed to be supported on newer
CPUs. Check the corresponding CPUID leaf to make sure.
- Fixup a duplicate initialization in an array which was probably cause by
the usual copy & paste - forgot to edit mishap.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl+xIi0THHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYofixD/4+4gc8DhOmAkMrN0Z9tiW8ebgMKmb9
wZRkMr5Osi0GzLJOPZ6SdY6jd0A3rMN/sW6P1DT6pDtcty4bKFoW5VZBuUDIAhel
BC4C93L3y1En/GEZu1GTy3LvsBwLBQTOoY4goDjbdAbk60S/0RTHOGyQsRsOQFe6
fVs3iXozAFuaR6I6N3dlxuJAE51zvr8MyBWaUoByNDB//1+lLNW+JfClaAOG1oXx
qZIg/niatBVGzSGgKNRUyh3g8G1HJtabsA/NZ4PH8ZHuYABfmj4lmmUPR77ICLfV
wMITEBG7eaktB8EqM9hvaoOZLA5kpXHO2JbCFSs4c4x11mlC8g7QMV3poCw33YoN
a5TmT1A3muri1riy1/Ee9lXACOq7/tf2+Xfn9o6dvDdBwd6s5pzlhLGR8gILp2lF
2bcg3IwYvHT/Kiurb/WGNpbCqQIPJpcUcfs3tNBCCtKegahUQNnGjxN3NVo9RCit
zfL6xIJ8eZiYnsxXx4NKm744AukWiql3aRNgRkOdBP5WC68xt6VLcxG1YZKUoDhy
jRSOCD/DuPSMSvAAgN7S8OWlPsKWBxVxxWYV+K8FpwhgzbQ3WbS3UDiYkhgjeOxu
OlM692oWpllKvQWlvYthr2Be6oPCRRi1vvADNNbTKzgHk5i61bwympsGl1EZx3Pz
2ROp7NJFRESnqw==
=FzCf
-----END PGP SIGNATURE-----
Merge tag 'perf-urgent-2020-11-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Thomas Gleixner:
"A set of fixes for perf:
- A set of commits which reduce the stack usage of various perf
event handling functions which allocated large data structs on
stack causing stack overflows in the worst case
- Use the proper mechanism for detecting soft interrupts in the
recursion protection
- Make the resursion protection simpler and more robust
- Simplify the scheduling of event groups to make the code more
robust and prepare for fixing the issues vs. scheduling of
exclusive event groups
- Prevent event multiplexing and rotation for exclusive event groups
- Correct the perf event attribute exclusive semantics to take
pinned events, e.g. the PMU watchdog, into account
- Make the anythread filtering conditional for Intel's generic PMU
counters as it is not longer guaranteed to be supported on newer
CPUs. Check the corresponding CPUID leaf to make sure
- Fixup a duplicate initialization in an array which was probably
caused by the usual 'copy & paste - forgot to edit' mishap"
* tag 'perf-urgent-2020-11-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel/uncore: Fix Add BW copypasta
perf/x86/intel: Make anythread filter support conditional
perf: Tweak perf_event_attr::exclusive semantics
perf: Fix event multiplexing for exclusive groups
perf: Simplify group_sched_in()
perf: Simplify group_sched_out()
perf/x86: Make dummy_iregs static
perf/arch: Remove perf_sample_data::regs_user_copy
perf: Optimize get_recursion_context()
perf: Fix get_recursion_context()
perf/x86: Reduce stack usage for x86_pmu::drain_pebs()
perf: Reduce stack usage of perf_output_begin()
- Address a load balancer regression by making the load balancer use the
same logic as the wakeup path to spread tasks in the LLC domain.
- Prefer the CPU on which a task run last over the local CPU in the fast
wakeup path for asymmetric CPU capacity systems to align with the
symmetric case. This ensures more locality and prevents massive
migration overhead on those asymetric systems
- Fix a memory corruption bug in the scheduler debug code caused by
handing a modified buffer pointer to kfree().
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl+xJIoTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYofyGD/9rUnLlC1h7jEufVa4yPG94DcEqiXT7
8B/zNRKnOmqQePCYUm+DS8njSFqpF9VjR+5zpos3bgYqwn7DyfV+hpxbbgS9NDh/
qRg5gxhTrR4uMyZN62Fex5JS4bP8mKO7oc0usgV2Ytsg3e4H+9DqYhuaA5GrJAxC
J3d1Hv/YBW2Uo+RZpB20aaJr0srN7bswTtPMxeeqo8q3Qh4pFcI+rmA4WphVAgHF
jQWaNP4YVTgNjqxy7nBp7zFHlSdRbLohldZFtueYmRo1mjmkyQ34Cg7etfBvN1Uf
iVYZLaInr0YPr0qR4FrQ3yI8ln/HESxshs0ARzMReYVT71mV//o5wftE18uCULQB
rRu9vYz+LBVhkdgx118jJdNJqyqk6Ca6h9ZLqyBKuckj9a39289bwWiS6D/6W51p
gurq58YTb2lRzyCnOVEULXehYRJkDI8EToiWppRVm9gy43OFPNox7n6TvNLW6BLS
I8msTVdqDYXXj4U1o4Mf9K5LBKlda+ARuBu87r7kH1BJLxXHnOHcEkmeN8O9k7eu
jdWfeDzDDjBjt/TU+X4f4RNjudUZrSPQrrESE5+XhfM4CwqcPXa2M/dGtPekW/ED
9IqxPvwkau+0Ym6gkuanfnmda+JVR/nLvZV0uFuUGd+2xMcRemZbZE6hTUiYvYPY
CAHpOhmeakbr6w==
=wFcU
-----END PGP SIGNATURE-----
Merge tag 'sched-urgent-2020-11-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Thomas Gleixner:
"A set of scheduler fixes:
- Address a load balancer regression by making the load balancer use
the same logic as the wakeup path to spread tasks in the LLC domain
- Prefer the CPU on which a task run last over the local CPU in the
fast wakeup path for asymmetric CPU capacity systems to align with
the symmetric case. This ensures more locality and prevents massive
migration overhead on those asymetric systems
- Fix a memory corruption bug in the scheduler debug code caused by
handing a modified buffer pointer to kfree()"
* tag 'sched-urgent-2020-11-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/debug: Fix memory corruption caused by multiple small reads of flags
sched/fair: Prefer prev cpu in asymmetric wakeup path
sched/fair: Ensure tasks spreading in LLC during LB
- Prevent an unconditional interrupt enable in a futex helper function
which can be called from contexts which expect interrupts to stay
disabled across the call.
- Don't modify lockdep chain keys in the validation process as that
causes chain inconsistency.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl+xF7cTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoR9dD/0X7EmbGdTIL9ZgEQfch8fWY79uHqQ7
EXgVQb26KuWDzRutgk0qeSixQTYfhzoT2nPRGNpAQtyYYir0p2fXG2kstJEQZQzq
toK5VsL11NJKVhlO/1y5+RcufsgfjTqOpqygbMm1gz+7ejfe7kfJZUAEMwbWzZRn
a8HZ/VTQb/Z/F1xv+PuACkCp79ezzosL4hiN5QG0FEyiX47Pf8HuXTHKAD3SJgnc
6ZvCkDkCHOv3jGmQ68sXBQ2m/ciYnDs1D8J/SD9zmggLFs8+R0LKCNxI46HfgRDV
3oqx7OivazDvBmNXlSCFQQG+saIgRlWuVPV8PVTD3Ihmx25DzXhreSyxjyRhl3uL
WN7C3ztk6lJv0B/BbtpFceobXjE7IN71CjDIFCwii20dn5UTrgLaJwnW1YrX6qqb
+iz3cJs4bNLD1brGAlx8lqZxZ2omyXcPNQOi+vSkdTy8C/OYmypC1xusWSBpBtkp
1+V1uoYFtJWsBzKfmbXAPSSiw9ppIVe/w3J/LrCcFv3CAEaDlCkN0klOF3/ULsga
d+hMEUKagIqRXNeBdoEfY78LzSfIqkovy3rLNnWATu8fS2IdoRiwhliRxMCU9Ceh
4WU1F4QAqu42cd0zYVWnVIfDVP3Qx/ENl8/Jd0m7Z8jmmjhOQo5XO7o5PxP8uP3U
NvnoT5d5D2KysA==
=XuF7
-----END PGP SIGNATURE-----
Merge tag 'locking-urgent-2020-11-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fixes from Thomas Gleixner:
"Two fixes for the locking subsystem:
- Prevent an unconditional interrupt enable in a futex helper
function which can be called from contexts which expect interrupts
to stay disabled across the call
- Don't modify lockdep chain keys in the validation process as that
causes chain inconsistency"
* tag 'locking-urgent-2020-11-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
lockdep: Avoid to modify chain keys in validate_chain()
futex: Don't enable IRQs unconditionally in put_pi_state()
This allows an exclusive wait_queue_entry to be added at the head of the
queue, instead of the tail as normal. Thus, it gets to consume events
first without allowing non-exclusive waiters to be woken at all.
The (first) intended use is for KVM IRQFD, which currently has
inconsistent behaviour depending on whether posted interrupts are
available or not. If they are, KVM will bypass the eventfd completely
and deliver interrupts directly to the appropriate vCPU. If not, events
are delivered through the eventfd and userspace will receive them when
polling on the eventfd.
By using add_wait_queue_priority(), KVM will be able to consistently
consume events within the kernel without accidentally exposing them
to userspace when they're supposed to be bypassed. This, in turn, means
that userspace doesn't have to jump through hoops to avoid listening
on the erroneously noisy eventfd and injecting duplicate interrupts.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20201027143944.648769-2-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Define watchdog_allowed_mask only when SOFTLOCKUP_DETECTOR is enabled.
Fixes: 7feeb9cd4f ("watchdog/sysctl: Clean up sysctl variable name space")
Signed-off-by: Santosh Sivaraj <santosh@fossix.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20201106015025.1281561-1-santosh@fossix.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "fix parsing of reboot= cmdline", v3.
The parsing of the reboot= cmdline has two major errors:
- a missing bound check can crash the system on reboot
- parsing of the cpu number only works if specified last
Fix both.
This patch (of 2):
This reverts commit 616feab753.
kstrtoint() and simple_strtoul() have a subtle difference which makes
them non interchangeable: if a non digit character is found amid the
parsing, the former will return an error, while the latter will just
stop parsing, e.g. simple_strtoul("123xyx") = 123.
The kernel cmdline reboot= argument allows to specify the CPU used for
rebooting, with the syntax `s####` among the other flags, e.g.
"reboot=warm,s31,force", so if this flag is not the last given, it's
silently ignored as well as the subsequent ones.
Fixes: 616feab753 ("kernel/reboot.c: convert simple_strtoul to kstrtoint")
Signed-off-by: Matteo Croce <mcroce@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Robin Holt <robinmholt@gmail.com>
Cc: Fabian Frederick <fabf@skynet.be>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20201103214025.116799-2-mcroce@linux.microsoft.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Daniel Borkmann says:
====================
pull-request: bpf-next 2020-11-14
1) Add BTF generation for kernel modules and extend BTF infra in kernel
e.g. support for split BTF loading and validation, from Andrii Nakryiko.
2) Support for pointers beyond pkt_end to recognize LLVM generated patterns
on inlined branch conditions, from Alexei Starovoitov.
3) Implements bpf_local_storage for task_struct for BPF LSM, from KP Singh.
4) Enable FENTRY/FEXIT/RAW_TP tracing program to use the bpf_sk_storage
infra, from Martin KaFai Lau.
5) Add XDP bulk APIs that introduce a defer/flush mechanism to optimize the
XDP_REDIRECT path, from Lorenzo Bianconi.
6) Fix a potential (although rather theoretical) deadlock of hashtab in NMI
context, from Song Liu.
7) Fixes for cross and out-of-tree build of bpftool and runqslower allowing build
for different target archs on same source tree, from Jean-Philippe Brucker.
8) Fix error path in htab_map_alloc() triggered from syzbot, from Eric Dumazet.
9) Move functionality from test_tcpbpf_user into the test_progs framework so it
can run in BPF CI, from Alexander Duyck.
10) Lift hashtab key_size limit to be larger than MAX_BPF_STACK, from Florian Lehner.
Note that for the fix from Song we have seen a sparse report on context
imbalance which requires changes in sparse itself for proper annotation
detection where this is currently being discussed on linux-sparse among
developers [0]. Once we have more clarification/guidance after their fix,
Song will follow-up.
[0] https://lore.kernel.org/linux-sparse/CAHk-=wh4bx8A8dHnX612MsDO13st6uzAz1mJ1PaHHVevJx_ZCw@mail.gmail.com/T/https://lore.kernel.org/linux-sparse/20201109221345.uklbp3lzgq6g42zb@ltop.local/T/
* git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (66 commits)
net: mlx5: Add xdp tx return bulking support
net: mvpp2: Add xdp tx return bulking support
net: mvneta: Add xdp tx return bulking support
net: page_pool: Add bulk support for ptr_ring
net: xdp: Introduce bulking for xdp tx return path
bpf: Expose bpf_d_path helper to sleepable LSM hooks
bpf: Augment the set of sleepable LSM hooks
bpf: selftest: Use bpf_sk_storage in FENTRY/FEXIT/RAW_TP
bpf: Allow using bpf_sk_storage in FENTRY/FEXIT/RAW_TP
bpf: Rename some functions in bpf_sk_storage
bpf: Folding omem_charge() into sk_storage_charge()
selftests/bpf: Add asm tests for pkt vs pkt_end comparison.
selftests/bpf: Add skb_pkt_end test
bpf: Support for pointers beyond pkt_end.
tools/bpf: Always run the *-clean recipes
tools/bpf: Add bootstrap/ to .gitignore
bpf: Fix NULL dereference in bpf_task_storage
tools/bpftool: Fix build slowdown
tools/runqslower: Build bpftool using HOSTCC
tools/runqslower: Enable out-of-tree build
...
====================
Link: https://lore.kernel.org/r/20201114020819.29584-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently verifier enforces return code checks for subprograms in the
same manner as it does for program entry points. This prevents returning
arbitrary scalar values from subprograms. Scalar type of returned values
is checked by btf_prepare_func_args() and hence it should be safe to
allow only scalars for now. Relax return code checks for subprograms and
allow any correct scalar values.
Fixes: 51c39bb1d5 (bpf: Introduce function-by-function verification)
Signed-off-by: Dmitrii Banshchikov <me@ubique.spb.ru>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20201113171756.90594-1-me@ubique.spb.ru
Commit ba31c1a485 ("futex: Move futex exit handling into futex code")
introduced compat_exit_robust_list() with a full-fledged implementation for
CONFIG_COMPAT, and an empty-body function for !CONFIG_COMPAT.
However, compat_exit_robust_list() is only used in futex_mm_release() under
#ifdef CONFIG_COMPAT.
Hence for !CONFIG_COMPAT, make CC=clang W=1 warns:
kernel/futex.c:314:20:
warning: unused function 'compat_exit_robust_list' [-Wunused-function]
There is no need to declare the unused empty function for !CONFIG_COMPAT.
Simply remove it.
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Nathan Chancellor <natechancellor@gmail.com>
Link: https://lore.kernel.org/r/20201113172012.27221-1-lukas.bulwahn@gmail.com
- Spectre/Meltdown safelisting for some Qualcomm KRYO cores
- Fix RCU splat when failing to online a CPU due to a feature mismatch
- Fix a recently introduced sparse warning in kexec()
- Fix handling of CPU erratum 1418040 for late CPUs
- Ensure hot-added memory falls within linear-mapped region
-----BEGIN PGP SIGNATURE-----
iQFEBAABCgAuFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAl+ubogQHHdpbGxAa2Vy
bmVsLm9yZwAKCRC3rHDchMFjNPD7B/9i5ao44AEJwjz0a68S/jD7kUD7i3xVkCNN
Y8i/i9mx44IAcf8pmyQh3ngaFlJuF2C6oC/SQFiDbmVeGeZXLXvXV7uGAqXG5Xjm
O2Svgr1ry176JWpsB7MNnZwzAatQffdkDjbjQCcUnUIKYcLvge8H2fICljujGcfQ
094vNmT9VerTWRbWDti3Ck/ug+sanVHuzk5BWdKx3jamjeTqo+sBZK/wgBr6UoYQ
mT3BFX42kLIGg+AzwXRDPlzkJymjYgQDbSwGsvny8qKdOEJbAUwWXYZ5sTs9J/gU
E9PT3VJI7BYtTd1uPEWkD645U3arfx3Pf2JcJlbkEp86qx4CUF9s
=T6k4
-----END PGP SIGNATURE-----
Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fixes from Will Deacon:
- Spectre/Meltdown safelisting for some Qualcomm KRYO cores
- Fix RCU splat when failing to online a CPU due to a feature mismatch
- Fix a recently introduced sparse warning in kexec()
- Fix handling of CPU erratum 1418040 for late CPUs
- Ensure hot-added memory falls within linear-mapped region
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: cpu_errata: Apply Erratum 845719 to KRYO2XX Silver
arm64: proton-pack: Add KRYO2XX silver CPUs to spectre-v2 safe-list
arm64: kpti: Add KRYO2XX gold/silver CPU cores to kpti safelist
arm64: Add MIDR value for KRYO2XX gold/silver CPU cores
arm64/mm: Validate hotplug range before creating linear mapping
arm64: smp: Tell RCU about CPUs that fail to come online
arm64: psci: Avoid printing in cpu_psci_cpu_die()
arm64: kexec_file: Fix sparse warning
arm64: errata: Fix handling of 1418040 with late CPU onlining
The value of variable ret is overwritten on the delete branch in the
test_create_synth_event() and we care more about the above error than
this delete portion. Remove it.
Link: https://lkml.kernel.org/r/1605283360-6804-1-git-send-email-kaixuxia@tencent.com
Reported-by: Tosk Robot <tencent_os_robot@tencent.com>
Signed-off-by: Kaixu Xia <kaixuxia@tencent.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
When CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS is available, the ftrace call
will be able to set the ip of the calling function. This will improve the
performance of live kernel patching where it does not need all the regs to
be stored just to change the instruction pointer.
If all archs that support live kernel patching also support
HAVE_DYNAMIC_FTRACE_WITH_ARGS, then the architecture specific function
klp_arch_set_pc() could be made generic.
It is possible that an arch can support HAVE_DYNAMIC_FTRACE_WITH_ARGS but
not HAVE_DYNAMIC_FTRACE_WITH_REGS and then have access to live patching.
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: live-patching@vger.kernel.org
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Currently, the only way to get access to the registers of a function via a
ftrace callback is to set the "FL_SAVE_REGS" bit in the ftrace_ops. But as this
saves all regs as if a breakpoint were to trigger (for use with kprobes), it
is expensive.
The regs are already saved on the stack for the default ftrace callbacks, as
that is required otherwise a function being traced will get the wrong
arguments and possibly crash. And on x86, the arguments are already stored
where they would be on a pt_regs structure to use that code for both the
regs version of a callback, it makes sense to pass that information always
to all functions.
If an architecture does this (as x86_64 now does), it is to set
HAVE_DYNAMIC_FTRACE_WITH_ARGS, and this will let the generic code that it
could have access to arguments without having to set the flags.
This also includes having the stack pointer being saved, which could be used
for accessing arguments on the stack, as well as having the function graph
tracer not require its own trampoline!
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
In preparation to have arguments of a function passed to callbacks attached
to functions as default, change the default callback prototype to receive a
struct ftrace_regs as the forth parameter instead of a pt_regs.
For callbacks that set the FL_SAVE_REGS flag in their ftrace_ops flags, they
will now need to get the pt_regs via a ftrace_get_regs() helper call. If
this is called by a callback that their ftrace_ops did not have a
FL_SAVE_REGS flag set, it that helper function will return NULL.
This will allow the ftrace_regs to hold enough just to get the parameters
and stack pointer, but without the worry that callbacks may have a pt_regs
that is not completely filled.
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Sleepable hooks are never called from an NMI/interrupt context, so it
is safe to use the bpf_d_path helper in LSM programs attaching to these
hooks.
The helper is not restricted to sleepable programs and merely uses the
list of sleepable hooks as the initial subset of LSM hooks where it can
be used.
Signed-off-by: KP Singh <kpsingh@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20201113005930.541956-3-kpsingh@chromium.org
Update the set of sleepable hooks with the ones that do not trigger
a warning with might_fault() when exercised with the correct kernel
config options enabled, i.e.
DEBUG_ATOMIC_SLEEP=y
LOCKDEP=y
PROVE_LOCKING=y
This means that a sleepable LSM eBPF program can be attached to these
LSM hooks. A new helper method bpf_lsm_is_sleepable_hook is added and
the set is maintained locally in bpf_lsm.c
Signed-off-by: KP Singh <kpsingh@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20201113005930.541956-2-kpsingh@chromium.org
This patch enables the FENTRY/FEXIT/RAW_TP tracing program to use
the bpf_sk_storage_(get|delete) helper, so those tracing programs
can access the sk's bpf_local_storage and the later selftest
will show some examples.
The bpf_sk_storage is currently used in bpf-tcp-cc, tc,
cg sockops...etc which is running either in softirq or
task context.
This patch adds bpf_sk_storage_get_tracing_proto and
bpf_sk_storage_delete_tracing_proto. They will check
in runtime that the helpers can only be called when serving
softirq or running in a task context. That should enable
most common tracing use cases on sk.
During the load time, the new tracing_allowed() function
will ensure the tracing prog using the bpf_sk_storage_(get|delete)
helper is not tracing any bpf_sk_storage*() function itself.
The sk is passed as "void *" when calling into bpf_local_storage.
This patch only allows tracing a kernel function.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20201112211313.2587383-1-kafai@fb.com
This patch adds the verifier support to recognize inlined branch conditions.
The LLVM knows that the branch evaluates to the same value, but the verifier
couldn't track it. Hence causing valid programs to be rejected.
The potential LLVM workaround: https://reviews.llvm.org/D87428
can have undesired side effects, since LLVM doesn't know that
skb->data/data_end are being compared. LLVM has to introduce extra boolean
variable and use inline_asm trick to force easier for the verifier assembly.
Instead teach the verifier to recognize that
r1 = skb->data;
r1 += 10;
r2 = skb->data_end;
if (r1 > r2) {
here r1 points beyond packet_end and
subsequent
if (r1 > r2) // always evaluates to "true".
}
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Jiri Olsa <jolsa@redhat.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20201111031213.25109-2-alexei.starovoitov@gmail.com
Current release - regressions:
- arm64: dts: fsl-ls1028a-kontron-sl28: specify in-band mode for ENETC
Current release - bugs in new features:
- mptcp: provide rmem[0] limit offset to fix oops
Previous release - regressions:
- IPv6: Set SIT tunnel hard_header_len to zero to fix path MTU
calculations
- lan743x: correctly handle chips with internal PHY
- bpf: Don't rely on GCC __attribute__((optimize)) to disable GCSE
- mlx5e: Fix VXLAN port table synchronization after function reload
Previous release - always broken:
- bpf: Zero-fill re-used per-cpu map element
- net: udp: fix out-of-order packets when forwarding with UDP GSO
fraglists turned on
- fix UDP header access on Fast/frag0 UDP GRO
- fix IP header access and skb lookup on Fast/frag0 UDP GRO
- ethtool: netlink: add missing netdev_features_change() call
- net: Update window_clamp if SOCK_RCVBUF is set
- igc: Fix returning wrong statistics
- ch_ktls: fix multiple leaks and corner cases in Chelsio TLS offload
- tunnels: Fix off-by-one in lower MTU bounds for ICMP/ICMPv6 replies
- r8169: disable hw csum for short packets on all chip versions
- vrf: Fix fast path output packet handling with async Netfilter rules
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAl+thbIACgkQMUZtbf5S
Irsy0RAAhYIYDNMSkQhcVcQPMxbtStwgTtKrWxg/D2zh3Kg+B4oRgoNZnt9kmlHX
Su/aRWbTWBkDIMxIWBfRsO3z5zSQm4yLG1FTlfsOcWzOJcsntCO8SzikyxtnEZK8
Bpi7dOoKB6KF0V2YjM9AHh5fbXvS7KJfp/PjZ7Kpn5BEbFV8rKtIyiJxwXXZUr6O
ddM9Om4i0zf+dmsY1HVEyowPQMVB3vbn8F3dPk3ZrD8NVa53NtvMRxHKSsourRbZ
yp4LKZV+POKHPFglO4jhLymhyeiwb1qgA8wssk7EKu0bwPeOcER4Tpewh1ib4C/C
sRRzj0Wlw6dyPCkyNKx23D7dF/DrnLmXLUBhGS2mu2htSlWOH6w6rFQoVSNGGy9T
DKUlUVUPG80mgYdME6NLJ27GOGQzxoAvzWgpcL6dJs9jz8nQqABJeXvdjw/vc/XH
AOaKy4VwE3qf0W106JpUb+a/q0RJf7w3o4c1vLc/AZwpshNBOsrJBqrTk2E5Nrhd
mcQykaF++DbLPIyTqhHl0GpKapohThESyMvfc4WRBFBaCwgFdOY/t0Gz3GA2N8Jc
fuq9NOB1bfouaFGfzdkZ7RZJi3lFqZfv/XiJCh/knp1/lHAQPo4TuADcFDsjeEc9
yr48SRDnCqahAQ7bUP0b5i31SZzwAYb/HnwYuvf4LWFvHl9XG5A=
=AKM7
-----END PGP SIGNATURE-----
Merge tag 'net-5.10-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Current release - regressions:
- arm64: dts: fsl-ls1028a-kontron-sl28: specify in-band mode for
ENETC
Current release - bugs in new features:
- mptcp: provide rmem[0] limit offset to fix oops
Previous release - regressions:
- IPv6: Set SIT tunnel hard_header_len to zero to fix path MTU
calculations
- lan743x: correctly handle chips with internal PHY
- bpf: Don't rely on GCC __attribute__((optimize)) to disable GCSE
- mlx5e: Fix VXLAN port table synchronization after function reload
Previous release - always broken:
- bpf: Zero-fill re-used per-cpu map element
- fix out-of-order UDP packets when forwarding with UDP GSO fraglists
turned on:
- fix UDP header access on Fast/frag0 UDP GRO
- fix IP header access and skb lookup on Fast/frag0 UDP GRO
- ethtool: netlink: add missing netdev_features_change() call
- net: Update window_clamp if SOCK_RCVBUF is set
- igc: Fix returning wrong statistics
- ch_ktls: fix multiple leaks and corner cases in Chelsio TLS offload
- tunnels: Fix off-by-one in lower MTU bounds for ICMP/ICMPv6 replies
- r8169: disable hw csum for short packets on all chip versions
- vrf: Fix fast path output packet handling with async Netfilter
rules"
* tag 'net-5.10-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (65 commits)
lan743x: fix use of uninitialized variable
net: udp: fix IP header access and skb lookup on Fast/frag0 UDP GRO
net: udp: fix UDP header access on Fast/frag0 UDP GRO
devlink: Avoid overwriting port attributes of registered port
vrf: Fix fast path output packet handling with async Netfilter rules
cosa: Add missing kfree in error path of cosa_write
net: switch to the kernel.org patchwork instance
ch_ktls: stop the txq if reaches threshold
ch_ktls: tcb update fails sometimes
ch_ktls/cxgb4: handle partial tag alone SKBs
ch_ktls: don't free skb before sending FIN
ch_ktls: packet handling prior to start marker
ch_ktls: Correction in middle record handling
ch_ktls: missing handling of header alone
ch_ktls: Correction in trimmed_len calculation
cxgb4/ch_ktls: creating skbs causes panic
ch_ktls: Update cheksum information
ch_ktls: Correction in finding correct length
cxgb4/ch_ktls: decrypted bit is not enough
net/x25: Fix null-ptr-deref in x25_connect
...
Make the intel_pstate driver behave as expected when it operates in
the passive mode with HWP enabled and the "powersave" governor on
top of it.
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAl+tVWUSHHJqd0Byand5
c29ja2kubmV0AAoJEILEb/54YlRxNIwQALNj2uV+CJL4DCvcMPFqvtB7bwwsk1mS
fqRk/wEhz5noE/x3uhD1DKlL3VPj1sDUVRmHKSdIgqrwSFX2zbw+cf2y6E94WdDz
/7x0Khj4mT5cfGHacItNBnkglCrxVXxSdU4DcPTgINlWM8iv6W8D3uK5OpFYDtKr
5shqf45U8/+fh7hGCtNnofAZEVU+YTDzY0jHnnIxD8FKXFLaDFj6jVGjgdXgBD5s
/XgsKz867SybzLuTW9O0SKDughMhmhaqXnHwtu9jvlw/3i1Wn16r2LeCWyIkoxWy
MRNXg4rOerJvK084gxJW9BWmCuA6NnKBtKvXNlqHzl14ept3Cf3dYtaQ50x7eYQB
osMWbBDdRjV1fo7SptMXQmn8sKxZgrjc0pSYicbiMOH3BkpIn5ed4+MPWfWN8pyb
piRkx17sFwPE7jI5Rkuv+EucisG8tNvWImE9gFENxtelF1rj7njV1xMYlevrD/9u
aYTNYUeRAc6DA2AF/mzXtqwXpDqoxa7X0UBl8JFmkLvcvORtR3XY6HlcYMgp820a
/Sh0rZmuUlssUnrhd1Kr6QRiMIrCihTnbhTXsY0oZH4QSYYJCS89qijngAtqobEt
K+eqsHHoGmVK3Ch+O+YpFo+GpH5Avk0b/DisX3Zu20hGEX4fvv4q7ZTuNSnHhgOL
ERjLBUQZUFaf
=zh3K
-----END PGP SIGNATURE-----
Merge tag 'pm-5.10-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"Make the intel_pstate driver behave as expected when it operates in
the passive mode with HWP enabled and the 'powersave' governor on top
of it"
* tag 'pm-5.10-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
cpufreq: intel_pstate: Take CPUFREQ_GOV_STRICT_TARGET into account
cpufreq: Add strict_target to struct cpufreq_policy
cpufreq: Introduce CPUFREQ_GOV_STRICT_TARGET
cpufreq: Introduce governor flags
In bpf_pid_task_storage_update_elem(), it missed to
test the !task_storage_ptr(task) which then could trigger a NULL
pointer exception in bpf_local_storage_update().
Fixes: 4cf1bc1f10 ("bpf: Implement task local storage")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Roman Gushchin <guro@fb.com>
Acked-by: KP Singh <kpsingh@google.com>
Link: https://lore.kernel.org/bpf/20201112001919.2028357-1-kafai@fb.com
Pull swiotlb fixes from Konrad Rzeszutek Wilk:
"Two tiny fixes for issues that make drivers under Xen unhappy under
certain conditions"
* 'stable/for-linus-5.10-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/swiotlb:
swiotlb: remove the tbl_dma_addr argument to swiotlb_tbl_map_single
swiotlb: fix "x86: Don't panic if can not alloc buffer for swiotlb"
A bunch of functions in the new ringbuffer code take both a
printk_ringbuffer struct and a separate prb_data_ring. This is a relic
from an earlier version of the code when a second data ring was present.
Since this is no longer the case remove the extra function argument
from:
- data_make_reusable()
- data_push_tail()
- data_alloc()
- data_realloc()
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
The unsigned variable datasec_id is assigned a return value from the call
to check_pseudo_btf_id(), which may return negative error code.
This fixes the following coccicheck warning:
./kernel/bpf/verifier.c:9616:5-15: WARNING: Unsigned expression compared with zero: datasec_id > 0
Fixes: eaa6bcb71e ("bpf: Introduce bpf_per_cpu_ptr()")
Reported-by: Tosk Robot <tencent_os_robot@tencent.com>
Signed-off-by: Kaixu Xia <kaixuxia@tencent.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Cc: Hao Luo <haoluo@google.com>
Link: https://lore.kernel.org/bpf/1605071026-25906-1-git-send-email-kaixuxia@tencent.com
Make sure btf_parse_module() is compiled out if module BTFs are not enabled.
Fixes: 36e68442d1 ("bpf: Load and verify kernel module BTFs")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20201111040645.903494-1-andrii@kernel.org
'ret' in 2 functions are not used. and one of them is a void function.
So remove them to avoid gcc warning:
kernel/trace/ftrace.c:4166:6: warning: variable ‘ret’ set but not used
[-Wunused-but-set-variable]
kernel/trace/ftrace.c:5571:6: warning: variable ‘ret’ set but not used
[-Wunused-but-set-variable]
Link: https://lkml.kernel.org/r/1604674486-52350-1-git-send-email-alex.shi@linux.alibaba.com
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Add a new config RING_BUFFER_RECORD_RECURSION that will place functions that
recurse from the ring buffer into the ftrace recused_functions file.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Inspecting the data structures of the function graph tracer, I found that
the overrun value is unsigned long, which is 8 bytes on a 64 bit machine,
and not only that, the depth is an int (4 bytes). The overrun can be simply
an unsigned int (4 bytes) and pack the ftrace_graph_ret structure better.
The depth is moved up next to the func, as it is used more often with func,
and improves cache locality.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The try_invoke_on_locked_down_task() function requires that
interrupts be enabled, but it is called with interrupts disabled from
rcu_print_task_stall(), resulting in an "IRQs not enabled as expected"
diagnostic. This commit therefore updates rcu_print_task_stall()
to accumulate a list of the first few tasks while holding the current
leaf rcu_node structure's ->lock, then releases that lock and only then
uses try_invoke_on_locked_down_task() to attempt to obtain per-task
detailed information. Of course, as soon as ->lock is released, the
task might exit, so the get_task_struct() function is used to prevent
the task structure from going away in the meantime.
Link: https://lore.kernel.org/lkml/000000000000903d5805ab908fc4@google.com/
Fixes: 5bef8da66a ("rcu: Add per-task state to RCU CPU stall warnings")
Reported-by: syzbot+cb3b69ae80afd6535b0e@syzkaller.appspotmail.com
Reported-by: syzbot+f04854e1c5c9e913cc27@syzkaller.appspotmail.com
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Add kernel module listener that will load/validate and unload module BTF.
Module BTFs gets ID generated for them, which makes it possible to iterate
them with existing BTF iteration API. They are given their respective module's
names, which will get reported through GET_OBJ_INFO API. They are also marked
as in-kernel BTFs for tooling to distinguish them from user-provided BTFs.
Also, similarly to vmlinux BTF, kernel module BTFs are exposed through
sysfs as /sys/kernel/btf/<module-name>. This is convenient for user-space
tools to inspect module BTF contents and dump their types with existing tools:
[vmuser@archvm bpf]$ ls -la /sys/kernel/btf
total 0
drwxr-xr-x 2 root root 0 Nov 4 19:46 .
drwxr-xr-x 13 root root 0 Nov 4 19:46 ..
...
-r--r--r-- 1 root root 888 Nov 4 19:46 irqbypass
-r--r--r-- 1 root root 100225 Nov 4 19:46 kvm
-r--r--r-- 1 root root 35401 Nov 4 19:46 kvm_intel
-r--r--r-- 1 root root 120 Nov 4 19:46 pcspkr
-r--r--r-- 1 root root 399 Nov 4 19:46 serio_raw
-r--r--r-- 1 root root 4094095 Nov 4 19:46 vmlinux
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/bpf/20201110011932.3201430-5-andrii@kernel.org
Allocate ID for vmlinux BTF. This makes it visible when iterating over all BTF
objects in the system. To allow distinguishing vmlinux BTF (and later kernel
module BTF) from user-provided BTFs, expose extra kernel_btf flag, as well as
BTF name ("vmlinux" for vmlinux BTF, will equal to module's name for module
BTF). We might want to later allow specifying BTF name for user-provided BTFs
as well, if that makes sense. But currently this is reserved only for
in-kernel BTFs.
Having in-kernel BTFs exposed IDs will allow to extend BPF APIs that require
in-kernel BTF type with ability to specify BTF types from kernel modules, not
just vmlinux BTF. This will be implemented in a follow up patch set for
fentry/fexit/fmod_ret/lsm/etc.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20201110011932.3201430-3-andrii@kernel.org
Adjust in-kernel BTF implementation to support a split BTF mode of operation.
Changes are mostly mirroring libbpf split BTF changes, with the exception of
start_id being 0 for in-kernel implementation due to simpler read-only mode.
Otherwise, for split BTF logic, most of the logic of jumping to base BTF,
where necessary, is encapsulated in few helper functions. Type numbering and
string offset in a split BTF are logically continuing where base BTF ends, so
most of the high-level logic is kept without changes.
Type verification and size resolution is only doing an added resolution of new
split BTF types and relies on already cached size and type resolution results
in the base BTF.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20201110011932.3201430-2-andrii@kernel.org
The Energy Model supports power values expressed in milli-Watts or in an
'abstract scale'. Update the related comments is the code to reflect that
state.
Reviewed-by: Quentin Perret <qperret@google.com>
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
There are different platforms and devices which might use different scale
for the power values. Kernel sub-systems might need to check if all
Energy Model (EM) devices are using the same scale. Address that issue and
store the information inside EM for each device. Thanks to that they can
be easily compared and proper action triggered.
Suggested-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Reviewed-by: Quentin Perret <qperret@google.com>
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Pull core dump fix from Al Viro:
"Fix for multithreaded coredump playing fast and loose with getting
registers of secondary threads; if a secondary gets caught in the
middle of exit(2), the conditition it will be stopped in for dumper to
examine might be unusual enough for things to go wrong.
Quite a few architectures are fine with that, but some are not."
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
don't dump the threads that had been already exiting when zapped.
After reboot, it's not possible to use hotkeys to enter BIOS setup
and boot menu on some HP laptops.
BIOS folks identified the root cause is the missing _PTS call, and
BIOS is expecting _PTS to do proper reset.
Using S5 for reboot is default behavior under Windows, "A full
shutdown (S5) occurs when a system restart is requested" [1], so
let's do the same here.
[1] https://docs.microsoft.com/en-us/windows/win32/power/system-power-states
Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
[ rjw: Subject edit ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The CFS wakeup code will only ever go through EAS / its fast path on
"regular" wakeups (i.e. not on forks or execs). These are currently gated
by a check against 'sd_flag', which would be SD_BALANCE_WAKE at wakeup.
However, we now have a flag that explicitly tells us whether a wakeup is a
"regular" one, so hinge those conditions on that flag instead.
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20201102184514.2733-4-valentin.schneider@arm.com
Only select_task_rq_fair() uses that parameter to do an actual domain
search, other classes only care about what kind of wakeup is happening
(fork, exec, or "regular") and thus just translate the flag into a wakeup
type.
WF_TTWU and WF_EXEC have just been added, use these along with WF_FORK to
encode the wakeup types we care about. For select_task_rq_fair(), we can
simply use the shiny new WF_flag : SD_flag mapping.
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20201102184514.2733-3-valentin.schneider@arm.com
To remove the sd_flag parameter of select_task_rq(), we need another way of
encoding wakeup types. There already is a WF_FORK flag, add the missing two.
With that said, we still need an easy way to turn WF_foo into
SD_bar (e.g. WF_TTWU into SD_BALANCE_WAKE). As suggested by Peter, let's
make our lives easier and make them match exactly, and throw in some
compile-time checks for good measure.
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20201102184514.2733-2-valentin.schneider@arm.com
Since ab93a4bc95 ("sched/fair: Remove distribute_running fromCFS
bandwidth"), there is nothing to protect between
raw_spin_lock_irqsave/store() in do_sched_cfs_slack_timer().
Signed-off-by: Hui Su <sh_def@163.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Reviewed-by: Ben Segall <bsegall@google.com>
Link: https://lkml.kernel.org/r/20201030144621.GA96974@rlk
In order to minimize the interference of migrate_disable() on lower
priority tasks, which can be deprived of runtime due to being stuck
below a higher priority task. Teach the RT/DL balancers to push away
these higher priority tasks when a lower priority task gets selected
to run on a freshly demoted CPU (pull).
This adds migration interference to the higher priority task, but
restores bandwidth to system that would otherwise be irrevocably lost.
Without this it would be possible to have all tasks on the system
stuck on a single CPU, each task preempted in a migrate_disable()
section with a single high priority task running.
This way we can still approximate running the M highest priority tasks
on the system.
Migrating the top task away is (ofcourse) still subject to
migrate_disable() too, which means the lower task is subject to an
interference equivalent to the worst case migrate_disable() section.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Link: https://lkml.kernel.org/r/20201023102347.499155098@infradead.org
There's a valid ->pi_lock recursion issue where the actual PI code
tries to wake up the stop task. Make lockdep aware so it doesn't
complain about this.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Link: https://lkml.kernel.org/r/20201023102347.406912197@infradead.org
We want migrate_disable() tasks to get PULLs in order for them to PUSH
away the higher priority task.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Link: https://lkml.kernel.org/r/20201023102347.310519774@infradead.org
Replace a bunch of cpumask_any*() instances with
cpumask_any*_distribute(), by injecting this little bit of random in
cpu selection, we reduce the chance two competing balance operations
working off the same lowest_mask pick the same CPU.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Link: https://lkml.kernel.org/r/20201023102347.190759694@infradead.org
On CPU unplug tasks which are in a migrate disabled region cannot be pushed
to a different CPU until they returned to migrateable state.
Account the number of tasks on a runqueue which are in a migrate disabled
section and make the hotplug wait mechanism respect that.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Link: https://lkml.kernel.org/r/20201023102347.067278757@infradead.org
Concurrent migrate_disable() and set_cpus_allowed_ptr() has
interesting features. We rely on set_cpus_allowed_ptr() to not return
until the task runs inside the provided mask. This expectation is
exported to userspace.
This means that any set_cpus_allowed_ptr() caller must wait until
migrate_enable() allows migrations.
At the same time, we don't want migrate_enable() to schedule, due to
patterns like:
preempt_disable();
migrate_disable();
...
migrate_enable();
preempt_enable();
And:
raw_spin_lock(&B);
spin_unlock(&A);
this means that when migrate_enable() must restore the affinity
mask, it cannot wait for completion thereof. Luck will have it that
that is exactly the case where there is a pending
set_cpus_allowed_ptr(), so let that provide storage for the async stop
machine.
Much thanks to Valentin who used TLA+ most effective and found lots of
'interesting' cases.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Link: https://lkml.kernel.org/r/20201023102346.921768277@infradead.org
Add the base migrate_disable() support (under protest).
While migrate_disable() is (currently) required for PREEMPT_RT, it is
also one of the biggest flaws in the system.
Notably this is just the base implementation, it is broken vs
sched_setaffinity() and hotplug, both solved in additional patches for
ease of review.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Link: https://lkml.kernel.org/r/20201023102346.818170844@infradead.org
Thread a u32 flags word through the *set_cpus_allowed*() callchain.
This will allow adding behavioural tweaks for future users.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Link: https://lkml.kernel.org/r/20201023102346.729082820@infradead.org
Since we now migrate tasks away before DYING, we should also move
bandwidth unthrottle, otherwise we can gain tasks from unthrottle
after we expect all tasks to be gone already.
Also; it looks like the RT balancers don't respect cpu_active() and
instead rely on rq->online in part, complete this. This too requires
we do set_rq_offline() earlier to match the cpu_active() semantics.
(The bigger patch is to convert RT to cpu_active() entirely)
Since set_rq_online() is called from sched_cpu_activate(), place
set_rq_offline() in sched_cpu_deactivate().
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Link: https://lkml.kernel.org/r/20201023102346.639538965@infradead.org
With the new mechanism which kicks tasks off the outgoing CPU at the end of
schedule() the situation on an outgoing CPU right before the stopper thread
brings it down completely is:
- All user tasks and all unbound kernel threads have either been migrated
away or are not running and the next wakeup will move them to a online CPU.
- All per CPU kernel threads, except cpu hotplug thread and the stopper
thread have either been unbound or parked by the responsible CPU hotplug
callback.
That means that at the last step before the stopper thread is invoked the
cpu hotplug thread is the last legitimate running task on the outgoing
CPU.
Add a final wait step right before the stopper thread is kicked which
ensures that any still running tasks on the way to park or on the way to
kick themself of the CPU are either sleeping or gone.
This allows to remove the migrate_tasks() crutch in sched_cpu_dying(). If
sched_cpu_dying() detects that there is still another running task aside of
the stopper thread then it will explode with the appropriate fireworks.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Link: https://lkml.kernel.org/r/20201023102346.547163969@infradead.org
Don't rely on the scheduler to force break affinity for us -- it will
stop doing that for per-cpu-kthreads.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Link: https://lkml.kernel.org/r/20201023102346.464718669@infradead.org
RT kernels need to ensure that all tasks which are not per CPU kthreads
have left the outgoing CPU to guarantee that no tasks are force migrated
within a migrate disabled section.
There is also some desire to (ab)use fine grained CPU hotplug control to
clear a CPU from active state to force migrate tasks which are not per CPU
kthreads away for power control purposes.
Add a mechanism which waits until all tasks which should leave the CPU
after the CPU active flag is cleared have moved to a different online CPU.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Link: https://lkml.kernel.org/r/20201023102346.377836842@infradead.org
In preparation for migrate_disable(), make sure only per-cpu kthreads
are allowed to run on !active CPUs.
This is ran (as one of the very first steps) from the cpu-hotplug
task which is a per-cpu kthread and completion of the hotplug
operation only requires such tasks.
This constraint enables the migrate_disable() implementation to wait
for completion of all migrate_disable regions on this CPU at hotplug
time without fear of any new ones starting.
This replaces the unlikely(rq->balance_callbacks) test at the tail of
context_switch with an unlikely(rq->balance_work), the fast path is
not affected.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Link: https://lkml.kernel.org/r/20201023102346.292709163@infradead.org
The intent of balance_callback() has always been to delay executing
balancing operations until the end of the current rq->lock section.
This is because balance operations must often drop rq->lock, and that
isn't safe in general.
However, as noted by Scott, there were a few holes in that scheme;
balance_callback() was called after rq->lock was dropped, which means
another CPU can interleave and touch the callback list.
Rework code to call the balance callbacks before dropping rq->lock
where possible, and otherwise splice the balance list onto a local
stack.
This guarantees that the balance list must be empty when we take
rq->lock. IOW, we'll only ever run our own balance callbacks.
Reported-by: Scott Wood <swood@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Link: https://lkml.kernel.org/r/20201023102346.203901269@infradead.org
Crashes in stop-machine are hard to connect to the calling code, add a
little something to help with that.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Link: https://lkml.kernel.org/r/20201023102346.116513635@infradead.org
Reading /proc/sys/kernel/sched_domain/cpu*/domain0/flags mutliple times
with small reads causes oopses with slub corruption issues because the kfree is
free'ing an offset from a previous allocation. Fix this by adding in a new
pointer 'buf' for the allocation and kfree and use the temporary pointer tmp
to handle memory copies of the buf offsets.
Fixes: 5b9f8ff7b3 ("sched/debug: Output SD flag names rather than their values")
Reported-by: Jeff Bastian <jbastian@redhat.com>
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20201029151103.373410-1-colin.king@canonical.com
During fast wakeup path, scheduler always check whether local or prev
cpus are good candidates for the task before looking for other cpus in
the domain. With commit b7a331615d ("sched/fair: Add asymmetric CPU
capacity wakeup scan") the heterogenous system gains a dedicated path
but doesn't try to reuse prev cpu whenever possible. If the previous
cpu is idle and belong to the LLC domain, we should check it 1st
before looking for another cpu because it stays one of the best
candidate and this also stabilizes task placement on the system.
This change aligns asymmetric path behavior with symmetric one and reduces
cases where the task migrates across all cpus of the sd_asym_cpucapacity
domains at wakeup.
This change does not impact normal EAS mode but only the overloaded case or
when EAS is not used.
- On hikey960 with performance governor (EAS disable)
./perf bench sched pipe -T -l 50000
mainline w/ patch
# migrations 999364 0
ops/sec 149313(+/-0.28%) 182587(+/- 0.40) +22%
- On hikey with performance governor
./perf bench sched pipe -T -l 50000
mainline w/ patch
# migrations 0 0
ops/sec 47721(+/-0.76%) 47899(+/- 0.56) +0.4%
According to test on hikey, the patch doesn't impact symmetric system
compared to current implementation (only tested on arm64)
Also read the uclamped value of task's utilization at most twice instead
instead each time we compare task's utilization with cpu's capacity.
Fixes: b7a331615d ("sched/fair: Add asymmetric CPU capacity wakeup scan")
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20201029161824.26389-1-vincent.guittot@linaro.org
schbench shows latency increase for 95 percentile above since:
commit 0b0695f2b3 ("sched/fair: Rework load_balance()")
Align the behavior of the load balancer with the wake up path, which tries
to select an idle CPU which belongs to the LLC for a waking task.
calculate_imbalance() will use nr_running instead of the spare
capacity when CPUs share resources (ie cache) at the domain level. This
will ensure a better spread of tasks on idle CPUs.
Running schbench on a hikey (8cores arm64) shows the problem:
tip/sched/core :
schbench -m 2 -t 4 -s 10000 -c 1000000 -r 10
Latency percentiles (usec)
50.0th: 33
75.0th: 45
90.0th: 51
95.0th: 4152
*99.0th: 14288
99.5th: 14288
99.9th: 14288
min=0, max=14276
tip/sched/core + patch :
schbench -m 2 -t 4 -s 10000 -c 1000000 -r 10
Latency percentiles (usec)
50.0th: 34
75.0th: 47
90.0th: 52
95.0th: 78
*99.0th: 94
99.5th: 94
99.9th: 94
min=0, max=94
Fixes: 0b0695f2b3 ("sched/fair: Rework load_balance()")
Reported-by: Chris Mason <clm@fb.com>
Suggested-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Rik van Riel <riel@surriel.com>
Tested-by: Rik van Riel <riel@surriel.com>
Link: https://lkml.kernel.org/r/20201102102457.28808-1-vincent.guittot@linaro.org
Chris Wilson reported a problem spotted by check_chain_key(): a chain
key got changed in validate_chain() because we modify the ->read in
validate_chain() to skip checks for dependency adding, and ->read is
taken into calculation for chain key since commit f611e8cf98
("lockdep: Take read/write status in consideration when generate
chainkey").
Fix this by avoiding to modify ->read in validate_chain() based on two
facts: a) since we now support recursive read lock detection, there is
no need to skip checks for dependency adding for recursive readers, b)
since we have a), there is only one case left (nest_lock) where we want
to skip checks in validate_chain(), we simply remove the modification
for ->read and rely on the return value of check_deadlock() to skip the
dependency adding.
Reported-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20201102053743.450459-1-boqun.feng@gmail.com
A new cpufreq governor flag will be added subsequently, so replace
the bool dynamic_switching fleid in struct cpufreq_governor with a
flags field and introduce CPUFREQ_GOV_DYNAMIC_SWITCHING to set for
the "dynamic switching" governors instead of it.
No intentional functional impact.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Commit ce3d31ad3c ("arm64/smp: Move rcu_cpu_starting() earlier") ensured
that RCU is informed early about incoming CPUs that might end up calling
into printk() before they are online. However, if such a CPU fails the
early CPU feature compatibility checks in check_local_cpu_capabilities(),
then it will be powered off or parked without informing RCU, leading to
an endless stream of stalls:
| rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
| rcu: 2-O...: (0 ticks this GP) idle=002/1/0x4000000000000000 softirq=0/0 fqs=2593
| (detected by 0, t=5252 jiffies, g=9317, q=136)
| Task dump for CPU 2:
| task:swapper/2 state:R running task stack: 0 pid: 0 ppid: 1 flags:0x00000028
| Call trace:
| ret_from_fork+0x0/0x30
Ensure that the dying CPU invokes rcu_report_dead() prior to being powered
off or parked.
Cc: Qian Cai <cai@redhat.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Suggested-by: Qian Cai <cai@redhat.com>
Link: https://lore.kernel.org/r/20201105222242.GA8842@willie-the-truck
Link: https://lore.kernel.org/r/20201106103602.9849-3-will@kernel.org
Signed-off-by: Will Deacon <will@kernel.org>
As commit:
39f23ce07b ("sched/fair: Fix unthrottle_cfs_rq() for leaf_cfs_rq list")
does in unthrottle_cfs_rq(), throttle_cfs_rq() can also use the same
pattern as dequeue_task_fair().
No functional changes.
Signed-off-by: Peng Wang <rocking@linux.alibaba.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Phil Auld <pauld@redhat.com>
Cc: Ben Segall <bsegall@google.com>
Link: https://lore.kernel.org/r/f11dd2e3ab35cc538e2eb57bf0c99b6eaffce127.1604973978.git.rocking@linux.alibaba.com
There is a bug when passing zero to PTR_ERR() and return.
Fix the smatch error.
Fixes: c4d0bfb450 ("bpf: Add bpf_snprintf_btf helper")
Signed-off-by: Wang Qing <wangqing@vivo.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/1604735144-686-1-git-send-email-wangqing@vivo.com
Currently perf_event_attr::exclusive can be used to ensure an
event(group) is the sole group scheduled on the PMU. One consequence
is that when you have a pinned event (say the watchdog) you can no
longer have regular exclusive event(group)s.
Inspired by the fact that !pinned events are considered less strict,
allow !pinned,exclusive events to share the PMU with pinned,!exclusive
events.
Pinned,exclusive is still fully exclusive.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20201029162902.105962225@infradead.org
Commit 9e6302056f ("perf: Use hrtimers for event multiplexing")
placed the hrtimer (re)start call in the wrong place. Instead of
capturing all scheduling failures, it only considered the PMU failure.
The result is that groups using perf_event_attr::exclusive are no
longer rotated.
Fixes: 9e6302056f ("perf: Use hrtimers for event multiplexing")
Reported-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20201029162902.038667689@infradead.org
Since event_sched_out() clears cpuctx->exclusive upon removal of an
exclusive event (and only group leaders can be exclusive), there is no
point in group_sched_out() trying to do it too. It is impossible for
cpuctx->exclusive to still be set here.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20201029162901.904060564@infradead.org
struct perf_sample_data lives on-stack, we should be careful about it's
size. Furthermore, the pt_regs copy in there is only because x86_64 is a
trainwreck, solve it differently.
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Steven Rostedt <rostedt@goodmis.org>
Link: https://lkml.kernel.org/r/20201030151955.258178461@infradead.org
__perf_output_begin() has an on-stack struct perf_sample_data in the
unlikely case it needs to generate a LOST record. However, every call
to perf_output_begin() must already have a perf_sample_data on-stack.
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20201030151954.985416146@infradead.org
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl+pR0MTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoSWpD/93kyOy0L7NkIELgM6/OHipjsLC6K12
jMXifA5DfmIIm31sLLzLk08YPz4TOJU+lZKn1DGdqYLioMvvsJe3uZP/WQVyV81z
QnqdwOpdJVxq7JIjQW04eaVDWjXmxFGmbjKq/tbGexph0hckOtUq/7JOEukzgbLX
q5iZqYgFouU/G5HIW5Um21WiEVmzzjTFMp13zbqo3rMrG9vXIb5JQxm+TkBbx2P7
u2kS8R3OFScQcH6UyhaFBBmNFyUHtfbMKPESnhVSXUggQ0aVhNi6pjylE2ZaEIB+
d7C/yNGNwlC7rGPlh49W5gH5rogrX2Ft2YVrHf645q3Sj/GhbXZ1NqT4f1DJt5uM
tKjKFxJv6g1dT7ejlUQmseAtLI4ue2wj3C0qtPeHOnUlPHlaDlkTLE2oaCh97Mgn
mDpAZVnMOcLMRuBFt1J+fnoYcBHwXlfT1rAj//U6m6Pi5BbwIVwTsok5Vsysms4L
Tyx31zXee3XTp8FEWRL9gqH0b5zXxEUuHxZkqu4vdQDf+wkTi08Q4FpGnhaxGBrY
CHTwm42hmOK80hdQ6Vv4O38LkQcaEpTztVRexP7a89Q3zxmGlL5BsWGqIIoai0Fg
8UWmHvauG3kqAif6giWv3QNQ5v0rWksHNxU1yK0tRT6s03no3Kz1eoHHLar8+hoU
8+/AY23fcJookg==
=SPWC
-----END PGP SIGNATURE-----
Merge tag 'core-entry-notify-signal' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into tif-task_work.arch
Core changes to support TASK_NOTIFY_SIGNAL
* tag 'core-entry-notify-signal' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
task_work: Use TIF_NOTIFY_SIGNAL if available
entry: Add support for TIF_NOTIFY_SIGNAL
signal: Add task_sigpending() helper
The exit_pi_state_list() function calls put_pi_state() with IRQs disabled
and is not expecting that IRQs will be enabled inside the function.
Use the _irqsave() variant so that IRQs are restored to the original state
instead of being enabled unconditionally.
Fixes: 153fbd1226 ("futex: Fix more put_pi_state() vs. exit_pi_state_list() races")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20201106085205.GA1159983@mwanda
Many comments in this module do not comply with the preferred multi-line
comment style as reported by 'scripts/checkpatch.pl':
WARNING: Block comments use * on subsequent lines
WARNING: Block comments use a trailing */ on a separate line
Fix those comments, along with (unreported for some reason?) the starts
of the multi-line comments not being /* on their own line...
Signed-off-by: Sergey Shtylyov <s.shtylyov@omprussia.ru>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
Some functions have the proper 'kernel-doc' comments but these don't start
with proper /** -- fix that, along with adding () to the function name on
the following lines to fully comply with the 'kernel-doc' format.
Signed-off-by: Sergey Shtylyov <s.shtylyov@omprussia.ru>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
Some 'kernel-doc' function comments do not fully comply with the specified
format due to:
- missing () after the function name;
- "RETURNS:"/"Returns:" instead of "Return:" when documenting the function's
result.
- empty line before describing the function's arguments.
Signed-off-by: Sergey Shtylyov <s.shtylyov@omprussia.ru>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
current->group_leader->exit_signal may change during copy_process() if
current->real_parent exits.
Move the assignment inside tasklist_lock to avoid the race.
Signed-off-by: Eddy Wu <eddy_wu@trendmicro.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
parser.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl+oC4ATHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoXveD/4mIAq1umSj/YdThj65S6nAHjhW1MjO
VxGRNSEiUxOld6OnxZlyup1q4NvyYB1cVk71jLjdBIvEK/ANdq3Cd+PVB5tAAoNK
726zrPlOXx4NJulq2q17ygazZQgOtl/5QsYovS0PymCyyvwfz56yr/OfPRThCe+S
gJZs+zt8uunhEKz1J7lki/vn7D6bHRdaL8yTLabsPIQ296WXrUd2NS4fuanFNyLO
kxETMAECyy77q17dgClFFDheyLHAotoySLm6qSwNeWUPFwlCmoCbHgzSBoryIlH5
DMlwA5iuA5qVv/4jlZ0+WvVWeSWOyF0boZ1cZI+r4nitVvdwnnxTgXcyuTxDSggu
nygQqLlTIQ31dwQleAfKrooJfwk9nB5uLEQkTwdx0BQwdHhvb7usnNEcIMx8WVFf
86LorPlfbKrUTD7PWj7zg3x32GNv/nWjuDnzD0NW6BPKqVy6XaRZtaG2WyYfB9YW
9KXpfsxjDeTBrZK5b/YOna1ah00J+2hwIL7sUhVf24aOmmImy3+Yqe2bQOlR/JGs
yIBFOAPCWrl7AJOwziMD2NQGCTzk9hqKRyhy1WtPC/L0uohf16/xbnViyL1Ijhud
Hzu09qoo8EXYbSwRcHQ+qK39s9fwDK1yvPp+k4XGXS/v94YN0W5rd3o49bTMvhxE
T8CKfwngsBzeYg==
=r2wz
-----END PGP SIGNATURE-----
Merge tag 'perf-urgent-2020-11-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fix from Thomas Gleixner:
"A single fix for the perf core plugging a memory leak in the address
filter parser"
* tag 'perf-urgent-2020-11-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/core: Fix a memory leak in perf_event_parse_addr_filter()
underlying RT mutex was not handled correctly and triggering a BUG()
instead of treating it as another variant of retry condition.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl+oCqcTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoUSvD/kBBV+cSHoLZqWzoYCySxnIgTX5xLgH
imdhBnGX4MZsHydD1zu9IKuHbOTdRdnf/aXk+6akqyRTqtAxqL3e1BW7zfX2aWo/
EOk5P36ZLzdqqLrc4Z/FZOLaUAx95t2dse0PA3jXOUTHaxr/bwih+/f1vR84xFBG
/Sv1TubPLU7koZI057SrZsHmhF7yc22BMvg30UFUgx8tZyXGBIUEYOYS6sgGBwQe
qXIxTHkpiRwCHu+i44QJiBdhPg8gEb4tsJQLlnmhPYz4m5v/w4HGe7ixZlBQU7f1
isY4GEuDHRRheOzvP0d5LgcRx6r1hNDDu8MWNjWp7N1sTTKMGnDnZ/a36TgBtlvc
Go+HVNCapfhSeul8qjZuL022a9Yv2NcsouwdPWfZpMyWuhkXYWnaVHWvuEKsB4UZ
z0b34q5pULwSz/R4BiqwoLZXdWNsN8x/C31F3v0jBNa/HwK5dTEAgeez0+HNJhDA
Ci/+CgggLuch6V6JYwOT1AEB4ymEXTuRK/d8IwlVADA3fN0wulUUAnbgVYHRWD4y
JaXJKWHnw/G8R7AI31aVHa70D64tAM+0UdsLM+4/idP2kc2Cv7EcCdIHz2OUfQBD
I1/RmugdCPsF5Zutb0dQFNumzxRGk6l+woa7EFnzpkO96GBJDCQiUZKvTFINdCh0
LoLv49vOBVDifQ==
=PX5t
-----END PGP SIGNATURE-----
Merge tag 'locking-urgent-2020-11-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull futex fix from Thomas Gleixner:
"A single fix for the futex code where an intermediate state in the
underlying RT mutex was not handled correctly and triggering a BUG()
instead of treating it as another variant of retry condition"
* tag 'locking-urgent-2020-11-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
futex: Handle transient "ownerless" rtmutex state correctly
- Fix the fallout of the IPI as interrupt conversion in Kconfig and the
BCM2836 interrupt chip driver/
- Fixes for interrupt affinity setting and the handling of hierarchical
irq domains in the SiFive PLIC driver.
- Make the unmapped event handling in the TI SCI driver work correctly.
- A few minor fixes and cleanups in various chip drivers and Kconfig.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl+oCiATHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoUIkEAC9wYn4x5fObSRpamwj0lagGkUsd0dB
JmKW+u/oUkhM1ZosqPegslA6RDc7LitONFcnqBD348JBrxyP94OnskhBWITv9Nx1
p4AkLS7HDs0pgOpam66qJdxAmLZ0F0vOPnEgr2VQWqaRV8qlTESNiCxT2wKF7mRd
qJn5pW3kSJnyfA66cpFVC3Db64KmdWQvv0BCnc1Wqq3odXgOuTvOkqzVrlqJZQ59
RrZudU0Lz9FtUujQ7AuP+RZY7Ti/AjmEaZDvcsnz3SR6vZGV8qG1f/88RP31d4qK
62cIOfz/bsl1uxAwnKGM4U84tzYua6djhRZXInlzL4/iYKlm5qVR8Qpotl7IxT5n
ntMJGhu/Evy987mq1maOR2rWyqVNU5BoVJUeHHibP8LANTKf7sdWOaNqieXQRKYS
ZnTmoRImBOFhxi0BuqKwwHqJILC4yOZpOa+ARMPqns2KzA4jpAvN+MwPWwpsBVaD
giVm8e8CQvFgMQjjRHcOkfernSsQs/fyQSYQY9qJI/IzVqTEFcYUONneelJGFB7R
iDFMURx7aQUPNf8p9c7eEx0BSBUBan+Quul4HQ1I+SnmOZg3QgRvlMm/zPxhnfyU
3NV4oi4c/fG+7ex+tR3igcfYPCK35BP+6iDG+GKTBjAHimQqk9XXsDYN/Wh3nW6f
UnsJIs9zdhPJsA==
=HUPi
-----END PGP SIGNATURE-----
Merge tag 'irq-urgent-2020-11-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Thomas Gleixner:
"A set of fixes for interrupt chip drivers:
- Fix the fallout of the IPI as interrupt conversion in Kconfig and
the BCM2836 interrupt chip driver
- Fixes for interrupt affinity setting and the handling of
hierarchical irq domains in the SiFive PLIC driver
- Make the unmapped event handling in the TI SCI driver work
correctly
- A few minor fixes and cleanups in various chip drivers and Kconfig"
* tag 'irq-urgent-2020-11-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
dt-bindings: irqchip: ti, sci-inta: Fix diagram indentation for unmapped events
irqchip/ti-sci-inta: Add support for unmapped event handling
dt-bindings: irqchip: ti, sci-inta: Update for unmapped event handling
irqchip/renesas-intc-irqpin: Merge irlm_bit and needs_irlm
irqchip/sifive-plic: Fix chip_data access within a hierarchy
irqchip/sifive-plic: Fix broken irq_set_affinity() callback
irqchip/stm32-exti: Add all LP timer exti direct events support
irqchip/bcm2836: Fix missing __init annotation
irqchip/mips: Drop selection of IRQ_DOMAIN_HIERARCHY
irqchip/mst: Make mst_intc_of_init static
irqchip/mst: MST_IRQ should depend on ARCH_MEDIATEK or ARCH_MSTARV7
genirq: Let GENERIC_IRQ_IPI select IRQ_DOMAIN_HIERARCHY
that the lockdep interrupt state needs not to be established before calling
the RCU check.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl+oCN8THHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoYgDD/42A/CTltvs1meUm8gpCbACe/mwY4Ms
4K35p9E/3ozyhjkXbP1odZ6I/rh1pKZYz+qxS7JYDEDgl+3ZfZCeb8WFNspBRvD1
uURn3ou5aVFHGutjF/5yEzrHP+Vkw8ViOvIjhZ1zpPQ0kWyEP7P5K+0wPFV1BVda
ugk6T0ouQo7Kl7c7g/EP9tr9UNorih9ZmkJcsZ7fkzDin+5Ine6v9J1VRumSkm11
4OU7My/c/0EpTuslnW4w4kVYk90B+o6dJccG5SVJVrRHvziD3qhM3slI2UqnIIb+
/8NTrUy79mikZQh6LG5QrPoX9/+tp68csoFwzE6MpVvV+GTlfnbX7xn2ZW0t7ZSA
FVYUxYSO3kCzvU2VfqzLc8HtgK/esoMpcgloB0FM9XDw2khlgvnJyTCAnsn44QIP
tIA+FHGv9oa+QW5NoiRpBtTAYImTfiwaTa5AXCBFijHXsJVMziWDqZ+AQwf58LP9
RjItnOu0ZKzvLHRYDfqpusIUytFv04YuAn2kJ1vDWL4nXeo3vma2at+s/zbPHOjW
LZDUeavv7e5pqMDFC1QDja4o5fSfK0Q7ddVP8Xm1yrnZz5jPmQNCGpLxZWx8wsjq
qxPEUdfqTFvRJQhLHkVznuQUrOVYPVlZfoPro47HNHeHs+qLI1QxmsJHhRu2vd1M
PwVVBz1vvk4Txg==
=ZSdr
-----END PGP SIGNATURE-----
Merge tag 'core-urgent-2020-11-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull entry code fix from Thomas Gleixner:
"A single fix for the generic entry code to correct the wrong
assumption that the lockdep interrupt state needs not to be
established before calling the RCU check"
* tag 'core-urgent-2020-11-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
entry: Fix the incorrect ordering of lockdep and RCU check
Gratian managed to trigger the BUG_ON(!newowner) in fixup_pi_state_owner().
This is one possible chain of events leading to this:
Task Prio Operation
T1 120 lock(F)
T2 120 lock(F) -> blocks (top waiter)
T3 50 (RT) lock(F) -> boosts T1 and blocks (new top waiter)
XX timeout/ -> wakes T2
signal
T1 50 unlock(F) -> wakes T3 (rtmutex->owner == NULL, waiter bit is set)
T2 120 cleanup -> try_to_take_mutex() fails because T3 is the top waiter
and the lower priority T2 cannot steal the lock.
-> fixup_pi_state_owner() sees newowner == NULL -> BUG_ON()
The comment states that this is invalid and rt_mutex_real_owner() must
return a non NULL owner when the trylock failed, but in case of a queued
and woken up waiter rt_mutex_real_owner() == NULL is a valid transient
state. The higher priority waiter has simply not yet managed to take over
the rtmutex.
The BUG_ON() is therefore wrong and this is just another retry condition in
fixup_pi_state_owner().
Drop the locks, so that T3 can make progress, and then try the fixup again.
Gratian provided a great analysis, traces and a reproducer. The analysis is
to the point, but it confused the hell out of that tglx dude who had to
page in all the futex horrors again. Condensed version is above.
[ tglx: Wrote comment and changelog ]
Fixes: c1e2f0eaf0 ("futex: Avoid violating the 10th rule of futex")
Reported-by: Gratian Crisan <gratian.crisan@ni.com>
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/87a6w6x7bb.fsf@ni.com
Link: https://lore.kernel.org/r/87sg9pkvf7.fsf@nanos.tec.linutronix.de
Conflicts:
include/asm-generic/atomic-instrumented.h
kernel/kprobes.c
Use the upstream atomic-instrumented.h checksum, and pick
the kprobes version of kernel/kprobes.c, which effectively
reverts this upstream workaround:
645f224e7ba2: ("kprobes: Tell lockdep about kprobe nesting")
Since the new code *should* be fine without nesting.
Knock on wood ...
Signed-off-by: Ingo Molnar <mingo@kernel.org>
As shown through runtime testing, the "filename" allocation is not
always freed in perf_event_parse_addr_filter().
There are three possible ways that this could happen:
- It could be allocated twice on subsequent iterations through the loop,
- or leaked on the success path,
- or on the failure path.
Clean up the code flow to make it obvious that 'filename' is always
freed in the reallocation path and in the two return paths as well.
We rely on the fact that kfree(NULL) is NOP and filename is initialized
with NULL.
This fixes the leak. No other side effects expected.
[ Dan Carpenter: cleaned up the code flow & added a changelog. ]
[ Ingo Molnar: updated the changelog some more. ]
Fixes: 375637bc52 ("perf/core: Introduce address range filtering")
Signed-off-by: "kiyin(尹亮)" <kiyin@tencent.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: "Srivatsa S. Bhat" <srivatsa@csail.mit.edu>
Cc: Anthony Liguori <aliguori@amazon.com>
--
kernel/events/core.c | 12 +++++-------
1 file changed, 5 insertions(+), 7 deletions(-)
Introduce irq_domain_create_legacy() API which is functional equivalent
to the existing irq_domain_add_legacy(), but takes a pointer to the struct
fwnode_handle as a parameter.
This is useful for non OF systems.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lore.kernel.org/r/20201030165919.86234-5-andriy.shevchenko@linux.intel.com
of_node_to_fwnode() should be used for conversion. Replace the open coded
variant of it in of_phandle_args_to_fwspec().
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lore.kernel.org/r/20201030165919.86234-4-andriy.shevchenko@linux.intel.com
Alexei Starovoitov says:
====================
pull-request: bpf 2020-11-06
1) Pre-allocated per-cpu hashmap needs to zero-fill reused element, from David.
2) Tighten bpf_lsm function check, from KP.
3) Fix bpftool attaching to flow dissector, from Lorenz.
4) Use -fno-gcse for the whole kernel/bpf/core.c instead of function attribute, from Ard.
* git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
bpf: Update verification logic for LSM programs
bpf: Zero-fill re-used per-cpu map element
bpf: BPF_PRELOAD depends on BPF_SYSCALL
tools/bpftool: Fix attaching flow dissector
libbpf: Fix possible use after free in xsk_socket__delete
libbpf: Fix null dereference in xsk_socket__delete
libbpf, hashmap: Fix undefined behavior in hash_bits
bpf: Don't rely on GCC __attribute__((optimize)) to disable GCSE
tools, bpftool: Remove two unused variables.
tools, bpftool: Avoid array index warnings.
xsk: Fix possible memory leak at socket close
bpf: Add struct bpf_redir_neigh forward declaration to BPF helper defs
samples/bpf: Set rlimit for memlock to infinity in all samples
bpf: Fix -Wshadow warnings
selftest/bpf: Fix profiler test using CO-RE relocation for enums
====================
Link: https://lore.kernel.org/r/20201106221759.24143-1-alexei.starovoitov@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The watchpoint encoding masks for size and address were off-by-one bit
each, with the size mask using 1 unnecessary bit and the address mask
missing 1 bit. However, due to the way the size is shifted into the
encoded watchpoint, we were effectively wasting and never using the
extra bit.
For example, on x86 with PAGE_SIZE==4K, we have 1 bit for the is-write
bit, 14 bits for the size bits, and then 49 bits left for the address.
Prior to this fix we would end up with this usage:
[ write<1> | size<14> | wasted<1> | address<48> ]
Fix it by subtracting 1 bit from the GENMASK() end and start ranges of
size and address respectively. The added static_assert()s verify that
the masks are as expected. With the fixed version, we get the expected
usage:
[ write<1> | size<14> | address<49> ]
Functionally no change is expected, since that extra address bit is
insignificant for enabled architectures.
Acked-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Currently, the units of ->init_fract are milliseconds while those of
->gp_sleep are jiffies. For consistency with each other and with the
argument of schedule_timeout_idle(), this commit changes the units of
->init_fract to jiffies.
This change does affect the backoff algorithm, but only on systems where
HZ is not 1000, and even there the change makes more sense, given that the
current setup would "back off" to the same number of jiffies repeatedly.
In contrast, with this change, the number of jiffies waited increases
on each pass through the loop in the rcu_tasks_wait_gp() function.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
If cur_ops->sync is NULL, rcu_torture_fwd_prog_nr() will nevertheless
attempt to call through it. This commit therefore flags cases where
neither need_resched() nor call_rcu() forward-progress testing
can be performed due to NULL function pointers, and also causes
rcu_torture_fwd_prog_nr() to take an early exit if cur_ops->sync()
is NULL.
Reported-by: Tom Rix <trix@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
When executing the LOCK06 locktorture scenario featuring percpu-rwsem,
the RCU callback rcu_sync_func() may still be pending after locktorture
module is removed. This can in turn lead to the following Oops:
BUG: unable to handle page fault for address: ffffffffc00eb920
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 6500a067 P4D 6500a067 PUD 6500c067 PMD 13a36c067 PTE 800000013691c163
Oops: 0000 [#1] PREEMPT SMP
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.9.0-rc5+ #4
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
RIP: 0010:rcu_cblist_dequeue+0x12/0x30
Call Trace:
<IRQ>
rcu_core+0x1b1/0x860
__do_softirq+0xfe/0x326
asm_call_on_stack+0x12/0x20
</IRQ>
do_softirq_own_stack+0x5f/0x80
irq_exit_rcu+0xaf/0xc0
sysvec_apic_timer_interrupt+0x2e/0xb0
asm_sysvec_apic_timer_interrupt+0x12/0x20
This commit avoids tis problem by adding an exit hook in lock_torture_ops
and using it to call percpu_free_rwsem() for percpu rwsem torture during
the module-cleanup function, thus ensuring that rcu_sync_func() completes
before module exits.
It is also necessary to call the exit hook if lock_torture_init()
fails half-way, so this commit also adds an ->init_called field in
lock_torture_cxt to indicate that exit hook, if present, must be called.
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
In virtual environments on systems with hardware assist, inter-processor
interrupts must do very different things based on whether the target
vCPU is running or not. This commit therefore enables torture-test
stuttering to better test these running/not-running transitions.
Suggested-by: Chris Mason <clm@fb.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The rcu_torture_cleanup() function fails to NULL out the reader_tasks
pointer after freeing it and its fakewriter_tasks loop has redundant
braces. This commit therefore cleans these up.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Currently, stutter_wait() will happily spin waiting for the stutter
interval to end even if the caller is running at a real-time priority
level. This could starve normal-priority tasks for no good reason. This
commit therefore drops the calling task's priority to SCHED_OTHER MAX_NICE
if stutter_wait() needs to wait. But when it waits, stutter_wait()
returns true, which allows the caller to restore the priority if needed.
Callers that were already running at SCHED_OTHER MAX_NICE obviously
do not need any changes, but this commit also restores priority for
higher-priority callers.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
If an rcutorture torture-test run is given a bad kvm.sh argument, the
test will complain to the console, which is good. What is bad is that
from the user's perspective, it will just hang for the time specified
by the --duration argument. This commit therefore forces an immediate
kernel shutdown if a rcu_torture_init()-time error occurs, thus avoiding
the appearance of a hang. It also forces a console splat in this case
to clearly indicate the presence of an error.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
If an locktorture torture-test run is given a bad kvm.sh argument, the
test will complain to the console, which is good. What is bad is that
from the user's perspective, it will just hang for the time specified
by the --duration argument. This commit therefore forces an immediate
kernel shutdown if a lock_torture_init()-time error occurs, thus avoiding
the appearance of a hang. It also forces a console splat in this case
to clearly indicate the presence of an error.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Exclusive locks do not have readlock support, which means that a
locktorture run with the following module parameters will do nothing:
torture_type=mutex_lock nwriters_stress=0 nreaders_stress=1
This commit therefore rejects this combination for exclusive locks by
returning -EINVAL during module init.
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
If an refscale torture-test run is given a bad kvm.sh argument, the
test will complain to the console, which is good. What is bad is that
from the user's perspective, it will just hang for the time specified
by the --duration argument. This commit therefore forces an immediate
kernel shutdown if a ref_scale_init()-time error occurs, thus avoiding
the appearance of a hang. It also forces a console splat in this case
to clearly indicate the presence of an error.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
If an rcuscale torture-test run is given a bad kvm.sh argument, the
test will complain to the console, which is good. What is bad is that
from the user's perspective, it will just hang for the time specified
by the --duration argument. This commit therefore forces an immediate
kernel shutdown if a rcu_scale_init()-time error occurs, thus avoiding
the appearance of a hang. It also forces a console splat in this case
to clearly indicate the presence of an error.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit adds the ability to test performance and scalability of RCU
Tasks Trace updaters.
Reported-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The scftorture tests currently use only smp_call_function() and
friends, which means that these tests cannot locate bugs caused by
interactions between different IPI vectors. This commit therefore adds
the rescheduling IPI to the mix.
Note that this commit permits resched_cpus() only when scftorture is
built in. This is a workaround. Longer term, this will use real wakeups
rather than resched_cpu().
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The torture_stutter() function uses schedule_timeout_interruptible()
to time the stutter duration, but this can miss race conditions due to
its being time-synchronized with everything else that is based on the
timer wheels. This commit therefore converts torture_stutter() to use
the high-resolution timers via schedule_hrtimeout(), and also to fuzz
the stutter interval. While in the area, this commit also limits the
spin-loop portion of the stutter_wait() function's wait loop to two
jiffies, down from about one second.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Running locktorture scenario LOCK05 results in hangs:
tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --torture lock --duration 3 --configs LOCK05
The lock_torture_writer() kthreads set themselves to MAX_NICE while
running SCHED_OTHER. Other locktorture kthreads run at default niceness,
also SCHED_OTHER. This results in these other locktorture kthreads
indefinitely preempting the lock_torture_writer() kthreads. Note that
the cond_resched() in the stutter_wait() function's loop is ineffective
because this scenario is built with CONFIG_PREEMPT=y.
It is not clear that such indefinite preemption is supposed to happen, but
in the meantime this commit prevents kthreads running in stutter_wait()
from being completely CPU-bound, thus allowing the other threads to get
some CPU in a timely fashion. This commit also uses hrtimers to provide
very short sleeps to avoid degrading the sudden-on testing that stutter
is supposed to provide.
Reviewed-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit adds a last_lock_release variable that tracks the time of
the last ->writeunlock() call, which allows easier diagnosing of lock
hangs when using a kernel debugger.
Acked-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Currently, accessing /proc/cpuinfo sends IPIs to idle CPUs in order to
learn their clock frequency. Which is a bit strange, given that waking
them from idle likely significantly changes their clock frequency.
This commit therefore avoids sending /proc/cpuinfo-induced IPIs to
idle CPUs.
[ paulmck: Also check for idle in arch_freq_prepare_all(). ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <x86@kernel.org>
The current logic checks if the name of the BTF type passed in
attach_btf_id starts with "bpf_lsm_", this is not sufficient as it also
allows attachment to non-LSM hooks like the very function that performs
this check, i.e. bpf_lsm_verify_prog.
In order to ensure that this verification logic allows attachment to
only LSM hooks, the LSM_HOOK definitions in lsm_hook_defs.h are used to
generate a BTF_ID set. Upon verification, the attach_btf_id of the
program being attached is checked for presence in this set.
Fixes: 9e4e01dfd3 ("bpf: lsm: Implement attach, detach and execution")
Signed-off-by: KP Singh <kpsingh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20201105230651.2621917-1-kpsingh@chromium.org
The currently available bpf_get_current_task returns an unsigned integer
which can be used along with BPF_CORE_READ to read data from
the task_struct but still cannot be used as an input argument to a
helper that accepts an ARG_PTR_TO_BTF_ID of type task_struct.
In order to implement this helper a new return type, RET_PTR_TO_BTF_ID,
is added. This is similar to RET_PTR_TO_BTF_ID_OR_NULL but does not
require checking the nullness of returned pointer.
Signed-off-by: KP Singh <kpsingh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20201106103747.2780972-6-kpsingh@chromium.org
Similar to bpf_local_storage for sockets and inodes add local storage
for task_struct.
The life-cycle of storage is managed with the life-cycle of the
task_struct. i.e. the storage is destroyed along with the owning task
with a callback to the bpf_task_storage_free from the task_free LSM
hook.
The BPF LSM allocates an __rcu pointer to the bpf_local_storage in
the security blob which are now stackable and can co-exist with other
LSMs.
The userspace map operations can be done by using a pid fd as a key
passed to the lookup, update and delete operations.
Signed-off-by: KP Singh <kpsingh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20201106103747.2780972-3-kpsingh@chromium.org
Usage of spin locks was not allowed for tracing programs due to
insufficient preemption checks. The verifier does not currently prevent
LSM programs from using spin locks, but the helpers are not exposed
via bpf_lsm_func_proto.
Based on the discussion in [1], non-sleepable LSM programs should be
able to use bpf_spin_{lock, unlock}.
Sleepable LSM programs can be preempted which means that allowng spin
locks will need more work (disabling preemption and the verifier
ensuring that no sleepable helpers are called when a spin lock is held).
[1]: https://lore.kernel.org/bpf/20201103153132.2717326-1-kpsingh@chromium.org/T/#md601a053229287659071600d3483523f752cd2fb
Signed-off-by: KP Singh <kpsingh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20201106103747.2780972-2-kpsingh@chromium.org
This adds CONFIG_FTRACE_RECORD_RECURSION that will record to a file
"recursed_functions" all the functions that caused recursion while a
callback to the function tracer was running.
Link: https://lkml.kernel.org/r/20201106023548.102375687@goodmis.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Guo Ren <guoren@kernel.org>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: x86@kernel.org
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Anton Vorontsov <anton@enomsg.org>
Cc: Colin Cross <ccross@android.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Miroslav Benes <mbenes@suse.cz>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Joe Lawrence <joe.lawrence@redhat.com>
Cc: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Cc: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-csky@vger.kernel.org
Cc: linux-parisc@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-s390@vger.kernel.org
Cc: live-patching@vger.kernel.org
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Now that all callbacks are recursion safe, reverse the meaning of the
RECURSION flag and rename it from RECURSION_SAFE to simply RECURSION.
Now only callbacks that request to have recursion protecting it will
have the added trampoline to do so.
Also remove the outdated comment about "PER_CPU" when determining to
use the ftrace_ops_assist_func.
Link: https://lkml.kernel.org/r/20201028115613.742454631@goodmis.org
Link: https://lkml.kernel.org/r/20201106023547.904270143@goodmis.org
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Miroslav Benes <mbenes@suse.cz>
Cc: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: linux-doc@vger.kernel.org
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
If a ftrace callback requires "rcu_is_watching", then it adds the
FTRACE_OPS_FL_RCU flag and it will not be called if RCU is not "watching".
But this means that it will use a trampoline when called, and this slows
down the function tracing a tad. By checking rcu_is_watching() from within
the callback, it no longer needs the RCU flag set in the ftrace_ops and it
can be safely called directly.
Link: https://lkml.kernel.org/r/20201028115613.591878956@goodmis.org
Link: https://lkml.kernel.org/r/20201106023547.711035826@goodmis.org
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Miroslav Benes <mbenes@suse.cz>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
If a ftrace callback does not supply its own recursion protection and
does not set the RECURSION_SAFE flag in its ftrace_ops, then ftrace will
make a helper trampoline to do so before calling the callback instead of
just calling the callback directly.
The default for ftrace_ops is going to change. It will expect that handlers
provide their own recursion protection, unless its ftrace_ops states
otherwise.
Link: https://lkml.kernel.org/r/20201028115613.444477858@goodmis.org
Link: https://lkml.kernel.org/r/20201106023547.466892083@goodmis.org
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Miroslav Benes <mbenes@suse.cz>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
If for some reason a function is called that triggers the recursion
detection of live patching, trigger a warning. By not executing the live
patch code, it is possible that the old unpatched function will be called
placing the system into an unknown state.
Link: https://lore.kernel.org/r/20201029145709.GD16774@alley
Link: https://lkml.kernel.org/r/20201106023547.312639435@goodmis.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Joe Lawrence <joe.lawrence@redhat.com>
Cc: live-patching@vger.kernel.org
Suggested-by: Miroslav Benes <mbenes@suse.cz>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Acked-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
If a ftrace callback does not supply its own recursion protection and
does not set the RECURSION_SAFE flag in its ftrace_ops, then ftrace will
make a helper trampoline to do so before calling the callback instead of
just calling the callback directly.
The default for ftrace_ops is going to change. It will expect that handlers
provide their own recursion protection, unless its ftrace_ops states
otherwise.
Link: https://lkml.kernel.org/r/20201028115613.291169246@goodmis.org
Link: https://lkml.kernel.org/r/20201106023547.122802424@goodmis.org
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Joe Lawrence <joe.lawrence@redhat.com>
Cc: live-patching@vger.kernel.org
Reviewed-by: Petr Mladek <pmladek@suse.com>
Acked-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Currently, if a callback is registered to a ftrace function and its
ftrace_ops does not have the RECURSION flag set, it is encapsulated in a
helper function that does the recursion for it.
Really, all the callbacks should have their own recursion protection for
performance reasons. But they should not all implement their own. Move the
recursion helpers to global headers, so that all callbacks can use them.
Link: https://lkml.kernel.org/r/20201028115612.460535535@goodmis.org
Link: https://lkml.kernel.org/r/20201106023546.166456258@goodmis.org
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
make clang-analyzer on x86_64 defconfig caught my attention with:
kernel/printk/printk_ringbuffer.c:885:3: warning:
Value stored to 'desc' is never read [clang-analyzer-deadcode.DeadStores]
desc = to_desc(desc_ring, head_id);
^
Commit b6cf8b3f33 ("printk: add lockless ringbuffer") introduced
desc_reserve() with this unneeded dead-store assignment.
As discussed with John Ogness privately, this is probably just some minor
left-over from previous iterations of the ringbuffer implementation. So,
simply remove this unneeded dead assignment to make clang-analyzer happy.
As compilers will detect this unneeded assignment and optimize this anyway,
the resulting object code is identical before and after this change.
No functional change. No change to object code.
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Reviewed-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20201106034005.18822-1-lukas.bulwahn@gmail.com
Currently key_size of hashtab is limited to MAX_BPF_STACK.
As the key of hashtab can also be a value from a per cpu map it can be
larger than MAX_BPF_STACK.
The use-case for this patch originates to implement allow/disallow
lists for files and file paths. The maximum length of file paths is
defined by PATH_MAX with 4096 chars including nul.
This limit exceeds MAX_BPF_STACK.
Changelog:
v5:
- Fix cast overflow
v4:
- Utilize BPF skeleton in tests
- Rebase
v3:
- Rebase
v2:
- Add a test for bpf side
Signed-off-by: Florian Lehner <dev@der-flo.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20201029201442.596690-1-dev@der-flo.net
Zero-fill element values for all other cpus than current, just as
when not using prealloc. This is the only way the bpf program can
ensure known initial values for all cpus ('onallcpus' cannot be
set when coming from the bpf program).
The scenario is: bpf program inserts some elements in a per-cpu
map, then deletes some (or userspace does). When later adding
new elements using bpf_map_update_elem(), the bpf program can
only set the value of the new elements for the current cpu.
When prealloc is enabled, previously deleted elements are re-used.
Without the fix, values for other cpus remain whatever they were
when the re-used entry was previously freed.
A selftest is added to validate correct operation in above
scenario as well as in case of LRU per-cpu map element re-use.
Fixes: 6c90598174 ("bpf: pre-allocate hash map elements")
Signed-off-by: David Verbeiren <david.verbeiren@tessares.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20201104112332.15191-1-david.verbeiren@tessares.net
Fix build error when BPF_SYSCALL is not set/enabled but BPF_PRELOAD is
by making BPF_PRELOAD depend on BPF_SYSCALL.
ERROR: modpost: "bpf_preload_ops" [kernel/bpf/preload/bpf_preload.ko] undefined!
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20201105195109.26232-1-rdunlap@infradead.org
- Fix off-by-one error in retrieving the context buffer for trace_printk()
- Fix off-by-one error in stack nesting limit
- Fix recursion to not make all NMI code false positive as recursing
- Stop losing events in function tracing when transitioning between irq context
- Stop losing events in ring buffer when transitioning between irq context
- Fix return code of error pointer in parse_synth_field() to prevent
NULL pointer dereference.
- Fix false positive of NMI recursion in kprobe event handling
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCX6QqmRQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qtWiAQCKf3xblmQDcge2BLQPy2H7ih5VCSSb
vDPrQKgLa35pSgEAhGj7mgAKxb6kDbae1BqPorfa/qn9Bi13XfRNTxBK2Qw=
=zKfd
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.10-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
- Fix off-by-one error in retrieving the context buffer for
trace_printk()
- Fix off-by-one error in stack nesting limit
- Fix recursion to not make all NMI code false positive as recursing
- Stop losing events in function tracing when transitioning between irq
context
- Stop losing events in ring buffer when transitioning between irq
context
- Fix return code of error pointer in parse_synth_field() to prevent
NULL pointer dereference.
- Fix false positive of NMI recursion in kprobe event handling
* tag 'trace-v5.10-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
kprobes: Tell lockdep about kprobe nesting
tracing: Make -ENOMEM the default error for parse_synth_field()
ring-buffer: Fix recursion protection transitions between interrupt context
tracing: Fix the checking of stackidx in __ftrace_trace_stack
ftrace: Handle tracing when switching between context
ftrace: Fix recursion check for NMI test
tracing: Fix out of bounds write in get_trace_buf
- Unify the handling of managed and stateless device links in the
runtime PM framework and prevent runtime PM references to devices
from being leaked after device link removal (Rafael Wysocki).
- Fix two mistakes in the cpuidle documentation (Julia Lawall).
- Prevent the schedutil cpufreq governor from missing policy
limits updates in some cases (Viresh Kumar).
- Prevent static OPPs from being dropped by mistake (Viresh Kumar).
- Prevent helper function in the OPP framework from returning
prematurely (Viresh Kumar).
- Prevent opp_table_lock from being held too long during removal
of OPP tables with no more active references (Viresh Kumar).
- Drop redundant semicolon from the Intel RAPL power capping
driver (Tom Rix).
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAl+kAnUSHHJqd0Byand5
c29ja2kubmV0AAoJEILEb/54YlRxND0P/j5e/wqFZ/34XEaUa4PIs27TDoJHq1Jh
PfUxqSBEVO6Xm8TMCtOIuS8gw4x7ehenXC8gFdivy+Fu+4QNkPKYggF5/LgWO8Gl
cMNzYwJ8shIBQWQZftIx01Bn5tAvk1YhV1mnSNf580Iy7FhKqojMnrvzQnpGD4jR
piBkvBcbIJWDk+T96RzbqnqmMD0euvlYfzg1KnsyqsOpoRl7kZoH4ahWYskdIxDY
NtA4SqQ8dvxjTwI8+a+JBb//ua9jjjDyjd7FUV87HMdcVh9rZKbmHKkk4Yv/3C4C
jZuCTV6zaWIHCbbkZwi+OTNE8q21GgWm1xqwnGWVTFSGrs8ZyfjLs73/g9S5zI3N
C/n4YUJxiYDVcnrAVwR7kSTEouiQ6mTuChOt7T3r1Ilx67rw4TDsI59Fk1Xn07bZ
1wfaPASUPsFAxF7vMhkdzsidhpR3BYNgMwmAWo9R4Yvw4evyRN4tywoUQJ1X27zg
ERir6mFVz6XCnXPrJhyx0bWLo+VD8J/arhfIPSTqHR7wn0tgc23aVeYy7wvfYiiu
QQJqQ58Aoa0Z7ZpeAv3xbSung+eqQ/dDC9FnXqxdZCI6brYUhrZ19OhsFnEyzGUR
IDRzcP1/72AxhidpjO71txOJSJFTqKF2LzL/wuS6kMbYwmFjYfj18xX5GvOUuXmL
hO+E+QdleQFo
=lKQA
-----END PGP SIGNATURE-----
Merge tag 'pm-5.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"These fix the device links support in runtime PM, correct mistakes in
the cpuidle documentation, fix the handling of policy limits changes
in the schedutil cpufreq governor, fix assorted issues in the OPP
(operating performance points) framework and make one janitorial
change.
Specifics:
- Unify the handling of managed and stateless device links in the
runtime PM framework and prevent runtime PM references to devices
from being leaked after device link removal (Rafael Wysocki).
- Fix two mistakes in the cpuidle documentation (Julia Lawall).
- Prevent the schedutil cpufreq governor from missing policy limits
updates in some cases (Viresh Kumar).
- Prevent static OPPs from being dropped by mistake (Viresh Kumar).
- Prevent helper function in the OPP framework from returning
prematurely (Viresh Kumar).
- Prevent opp_table_lock from being held too long during removal of
OPP tables with no more active references (Viresh Kumar).
- Drop redundant semicolon from the Intel RAPL power capping driver
(Tom Rix)"
* tag 'pm-5.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM: runtime: Resume the device earlier in __device_release_driver()
PM: runtime: Drop pm_runtime_clean_up_links()
PM: runtime: Drop runtime PM references to supplier on link removal
powercap/intel_rapl: remove unneeded semicolon
Documentation: PM: cpuidle: correct path name
Documentation: PM: cpuidle: correct typo
cpufreq: schedutil: Don't skip freq update if need_freq_update is set
opp: Reduce the size of critical section in _opp_table_kref_release()
opp: Fix early exit from dev_pm_opp_register_set_opp_helper()
opp: Don't always remove static OPPs in _of_add_opp_table_v1()
Lockdep state handling on NMI enter and exit is nothing specific to X86. It's
not any different on other architectures. Also the extra state type is not
necessary, irqentry_state_t can carry the necessary information as well.
Move it to common code and extend irqentry_state_t to carry lockdep state.
[ Ira: Make exit_rcu and lockdep a union as they are mutually exclusive
between the IRQ and NMI exceptions, and add kernel documentation for
struct irqentry_state_t ]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201102205320.1458656-7-ira.weiny@intel.com
When an exception/interrupt hits kernel space and the kernel is not
currently in the idle task then RCU must be watching.
irqentry_enter() validates this via rcu_irq_enter_check_tick(), which in
turn invokes lockdep when taking a lock. But at that point lockdep does not
yet know about the fact that interrupts have been disabled by the CPU,
which triggers a lockdep splat complaining about inconsistent state.
Invoking trace_hardirqs_off() before rcu_irq_enter_check_tick() defeats the
point of rcu_irq_enter_check_tick() because trace_hardirqs_off() uses RCU.
So use the same sequence as for the idle case and tell lockdep about the
irq state change first, invoke the RCU check and then do the lockdep and
tracer update.
Fixes: a5497bab5f ("entry: Provide generic interrupt entry/exit code")
Reported-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Mark Rutland <mark.rutland@arm.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/87y2jhl19s.fsf@nanos.tec.linutronix.de
Since the kprobe handlers have protection that prohibits other handlers from
executing in other contexts (like if an NMI comes in while processing a
kprobe, and executes the same kprobe, it will get fail with a "busy"
return). Lockdep is unaware of this protection. Use lockdep's nesting api to
differentiate between locks taken in INT3 context and other context to
suppress the false warnings.
Link: https://lore.kernel.org/r/20201102160234.fa0ae70915ad9e2b21c08b85@kernel.org
Cc: Peter Zijlstra <peterz@infradead.org>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Let's handle the successful call of mod_verify_sig() right after that call,
making the *switch* statement only handle the real errors, and then move
the comment from the first *case* before *switch* itself and the comment
before *default* after it. Fix the comment style, add article/comma/dash,
spell out "nomem" as "lack of memory" in these comments, while at it...
Suggested-by: Joe Perches <joe@perches.com>
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Sergey Shtylyov <s.shtylyov@omprussia.ru>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
Let's move the common handling of the non-fatal errors after the *switch*
statement -- this avoids *goto*s inside that *switch*...
Suggested-by: Joe Perches <joe@perches.com>
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Sergey Shtylyov <s.shtylyov@omprussia.ru>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
The 'reason' variable in module_sig_check() points to 3 strings across
the *switch* statement, all needlessly starting with the same text.
Let's put the starting text into the pr_notice() call -- it saves 21
bytes of the object code (x86 gcc 10.2.1).
Suggested-by: Joe Perches <joe@perches.com>
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Sergey Shtylyov <s.shtylyov@omprussia.ru>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
kcov_common_handle is a method that is used to obtain a "default" KCOV
remote handle of the current process. The handle can later be passed
to kcov_remote_start in order to collect coverage for the processing
that is initiated by one process, but done in another. For details see
Documentation/dev-tools/kcov.rst and comments in kernel/kcov.c.
Presently, if kcov_common_handle is called in an IRQ context, it will
return a handle for the interrupted process. This may lead to
unreliable and incorrect coverage collection.
Adjust the behavior of kcov_common_handle in the following way. If it
is called in a task context, return the common handle for the
currently running task. Otherwise, return 0.
Signed-off-by: Aleksandr Nogikh <nogikh@google.com>
Reviewed-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The default value for refscale.nreaders is -1, which results in the code
setting the value to three-quarters of the number of CPUs. On single-CPU
systems, this results in three-quarters of the value one, which the C
language's integer arithmetic rounds to zero. This in turn results in
a divide-by-zero error.
This commit therefore adds bounds checking to the refscale module
parameters, so that if they are less than one, they are set to the
value one.
Reported-by: kernel test robot <lkp@intel.com>
Tested-by "Chen, Rong A" <rong.a.chen@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
At the end of the test and after rcu_torture_writer() stalls, rcutorture
invokes show_rcu_gp_kthreads() in order to dump out information on the
RCU grace-period kthread. This makes a lot of sense when testing vanilla
RCU, but not so much for the other flavors. This commit therefore allows
per-flavor kthread-dump functions to be specified.
[ paulmck: Apply feedback from kernel test robot <lkp@intel.com>. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The infinite for-loop in rcu_tasks_wait_gp() has its only exit at the
top of the loop, so this commit does the straightforward conversion to
a while-loop, thus saving a few lines.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The lockdep_is_held() macro is defined as:
#define lockdep_is_held(lock) lock_is_held(&(lock)->dep_map)
This hides away the dereference, so that builds with !LOCKDEP don't break.
This works in current kernels because the RCU_LOCKDEP_WARN() eliminates
its condition at preprocessor time in !LOCKDEP kernels. However, later
patches in this series will cause the compiler to see this condition even
in !LOCKDEP kernels. This commit prepares for this upcoming change by
switching from lock_is_held() to lockdep_is_held().
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
--
CC: jiangshanlai@gmail.com
CC: paulmck@kernel.org
CC: josh@joshtriplett.org
CC: rostedt@goodmis.org
CC: mathieu.desnoyers@efficios.com
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Avoid setting up watchpoints on NULL pointers, as otherwise we would
crash inside the KCSAN runtime (when checking for value changes) instead
of the instrumented code.
Because that may be confusing, skip any address less than PAGE_SIZE.
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
In preparation of supporting only addresses not within the NULL page,
change the selftest to never use addresses that are less than PAGE_SIZE.
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
parse_synth_field() returns a pointer and requires that errors get
surrounded by ERR_PTR(). The ret variable is initialized to zero, but should
never be used as zero, and if it is, it could cause a false return code and
produce a NULL pointer dereference. It makes no sense to set ret to zero.
Set ret to -ENOMEM (the most common error case), and have any other errors
set it to something else. This removes the need to initialize ret on *every*
error branch.
Fixes: 761a8c58db ("tracing, synthetic events: Replace buggy strcat() with seq_buf operations")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The recursion protection of the ring buffer depends on preempt_count() to be
correct. But it is possible that the ring buffer gets called after an
interrupt comes in but before it updates the preempt_count(). This will
trigger a false positive in the recursion code.
Use the same trick from the ftrace function callback recursion code which
uses a "transition" bit that gets set, to allow for a single recursion for
to handle transitions between contexts.
Cc: stable@vger.kernel.org
Fixes: 567cd4da54 ("ring-buffer: User context bit recursion checking")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Commit 32927393dc ("sysctl: pass kernel pointers to ->proc_handler")
removed various __user annotations from function signatures as part of
its refactoring.
It also removed the __user annotation for proc_dohung_task_timeout_secs()
at its declaration in sched/sysctl.h, but not at its definition in
kernel/hung_task.c.
Hence, sparse complains:
kernel/hung_task.c:271:5: error: symbol 'proc_dohung_task_timeout_secs' redeclared with different type (incompatible argument 3 (different address spaces))
Adjust the annotation at the definition fitting to that refactoring to make
sparse happy again, which also resolves this warning from sparse:
kernel/hung_task.c:277:52: warning: incorrect type in argument 3 (different address spaces)
kernel/hung_task.c:277:52: expected void *
kernel/hung_task.c:277:52: got void [noderef] __user *buffer
No functional change. No change in object code.
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Ignatov <rdna@fb.com>
Link: https://lkml.kernel.org/r/20201028130541.20320-1-lukas.bulwahn@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is a small race window when a delayed work is being canceled and
the work still might be queued from the timer_fn:
CPU0 CPU1
kthread_cancel_delayed_work_sync()
__kthread_cancel_work_sync()
__kthread_cancel_work()
work->canceling++;
kthread_delayed_work_timer_fn()
kthread_insert_work();
BUG: kthread_insert_work() should not get called when work->canceling is
set.
Signed-off-by: Zqiang <qiang.zhang@windriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20201014083030.16895-1-qiang.zhang@windriver.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This testcase
#include <stdio.h>
#include <unistd.h>
#include <signal.h>
#include <sys/ptrace.h>
#include <sys/wait.h>
#include <pthread.h>
#include <assert.h>
void *tf(void *arg)
{
return NULL;
}
int main(void)
{
int pid = fork();
if (!pid) {
kill(getpid(), SIGSTOP);
pthread_t th;
pthread_create(&th, NULL, tf, NULL);
return 0;
}
waitpid(pid, NULL, WSTOPPED);
ptrace(PTRACE_SEIZE, pid, 0, PTRACE_O_TRACECLONE);
waitpid(pid, NULL, 0);
ptrace(PTRACE_CONT, pid, 0,0);
waitpid(pid, NULL, 0);
int status;
int thread = waitpid(-1, &status, 0);
assert(thread > 0 && thread != pid);
assert(status == 0x80137f);
return 0;
}
fails and triggers WARN_ON_ONCE(!signr) in do_jobctl_trap().
This is because task_join_group_stop() has 2 problems when current is traced:
1. We can't rely on the "JOBCTL_STOP_PENDING" check, a stopped tracee
can be woken up by debugger and it can clone another thread which
should join the group-stop.
We need to check group_stop_count || SIGNAL_STOP_STOPPED.
2. If SIGNAL_STOP_STOPPED is already set, we should not increment
sig->group_stop_count and add JOBCTL_STOP_CONSUME. The new thread
should stop without another do_notify_parent_cldstop() report.
To clarify, the problem is very old and we should blame
ptrace_init_task(). But now that we have task_join_group_stop() it makes
more sense to fix this helper to avoid the code duplication.
Reported-by: syzbot+3485e3773f7da290eecc@syzkaller.appspotmail.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christian Brauner <christian@brauner.io>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: Zhiqiang Liu <liuzhiqiang26@huawei.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20201019134237.GA18810@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The cpufreq policy's frequency limits (min/max) can get changed at any
point of time, while schedutil is trying to update the next frequency.
Though the schedutil governor has necessary locking and support in place
to make sure we don't miss any of those updates, there is a corner case
where the governor will find that the CPU is already running at the
desired frequency and so may skip an update.
For example, consider that the CPU can run at 1 GHz, 1.2 GHz and 1.4 GHz
and is running at 1 GHz currently. Schedutil tries to update the
frequency to 1.2 GHz, during this time the policy limits get changed as
policy->min = 1.4 GHz. As schedutil (and cpufreq core) does clamp the
frequency at various instances, we will eventually set the frequency to
1.4 GHz, while we will save 1.2 GHz in sg_policy->next_freq.
Now lets say the policy limits get changed back at this time with
policy->min as 1 GHz. The next time schedutil is invoked by the
scheduler, we will reevaluate the next frequency (because
need_freq_update will get set due to limits change event) and lets say
we want to set the frequency to 1.2 GHz again. At this point
sugov_update_next_freq() will find the next_freq == current_freq and
will abort the update, while the CPU actually runs at 1.4 GHz.
Until now need_freq_update was used as a flag to indicate that the
policy's frequency limits have changed, and that we should consider the
new limits while reevaluating the next frequency.
This patch fixes the above mentioned issue by extending the purpose of
the need_freq_update flag. If this flag is set now, the schedutil
governor will not try to abort a frequency change even if next_freq ==
current_freq.
As similar behavior is required in the case of
CPUFREQ_NEED_UPDATE_LIMITS flag as well, need_freq_update will never be
set to false if that flag is set for the driver.
We also don't need to consider the need_freq_update flag in
sugov_update_single() anymore to handle the special case of busy CPU, as
we won't abort a frequency update anymore.
Reported-by: zhuguangqing <zhuguangqing@xiaomi.com>
Suggested-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
[ rjw: Rearrange code to avoid a branch ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The array size is FTRACE_KSTACK_NESTING, so the index FTRACE_KSTACK_NESTING
is illegal too. And fix two typos by the way.
Link: https://lkml.kernel.org/r/20201031085714.2147-1-hqjagain@gmail.com
Signed-off-by: Qiujun Huang <hqjagain@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The tbl_dma_addr argument is used to check the DMA boundary for the
allocations, and thus needs to be a dma_addr_t. swiotlb-xen instead
passed a physical address, which could lead to incorrect results for
strange offsets. Fix this by removing the parameter entirely and hard
code the DMA address for io_tlb_start instead.
Fixes: 91ffe4ad53 ("swiotlb-xen: introduce phys_to_dma/dma_to_phys translations")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
kernel/dma/swiotlb.c:swiotlb_init gets called first and tries to
allocate a buffer for the swiotlb. It does so by calling
memblock_alloc_low(PAGE_ALIGN(bytes), PAGE_SIZE);
If the allocation must fail, no_iotlb_memory is set.
Later during initialization swiotlb-xen comes in
(drivers/xen/swiotlb-xen.c:xen_swiotlb_init) and given that io_tlb_start
is != 0, it thinks the memory is ready to use when actually it is not.
When the swiotlb is actually needed, swiotlb_tbl_map_single gets called
and since no_iotlb_memory is set the kernel panics.
Instead, if swiotlb-xen.c:xen_swiotlb_init knew the swiotlb hadn't been
initialized, it would do the initialization itself, which might still
succeed.
Fix the panic by setting io_tlb_start to 0 on swiotlb initialization
failure, and also by setting no_iotlb_memory to false on swiotlb
initialization success.
Fixes: ac2cbab21f ("x86: Don't panic if can not alloc buffer for swiotlb")
Reported-by: Elliott Mitchell <ehem+xen@m5p.com>
Tested-by: Elliott Mitchell <ehem+xen@m5p.com>
Signed-off-by: Stefano Stabellini <stefano.stabellini@xilinx.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
When an interrupt or NMI comes in and switches the context, there's a delay
from when the preempt_count() shows the update. As the preempt_count() is
used to detect recursion having each context have its own bit get set when
tracing starts, and if that bit is already set, it is considered a recursion
and the function exits. But if this happens in that section where context
has changed but preempt_count() has not been updated, this will be
incorrectly flagged as a recursion.
To handle this case, create another bit call TRANSITION and test it if the
current context bit is already set. Flag the call as a recursion if the
TRANSITION bit is already set, and if not, set it and continue. The
TRANSITION bit will be cleared normally on the return of the function that
set it, or if the current context bit is clear, set it and clear the
TRANSITION bit to allow for another transition between the current context
and an even higher one.
Cc: stable@vger.kernel.org
Fixes: edc15cafcb ("tracing: Avoid unnecessary multiple recursion checks")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The code that checks recursion will work to only do the recursion check once
if there's nested checks. The top one will do the check, the other nested
checks will see recursion was already checked and return zero for its "bit".
On the return side, nothing will be done if the "bit" is zero.
The problem is that zero is returned for the "good" bit when in NMI context.
This will set the bit for NMIs making it look like *all* NMI tracing is
recursing, and prevent tracing of anything in NMI context!
The simple fix is to return "bit + 1" and subtract that bit on the end to
get the real bit.
Cc: stable@vger.kernel.org
Fixes: edc15cafcb ("tracing: Avoid unnecessary multiple recursion checks")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The nesting count of trace_printk allows for 4 levels of nesting. The
nesting counter starts at zero and is incremented before being used to
retrieve the current context's buffer. But the index to the buffer uses the
nesting counter after it was incremented, and not its original number,
which in needs to do.
Link: https://lkml.kernel.org/r/20201029161905.4269-1-hqjagain@gmail.com
Cc: stable@vger.kernel.org
Fixes: 3d9622c12c ("tracing: Add barrier to trace_printk() buffer nesting modification")
Signed-off-by: Qiujun Huang <hqjagain@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
- Prevent undefined behaviour in the timespec64_to_ns() conversion which
is used for converting user supplied time input to nanoseconds. It
lacked overflow protection.
- Mark sched_clock_read_begin/retry() to prevent recursion in the tracer
- Remove unused debug functions in the hrtimer and timerlist code
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl+evaETHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoT69D/0Z3ZGdNQu3mAPYdMSaW7R3MfXnLKno
biRCwre/T88Exv+Yu612iLthTlfNlxZbofAgUc5BvCwjAjtt+uXlmS8mwXNsbGFd
516hisUnjJkjbcHI6WxrQQ7+YsK2XLjOeV90TpLs9zMgI94kQ+uzEzqKc6ZaRQP4
Zi/emqgHeO/RoObNBWjNh9McrHqXllXU/LeqD7y82JXdLPik78PniKFkh7B1d6u8
RL7kW1TeblkNaHMVjO4/lM9zR1hkZr4GPYbEHIbVN8FMQ10BAD2iswUR72U2mqx7
zf2Jt+09UEzPRu2gOr4Lrvo6bJQKClu4ZUnRt4apU4BeTgyG0gi7pgy0VZ3M1xtq
KUnN8dNSMBwTurv9GuGTdiNpfFad2nsnSAra8r7GlZDHxNe1wWIPJmyu/FV1xohD
yUgRq2s7rvCb894oz159hyIwkI/ZNbFr/AJTCvdzowgf9nf4WS5/Y2/JVLF28LRT
4xPhvuVUfrhHJajoZayuVvxYSVEPZFek2SgUEIFi6b1+FiMFwV6guv/f31D5GQ+E
NHgvDMbDsVmzGt++c4zFHoqWvBrgC8Os8AEP8s2oczfvwMt2A8g7nIrGieAux+Jm
eLO1DfXaXpQi/MmTBf/WD1dKwjuoLV9dT0Fp2y2docu9SXm/bMgRhxX7fm3uDge7
W87XWkfIxd5fzA==
=Gdoq
-----END PGP SIGNATURE-----
Merge tag 'timers-urgent-2020-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Thomas Gleixner:
"A few fixes for timers/timekeeping:
- Prevent undefined behaviour in the timespec64_to_ns() conversion
which is used for converting user supplied time input to
nanoseconds. It lacked overflow protection.
- Mark sched_clock_read_begin/retry() to prevent recursion in the
tracer
- Remove unused debug functions in the hrtimer and timerlist code"
* tag 'timers-urgent-2020-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
time: Prevent undefined behaviour in timespec64_to_ns()
timers: Remove unused inline funtion debug_timer_free()
hrtimer: Remove unused inline function debug_hrtimer_free()
time/sched_clock: Mark sched_clock_read_begin/retry() as notrace
caused by recursion when enabling or disabling a tracer on RISC-V (probably
all architectures which patch through stop machine).
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl+evLQTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoQR6D/49PIepVEn63HmgsKdePUShUFzYb+v6
mPV0iwaDCdW6BKXVEuHkzGdz3dB5ytdIZP6tJLV5a44tYxPhYESO+O4FwAJnqBxD
JGWLUELxqeEESnjBHww26/YptxqRwcVir2G0WeRDIO0eV2JE424ptdRSisgRVC2s
Kcu41D9Ry5SLayEKN29GyRXQPrMwLCUTAzeziN4TwJeRB2exSUMg6iDFiGC2uZy6
IzSu9BXak5T61LL7q5urElkY8a+sGLx6N29QBtSNxazYeXCzCoEtX+1IQ2e9b3PO
4abFsEtYJ9Mbw4R+iKm0lIpUMKmr/GOPWBs/PDTqK0aYhpc8UnKXTCr6hWLTvtkE
Anb/08AqN7YtvgbBR+bEN4fPOP4jL5kQVjLbNlIKm5ktRjZ5Sfd9OKOc//c/Bs3e
Auqe2u+UW2REDsB3lNiVFY+I/1nQUVwKLwfOsBHJW1KzH6O+BhJrJgigUERTeyTf
B5eLsOIxKgP1vprA/qQgPc/VT25PFIMOtxr7KhH1qPBej9R/wGEOdeOGSdJoiNYM
aH3kCTUMU0NP8IUSyVQFt4v2IhZ5aF+3rQmyVXifGRl54QCXKgb5gzqcQ/L2SFn+
I4SkB35hAoEzJ9S4vtx5ZtWPXAO20V644qRMalYv5X739zMVZQ2AyHGB9rhhktmX
a8ZDIVFCh0ssmw==
=38RZ
-----END PGP SIGNATURE-----
Merge tag 'smp-urgent-2020-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull smp fix from Thomas Gleixner:
"A single fix for stop machine.
Mark functions no trace to prevent a crash caused by recursion when
enabling or disabling a tracer on RISC-V (probably all architectures
which patch through stop machine)"
* tag 'smp-urgent-2020-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
stop_machine, rcu: Mark functions as notrace
- Fix incorrect failure injection handling on the fuxtex code
- Prevent a preemption warning in lockdep when tracking local_irq_enable()
and interrupts are already enabled
- Remove more raw_cpu_read() usage from lockdep which causes state
corruption on !X86 architectures.
- Make the nr_unused_locks accounting in lockdep correct again.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl+evEUTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYod93EADG90GmRBYQxn6y2eQUKE/9f5SMiMFJ
KdvuqNBqHOhYm3iUPbZJcb0P/JZi1NHP0fBFMishdESGi96tD96K7T04WD0gmjtm
ArWFroe8uzEYtY9atlEwM0Nrvq0w8ZBLv9x1adXzJ59vB8/8Uq+wzYioSWn9yMcv
ye3jfVyAlM7ouFHDQAA36s/nhvZfxms4C0t+6S3gjVTIp/6riGuYh5t7dbXUMlnu
nGLiIJFjU+ekurweVDGpqD/nAxYfqf3UxebWnrosf7iu6suwYwaPFZGZ/kxlbr5e
qWx0B1RuhjAoefVJlPTkHmuhd0SnH/Gm/tTNkQ3LidJhPTIhLJlb7zffwyZlc510
VdaUipfZ6bNqDD6/dK6fKJtdKSE4w/z3pT53954NUD5zw/jIcHlgnaQieh72DH+F
1EKqmsNrwHAxYfMndQxLGdIoBScUAFzHzDnzsY9KKS2cfhChljzLa2nDIfMsDfKQ
aROugzEbZPQEb1iWUEOF3XopcuZzZQCaPlLDLvAnsBeYEPm0gdmbKFPFsDjOyBVX
/Qc41O7DyHKcoiLX2zM2c7CxnV5J6YEZz3jQSZLFlpH9Ih7jwAl9/6VirggNUNvV
YVsgM/myhYQtJBqHHojNppZFFW3KdgfxWuY7+qt7Ox5w/ck5qYQwRnoB4FROwVHV
pzcYTBE5qkQnIw==
=S01o
-----END PGP SIGNATURE-----
Merge tag 'locking-urgent-2020-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fixes from Thomas Gleixner:
"A couple of locking fixes:
- Fix incorrect failure injection handling in the fuxtex code
- Prevent a preemption warning in lockdep when tracking
local_irq_enable() and interrupts are already enabled
- Remove more raw_cpu_read() usage from lockdep which causes state
corruption on !X86 architectures.
- Make the nr_unused_locks accounting in lockdep correct again"
* tag 'locking-urgent-2020-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
lockdep: Fix nr_unused_locks accounting
locking/lockdep: Remove more raw_cpu_read() usage
futex: Fix incorrect should_fail_futex() handling
lockdep: Fix preemption WARN for spurious IRQ-enable
Hi Linus,
Please, pull the following patches that replace zero-length arrays with
flexible-array members.
Thanks
--
Gustavo
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEkmRahXBSurMIg1YvRwW0y0cG2zEFAl+cjRUACgkQRwW0y0cG
2zGWAhAAjUfTsAmXWhKNaWFSCYR0Q822puTUWOKfiBd+jjGaO04luTtr2gjv2Dkb
Vgad8H4N8oZU79xfh5JZ5PUyScaso8wE6ZJTh2PLKXpKmNd213f5x/pIt78CCDTa
Y1L/eR41mmveTL3VNS3sf6WaZpT9owxJKGIY8JgdiOmSjxJQpX5zdaC1KYso4eXr
lIXIRo9VLEmVLhhHhZi+QmX6+aQ05E1D9K0ENe4/uEnRsV525W78iwZ4fYeLzr+A
krEOdgx6sPgzajPYnHoayrrcKNKxD5YY1SWuVSm2tqYYIhlRoK3f5xgLOd10RiHE
YMgx8aWzGmGJwoUhgp1bo/l9EZ7O8OWRqM/GOP4x6Wgjdhqw2x5jgskmhsKNGEXu
/BlbS+qL5aUrMCxhvNbApuZW6xBiBbva76MH3vU9vFhZbVz1CHLQdGI0tfxggYWS
jc2UPgoxL9OQlf3jSc+gK7RMFhBGNWn2Aiy8GQas3BxPYXuYPvwOj+irDOG/qZ9D
VZ5swUw4+th+DsF5K53mEFeLv0fONMgL9Ka5bNR6+k6HG0WNLYYVOiet3xYUDo1f
eZbMZthfc+QW7R8cwG0WuFk6rC6mLqE+A9nQuLZoJD+VMuJd4pwW9+6EW8nDX08w
FS4/o92xUFJfOCgaLRS61FSAuSmFENieN+yoKMK/Uf6PJVdNMb4=
=vyu3
-----END PGP SIGNATURE-----
Merge tag 'flexible-array-conversions-5.10-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux
Pull more flexible-array member conversions from Gustavo A. R. Silva:
"Replace zero-length arrays with flexible-array members"
* tag 'flexible-array-conversions-5.10-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux:
printk: ringbuffer: Replace zero-length array with flexible-array member
net/smc: Replace zero-length array with flexible-array member
net/mlx5: Replace zero-length array with flexible-array member
mei: hw: Replace zero-length array with flexible-array member
gve: Replace zero-length array with flexible-array member
Bluetooth: btintel: Replace zero-length array with flexible-array member
scsi: target: tcmu: Replace zero-length array with flexible-array member
ima: Replace zero-length array with flexible-array member
enetc: Replace zero-length array with flexible-array member
fs: Replace zero-length array with flexible-array member
Bluetooth: Replace zero-length array with flexible-array member
params: Replace zero-length array with flexible-array member
tracepoint: Replace zero-length array with flexible-array member
platform/chrome: cros_ec_proto: Replace zero-length array with flexible-array member
platform/chrome: cros_ec_commands: Replace zero-length array with flexible-array member
mailbox: zynqmp-ipi-message: Replace zero-length array with flexible-array member
dmaengine: ti-cppi5: Replace zero-length array with flexible-array member
Almost all machines use GENERIC_CLOCKEVENTS, so it feels wrong to
require each one to select that symbol manually.
Instead, enable it whenever CONFIG_LEGACY_TIMER_TICK is disabled as
a simplification. It should be possible to select both
GENERIC_CLOCKEVENTS and LEGACY_TIMER_TICK from an architecture now
and decide at runtime between the two.
For the clockevents arch-support.txt file, this means that additional
architectures are marked as TODO when they have at least one machine
that still uses LEGACY_TIMER_TICK, rather than being marked 'ok' when
at least one machine has been converted. This means that both m68k and
arm (for riscpc) revert to TODO.
At this point, we could just always enable CONFIG_GENERIC_CLOCKEVENTS
rather than leaving it off when not needed. I built an m68k
defconfig kernel (using gcc-10.1.0) and found that this would add
around 5.5KB in kernel image size:
text data bss dec hex filename
3861936 1092236 196656 5150828 4e986c obj-m68k/vmlinux-no-clockevent
3866201 1093832 196184 5156217 4ead79 obj-m68k/vmlinux-clockevent
On Arm (MACH_RPC), that difference appears to be twice as large,
around 11KB on top of an 6MB vmlinux.
Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
There are no more users of xtime_update aside from legacy_timer_tick(),
so fold it into that function and remove the declaration.
update_process_times() is now only called inside of the kernel/time/
code, so the declaration can be moved there.
Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org>
Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>