WSL2-Linux-Kernel

История

Juergen Gross a5aabace5f locking/csd_lock: Add more data to CSD lock debugging In order to help identifying problems with IPI handling and remote function execution add some more data to IPI debugging code. There have been multiple reports of CPUs looping long times (many seconds) in smp_call_function_many() waiting for another CPU executing a function like tlb flushing. Most of these reports have been for cases where the kernel was running as a guest on top of KVM or Xen (there are rumours of that happening under VMWare, too, and even on bare metal). Finding the root cause hasn't been successful yet, even after more than 2 years of chasing this bug by different developers. Commit: `35feb60474` ("kernel/smp: Provide CSD lock timeout diagnostics") tried to address this by adding some debug code and by issuing another IPI when a hang was detected. This helped mitigating the problem (the repeated IPI unlocks the hang), but the root cause is still unknown. Current available data suggests that either an IPI wasn't sent when it should have been, or that the IPI didn't result in the target CPU executing the queued function (due to the IPI not reaching the CPU, the IPI handler not being called, or the handler not seeing the queued request). Try to add more diagnostic data by introducing a global atomic counter which is being incremented when doing critical operations (before and after queueing a new request, when sending an IPI, and when dequeueing a request). The counter value is stored in percpu variables which can be printed out when a hang is detected. The data of the last event (consisting of sequence counter, source CPU, target CPU, and event type) is stored in a global variable. When a new event is to be traced, the data of the last event is stored in the event related percpu location and the global data is updated with the new event's data. This allows to track two events in one data location: one by the value of the event data (the event before the current one), and one by the location itself (the current event). A typical printout with a detected hang will look like this: csd: Detected non-responsive CSD lock (#1) on CPU#1, waiting 5000000003 ns for CPU#06 scf_handler_1+0x0/0x50(0xffffa2a881bb1410). csd: CSD lock (#1) handling prior scf_handler_1+0x0/0x50(0xffffa2a8813823c0) request. csd: cnt(00008cc): ffff->0000 dequeue (src cpu 0 == empty) csd: cnt(00008cd): ffff->0006 idle csd: cnt(0003668): 0001->0006 queue csd: cnt(0003669): 0001->0006 ipi csd: cnt(`0003e0f`): 0007->000a queue csd: cnt(0003e10): 0001->ffff ping csd: cnt(0003e71): 0003->0000 ping csd: cnt(0003e72): ffff->0006 gotipi csd: cnt(0003e73): ffff->0006 handle csd: cnt(0003e74): ffff->0006 dequeue (src cpu 0 == empty) csd: cnt(0003e7f): 0004->0006 ping csd: cnt(0003e80): 0001->ffff pinged csd: cnt(0003eb2): 0005->0001 noipi csd: cnt(0003eb3): 0001->0006 queue csd: cnt(0003eb4): 0001->0006 noipi csd: cnt now: 0003f00 The idea is to print only relevant entries. Those are all events which are associated with the hang (so sender side events for the source CPU of the hanging request, and receiver side events for the target CPU), and the related events just before those (for adding data needed to identify a possible race). Printing all available data would be possible, but this would add large amounts of data printed on larger configurations. Signed-off-by: Juergen Gross <jgross@suse.com> [ Minor readability edits. Breaks col80 but is far more readable. ] Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Paul E. McKenney <paulmck@kernel.org> Link: https://lore.kernel.org/r/20210301101336.7797-4-jgross@suse.com		2021-03-06 12:49:48 +01:00
..
ABI	A handful of late-arriving documentation fixes, nothing all that notable.	2021-02-26 14:21:18 -08:00
PCI	Documentation: PCI: Add PCI endpoint NTB function user guide	2021-02-23 14:15:45 -06:00
RCU	It has been a relatively quiet cycle in docsland.	2021-02-22 10:57:46 -08:00
accounting	…
admin-guide	locking/csd_lock: Add more data to CSD lock debugging	2021-03-06 12:49:48 +01:00
arm	Documentation: ARM: fix reference to DT format documentation	2021-01-28 15:37:43 -07:00
arm64	…
block	block/bfq: update comments and default value in docs for fifo_expire	2021-03-02 11:25:38 -07:00
bpf	…
cdrom	…
core-api	Merge branch 'akpm' (patches from Andrew)	2021-02-24 16:20:38 -08:00
cpu-freq	…
crypto	…
dev-tools	kasan: clarify that only first bug is reported in HW_TAGS	2021-02-26 09:41:03 -08:00
devicetree	Devicetree fixes for v5.12-rc:	2021-03-05 12:12:28 -08:00
doc-guide	docs: Document cross-referencing using relative path	2021-02-04 16:24:12 -07:00
driver-api	Char/Misc driver patches for 5.12-rc1	2021-02-24 10:25:37 -08:00
fault-injection	…
fb	…
features	Documentation: features: refresh feature list	2021-02-25 11:25:57 -07:00
filesystems	Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2021-02-27 08:07:12 -08:00
firmware-guide	Merge branch 'acpi-messages'	2021-02-15 17:04:53 +01:00
firmware_class	…
fpga	…
gpu	It has been a relatively quiet cycle in docsland.	2021-02-22 10:57:46 -08:00
hid	…
hwmon	hwmon: add Texas Instruments TPS23861 driver	2021-02-12 07:02:55 -08:00
i2c	i2c: testunit: add support for block process calls	2021-02-12 11:11:04 +01:00
ia64	…
ide	…
iio	…
infiniband	…
input	Documentation: input: define ABS_PRESSURE/ABS_MT_PRESSURE resolution as grams	2021-01-28 16:43:04 -07:00
isdn	…
kbuild	Kbuild updates for v5.12	2021-02-25 10:17:31 -08:00
kernel-hacking	docs: kernel-hacking: be more civil	2021-02-11 10:00:40 -07:00
leds	…
litmus-tests	…
livepatch	…
locking	…
m68k	…
maintainer	…
mhi	…
mips	…
misc-devices	…
netlabel	…
networking	Staging/IIO driver patches for 5.12-rc1	2021-02-20 21:36:51 -08:00
nios2	…
nvdimm	…
openrisc	…
parisc	…
pcmcia	…
power	It has been a relatively quiet cycle in docsland.	2021-02-22 10:57:46 -08:00
powerpc	docs: powerpc: Fix tables in syscall64-abi.rst	2021-02-25 13:04:24 -07:00
process	A handful of late-arriving documentation fixes, nothing all that notable.	2021-02-26 14:21:18 -08:00
riscv	…
s390	…
scheduler	It has been a relatively quiet cycle in docsland.	2021-02-22 10:57:46 -08:00
scsi	SCSI misc on 20210219	2021-02-22 10:24:58 -08:00
security	Keyrings miscellany	2021-02-23 16:09:23 -08:00
sh	…
sound	ALSA: jack: implement software jack injection via debugfs	2021-02-02 10:37:07 +01:00
sparc	…
sphinx	docs: Enable usage of relative paths to docs on automarkup	2021-02-04 16:23:43 -07:00
sphinx-static	…
spi	…
staging	…
target	…
timers	…
trace	Char/Misc driver patches for 5.12-rc1	2021-02-24 10:25:37 -08:00
translations	A handful of late-arriving documentation fixes, nothing all that notable.	2021-02-26 14:21:18 -08:00
usb	…
userspace-api	Char/Misc driver patches for 5.12-rc1	2021-02-24 10:25:37 -08:00
virt	* Doc fixes	2021-03-04 11:26:17 -08:00
vm	mm/debug_vm_pgtable/basic: add validation for dirtiness after write protect	2021-02-24 13:38:27 -08:00
w1	…
watchdog	…
x86	Documentation/x86/boot.rst: Correct the example of SETUP_INDIRECT	2021-01-28 15:25:31 -07:00
xtensa	…
.gitignore	…
COPYING-logo	…
Changes	…
CodingStyle	…
Kconfig	…
Makefile	kbuild: remove PYTHON variable	2021-02-01 10:37:19 +09:00
SubmittingPatches	…
asm-annotations.rst	…
atomic_bitops.txt	…
atomic_t.txt	…
conf.py	Fix unaesthetic indentation	2021-02-22 14:35:04 -07:00
docutils.conf	…
dontdiff	…
index.rst	…
logo.gif	…
memory-barriers.txt	…
watch_queue.rst	…