Provide static call to control IRQ preemption (called in CONFIG_PREEMPT)
so that we can override its behaviour when preempt= is overriden.
Since the default behaviour is full preemption, its call is
initialized to provide IRQ preemption when preempt= isn't passed.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20210118141223.123667-8-frederic@kernel.org
Commit 2991552447 ("entry: Drop usage of TIF flags in the generic syscall
code") introduced a bug on architectures using the generic syscall entry
code, in which processes stopped by PTRACE_SYSCALL do not trap on syscall
return after receiving a TIF_SINGLESTEP.
The reason is that the meaning of TIF_SINGLESTEP flag is overloaded to
cause the trap after a system call is executed, but since the above commit,
the syscall call handler only checks for the SYSCALL_WORK flags on the exit
work.
Split the meaning of TIF_SINGLESTEP such that it only means single-step
mode, and create a new type of SYSCALL_WORK to request a trap immediately
after a syscall in single-step mode. In the current implementation, the
SYSCALL_WORK flag shadows the TIF_SINGLESTEP flag for simplicity.
Update x86 to flip this bit when a tracer enables single stepping.
Fixes: 2991552447 ("entry: Drop usage of TIF flags in the generic syscall code")
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Kyle Huey <me@kylehuey.com>
Link: https://lore.kernel.org/r/87h7mtc9pr.fsf_-_@collabora.com
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl/YJxsQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpjpyEACBdW+YjenjTbkUPeEXzQgkBkTZUYw3g007
DPcUT1g8PQZXYXlQvBKCvGhhIr7/KVcjepKoowiNQfBNGcIPJTVopW58nzpqAfTQ
goI2WYGn5EKFFKBPvtH04cJD/Wo8muXdxynKtqyZbnGGgZjQxPrE259b8dpHjBSR
6L7HHkk0D1oU/5b6h6Ocpg9mc/0iIUCZylySAYY3eGO0JaVPJaXgZSJZYgHxCHll
Lb+/y/fXdtm/0PmQ3ko0ev54g3yEWqZIX0NsZW1asrButIy+KLzQ2Mz1xFLFDMag
prtIfwb8tzgc4dFPY090C/azjCh5CPpxqYS6FkRwS0p86n6OhkyXrqfily5Hs4/B
NC7CBPBSH/j+NKUK7CYZcpTzTpxPjUr9p0anUdlvMJz8FhTb/3YEEZ1UTeWOeHmk
Yo5SxnFghLeZZeZ1ok6rdymnVa7WEX12SCLGQX31BB2mld0tNbKb4b+FsBF6OUMk
IUaX6OjwDFVRaysC88BQ4hjcIP1HxsViG4/VZDX15gjAAH2Pvb+7tev+lcDcOhjz
TCD4GNFspTFzRhh9nT7oxQ679qCh9G9zHbzuIRewnrS6iqvo5SJQB3dR2yrWZRRH
ySkQFiHpYOlnLJYv0jg9COlGwo2FUdcvKhCvkjQKKBz48rzW/IC0LwKdRQWZDFk3
FKGzP/NBig==
=cadT
-----END PGP SIGNATURE-----
Merge tag 'tif-task_work.arch-2020-12-14' of git://git.kernel.dk/linux-block
Pull TIF_NOTIFY_SIGNAL updates from Jens Axboe:
"This sits on top of of the core entry/exit and x86 entry branch from
the tip tree, which contains the generic and x86 parts of this work.
Here we convert the rest of the archs to support TIF_NOTIFY_SIGNAL.
With that done, we can get rid of JOBCTL_TASK_WORK from task_work and
signal.c, and also remove a deadlock work-around in io_uring around
knowing that signal based task_work waking is invoked with the sighand
wait queue head lock.
The motivation for this work is to decouple signal notify based
task_work, of which io_uring is a heavy user of, from sighand. The
sighand lock becomes a huge contention point, particularly for
threaded workloads where it's shared between threads. Even outside of
threaded applications it's slower than it needs to be.
Roman Gershman <romger@amazon.com> reported that his networked
workload dropped from 1.6M QPS at 80% CPU to 1.0M QPS at 100% CPU
after io_uring was changed to use TIF_NOTIFY_SIGNAL. The time was all
spent hammering on the sighand lock, showing 57% of the CPU time there
[1].
There are further cleanups possible on top of this. One example is
TIF_PATCH_PENDING, where a patch already exists to use
TIF_NOTIFY_SIGNAL instead. Hopefully this will also lead to more
consolidation, but the work stands on its own as well"
[1] https://github.com/axboe/liburing/issues/215
* tag 'tif-task_work.arch-2020-12-14' of git://git.kernel.dk/linux-block: (28 commits)
io_uring: remove 'twa_signal_ok' deadlock work-around
kernel: remove checking for TIF_NOTIFY_SIGNAL
signal: kill JOBCTL_TASK_WORK
io_uring: JOBCTL_TASK_WORK is no longer used by task_work
task_work: remove legacy TWA_SIGNAL path
sparc: add support for TIF_NOTIFY_SIGNAL
riscv: add support for TIF_NOTIFY_SIGNAL
nds32: add support for TIF_NOTIFY_SIGNAL
ia64: add support for TIF_NOTIFY_SIGNAL
h8300: add support for TIF_NOTIFY_SIGNAL
c6x: add support for TIF_NOTIFY_SIGNAL
alpha: add support for TIF_NOTIFY_SIGNAL
xtensa: add support for TIF_NOTIFY_SIGNAL
arm: add support for TIF_NOTIFY_SIGNAL
microblaze: add support for TIF_NOTIFY_SIGNAL
hexagon: add support for TIF_NOTIFY_SIGNAL
csky: add support for TIF_NOTIFY_SIGNAL
openrisc: add support for TIF_NOTIFY_SIGNAL
sh: add support for TIF_NOTIFY_SIGNAL
um: add support for TIF_NOTIFY_SIGNAL
...
This is the same as syscall_exit_to_user_mode() but without calling
exit_to_user_mode(). This can be used if there is an architectural reason
to avoid the combo function, e.g. restarting a syscall without returning to
userspace. Before returning to user space the caller has to invoke
exit_to_user_mode().
[ tglx: Amended comments ]
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201201142755.31931-6-svens@linux.ibm.com
Called from architecture specific code when syscall_exit_to_user_mode() is
not suitable. It simply calls __exit_to_user_mode().
This way __exit_to_user_mode() can still be inlined because it is declared
static __always_inline.
[ tglx: Amended comments and moved it to a different place in the header ]
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201201142755.31931-5-svens@linux.ibm.com
To be called from architecture specific code if the combo interfaces are
not suitable. It simply calls __enter_from_user_mode(). This way
__enter_from_user_mode will still be inlined because it is declared static
__always_inline.
[ tglx: Amend comments and move it to a different location in the header ]
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201201142755.31931-4-svens@linux.ibm.com
Syscall User Dispatch (SUD) must take precedence over seccomp and
ptrace, since the use case is emulation (it can be invoked with a
different ABI) such that seccomp filtering by syscall number doesn't
make sense in the first place. In addition, either the syscall is
dispatched back to userspace, in which case there is no resource for to
trace, or the syscall will be executed, and seccomp/ptrace will execute
next.
Since SUD runs before tracepoints, it needs to be a SYSCALL_WORK_EXIT as
well, just to prevent a trace exit event when dispatch was triggered.
For that, the on_syscall_dispatch() examines context to skip the
tracepoint, audit and other work.
[ tglx: Add a comment on the exit side ]
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20201127193238.821364-5-krisman@collabora.com
Now that the flags migration in the common syscall entry code is complete
and the code relies exclusively on thread_info::syscall_work, clean up the
accesses to TI flags in that path.
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Link: https://lore.kernel.org/r/20201116174206.2639648-10-krisman@collabora.com
On architectures using the generic syscall entry code the architecture
independent syscall work is moved to flags in thread_info::syscall_work.
This removes architecture dependencies and frees up TIF bits.
Define SYSCALL_WORK_SYSCALL_AUDIT, use it in the generic entry code and
convert the code which uses the TIF specific helper functions to use the
new *_syscall_work() helpers which either resolve to the new mode for users
of the generic entry code or to the TIF based functions for the other
architectures.
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Link: https://lore.kernel.org/r/20201116174206.2639648-9-krisman@collabora.com
On architectures using the generic syscall entry code the architecture
independent syscall work is moved to flags in thread_info::syscall_work.
This removes architecture dependencies and frees up TIF bits.
Define SYSCALL_WORK_SYSCALL_EMU, use it in the generic entry code and
convert the code which uses the TIF specific helper functions to use the
new *_syscall_work() helpers which either resolve to the new mode for users
of the generic entry code or to the TIF based functions for the other
architectures.
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Link: https://lore.kernel.org/r/20201116174206.2639648-8-krisman@collabora.com
On architectures using the generic syscall entry code the architecture
independent syscall work is moved to flags in thread_info::syscall_work.
This removes architecture dependencies and frees up TIF bits.
Define SYSCALL_WORK_SYSCALL_TRACE, use it in the generic entry code and
convert the code which uses the TIF specific helper functions to use the
new *_syscall_work() helpers which either resolve to the new mode for users
of the generic entry code or to the TIF based functions for the other
architectures.
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Link: https://lore.kernel.org/r/20201116174206.2639648-7-krisman@collabora.com
On architectures using the generic syscall entry code the architecture
independent syscall work is moved to flags in thread_info::syscall_work.
This removes architecture dependencies and frees up TIF bits.
Define SYSCALL_WORK_SYSCALL_TRACEPOINT, use it in the generic entry code
and convert the code which uses the TIF specific helper functions to use
the new *_syscall_work() helpers which either resolve to the new mode for
users of the generic entry code or to the TIF based functions for the other
architectures.
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Link: https://lore.kernel.org/r/20201116174206.2639648-6-krisman@collabora.com
On architectures using the generic syscall entry code the architecture
independent syscall work is moved to flags in thread_info::syscall_work.
This removes architecture dependencies and frees up TIF bits.
Define SYSCALL_WORK_SECCOMP, use it in the generic entry code and convert
the code which uses the TIF specific helper functions to use the new
*_syscall_work() helpers which either resolve to the new mode for users of
the generic entry code or to the TIF based functions for the other
architectures.
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Link: https://lore.kernel.org/r/20201116174206.2639648-5-krisman@collabora.com
Prepare the common entry code to use the SYSCALL_WORK flags. They will
be defined in subsequent patches for each type of syscall
work. SYSCALL_WORK_ENTRY/EXIT are defined for the transition, as they
will replace the TIF_ equivalent defines.
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Link: https://lore.kernel.org/r/20201116174206.2639648-4-krisman@collabora.com
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl+pR0MTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoSWpD/93kyOy0L7NkIELgM6/OHipjsLC6K12
jMXifA5DfmIIm31sLLzLk08YPz4TOJU+lZKn1DGdqYLioMvvsJe3uZP/WQVyV81z
QnqdwOpdJVxq7JIjQW04eaVDWjXmxFGmbjKq/tbGexph0hckOtUq/7JOEukzgbLX
q5iZqYgFouU/G5HIW5Um21WiEVmzzjTFMp13zbqo3rMrG9vXIb5JQxm+TkBbx2P7
u2kS8R3OFScQcH6UyhaFBBmNFyUHtfbMKPESnhVSXUggQ0aVhNi6pjylE2ZaEIB+
d7C/yNGNwlC7rGPlh49W5gH5rogrX2Ft2YVrHf645q3Sj/GhbXZ1NqT4f1DJt5uM
tKjKFxJv6g1dT7ejlUQmseAtLI4ue2wj3C0qtPeHOnUlPHlaDlkTLE2oaCh97Mgn
mDpAZVnMOcLMRuBFt1J+fnoYcBHwXlfT1rAj//U6m6Pi5BbwIVwTsok5Vsysms4L
Tyx31zXee3XTp8FEWRL9gqH0b5zXxEUuHxZkqu4vdQDf+wkTi08Q4FpGnhaxGBrY
CHTwm42hmOK80hdQ6Vv4O38LkQcaEpTztVRexP7a89Q3zxmGlL5BsWGqIIoai0Fg
8UWmHvauG3kqAif6giWv3QNQ5v0rWksHNxU1yK0tRT6s03no3Kz1eoHHLar8+hoU
8+/AY23fcJookg==
=SPWC
-----END PGP SIGNATURE-----
Merge tag 'core-entry-notify-signal' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into tif-task_work.arch
Core changes to support TASK_NOTIFY_SIGNAL
* tag 'core-entry-notify-signal' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
task_work: Use TIF_NOTIFY_SIGNAL if available
entry: Add support for TIF_NOTIFY_SIGNAL
signal: Add task_sigpending() helper
Lockdep state handling on NMI enter and exit is nothing specific to X86. It's
not any different on other architectures. Also the extra state type is not
necessary, irqentry_state_t can carry the necessary information as well.
Move it to common code and extend irqentry_state_t to carry lockdep state.
[ Ira: Make exit_rcu and lockdep a union as they are mutually exclusive
between the IRQ and NMI exceptions, and add kernel documentation for
struct irqentry_state_t ]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201102205320.1458656-7-ira.weiny@intel.com
Add TIF_NOTIFY_SIGNAL handling in the generic entry code, which if set,
will return true if signal_pending() is used in a wait loop. That causes an
exit of the loop so that notify_signal tracehooks can be run. If the wait
loop is currently inside a system call, the system call is restarted once
task_work has been processed.
In preparation for only having arch_do_signal() handle syscall restarts if
_TIF_SIGPENDING isn't set, rename it to arch_do_signal_or_restart(). Pass
in a boolean that tells the architecture specific signal handler if it
should attempt to get a signal, or just process a potential syscall
restart.
For !CONFIG_GENERIC_ENTRY archs, add the TIF_NOTIFY_SIGNAL handling to
get_signal(). This is done to minimize the needed architecture changes to
support this feature.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20201026203230.386348-3-axboe@kernel.dk
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl+ENW0ACgkQEsHwGGHe
VUrWjw/+O0S/Bf7RQ2OIDnaHGo5u9k+T+FiklYtTO4klYqtfNEt/DFWVOIThVXBQ
ma4I8Hspj+zUzlq2kqSeqJ2PiikTxRNDqkCUwZhqEFgbXS6/pt8VXXdPniKjeXge
ZE4lcD1RIyDFxzVlKvVaYt1KryZZVVSRqRIChejLrujN23fI6riWfa0W4Bq54J6m
fdiujuDJQ9oroak36dF5Ah6g4g8gL8hBLU9Oyzla9V+1O3GSZuDlwTgDsxZZkmC5
LN4spxwd9tOXOmWhbH7vFfRtQL79KUHkHbUuUvZzZsJ/zs85bxhMa+fUAfjWAEja
brMpD1GZKOcjUM7xzQ9HngMcKD8lWmlsTBTAO9drD89Z949ntjIA4uCY3d3RTJ1q
NoYCV8Xw+8Q8e+zjnMW0tph39LCUEeuccT7t09XP5IF5UEXi5T5S14WoCu5Shnt9
VTQ44NrAxpP7ZNWMpBTaxmr3aXABbdgnvDIxqrohqgQnCnPkWlBJ9FdKj8sQ3y9B
K010ihIb1pWnmTyKGIC3GOWNjwtCpqz9z3gya76tI7EzAejVS6yUqwMohjaWq6JZ
Tz/TtTSTUyczKiCCqoOf7P+5LKrhxjWS8IVBeMqMTeN7osCCIT69U+cox1Ih3DST
pBfy7R3+FXKLHVi/iQv8E+fl3//pTGppKv4MM/wab0E6L+KhqEo=
=NYxb
-----END PGP SIGNATURE-----
Merge tag 'x86_cleanups_for_v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cleanups from Borislav Petkov:
"Misc minor cleanups"
* tag 'x86_cleanups_for_v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/entry: Fix typo in comments for syscall_enter_from_user_mode()
x86/resctrl: Fix spelling in user-visible warning messages
x86/entry/64: Do not include inst.h in calling.h
x86/mpparse: Remove duplicate io_apic.h include
Just to help myself and others with finding the correct function names,
fix a typo for "usermode" vs "user_mode".
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200919080936.259819-1-keescook@chromium.org
Andy reported that the syscall treacing for 32bit fast syscall fails:
# ./tools/testing/selftests/x86/ptrace_syscall_32
...
[RUN] SYSEMU
[FAIL] Initial args are wrong (nr=224, args=10 11 12 13 14 4289172732)
...
[RUN] SYSCALL
[FAIL] Initial args are wrong (nr=29, args=0 0 0 0 0 4289172732)
The eason is that the conversion to generic entry code moved the retrieval
of the sixth argument (EBP) after the point where the syscall entry work
runs, i.e. ptrace, seccomp, audit...
Unbreak it by providing a split up version of syscall_enter_from_user_mode().
- syscall_enter_from_user_mode_prepare() establishes state and enables
interrupts
- syscall_enter_from_user_mode_work() runs the entry work
Replace the call to syscall_enter_from_user_mode() in the 32bit fast
syscall C-entry with the split functions and stick the EBP retrieval
between them.
Fixes: 27d6b4d14f ("x86/entry: Use generic syscall entry function")
Reported-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/87k0xdjbtt.fsf@nanos.tec.linutronix.de
Like the syscall entry/exit code interrupt/exception entry after the real
low level ASM bits should not be different accross architectures.
Provide a generic version based on the x86 code.
irqentry_enter() is called after the low level entry code and
irqentry_exit() must be invoked right before returning to the low level
code which just contains the actual return logic. The code before
irqentry_enter() and irqentry_exit() must not be instrumented. Code after
irqentry_enter() and before irqentry_exit() can be instrumented.
irqentry_enter() invokes irqentry_enter_from_user_mode() if the
interrupt/exception came from user mode. If if entered from kernel mode it
handles the kernel mode variant of establishing state for lockdep, RCU and
tracing depending on the kernel context it interrupted (idle, non-idle).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200722220519.723703209@linutronix.de
Like syscall entry all architectures have similar and pointlessly different
code to handle pending work before returning from a syscall to user space.
1) One-time syscall exit work:
- rseq syscall exit
- audit
- syscall tracing
- tracehook (single stepping)
2) Preparatory work
- Exit to user mode loop (common TIF handling).
- Architecture specific one time work arch_exit_to_user_mode_prepare()
- Address limit and lockdep checks
3) Final transition (lockdep, tracing, context tracking, RCU). Invokes
arch_exit_to_user_mode() to handle e.g. speculation mitigations
Provide a generic version based on the x86 code which has all the RCU and
instrumentation protections right.
Provide a variant for interrupt return to user mode as well which shares
the above #2 and #3 work items.
After syscall_exit_to_user_mode() and irqentry_exit_to_user_mode() the
architecture code just has to return to user space. The code after
returning from these functions must not be instrumented.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/20200722220519.613977173@linutronix.de
On syscall entry certain work needs to be done:
- Establish state (lockdep, context tracking, tracing)
- Conditional work (ptrace, seccomp, audit...)
This code is needlessly duplicated and different in all
architectures.
Provide a generic version based on the x86 implementation which has all the
RCU and instrumentation bits right.
As interrupt/exception entry from user space needs parts of the same
functionality, provide a function for this as well.
syscall_enter_from_user_mode() and irqentry_enter_from_user_mode() must be
called right after the low level ASM entry. The calling code must be
non-instrumentable. After the functions returns state is correct and the
subsequent functions can be instrumented.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/20200722220519.513463269@linutronix.de