coarse grained, hardware based, forward edge Control-Flow-Integrity mechanism
where any indirect CALL/JMP must target an ENDBR instruction or suffer #CP.
Additionally, since Alderlake (12th gen)/Sapphire-Rapids, speculation is
limited to 2 instructions (and typically fewer) on branch targets not starting
with ENDBR. CET-IBT also limits speculation of the next sequential instruction
after the indirect CALL/JMP [1].
CET-IBT is fundamentally incompatible with retpolines, but provides, as
described above, speculation limits itself.
[1] https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/branch-history-injection.html
-----BEGIN PGP SIGNATURE-----
iQJJBAABCgAzFiEEv3OU3/byMaA0LqWJdkfhpEvA5LoFAmI/LI8VHHBldGVyekBp
bmZyYWRlYWQub3JnAAoJEHZH4aRLwOS6ZnkP/2QCgQLTu6oRxv9O020CHwlaSEeD
1Hoy3loum5q5hAi1Ik3dR9p0H5u64c9qbrBVxaFoNKaLt5GKrtHaDSHNk2L/CFHX
urpH65uvTLxbyZzcahkAahoJ71XU+m7PcrHLWMunw9sy10rExYVsUOlFyoyG6XCF
BDCNZpdkC09ZM3vwlWGMZd5Pp+6HcZNPyoV9tpvWAS2l+WYFWAID7mflbpQ+tA8b
y/hM6b3Ud0rT2ubuG1iUpopgNdwqQZ+HisMPGprh+wKZkYwS2l8pUTrz0MaBkFde
go7fW16kFy2HQzGm6aIEBmfcg0palP/mFVaWP0zS62LwhJSWTn5G6xWBr3yxSsht
9gWCiI0oDZuTg698MedWmomdG2SK6yAuZuqmdKtLLoWfWgviPEi7TDFG/cKtZdAW
ag8GM8T4iyYZzpCEcWO9GWbjo6TTGq30JBQefCBG47GjD0csv2ubXXx0Iey+jOwT
x3E8wnv9dl8V9FSd/tMpTFmje8ges23yGrWtNpb5BRBuWTeuGiBPZED2BNyyIf+T
dmewi2ufNMONgyNp27bDKopY81CPAQq9cVxqNm9Cg3eWPFnpOq2KGYEvisZ/rpEL
EjMQeUBsy/C3AUFAleu1vwNnkwP/7JfKYpN00gnSyeQNZpqwxXBCKnHNgOMTXyJz
beB/7u2KIUbKEkSN
=jZfK
-----END PGP SIGNATURE-----
Merge tag 'x86_core_for_5.18_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 CET-IBT (Control-Flow-Integrity) support from Peter Zijlstra:
"Add support for Intel CET-IBT, available since Tigerlake (11th gen),
which is a coarse grained, hardware based, forward edge
Control-Flow-Integrity mechanism where any indirect CALL/JMP must
target an ENDBR instruction or suffer #CP.
Additionally, since Alderlake (12th gen)/Sapphire-Rapids, speculation
is limited to 2 instructions (and typically fewer) on branch targets
not starting with ENDBR. CET-IBT also limits speculation of the next
sequential instruction after the indirect CALL/JMP [1].
CET-IBT is fundamentally incompatible with retpolines, but provides,
as described above, speculation limits itself"
[1] https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/branch-history-injection.html
* tag 'x86_core_for_5.18_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (53 commits)
kvm/emulate: Fix SETcc emulation for ENDBR
x86/Kconfig: Only allow CONFIG_X86_KERNEL_IBT with ld.lld >= 14.0.0
x86/Kconfig: Only enable CONFIG_CC_HAS_IBT for clang >= 14.0.0
kbuild: Fixup the IBT kbuild changes
x86/Kconfig: Do not allow CONFIG_X86_X32_ABI=y with llvm-objcopy
x86: Remove toolchain check for X32 ABI capability
x86/alternative: Use .ibt_endbr_seal to seal indirect calls
objtool: Find unused ENDBR instructions
objtool: Validate IBT assumptions
objtool: Add IBT/ENDBR decoding
objtool: Read the NOENDBR annotation
x86: Annotate idtentry_df()
x86,objtool: Move the ASM_REACHABLE annotation to objtool.h
x86: Annotate call_on_stack()
objtool: Rework ASM_REACHABLE
x86: Mark __invalid_creds() __noreturn
exit: Mark do_group_exit() __noreturn
x86: Mark stop_this_cpu() __noreturn
objtool: Ignore extra-symbol code
objtool: Rename --duplicate to --lto
...
Core
----
- Introduce XDP multi-buffer support, allowing the use of XDP with
jumbo frame MTUs and combination with Rx coalescing offloads (LRO).
- Speed up netns dismantling (5x) and lower the memory cost a little.
Remove unnecessary per-netns sockets. Scope some lists to a netns.
Cut down RCU syncing. Use batch methods. Allow netdev registration
to complete out of order.
- Support distinguishing timestamp types (ingress vs egress) and
maintaining them across packet scrubbing points (e.g. redirect).
- Continue the work of annotating packet drop reasons throughout
the stack.
- Switch netdev error counters from an atomic to dynamically
allocated per-CPU counters.
- Rework a few preempt_disable(), local_irq_save() and busy waiting
sections problematic on PREEMPT_RT.
- Extend the ref_tracker to allow catching use-after-free bugs.
BPF
---
- Introduce "packing allocator" for BPF JIT images. JITed code is
marked read only, and used to be allocated at page granularity.
Custom allocator allows for more efficient memory use, lower
iTLB pressure and prevents identity mapping huge pages from
getting split.
- Make use of BTF type annotations (e.g. __user, __percpu) to enforce
the correct probe read access method, add appropriate helpers.
- Convert the BPF preload to use light skeleton and drop
the user-mode-driver dependency.
- Allow XDP BPF_PROG_RUN test infra to send real packets, enabling
its use as a packet generator.
- Allow local storage memory to be allocated with GFP_KERNEL if called
from a hook allowed to sleep.
- Introduce fprobe (multi kprobe) to speed up mass attachment (arch
bits to come later).
- Add unstable conntrack lookup helpers for BPF by using the BPF
kfunc infra.
- Allow cgroup BPF progs to return custom errors to user space.
- Add support for AF_UNIX iterator batching.
- Allow iterator programs to use sleepable helpers.
- Support JIT of add, and, or, xor and xchg atomic ops on arm64.
- Add BTFGen support to bpftool which allows to use CO-RE in kernels
without BTF info.
- Large number of libbpf API improvements, cleanups and deprecations.
Protocols
---------
- Micro-optimize UDPv6 Tx, gaining up to 5% in test on dummy netdev.
- Adjust TSO packet sizes based on min_rtt, allowing very low latency
links (data centers) to always send full-sized TSO super-frames.
- Make IPv6 flow label changes (AKA hash rethink) more configurable,
via sysctl and setsockopt. Distinguish between server and client
behavior.
- VxLAN support to "collect metadata" devices to terminate only
configured VNIs. This is similar to VLAN filtering in the bridge.
- Support inserting IPv6 IOAM information to a fraction of frames.
- Add protocol attribute to IP addresses to allow identifying where
given address comes from (kernel-generated, DHCP etc.)
- Support setting socket and IPv6 options via cmsg on ping6 sockets.
- Reject mis-use of ECN bits in IP headers as part of DSCP/TOS.
Define dscp_t and stop taking ECN bits into account in fib-rules.
- Add support for locked bridge ports (for 802.1X).
- tun: support NAPI for packets received from batched XDP buffs,
doubling the performance in some scenarios.
- IPv6 extension header handling in Open vSwitch.
- Support IPv6 control message load balancing in bonding, prevent
neighbor solicitation and advertisement from using the wrong port.
Support NS/NA monitor selection similar to existing ARP monitor.
- SMC
- improve performance with TCP_CORK and sendfile()
- support auto-corking
- support TCP_NODELAY
- MCTP (Management Component Transport Protocol)
- add user space tag control interface
- I2C binding driver (as specified by DMTF DSP0237)
- Multi-BSSID beacon handling in AP mode for WiFi.
- Bluetooth:
- handle MSFT Monitor Device Event
- add MGMT Adv Monitor Device Found/Lost events
- Multi-Path TCP:
- add support for the SO_SNDTIMEO socket option
- lots of selftest cleanups and improvements
- Increase the max PDU size in CAN ISOTP to 64 kB.
Driver API
----------
- Add HW counters for SW netdevs, a mechanism for devices which
offload packet forwarding to report packet statistics back to
software interfaces such as tunnels.
- Select the default NIC queue count as a fraction of number of
physical CPU cores, instead of hard-coding to 8.
- Expose devlink instance locks to drivers. Allow device layer of
drivers to use that lock directly instead of creating their own
which always runs into ordering issues in devlink callbacks.
- Add header/data split indication to guide user space enabling
of TCP zero-copy Rx.
- Allow configuring completion queue event size.
- Refactor page_pool to enable fragmenting after allocation.
- Add allocation and page reuse statistics to page_pool.
- Improve Multiple Spanning Trees support in the bridge to allow
reuse of topologies across VLANs, saving HW resources in switches.
- DSA (Distributed Switch Architecture):
- replay and offload of host VLAN entries
- offload of static and local FDB entries on LAG interfaces
- FDB isolation and unicast filtering
New hardware / drivers
----------------------
- Ethernet:
- LAN937x T1 PHYs
- Davicom DM9051 SPI NIC driver
- Realtek RTL8367S, RTL8367RB-VB switch and MDIO
- Microchip ksz8563 switches
- Netronome NFP3800 SmartNICs
- Fungible SmartNICs
- MediaTek MT8195 switches
- WiFi:
- mt76: MediaTek mt7916
- mt76: MediaTek mt7921u USB adapters
- brcmfmac: Broadcom BCM43454/6
- Mobile:
- iosm: Intel M.2 7360 WWAN card
Drivers
-------
- Convert many drivers to the new phylink API built for split PCS
designs but also simplifying other cases.
- Intel Ethernet NICs:
- add TTY for GNSS module for E810T device
- improve AF_XDP performance
- GTP-C and GTP-U filter offload
- QinQ VLAN support
- Mellanox Ethernet NICs (mlx5):
- support xdp->data_meta
- multi-buffer XDP
- offload tc push_eth and pop_eth actions
- Netronome Ethernet NICs (nfp):
- flow-independent tc action hardware offload (police / meter)
- AF_XDP
- Other Ethernet NICs:
- at803x: fiber and SFP support
- xgmac: mdio: preamble suppression and custom MDC frequencies
- r8169: enable ASPM L1.2 if system vendor flags it as safe
- macb/gem: ZynqMP SGMII
- hns3: add TX push mode
- dpaa2-eth: software TSO
- lan743x: multi-queue, mdio, SGMII, PTP
- axienet: NAPI and GRO support
- Mellanox Ethernet switches (mlxsw):
- source and dest IP address rewrites
- RJ45 ports
- Marvell Ethernet switches (prestera):
- basic routing offload
- multi-chain TC ACL offload
- NXP embedded Ethernet switches (ocelot & felix):
- PTP over UDP with the ocelot-8021q DSA tagging protocol
- basic QoS classification on Felix DSA switch using dcbnl
- port mirroring for ocelot switches
- Microchip high-speed industrial Ethernet (sparx5):
- offloading of bridge port flooding flags
- PTP Hardware Clock
- Other embedded switches:
- lan966x: PTP Hardward Clock
- qca8k: mdio read/write operations via crafted Ethernet packets
- Qualcomm 802.11ax WiFi (ath11k):
- add LDPC FEC type and 802.11ax High Efficiency data in radiotap
- enable RX PPDU stats in monitor co-exist mode
- Intel WiFi (iwlwifi):
- UHB TAS enablement via BIOS
- band disablement via BIOS
- channel switch offload
- 32 Rx AMPDU sessions in newer devices
- MediaTek WiFi (mt76):
- background radar detection
- thermal management improvements on mt7915
- SAR support for more mt76 platforms
- MBSSID and 6 GHz band on mt7915
- RealTek WiFi:
- rtw89: AP mode
- rtw89: 160 MHz channels and 6 GHz band
- rtw89: hardware scan
- Bluetooth:
- mt7921s: wake on Bluetooth, SCO over I2S, wide-band-speed (WBS)
- Microchip CAN (mcp251xfd):
- multiple RX-FIFOs and runtime configurable RX/TX rings
- internal PLL, runtime PM handling simplification
- improve chip detection and error handling after wakeup
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmI7YBcACgkQMUZtbf5S
IrveSBAAmSNJlUK6vPsnNzs7IhsZnfI/AUjm2TCLZnlhKttbpI4A/4Pohk33V7RS
FGX7f8kjEfhUwrIiLDgeCnztNHRECrCmk6aZc/jLEvecmTauJ+f6kjShkDY/wix+
AkPHmrZnQeLPAEVuljDdV+sL6ik08+zQL7PazIYHsaSKKC0MGQptRwcri8PLRAKE
KPBAhVhleq2rAZ/ntprSN52F4Af6rpFTrPIWuN8Bqdbc9dy5094LT0mpOOWYvgr3
/DLvvAPuLemwyIQkjWknVKBRUAQcmNPC+BY3J8K3LRaiNhekGqOFan46BfqP+k2J
6DWu0Qrp2yWt4BMOeEToZR5rA6v5suUAMIBu8PRZIDkINXQMlIxHfGjZyNm0rVfw
7edNri966yus9OdzwPa32MIG3oC6PnVAwYCJAjjBMNS8sSIkp7wgHLkgWN4UFe2H
K/e6z8TLF4UQ+zFM0aGI5WZ+9QqWkTWEDF3R3OhdFpGrznna0gxmkOeV2YvtsgxY
cbS0vV9Zj73o+bYzgBKJsw/dAjyLdXoHUGvus26VLQ78S/VGunVKtItwoxBAYmZo
krW964qcC89YofzSi8RSKLHuEWtNWZbVm8YXr75u6jpr5GhMBu0CYefLs+BuZcxy
dw8c69cGneVbGZmY2J3rBhDkchbuICl8vdUPatGrOJAoaFdYKuw=
=ELpe
-----END PGP SIGNATURE-----
Merge tag 'net-next-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Jakub Kicinski:
"The sprinkling of SPI drivers is because we added a new one and Mark
sent us a SPI driver interface conversion pull request.
Core
----
- Introduce XDP multi-buffer support, allowing the use of XDP with
jumbo frame MTUs and combination with Rx coalescing offloads (LRO).
- Speed up netns dismantling (5x) and lower the memory cost a little.
Remove unnecessary per-netns sockets. Scope some lists to a netns.
Cut down RCU syncing. Use batch methods. Allow netdev registration
to complete out of order.
- Support distinguishing timestamp types (ingress vs egress) and
maintaining them across packet scrubbing points (e.g. redirect).
- Continue the work of annotating packet drop reasons throughout the
stack.
- Switch netdev error counters from an atomic to dynamically
allocated per-CPU counters.
- Rework a few preempt_disable(), local_irq_save() and busy waiting
sections problematic on PREEMPT_RT.
- Extend the ref_tracker to allow catching use-after-free bugs.
BPF
---
- Introduce "packing allocator" for BPF JIT images. JITed code is
marked read only, and used to be allocated at page granularity.
Custom allocator allows for more efficient memory use, lower iTLB
pressure and prevents identity mapping huge pages from getting
split.
- Make use of BTF type annotations (e.g. __user, __percpu) to enforce
the correct probe read access method, add appropriate helpers.
- Convert the BPF preload to use light skeleton and drop the
user-mode-driver dependency.
- Allow XDP BPF_PROG_RUN test infra to send real packets, enabling
its use as a packet generator.
- Allow local storage memory to be allocated with GFP_KERNEL if
called from a hook allowed to sleep.
- Introduce fprobe (multi kprobe) to speed up mass attachment (arch
bits to come later).
- Add unstable conntrack lookup helpers for BPF by using the BPF
kfunc infra.
- Allow cgroup BPF progs to return custom errors to user space.
- Add support for AF_UNIX iterator batching.
- Allow iterator programs to use sleepable helpers.
- Support JIT of add, and, or, xor and xchg atomic ops on arm64.
- Add BTFGen support to bpftool which allows to use CO-RE in kernels
without BTF info.
- Large number of libbpf API improvements, cleanups and deprecations.
Protocols
---------
- Micro-optimize UDPv6 Tx, gaining up to 5% in test on dummy netdev.
- Adjust TSO packet sizes based on min_rtt, allowing very low latency
links (data centers) to always send full-sized TSO super-frames.
- Make IPv6 flow label changes (AKA hash rethink) more configurable,
via sysctl and setsockopt. Distinguish between server and client
behavior.
- VxLAN support to "collect metadata" devices to terminate only
configured VNIs. This is similar to VLAN filtering in the bridge.
- Support inserting IPv6 IOAM information to a fraction of frames.
- Add protocol attribute to IP addresses to allow identifying where
given address comes from (kernel-generated, DHCP etc.)
- Support setting socket and IPv6 options via cmsg on ping6 sockets.
- Reject mis-use of ECN bits in IP headers as part of DSCP/TOS.
Define dscp_t and stop taking ECN bits into account in fib-rules.
- Add support for locked bridge ports (for 802.1X).
- tun: support NAPI for packets received from batched XDP buffs,
doubling the performance in some scenarios.
- IPv6 extension header handling in Open vSwitch.
- Support IPv6 control message load balancing in bonding, prevent
neighbor solicitation and advertisement from using the wrong port.
Support NS/NA monitor selection similar to existing ARP monitor.
- SMC
- improve performance with TCP_CORK and sendfile()
- support auto-corking
- support TCP_NODELAY
- MCTP (Management Component Transport Protocol)
- add user space tag control interface
- I2C binding driver (as specified by DMTF DSP0237)
- Multi-BSSID beacon handling in AP mode for WiFi.
- Bluetooth:
- handle MSFT Monitor Device Event
- add MGMT Adv Monitor Device Found/Lost events
- Multi-Path TCP:
- add support for the SO_SNDTIMEO socket option
- lots of selftest cleanups and improvements
- Increase the max PDU size in CAN ISOTP to 64 kB.
Driver API
----------
- Add HW counters for SW netdevs, a mechanism for devices which
offload packet forwarding to report packet statistics back to
software interfaces such as tunnels.
- Select the default NIC queue count as a fraction of number of
physical CPU cores, instead of hard-coding to 8.
- Expose devlink instance locks to drivers. Allow device layer of
drivers to use that lock directly instead of creating their own
which always runs into ordering issues in devlink callbacks.
- Add header/data split indication to guide user space enabling of
TCP zero-copy Rx.
- Allow configuring completion queue event size.
- Refactor page_pool to enable fragmenting after allocation.
- Add allocation and page reuse statistics to page_pool.
- Improve Multiple Spanning Trees support in the bridge to allow
reuse of topologies across VLANs, saving HW resources in switches.
- DSA (Distributed Switch Architecture):
- replay and offload of host VLAN entries
- offload of static and local FDB entries on LAG interfaces
- FDB isolation and unicast filtering
New hardware / drivers
----------------------
- Ethernet:
- LAN937x T1 PHYs
- Davicom DM9051 SPI NIC driver
- Realtek RTL8367S, RTL8367RB-VB switch and MDIO
- Microchip ksz8563 switches
- Netronome NFP3800 SmartNICs
- Fungible SmartNICs
- MediaTek MT8195 switches
- WiFi:
- mt76: MediaTek mt7916
- mt76: MediaTek mt7921u USB adapters
- brcmfmac: Broadcom BCM43454/6
- Mobile:
- iosm: Intel M.2 7360 WWAN card
Drivers
-------
- Convert many drivers to the new phylink API built for split PCS
designs but also simplifying other cases.
- Intel Ethernet NICs:
- add TTY for GNSS module for E810T device
- improve AF_XDP performance
- GTP-C and GTP-U filter offload
- QinQ VLAN support
- Mellanox Ethernet NICs (mlx5):
- support xdp->data_meta
- multi-buffer XDP
- offload tc push_eth and pop_eth actions
- Netronome Ethernet NICs (nfp):
- flow-independent tc action hardware offload (police / meter)
- AF_XDP
- Other Ethernet NICs:
- at803x: fiber and SFP support
- xgmac: mdio: preamble suppression and custom MDC frequencies
- r8169: enable ASPM L1.2 if system vendor flags it as safe
- macb/gem: ZynqMP SGMII
- hns3: add TX push mode
- dpaa2-eth: software TSO
- lan743x: multi-queue, mdio, SGMII, PTP
- axienet: NAPI and GRO support
- Mellanox Ethernet switches (mlxsw):
- source and dest IP address rewrites
- RJ45 ports
- Marvell Ethernet switches (prestera):
- basic routing offload
- multi-chain TC ACL offload
- NXP embedded Ethernet switches (ocelot & felix):
- PTP over UDP with the ocelot-8021q DSA tagging protocol
- basic QoS classification on Felix DSA switch using dcbnl
- port mirroring for ocelot switches
- Microchip high-speed industrial Ethernet (sparx5):
- offloading of bridge port flooding flags
- PTP Hardware Clock
- Other embedded switches:
- lan966x: PTP Hardward Clock
- qca8k: mdio read/write operations via crafted Ethernet packets
- Qualcomm 802.11ax WiFi (ath11k):
- add LDPC FEC type and 802.11ax High Efficiency data in radiotap
- enable RX PPDU stats in monitor co-exist mode
- Intel WiFi (iwlwifi):
- UHB TAS enablement via BIOS
- band disablement via BIOS
- channel switch offload
- 32 Rx AMPDU sessions in newer devices
- MediaTek WiFi (mt76):
- background radar detection
- thermal management improvements on mt7915
- SAR support for more mt76 platforms
- MBSSID and 6 GHz band on mt7915
- RealTek WiFi:
- rtw89: AP mode
- rtw89: 160 MHz channels and 6 GHz band
- rtw89: hardware scan
- Bluetooth:
- mt7921s: wake on Bluetooth, SCO over I2S, wide-band-speed (WBS)
- Microchip CAN (mcp251xfd):
- multiple RX-FIFOs and runtime configurable RX/TX rings
- internal PLL, runtime PM handling simplification
- improve chip detection and error handling after wakeup"
* tag 'net-next-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2521 commits)
llc: fix netdevice reference leaks in llc_ui_bind()
drivers: ethernet: cpsw: fix panic when interrupt coaleceing is set via ethtool
ice: don't allow to run ice_send_event_to_aux() in atomic ctx
ice: fix 'scheduling while atomic' on aux critical err interrupt
net/sched: fix incorrect vlan_push_eth dest field
net: bridge: mst: Restrict info size queries to bridge ports
net: marvell: prestera: add missing destroy_workqueue() in prestera_module_init()
drivers: net: xgene: Fix regression in CRC stripping
net: geneve: add missing netlink policy and size for IFLA_GENEVE_INNER_PROTO_INHERIT
net: dsa: fix missing host-filtered multicast addresses
net/mlx5e: Fix build warning, detected write beyond size of field
iwlwifi: mvm: Don't fail if PPAG isn't supported
selftests/bpf: Fix kprobe_multi test.
Revert "rethook: x86: Add rethook x86 implementation"
Revert "arm64: rethook: Add arm64 rethook implementation"
Revert "powerpc: Add rethook support"
Revert "ARM: rethook: Add rethook arm implementation"
netdevice: add missing dm_private kdoc
net: bridge: mst: prevent NULL deref in br_mst_info_size()
selftests: forwarding: Use same VRF for port and VLAN upper
...
- New user_events interface. User space can register an event with the kernel
describing the format of the event. Then it will receive a byte in a page
mapping that it can check against. A privileged task can then enable that
event like any other event, which will change the mapped byte to true,
telling the user space application to start writing the event to the
tracing buffer.
- Add new "ftrace_boot_snapshot" kernel command line parameter. When set,
the tracing buffer will be saved in the snapshot buffer at boot up when
the kernel hands things over to user space. This will keep the traces that
happened at boot up available even if user space boot up has tracing as
well.
- Have TRACE_EVENT_ENUM() also update trace event field type descriptions.
Thus if a static array defines its size with an enum, the user space trace
event parsers can still know how to parse that array.
- Add new TRACE_CUSTOM_EVENT() macro. This acts the same as the
TRACE_EVENT() macro, but will attach to an existing tracepoint. This will
make one tracepoint be able to trace different content and not be stuck at
only what the original TRACE_EVENT() macro exports.
- Fixes to tracing error logging.
- Better saving of cmdlines to PIDs when tracing (use the wakeup events for
mapping).
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYjiO3RQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qhQzAQDtek5p80p/zkMGymm14wSH6qq0NdgN
Kv7fTBwEewUa0gD/UCOVLw4Oj+JtHQhCa3sCGZopmRv0BT1+4UQANqosKQY=
=Au08
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
- New user_events interface. User space can register an event with the
kernel describing the format of the event. Then it will receive a
byte in a page mapping that it can check against. A privileged task
can then enable that event like any other event, which will change
the mapped byte to true, telling the user space application to start
writing the event to the tracing buffer.
- Add new "ftrace_boot_snapshot" kernel command line parameter. When
set, the tracing buffer will be saved in the snapshot buffer at boot
up when the kernel hands things over to user space. This will keep
the traces that happened at boot up available even if user space boot
up has tracing as well.
- Have TRACE_EVENT_ENUM() also update trace event field type
descriptions. Thus if a static array defines its size with an enum,
the user space trace event parsers can still know how to parse that
array.
- Add new TRACE_CUSTOM_EVENT() macro. This acts the same as the
TRACE_EVENT() macro, but will attach to an existing tracepoint. This
will make one tracepoint be able to trace different content and not
be stuck at only what the original TRACE_EVENT() macro exports.
- Fixes to tracing error logging.
- Better saving of cmdlines to PIDs when tracing (use the wakeup events
for mapping).
* tag 'trace-v5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (30 commits)
tracing: Have type enum modifications copy the strings
user_events: Add trace event call as root for low permission cases
tracing/user_events: Use alloc_pages instead of kzalloc() for register pages
tracing: Add snapshot at end of kernel boot up
tracing: Have TRACE_DEFINE_ENUM affect trace event types as well
tracing: Fix strncpy warning in trace_events_synth.c
user_events: Prevent dyn_event delete racing with ioctl add/delete
tracing: Add TRACE_CUSTOM_EVENT() macro
tracing: Move the defines to create TRACE_EVENTS into their own files
tracing: Add sample code for custom trace events
tracing: Allow custom events to be added to the tracefs directory
tracing: Fix last_cmd_set() string management in histogram code
user_events: Fix potential uninitialized pointer while parsing field
tracing: Fix allocation of last_cmd in last_cmd_set()
user_events: Add documentation file
user_events: Add sample code for typical usage
user_events: Add self-test for validator boundaries
user_events: Add self-test for perf_event integration
user_events: Add self-test for dynamic_events integration
user_events: Add self-test for ftrace integration
...
Alexei Starovoitov says:
====================
pull-request: bpf-next 2022-03-21 v2
We've added 137 non-merge commits during the last 17 day(s) which contain
a total of 143 files changed, 7123 insertions(+), 1092 deletions(-).
The main changes are:
1) Custom SEC() handling in libbpf, from Andrii.
2) subskeleton support, from Delyan.
3) Use btf_tag to recognize __percpu pointers in the verifier, from Hao.
4) Fix net.core.bpf_jit_harden race, from Hou.
5) Fix bpf_sk_lookup remote_port on big-endian, from Jakub.
6) Introduce fprobe (multi kprobe) _without_ arch bits, from Masami.
The arch specific bits will come later.
7) Introduce multi_kprobe bpf programs on top of fprobe, from Jiri.
8) Enable non-atomic allocations in local storage, from Joanne.
9) Various var_off ptr_to_btf_id fixed, from Kumar.
10) bpf_ima_file_hash helper, from Roberto.
11) Add "live packet" mode for XDP in BPF_PROG_RUN, from Toke.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (137 commits)
selftests/bpf: Fix kprobe_multi test.
Revert "rethook: x86: Add rethook x86 implementation"
Revert "arm64: rethook: Add arm64 rethook implementation"
Revert "powerpc: Add rethook support"
Revert "ARM: rethook: Add rethook arm implementation"
bpftool: Fix a bug in subskeleton code generation
bpf: Fix bpf_prog_pack when PMU_SIZE is not defined
bpf: Fix bpf_prog_pack for multi-node setup
bpf: Fix warning for cast from restricted gfp_t in verifier
bpf, arm: Fix various typos in comments
libbpf: Close fd in bpf_object__reuse_map
bpftool: Fix print error when show bpf map
bpf: Fix kprobe_multi return probe backtrace
Revert "bpf: Add support to inline bpf_get_func_ip helper on x86"
bpf: Simplify check in btf_parse_hdr()
selftests/bpf/test_lirc_mode2.sh: Exit with proper code
bpf: Check for NULL return from bpf_get_btf_vmlinux
selftests/bpf: Test skipping stacktrace
bpf: Adjust BPF stack helper functions to accommodate skip > 0
bpf: Select proper size for bpf_prog_pack
...
====================
Link: https://lore.kernel.org/r/20220322050159.5507-1-alexei.starovoitov@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Adding ftrace_set_filter_ips function to be able to set filter on
multiple ip addresses at once.
With the kprobe multi attach interface we have cases where we need to
initialize ftrace_ops object with thousands of functions, so having
single function diving into ftrace_hash_move_and_update_ops with
ftrace_lock is faster.
The functions ips are passed as unsigned long array with count.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Tested-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/164735282673.1084943.18310504594134769804.stgit@devnote2
Currently a lot of ftrace code assumes __fentry__ is at sym+0. However
with Intel IBT enabled the first instruction of a function will most
likely be ENDBR.
Change ftrace_location() to not only return the __fentry__ location
when called for the __fentry__ location, but also when called for the
sym+0 location.
Then audit/update all callsites of this function to consistently use
these new semantics.
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/r/20220308154318.227581603@infradead.org
Add ftrace_boot_snapshot kernel parameter that will take a snapshot at the
end of boot up just before switching over to user space (it happens during
the kernel freeing of init memory).
This is useful when there's interesting data that can be collected from
kernel start up, but gets overridden by user space start up code. With
this option, the ring buffer content from the boot up traces gets saved in
the snapshot at the end of boot up. This trace can be read from:
/sys/kernel/tracing/snapshot
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Clean up the following clang-w1 warning:
kernel/trace/ftrace.c:7827: warning: Function parameter or member 'ops'
not described in 'unregister_ftrace_function'.
kernel/trace/ftrace.c:7805: warning: Function parameter or member 'ops'
not described in 'register_ftrace_function'.
Link: https://lkml.kernel.org/r/20220307004303.26399-1-jiapeng.chong@linux.alibaba.com
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
As of commit
c6e7bd7afa ("sched/core: Optimize ttwu() spinning on p->on_cpu")
the following sequence becomes possible:
p->__state = TASK_INTERRUPTIBLE;
__schedule()
deactivate_task(p);
ttwu()
READ !p->on_rq
p->__state=TASK_WAKING
trace_sched_switch()
__trace_sched_switch_state()
task_state_index()
return 0;
TASK_WAKING isn't in TASK_REPORT, so the task appears as TASK_RUNNING in
the trace event.
Prevent this by pushing the value read from __schedule() down the trace
event.
Reported-by: Abhijeet Dharmapurikar <adharmap@quicinc.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lore.kernel.org/r/20220120162520.570782-2-valentin.schneider@arm.com
When building with clang + CONFIG_DYNAMIC_FTRACE=n + W=1, there is a
warning:
kernel/trace/ftrace.c:7194:20: error: unused function 'ftrace_startup_enable' [-Werror,-Wunused-function]
static inline void ftrace_startup_enable(int command) { }
^
1 error generated.
Clang warns on instances of static inline functions in .c files with W=1
after commit 6863f5643d ("kbuild: allow Clang to find unused static
inline functions for W=1 build").
The ftrace_startup_enable() stub has been unused since
commit e1effa0144 ("ftrace: Annotate the ops operation on update"),
where its use outside of the CONFIG_DYNAMIC_TRACE section was replaced
by ftrace_startup_all(). Remove it to resolve the warning.
Link: https://lkml.kernel.org/r/20220214192847.488166-1-nathan@kernel.org
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The latest merge of the tracing tree sorts the mcount table at build time.
But s390 appears to do things differently (like always) and replaces the
sorted table back to the original unsorted one. As the ftrace algorithm
depends on it being sorted, bad things happen when it is not, and s390
experienced those bad things.
Add a new config to tell the boot if the mcount table is sorted or not,
and allow s390 to opt out of it.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYezl+RQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qs5wAQCO2K1uB+XI4rown8zKLrGxOiFIXChi
Ds5GKJwYlcOnwQEAqKK3mAodmMy6ydbeY1OY1IDdTkOCAd752FDIe1Qozws=
=aF8W
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.17-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull ftrace fix from Steven Rostedt:
"Fix s390 breakage from sorting mcount tables.
The latest merge of the tracing tree sorts the mcount table at build
time. But s390 appears to do things differently (like always) and
replaces the sorted table back to the original unsorted one. As the
ftrace algorithm depends on it being sorted, bad things happen when it
is not, and s390 experienced those bad things.
Add a new config to tell the boot if the mcount table is sorted or
not, and allow s390 to opt out of it"
* tag 'trace-v5.17-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ftrace: Fix assuming build time sort works for s390
To speed up the boot process, as mcount_loc needs to be sorted for ftrace
to work properly, sorting it at build time is more efficient than boot up
and can save milliseconds of time. Unfortunately, this change broke s390
as it will modify the mcount_loc location after the sorting takes place
and will put back the unsorted locations. Since the sorting is skipped at
boot up if it is believed that it was sorted at run time, ftrace can crash
as its algorithms are dependent on the list being sorted.
Add a new config BUILDTIME_MCOUNT_SORT that is set when
BUILDTIME_TABLE_SORT but not if S390 is set. Use this config to determine
if sorting should take place at boot up.
Link: https://lore.kernel.org/all/yt9dee51ctfn.fsf@linux.ibm.com/
Fixes: 72b3942a17 ("scripts: ftrace - move the sort-processing in ftrace_init")
Reported-by: Sven Schnelle <svens@linux.ibm.com>
Tested-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
New:
- The Real Time Linux Analysis (RTLA) tool is added to the tools directory.
- Can safely filter on user space pointers with: field.ustring ~ "match-string"
- eprobes can now be filtered like any other event.
- trace_marker(_raw) now uses stream_open() to allow multiple threads to safely
write to it. Note, this could possibly break existing user space, but we will
not know until we hear about it, and then can revert the change if need be.
- New field in events to display when bottom halfs are disabled.
- Sorting of the ftrace functions are now done at compile time instead of
at bootup.
Infrastructure changes to support future efforts:
- Added __rel_loc type for trace events. Similar to __data_loc but the offset
to the dynamic data is based off of the location of the descriptor and not
the beginning of the event. Needed for user defined events.
- Some simplification of event trigger code.
- Make synthetic events process its callback better to not hinder other
event callbacks that are registered. Needed for user defined events.
And other small fixes and clean ups.
-
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYeGvcxQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qrZtAP9ICjJxX54MTErElhhUL/NFLV7wqhJi
OIAgmp6jGVRqPAD+JxQtBnGH+3XMd71ioQkTfQ1rp+jBz2ERBj2DmELUAg0=
=zmda
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"New:
- The Real Time Linux Analysis (RTLA) tool is added to the tools
directory.
- Can safely filter on user space pointers with: field.ustring ~
"match-string"
- eprobes can now be filtered like any other event.
- trace_marker(_raw) now uses stream_open() to allow multiple threads
to safely write to it. Note, this could possibly break existing
user space, but we will not know until we hear about it, and then
can revert the change if need be.
- New field in events to display when bottom halfs are disabled.
- Sorting of the ftrace functions are now done at compile time
instead of at bootup.
Infrastructure changes to support future efforts:
- Added __rel_loc type for trace events. Similar to __data_loc but
the offset to the dynamic data is based off of the location of the
descriptor and not the beginning of the event. Needed for user
defined events.
- Some simplification of event trigger code.
- Make synthetic events process its callback better to not hinder
other event callbacks that are registered. Needed for user defined
events.
And other small fixes and cleanups"
* tag 'trace-v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (50 commits)
tracing: Add ustring operation to filtering string pointers
rtla: Add rtla timerlat hist documentation
rtla: Add rtla timerlat top documentation
rtla: Add rtla timerlat documentation
rtla: Add rtla osnoise hist documentation
rtla: Add rtla osnoise top documentation
rtla: Add rtla osnoise man page
rtla: Add Documentation
rtla/timerlat: Add timerlat hist mode
rtla: Add timerlat tool and timelart top mode
rtla/osnoise: Add the hist mode
rtla/osnoise: Add osnoise top mode
rtla: Add osnoise tool
rtla: Helper functions for rtla
rtla: Real-Time Linux Analysis tool
tracing/osnoise: Properly unhook events if start_per_cpu_kthreads() fails
tracing: Remove duplicate warnings when calling trace_create_file()
tracing/kprobes: 'nmissed' not showed correctly for kretprobe
tracing: Add test for user space strings when filtering on string pointers
tracing: Have syscall trace events use trace_event_buffer_lock_reserve()
...
Now that ftrace function pointers are sorted at compile time, add a test
that makes sure they are sorted at run time. This test is only run if it is
configured in.
Link: https://lkml.kernel.org/r/20211206151858.4d21a24d@gandalf.local.home
Cc: Yinan Liu <yinan@linux.alibaba.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
When the kernel starts, the initialization of ftrace takes
up a portion of the time (approximately 6~8ms) to sort mcount
addresses. We can save this time by moving mcount-sorting to
compile time.
Link: https://lkml.kernel.org/r/20211212113358.34208-2-yinan@linux.alibaba.com
Signed-off-by: Yinan Liu <yinan@linux.alibaba.com>
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Adding ops cleanup to unregister_ftrace_direct_multi,
so it can be reused in another register call.
Link: https://lkml.kernel.org/r/20211206182032.87248-3-jolsa@kernel.org
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Fixes: f64dd4627e ("ftrace: Add multi direct register/unregister interface")
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Now when we have *direct_multi interface the direct_functions
hash is no longer owned just by direct_ops. It's also used by
any other ftrace_ops passed to *direct_multi interface.
Thus to find out that we are unregistering the last function
from direct_ops, we need to check directly direct_ops's hash.
Link: https://lkml.kernel.org/r/20211206182032.87248-2-jolsa@kernel.org
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Fixes: f64dd4627e ("ftrace: Add multi direct register/unregister interface")
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
We can't call unregister_ftrace_function under ftrace_lock.
Link: https://lkml.kernel.org/r/20211109114217.1645296-1-jolsa@kernel.org
Fixes: ed29271894 ("ftrace/direct: Do not disable when switching direct callers")
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Merge more updates from Andrew Morton:
"87 patches.
Subsystems affected by this patch series: mm (pagecache and hugetlb),
procfs, misc, MAINTAINERS, lib, checkpatch, binfmt, kallsyms, ramfs,
init, codafs, nilfs2, hfs, crash_dump, signals, seq_file, fork,
sysvfs, kcov, gdb, resource, selftests, and ipc"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (87 commits)
ipc/ipc_sysctl.c: remove fallback for !CONFIG_PROC_SYSCTL
ipc: check checkpoint_restore_ns_capable() to modify C/R proc files
selftests/kselftest/runner/run_one(): allow running non-executable files
virtio-mem: disallow mapping virtio-mem memory via /dev/mem
kernel/resource: disallow access to exclusive system RAM regions
kernel/resource: clean up and optimize iomem_is_exclusive()
scripts/gdb: handle split debug for vmlinux
kcov: replace local_irq_save() with a local_lock_t
kcov: avoid enable+disable interrupts if !in_task()
kcov: allocate per-CPU memory on the relevant node
Documentation/kcov: define `ip' in the example
Documentation/kcov: include types.h in the example
sysv: use BUILD_BUG_ON instead of runtime check
kernel/fork.c: unshare(): use swap() to make code cleaner
seq_file: fix passing wrong private data
seq_file: move seq_escape() to a header
signal: remove duplicate include in signal.h
crash_dump: remove duplicate include in crash_dump.h
crash_dump: fix boolreturn.cocci warning
hfs/hfsplus: use WARN_ON for sanity check
...
Move core_kernel_data() into sections.h and rename it to
is_kernel_core_data(), also make it return bool value, then update all the
callers.
Link: https://lkml.kernel.org/r/20210930071143.63410-4-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexander Potapenko <glider@google.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- kprobes: Restructured stack unwinder to show properly on x86 when a stack
dump happens from a kretprobe callback.
- Fix to bootconfig parsing
- Have tracefs allow owner and group permissions by default (only denying
others). There's been pressure to allow non root to tracefs in a
controlled fashion, and using groups is probably the safest.
- Bootconfig memory managament updates.
- Bootconfig clean up to have the tools directory be less dependent on
changes in the kernel tree.
- Allow perf to be traced by function tracer.
- Rewrite of function graph tracer to be a callback from the function tracer
instead of having its own trampoline (this change will happen on an arch
by arch basis, and currently only x86_64 implements it).
- Allow multiple direct trampolines (bpf hooks to functions) be batched
together in one synchronization.
- Allow histogram triggers to add variables that can perform calculations
against the event's fields.
- Use the linker to determine architecture callbacks from the ftrace
trampoline to allow for proper parameter prototypes and prevent warnings
from the compiler.
- Extend histogram triggers to key off of variables.
- Have trace recursion use bit magic to determine preempt context over if
branches.
- Have trace recursion disable preemption as all use cases do anyway.
- Added testing for verification of tracing utilities.
- Various small clean ups and fixes.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYYBdxhQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qp1sAQD2oYFwaG3sx872gj/myBcHIBSKdiki
Hry5csd8zYDBpgD+Poylopt5JIbeDuoYw/BedgEXmscZ8Qr7VzjAXdnv/Q4=
=Loz8
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
- kprobes: Restructured stack unwinder to show properly on x86 when a
stack dump happens from a kretprobe callback.
- Fix to bootconfig parsing
- Have tracefs allow owner and group permissions by default (only
denying others). There's been pressure to allow non root to tracefs
in a controlled fashion, and using groups is probably the safest.
- Bootconfig memory managament updates.
- Bootconfig clean up to have the tools directory be less dependent on
changes in the kernel tree.
- Allow perf to be traced by function tracer.
- Rewrite of function graph tracer to be a callback from the function
tracer instead of having its own trampoline (this change will happen
on an arch by arch basis, and currently only x86_64 implements it).
- Allow multiple direct trampolines (bpf hooks to functions) be batched
together in one synchronization.
- Allow histogram triggers to add variables that can perform
calculations against the event's fields.
- Use the linker to determine architecture callbacks from the ftrace
trampoline to allow for proper parameter prototypes and prevent
warnings from the compiler.
- Extend histogram triggers to key off of variables.
- Have trace recursion use bit magic to determine preempt context over
if branches.
- Have trace recursion disable preemption as all use cases do anyway.
- Added testing for verification of tracing utilities.
- Various small clean ups and fixes.
* tag 'trace-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (101 commits)
tracing/histogram: Fix semicolon.cocci warnings
tracing/histogram: Fix documentation inline emphasis warning
tracing: Increase PERF_MAX_TRACE_SIZE to handle Sentinel1 and docker together
tracing: Show size of requested perf buffer
bootconfig: Initialize ret in xbc_parse_tree()
ftrace: do CPU checking after preemption disabled
ftrace: disable preemption when recursion locked
tracing/histogram: Document expression arithmetic and constants
tracing/histogram: Optimize division by a power of 2
tracing/histogram: Covert expr to const if both operands are constants
tracing/histogram: Simplify handling of .sym-offset in expressions
tracing: Fix operator precedence for hist triggers expression
tracing: Add division and multiplication support for hist triggers
tracing: Add support for creating hist trigger variables from literal
selftests/ftrace: Stop tracing while reading the trace file by default
MAINTAINERS: Update KPROBES and TRACING entries
test_kprobes: Move it from kernel/ to lib/
docs, kprobes: Remove invalid URL and add new reference
samples/kretprobes: Fix return value if register_kretprobe() failed
lib/bootconfig: Fix the xbc_get_info kerneldoc
...
Some functions had kernel-doc that used a comma instead of a hash to
separate the function name from the one line description.
Also, the "ftrace_is_dead()" had an incomplete description.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
As the documentation explained, ftrace_test_recursion_trylock()
and ftrace_test_recursion_unlock() were supposed to disable and
enable preemption properly, however currently this work is done
outside of the function, which could be missing by mistake.
And since the internal using of trace_test_and_set_recursion()
and trace_clear_recursion() also require preemption disabled, we
can just merge the logical.
This patch will make sure the preemption has been disabled when
trace_test_and_set_recursion() return bit >= 0, and
trace_clear_recursion() will enable the preemption if previously
enabled.
Link: https://lkml.kernel.org/r/13bde807-779c-aa4c-0672-20515ae365ea@linux.alibaba.com
CC: Petr Mladek <pmladek@suse.com>
Cc: Guo Ren <guoren@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Joe Lawrence <joe.lawrence@redhat.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Jisheng Zhang <jszhang@kernel.org>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Miroslav Benes <mbenes@suse.cz>
Reported-by: Abaci <abaci@linux.alibaba.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
[ Removed extra line in comment - SDR ]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
This symbol is not used outside of ftrace.c, so marks it static.
Fixes the following sparse warning:
kernel/trace/ftrace.c:579:5: warning: symbol 'ftrace_profile_pages_init'
was not declared. Should it be static?
Link: https://lkml.kernel.org/r/1634640534-18280-1-git-send-email-jiapeng.chong@linux.alibaba.com
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Fixes: cafb168a1c ("tracing: make the function profiler per cpu")
Signed-off-by: chongjiapeng <jiapeng.chong@linux.alibaba.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Currently to switch a set of "multi" direct trampolines from one
trampoline to another, a full shutdown of the current set needs to be
done, followed by an update to what trampoline the direct callers would
call, and then re-enabling the callers. This leaves a time when the
functions will not be calling anything, and events may be missed.
Instead, use a trick to allow all the functions with direct trampolines
attached will always call either the new or old trampoline while the
switch is happening. To do this, first attach a "dummy" callback via
ftrace to all the functions that the current direct trampoline is attached
to. This will cause the functions to call the "list func" instead of the
direct trampoline. The list function will call the direct trampoline
"helper" that will set the function it should call as it returns back to
the ftrace trampoline.
At this moment, the direct caller descriptor can safely update the direct
call trampoline. The list function will pick either the new or old
function (depending on the memory coherency model of the architecture).
Now removing the dummy function from each of the locations of the direct
trampoline caller, will put back the direct call, but now to the new
trampoline.
A better visual is:
[ Changing direct call from my_direct_1 to my_direct_2 ]
<traced_func>:
call my_direct_1
||||||||||||||||||||
vvvvvvvvvvvvvvvvvvvv
<traced_func>:
call ftrace_caller
<ftrace_caller>:
[..]
call ftrace_ops_list_func
ftrace_ops_list_func()
{
ops->func() -> direct_helper -> set rax to my_direct_1 or my_direct_2
}
call rax (to either my_direct_1 or my_direct_2
||||||||||||||||||||
vvvvvvvvvvvvvvvvvvvv
<traced_func>:
call my_direct_2
Link: https://lore.kernel.org/all/20211014162819.5c85618b@gandalf.local.home/
Acked-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Adding interface to modify registered direct function
for ftrace_ops. Adding following function:
modify_ftrace_direct_multi(struct ftrace_ops *ops, unsigned long addr)
The function changes the currently registered direct
function for all attached functions.
Link: https://lkml.kernel.org/r/20211008091336.33616-8-jolsa@kernel.org
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Adding interface to register multiple direct functions
within single call. Adding following functions:
register_ftrace_direct_multi(struct ftrace_ops *ops, unsigned long addr)
unregister_ftrace_direct_multi(struct ftrace_ops *ops, unsigned long addr)
The register_ftrace_direct_multi registers direct function (addr)
with all functions in ops filter. The ops filter can be updated
before with ftrace_set_filter_ip calls.
All requested functions must not have direct function currently
registered, otherwise register_ftrace_direct_multi will fail.
The unregister_ftrace_direct_multi unregisters ops related direct
functions.
Link: https://lkml.kernel.org/r/20211008091336.33616-7-jolsa@kernel.org
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Factor out the code that adds (ip, addr) tuple to direct_functions
hash in new ftrace_add_rec_direct function. It will be used in
following patches.
Link: https://lkml.kernel.org/r/20211008091336.33616-6-jolsa@kernel.org
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
In an effort to enable -Wcast-function-type in the top-level Makefile to
support Control Flow Integrity builds, all function casts need to be
removed.
This means that ftrace_ops_list_func() can no longer be defined as
ftrace_ops_no_ops(). The reason for ftrace_ops_no_ops() is to use that when
an architecture calls ftrace_ops_list_func() with only two parameters
(called from assembly). And to make sure there's no C side-effects, those
archs call ftrace_ops_no_ops() which only has two parameters, as
ftrace_ops_list_func() has four parameters.
Instead of a typecast, use vmlinux.lds.h to define ftrace_ops_list_func() to
arch_ftrace_ops_list_func() that will define the proper set of parameters.
Link: https://lore.kernel.org/r/20200614070154.6039-1-oscar.carter@gmx.com
Link: https://lkml.kernel.org/r/20200617165616.52241bde@oasis.local.home
Link: https://lore.kernel.org/all/20211005053922.GA702049@embeddedor/
Requested-by: Oscar Carter <oscar.carter@gmx.com>
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
While writing an email explaining the "bit = 0" logic for a discussion on
making ftrace_test_recursion_trylock() disable preemption, I discovered a
path that makes the "not do the logic if bit is zero" unsafe.
The recursion logic is done in hot paths like the function tracer. Thus,
any code executed causes noticeable overhead. Thus, tricks are done to try
to limit the amount of code executed. This included the recursion testing
logic.
Having recursion testing is important, as there are many paths that can
end up in an infinite recursion cycle when tracing every function in the
kernel. Thus protection is needed to prevent that from happening.
Because it is OK to recurse due to different running context levels (e.g.
an interrupt preempts a trace, and then a trace occurs in the interrupt
handler), a set of bits are used to know which context one is in (normal,
softirq, irq and NMI). If a recursion occurs in the same level, it is
prevented*.
Then there are infrastructure levels of recursion as well. When more than
one callback is attached to the same function to trace, it calls a loop
function to iterate over all the callbacks. Both the callbacks and the
loop function have recursion protection. The callbacks use the
"ftrace_test_recursion_trylock()" which has a "function" set of context
bits to test, and the loop function calls the internal
trace_test_and_set_recursion() directly, with an "internal" set of bits.
If an architecture does not implement all the features supported by ftrace
then the callbacks are never called directly, and the loop function is
called instead, which will implement the features of ftrace.
Since both the loop function and the callbacks do recursion protection, it
was seemed unnecessary to do it in both locations. Thus, a trick was made
to have the internal set of recursion bits at a more significant bit
location than the function bits. Then, if any of the higher bits were set,
the logic of the function bits could be skipped, as any new recursion
would first have to go through the loop function.
This is true for architectures that do not support all the ftrace
features, because all functions being traced must first go through the
loop function before going to the callbacks. But this is not true for
architectures that support all the ftrace features. That's because the
loop function could be called due to two callbacks attached to the same
function, but then a recursion function inside the callback could be
called that does not share any other callback, and it will be called
directly.
i.e.
traced_function_1: [ more than one callback tracing it ]
call loop_func
loop_func:
trace_recursion set internal bit
call callback
callback:
trace_recursion [ skipped because internal bit is set, return 0 ]
call traced_function_2
traced_function_2: [ only traced by above callback ]
call callback
callback:
trace_recursion [ skipped because internal bit is set, return 0 ]
call traced_function_2
[ wash, rinse, repeat, BOOM! out of shampoo! ]
Thus, the "bit == 0 skip" trick is not safe, unless the loop function is
call for all functions.
Since we want to encourage architectures to implement all ftrace features,
having them slow down due to this extra logic may encourage the
maintainers to update to the latest ftrace features. And because this
logic is only safe for them, remove it completely.
[*] There is on layer of recursion that is allowed, and that is to allow
for the transition between interrupt context (normal -> softirq ->
irq -> NMI), because a trace may occur before the context update is
visible to the trace recursion logic.
Link: https://lore.kernel.org/all/609b565a-ed6e-a1da-f025-166691b5d994@linux.alibaba.com/
Link: https://lkml.kernel.org/r/20211018154412.09fcad3c@gandalf.local.home
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <James.Bottomley@hansenpartnership.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Miroslav Benes <mbenes@suse.cz>
Cc: Joe Lawrence <joe.lawrence@redhat.com>
Cc: Colin Ian King <colin.king@canonical.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Jisheng Zhang <jszhang@kernel.org>
Cc: =?utf-8?b?546L6LSH?= <yun.wang@linux.alibaba.com>
Cc: Guo Ren <guoren@kernel.org>
Cc: stable@vger.kernel.org
Fixes: edc15cafcb ("tracing: Avoid unnecessary multiple recursion checks")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Most of ARCHs use empty ftrace_dyn_arch_init(), introduce a weak common
ftrace_dyn_arch_init() to cleanup them.
Link: https://lkml.kernel.org/r/20210909090216.1955240-1-o451686892@gmail.com
Acked-by: Heiko Carstens <hca@linux.ibm.com> (s390)
Acked-by: Helge Deller <deller@gmx.de> (parisc)
Signed-off-by: Weizhao Ouyang <o451686892@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
When building the files in the tracefs file system, do not by default set
any permissions for OTH (other). This will make it easier for admins who
want to define a group for accessing tracefs and not having to first
disable all the permission bits for "other" in the file system.
As tracing can leak sensitive information, it should never by default
allowing all users access. An admin can still set the permission bits for
others to have access, which may be useful for creating a honeypot and
seeing who takes advantage of it and roots the machine.
Link: https://lkml.kernel.org/r/20210818153038.864149276@goodmis.org
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Instead of having the logic that does trace_pid_list open coded, wrap it in
abstract functions. This will allow a rewrite of the logic that implements
the trace_pid_list without affecting the users.
Note, this causes a change in behavior. Every time a pid is written into
the set_*_pid file, it creates a new list and uses RCU to update it. If
pid_max is lowered, but there was a pid currently in the list that was
higher than pid_max, those pids will now be removed on updating the list.
The old behavior kept that from happening.
The rewrite of the pid_list logic will no longer depend on pid_max,
and will return the old behavior.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Implementing live patching on s390 requires each function's prologue to
contain a very special kind of nop, which gcc and clang don't generate.
However, the current code assumes that if CC_USING_NOP_MCOUNT is
defined, then whatever the compiler generates is good enough.
Move the CC_USING_NOP_MCOUNT check into the new ftrace_need_init_nop()
macro, that the architectures can override.
An alternative solution is to disable using -mnop-mcount in the
Makefile, however, this makes the build logic (even) more complicated
and forces the arch-specific code to deal with the useless __fentry__
symbol.
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Link: https://lore.kernel.org/r/20210728212546.128248-2-iii@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
The variable ret is being initialized with a value that is never
read, it is being updated later on. The assignment is redundant and
can be removed.
Link: https://lkml.kernel.org/r/20210721120915.122278-1-colin.king@canonical.com
Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
synchronize_rcu_tasks_rude() triggers IPIs and forces rescheduling on
all CPUs. It is a costly operation and, when targeting nohz_full CPUs,
very disrupting (hence the name). So avoid calling it when 'old_hash'
doesn't need to be freed.
Link: https://lkml.kernel.org/r/20210721114726.1545103-1-nsaenzju@redhat.com
Signed-off-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Using list_move() instead of list_del() + list_add().
Link: https://lkml.kernel.org/r/20210608031108.2820996-1-libaokun1@huawei.com
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
It was reported that a bug on arm64 caused a bad ip address to be used for
updating into a nop in ftrace_init(), but the error path (rightfully)
returned -EINVAL and not -EFAULT, as the bug caused more than one error to
occur. But because -EINVAL was returned, the ftrace_bug() tried to report
what was at the location of the ip address, and read it directly. This
caused the machine to panic, as the ip was not pointing to a valid memory
address.
Instead, read the ip address with copy_from_kernel_nofault() to safely
access the memory, and if it faults, report that the address faulted,
otherwise report what was in that location.
Link: https://lore.kernel.org/lkml/20210607032329.28671-1-mark-pk.tsai@mediatek.com/
Cc: stable@vger.kernel.org
Fixes: 05736a427f ("ftrace: warn on failure to disable mcount callers")
Reported-by: Mark-PK Tsai <mark-pk.tsai@mediatek.com>
Tested-by: Mark-PK Tsai <mark-pk.tsai@mediatek.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Now that there's a library that accesses the tracefs file system,
(libtracefs), the way the files are interacted with is slightly
different than the command line. For instance, the write() system
call is used directly instead of an echo. This exposes some old bugs.
If a probe is written to "set_ftrace_filter" without any white space
after it, it will be ignored. This is because the write expects
that a string written to it that does not end with white spaces thinks
there is more to come. But if the file is closed, the release function
needs to finish it. The "set_ftrace_filter" release function handles
the filter part of the "set_ftrace_filter" file, but did not handle
the probe part.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYJP4OBQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6quzjAQCoFQXkJtYhwlMk0dTxclrsQlm0t93H
pHwJA9Zyxe25UgD8D/rpG/wtHaSSuP6omEDbqvshpNdszqKb0Nt+UM116QU=
=niJ6
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fix from Steven Rostedt:
"Fix probes written to the set_ftrace_filter file
Now that there's a library that accesses the tracefs file system
(libtracefs), the way the files are interacted with is slightly
different than the command line. For instance, the write() system call
is used directly instead of an echo. This exposes some old bugs.
If a probe is written to "set_ftrace_filter" without any white space
after it, it will be ignored. This is because the write expects that a
string written to it that does not end with white spaces thinks there
is more to come. But if the file is closed, the release function needs
to finish it. The "set_ftrace_filter" release function handles the
filter part of the "set_ftrace_filter" file, but did not handle the
probe part"
* tag 'trace-v5.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ftrace: Handle commands when closing set_ftrace_filter file
# echo switch_mm:traceoff > /sys/kernel/tracing/set_ftrace_filter
will cause switch_mm to stop tracing by the traceoff command.
# echo -n switch_mm:traceoff > /sys/kernel/tracing/set_ftrace_filter
does nothing.
The reason is that the parsing in the write function only processes
commands if it finished parsing (there is white space written after the
command). That's to handle:
write(fd, "switch_mm:", 10);
write(fd, "traceoff", 8);
cases, where the command is broken over multiple writes.
The problem is if the file descriptor is closed, then the write call is
not processed, and the command needs to be processed in the release code.
The release code can handle matching of functions, but does not handle
commands.
Cc: stable@vger.kernel.org
Fixes: eda1e32855 ("tracing: handle broken names in ftrace filter")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
New feature:
The "func-no-repeats" option in tracefs/options directory. When set
the function tracer will detect if the current function being traced
is the same as the previous one, and instead of recording it, it will
keep track of the number of times that the function is repeated in a row.
And when another function is recorded, it will write a new event that
shows the function that repeated, the number of times it repeated and
the time stamp of when the last repeated function occurred.
Enhancements:
In order to implement the above "func-no-repeats" option, the ring
buffer timestamp can now give the accurate timestamp of the event
as it is being recorded, instead of having to record an absolute
timestamp for all events. This helps the histogram code which no longer
needs to waste ring buffer space.
New validation logic to make sure all trace events that access
dereferenced pointers do so in a safe way, and will warn otherwise.
Fixes:
No longer limit the PIDs of tasks that are recorded for "saved_cmdlines"
to PID_MAX_DEFAULT (32768), as systemd now allows for a much larger
range. This caused the mapping of PIDs to the task names to be dropped
for all tasks with a PID greater than 32768.
Change trace_clock_global() to never block. This caused a deadlock.
Clean ups:
Typos, prototype fixes, and removing of duplicate or unused code.
Better management of ftrace_page allocations.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYI/1vBQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qiL0AP9EemIC5TDh2oihqLRNeUjdTu0ryEoM
HRFqxozSF985twD/bfkt86KQC8rLHwxTbxQZ863bmdaC6cMGFhWiF+H/MAs=
=psYt
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"New feature:
- A new "func-no-repeats" option in tracefs/options directory.
When set the function tracer will detect if the current function
being traced is the same as the previous one, and instead of
recording it, it will keep track of the number of times that the
function is repeated in a row. And when another function is
recorded, it will write a new event that shows the function that
repeated, the number of times it repeated and the time stamp of
when the last repeated function occurred.
Enhancements:
- In order to implement the above "func-no-repeats" option, the ring
buffer timestamp can now give the accurate timestamp of the event
as it is being recorded, instead of having to record an absolute
timestamp for all events. This helps the histogram code which no
longer needs to waste ring buffer space.
- New validation logic to make sure all trace events that access
dereferenced pointers do so in a safe way, and will warn otherwise.
Fixes:
- No longer limit the PIDs of tasks that are recorded for
"saved_cmdlines" to PID_MAX_DEFAULT (32768), as systemd now allows
for a much larger range. This caused the mapping of PIDs to the
task names to be dropped for all tasks with a PID greater than
32768.
- Change trace_clock_global() to never block. This caused a deadlock.
Clean ups:
- Typos, prototype fixes, and removing of duplicate or unused code.
- Better management of ftrace_page allocations"
* tag 'trace-v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (32 commits)
tracing: Restructure trace_clock_global() to never block
tracing: Map all PIDs to command lines
ftrace: Reuse the output of the function tracer for func_repeats
tracing: Add "func_no_repeats" option for function tracing
tracing: Unify the logic for function tracing options
tracing: Add method for recording "func_repeats" events
tracing: Add "last_func_repeats" to struct trace_array
tracing: Define new ftrace event "func_repeats"
tracing: Define static void trace_print_time()
ftrace: Simplify the calculation of page number for ftrace_page->records some more
ftrace: Store the order of pages allocated in ftrace_page
tracing: Remove unused argument from "ring_buffer_time_stamp()
tracing: Remove duplicate struct declaration in trace_events.h
tracing: Update create_system_filter() kernel-doc comment
tracing: A minor cleanup for create_system_filter()
kernel: trace: Mundane typo fixes in the file trace_events_filter.c
tracing: Fix various typos in comments
scripts/recordmcount.pl: Make vim and emacs indent the same
scripts/recordmcount.pl: Make indent spacing consistent
tracing: Add a verifier to check string pointers for trace events
...
Commit b40c6eabfc ("ftrace: Simplify the calculation of page number for
ftrace_page->records") simplified the calculation of the number of pages
needed for each page group without having any empty pages, but it can be
simplified even further.
Link: https://lore.kernel.org/lkml/CAHk-=wjt9b7kxQ2J=aDNKbR1QBMB3Hiqb_hYcZbKsxGRSEb+gQ@mail.gmail.com/
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Instead of saving the size of the records field of the ftrace_page, store
the order it uses to allocate the pages, as that is what is needed to know
in order to free the pages. This simplifies the code.
Link: https://lore.kernel.org/lkml/CAHk-=whyMxheOqXAORt9a7JK9gc9eHTgCJ55Pgs4p=X3RrQubQ@mail.gmail.com/
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[ change log written by Steven Rostedt ]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Needed to merge trace/ftrace/urgent to get:
Commit 59300b36f8 ("ftrace: Check if pages were allocated before calling free_pages()")
To clean up the code that is affected by it as well.
The function addresses that are traced by ftrace are stored in pages,
and the size is held in a variable. If there's some error in creating
them, the allocate ones will be freed. In this case, it is possible that
the order of pages to be freed may end up being negative due to a size of
zero passed to get_count_order(), and then that negative number will cause
free_pages() to free a very large section. Make sure that does not happen.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYGR30BQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qnbDAP9yEhTLcDRUi3VLWnEq19Dt4Lsg86Bf
QRpbWG6Ze9EbZQEAgYAOe1fsNCNEIMXXh/4nlKVpKKH+vviS0ux9Z6uhpQQ=
=Veyq
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.12-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull ftrace fix from Steven Rostedt:
"Add check of order < 0 before calling free_pages()
The function addresses that are traced by ftrace are stored in pages,
and the size is held in a variable. If there's some error in creating
them, the allocate ones will be freed. In this case, it is possible
that the order of pages to be freed may end up being negative due to a
size of zero passed to get_count_order(), and then that negative
number will cause free_pages() to free a very large section.
Make sure that does not happen"
* tag 'trace-v5.12-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ftrace: Check if pages were allocated before calling free_pages()
It is possible that on error pg->size can be zero when getting its order,
which would return a -1 value. It is dangerous to pass in an order of -1
to free_pages(). Check if order is greater than or equal to zero before
calling free_pages().
Link: https://lore.kernel.org/lkml/20210330093916.432697c7@gandalf.local.home/
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The following sequence of commands:
register_ftrace_direct(ip, addr1);
modify_ftrace_direct(ip, addr1, addr2);
unregister_ftrace_direct(ip, addr2);
will cause the kernel to warn:
[ 30.179191] WARNING: CPU: 2 PID: 1961 at kernel/trace/ftrace.c:5223 unregister_ftrace_direct+0x130/0x150
[ 30.180556] CPU: 2 PID: 1961 Comm: test_progs W O 5.12.0-rc2-00378-g86bc10a0a711-dirty #3246
[ 30.182453] RIP: 0010:unregister_ftrace_direct+0x130/0x150
When modify_ftrace_direct() changes the addr from old to new it should update
the addr stored in ftrace_direct_funcs. Otherwise the final
unregister_ftrace_direct() won't find the address and will cause the splat.
Fixes: 0567d68091 ("ftrace: Add modify_ftrace_direct()")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Link: https://lore.kernel.org/bpf/20210316195815.34714-1-alexei.starovoitov@gmail.com