Core code:
- Cure a couple of incorrectness issues in the posix CPU timer code to
prevent that the tick dependency for NOHZ full is kept alive for no
reason.
- Avoid expensive double reprogramming of the clockevent device in
hrtimer_start_range_ns().
- Avoid pointless SMP function calls when the clock was set to avoid
disturbing CPUs which do not have any affected timers queued.
- Make the clocksource watchdog test work correctly when CONFIG_HZ is
less than 100.
Drivers:
- Prefer the ARM architected timer over the Exynos timer which is way
more expensive to access.
- Add device tree bindings for new Ingenic SoCs
- The usual improvements and cleanups all over the place
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmEsnxcTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoZAmEAC0R5+9RkWAOpx0JWC7dQxFuIZoUZD1
8Inqs0ZFLX/LNrkjbBmZ/0XFvoU38+Eqd4Gqy3I748TgzMcB/0NHUUnXaugJE35J
UzqdHkzhSReVinDHgRcIGMxkdj+JGomhlOM7RjTcdTixUWBfi1iXJBsit/tIWYq9
R9r/cthpEQzFu7BCZCdAsgRBaLNinCH9cCP0qXNS3hDwHFtziszjBhIrElAaEK4J
IqrW/tsoxOlX/yQ/5TI8+E9DKgrr4pf+uNVjMJC0QBDodJrXRrkez9lFg6zJJggw
adtzSFgL/OrbwsuEKAREpkSTwWSdQGdtLy2i9fx16jmh78YiTilHYe2A1ZD5v3zd
dxfHUexnsgXcn4Im9w+sxLGxf2RQ6SsfVgd+R0lyOKLBFltnmbQWcvVl/6ZUa4Cc
je+yuh9DTr0ksUiCnm0spP8AMtsSaHKJUP+MqHbXgo83AutVIFXr4zyQimM1lkY1
TPeXtbKlhd76jDgdcKU85tiyLsJrIImEJDzLtOEvSAk37yO6S9PHzLUMbg9Yp/vp
Li4aUaMNmytCJTvKeSu6Wivzmyxqf4zSJus/fWLIJJjg/NSlNNhFDc0vkKPzaxM7
mg2VIcPrQKzQ1ZAap1kTdj1JNDWuANlV6pjE+zwaJbMtdHhvELyWnqC6EA05MsVj
Txw2VVFaQcqKXA==
=eP5w
-----END PGP SIGNATURE-----
Merge tag 'timers-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
"Updates for timekeeping, timers and related drivers:
Core code:
- Cure a couple of correctness issues in the posix CPU timer code to
prevent that the tick dependency for NOHZ full is kept alive for no
reason.
- Avoid expensive double reprogramming of the clockevent device in
hrtimer_start_range_ns().
- Avoid pointless SMP function calls when the clock was set to avoid
disturbing CPUs which do not have any affected timers queued.
- Make the clocksource watchdog test work correctly when CONFIG_HZ is
less than 100.
Drivers:
- Prefer the ARM architected timer over the Exynos timer which is way
more expensive to access.
- Add device tree bindings for new Ingenic SoCs
- The usual improvements and cleanups all over the place"
* tag 'timers-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (29 commits)
clocksource: Make clocksource watchdog test safe for slow-HZ systems
dt-bindings: timer: Add ABIs for new Ingenic SoCs
clocksource/drivers/fttmr010: Pass around less pointers
clocksource/drivers/mediatek: Optimize systimer irq clear flow on shutdown
clocksource/drivers/ingenic: Use bitfield macro helpers
clocksource/drivers/sh_cmt: Fix wrong setting if don't request IRQ for clock source channel
dt-bindings: timer: convert rockchip,rk-timer.txt to YAML
clocksource/drivers/exynos_mct: Mark MCT device as CLOCK_EVT_FEAT_PERCPU
clocksource/drivers/exynos_mct: Prioritise Arm arch timer on arm64
hrtimer: Unbreak hrtimer_force_reprogram()
hrtimer: Use raw_cpu_ptr() in clock_was_set()
hrtimer: Avoid more SMP function calls in clock_was_set()
hrtimer: Avoid unnecessary SMP function calls in clock_was_set()
hrtimer: Add bases argument to clock_was_set()
time/timekeeping: Avoid invoking clock_was_set() twice
timekeeping: Distangle resume and clock-was-set events
timerfd: Provide timerfd_resume()
hrtimer: Force clock_was_set() handling for the HIGHRES=n, NOHZ=y case
hrtimer: Ensure timerfd notification for HIGHRES=n
hrtimer: Consolidate reprogramming code
...
- The biggest change in this cycle is scheduler support for asymmetric
scheduling affinity, to support the execution of legacy 32-bit tasks on
AArch32 systems that also have 64-bit-only CPUs.
Architectures can fill in this functionality by defining their
own task_cpu_possible_mask(p). When this is done, the scheduler will
make sure the task will only be scheduled on CPUs that support it.
(The actual arm64 specific changes are not part of this tree.)
For other architectures there will be no change in functionality.
- Add cgroup SCHED_IDLE support
- Increase node-distance flexibility & delay determining it until a CPU
is brought online. (This enables platforms where node distance isn't
final until the CPU is only.)
- Deadline scheduler enhancements & fixes
- Misc fixes & cleanups.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmEsrDgRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1gMxBAAmzXPnDm1pDBBUaEwc+DynNGHNxZcBO5E
CaNyfywp4GMA+OC3JzUgDg1B9uvKQRdBGtv6SZ8OcyhJMfmkEvjt5/wYUrcdtQVP
TA2lt80/Is8LQMnvcz7X0gmsLt+fXWQTF8ik1KT4wsi/k03Xw8BH11zHct6sV2QN
NNQ+7BEjqU1HA1UXJFiaoGtWF0gdh29VyE5dSzfAis79L0XUQadS512LJKin/AK0
wYz8E+L7QIrjhfX9FQdOrR6da4TK6jAXyEY6a9dpaMHnFdtxuwhT4/BPtovNTeeY
yxEZm3qSZbpghWHsMEa6Z4GIeLE6aNi3wcHt10fgdZDdotSRsNZuF6gi4A8nhRC+
6wm+fCcFGEIBCL6eE/16Wms6YMdFfuiEAgtJGNy7GGyfH3/mS6u8eylXbLZncYXn
DFHY+xUvmVZSzoPzcnYXEy4FB3kywNL7WBFxyhdXf5/EvWmmtHi4K3jVQ8jaqvhL
MDk3NX9Hd0ariff3zUltWhMY5ouj6bIbBZmWWnD3s1xQT68VvE563cq0qH15dlnr
j5M71eNRWvoOdZKzflgjRZzmdQtsZQ51tiMA6W6ZRfwYkHjb70qiia0r5GFf41X1
MYelmcaA8+RjKrQ5etxzzDjoXl0xDXiZric6gRQHjG1Y1Zm2rVaoD+vkJGD5TQJ0
2XTOGQgAxh4=
=VdGE
-----END PGP SIGNATURE-----
Merge tag 'sched-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
- The biggest change in this cycle is scheduler support for asymmetric
scheduling affinity, to support the execution of legacy 32-bit tasks
on AArch32 systems that also have 64-bit-only CPUs.
Architectures can fill in this functionality by defining their own
task_cpu_possible_mask(p). When this is done, the scheduler will make
sure the task will only be scheduled on CPUs that support it.
(The actual arm64 specific changes are not part of this tree.)
For other architectures there will be no change in functionality.
- Add cgroup SCHED_IDLE support
- Increase node-distance flexibility & delay determining it until a CPU
is brought online. (This enables platforms where node distance isn't
final until the CPU is only.)
- Deadline scheduler enhancements & fixes
- Misc fixes & cleanups.
* tag 'sched-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits)
eventfd: Make signal recursion protection a task bit
sched/fair: Mark tg_is_idle() an inline in the !CONFIG_FAIR_GROUP_SCHED case
sched: Introduce dl_task_check_affinity() to check proposed affinity
sched: Allow task CPU affinity to be restricted on asymmetric systems
sched: Split the guts of sched_setaffinity() into a helper function
sched: Introduce task_struct::user_cpus_ptr to track requested affinity
sched: Reject CPU affinity changes based on task_cpu_possible_mask()
cpuset: Cleanup cpuset_cpus_allowed_fallback() use in select_fallback_rq()
cpuset: Honour task_cpu_possible_mask() in guarantee_online_cpus()
cpuset: Don't use the cpu_possible_mask as a last resort for cgroup v1
sched: Introduce task_cpu_possible_mask() to limit fallback rq selection
sched: Cgroup SCHED_IDLE support
sched/topology: Skip updating masks for non-online nodes
sched: Replace deprecated CPU-hotplug functions.
sched: Skip priority checks with SCHED_FLAG_KEEP_PARAMS
sched: Fix UCLAMP_FLAG_IDLE setting
sched/deadline: Fix missing clock update in migrate_task_rq_dl()
sched/fair: Avoid a second scan of target in select_idle_cpu
sched/fair: Use prev instead of new target as recent_used_cpu
sched: Don't report SCHED_FLAG_SUGOV in sched_getattr()
...
-----BEGIN PGP SIGNATURE-----
iQJHBAABCAAxFiEES8DXskRxsqGE6vXTAA5oQRlWghUFAmEo3WcTHGpsYXl0b25A
a2VybmVsLm9yZwAKCRAADmhBGVaCFbnuD/9Vj3dnXRgZ9LHFuQHp5Vy5yaGfujwP
kIMN7bJiL0pCZ1zHapyn0cidZuTScDK2qMzGVR9Wrj/uYHEI+T08IEfaq/2CU3pS
PdygHgqQi+WtJj4dks1OphVdlF0+46B/zy2YY0LE39zFUDO38VUIgb/gTiQ8JHIr
BaDGtohlXmK8bDXo2XqMunSp7uQoRX92mv+bPjvIpsM60RnhPzhX0HeO/xhXO1PP
OpswQG3nOTX1alZOXuSKdH7e3zfO+pLnC5NYeo5R4rys5xRiShU19OSAJfiPwf2w
06OX7uLT7aF5LPQPjOP4MA9dHPWWeIuYnKzUpQZZ+D+zjuaxhF62saJG1Um0yQns
qqPZgTWhxzkJ16bj94T2UZW0bufFKM5cEfBPFw9ZyzQshX5jIE2V5ecV+F/AdtTj
EA55m/5N8NZMsNgnEOqn6o8TkWoV1seHKJXqja0+ZY9/Xs8GkmjHLhjNBCwGZIbw
8S64Szp+yex3m1mRS+L6T0Gudic5WdFxOKQAzvvJcmT6u+ZGBzU0FjNTNNdnu3tp
f9mANYxT1vNfYplRuNhMCp9oioSRpO2vfDQLCBRtsCy0QlKCum6qb7E/BcWRlwn1
SOAFmjeXpr+VXGX+Rjp/bzCdHlNOVbR4ODWJ+h5D7QZaQwQ6FafkEq1D2yjPR0OR
i722ECoM++vZ4w==
=CfcC
-----END PGP SIGNATURE-----
Merge tag 'locks-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux
Pull file locking updates from Jeff Layton:
"This starts with a couple of fixes for potential deadlocks in the
fowner/fasync handling.
The next patch removes the old mandatory locking code from the kernel
altogether.
The last patch cleans up rw_verify_area a bit more after the mandatory
locking removal"
* tag 'locks-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
fs: clean up after mandatory file locking support removal
fs: remove mandatory file locking support
fcntl: fix potential deadlock for &fasync_struct.fa_lock
fcntl: fix potential deadlocks for &fown_struct.lock
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmEmTZcACgkQnJ2qBz9k
QNkkmAgArW6XoF1CePds/ZaC9vfg/nk66/zVo0n+J8xXjMWAPxcKbWFfV0uWVixq
yk4lcLV47a2Mu/B/1oLNd3vrSmhwU+srWqNwOFn1nv+lP/6wJqr8oztRHn/0L9Q3
ZSRrukSejbQ6AvTL/WzTNnCjjCc2ne3Kyko6W41aU6uyJuzhSM32wbx7qlV6t54Z
iint9OrB4gM0avLohNafTUq6I+tEGzBMNwpCG/tqCmkcvDcv3rTDVAnPSCTm0Tx2
hdrYDcY/rLxo93pDBaW1rYA/fohR+mIVye6k2TjkPAL6T1x+rxeT5qnc+YijH5yF
sFPDhlD+ZsfOLi8stWXLOJ+8+gLODg==
=pDBR
-----END PGP SIGNATURE-----
Merge tag 'hole_punch_for_v5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull fs hole punching vs cache filling race fixes from Jan Kara:
"Fix races leading to possible data corruption or stale data exposure
in multiple filesystems when hole punching races with operations such
as readahead.
This is the series I was sending for the last merge window but with
your objection fixed - now filemap_fault() has been modified to take
invalidate_lock only when we need to create new page in the page cache
and / or bring it uptodate"
* tag 'hole_punch_for_v5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
filesystems/locking: fix Malformed table warning
cifs: Fix race between hole punch and page fault
ceph: Fix race between hole punch and page fault
fuse: Convert to using invalidate_lock
f2fs: Convert to using invalidate_lock
zonefs: Convert to using invalidate_lock
xfs: Convert double locking of MMAPLOCK to use VFS helpers
xfs: Convert to use invalidate_lock
xfs: Refactor xfs_isilocked()
ext2: Convert to using invalidate_lock
ext4: Convert to use mapping->invalidate_lock
mm: Add functions to lock invalidate_lock for two mappings
mm: Protect operations adding pages to page cache with invalidate_lock
documentation: Sync file_operations members with reality
mm: Fix comments mentioning i_mutex
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmEmPKcACgkQnJ2qBz9k
QNk6RwgA2GJlJ6g2I/iT64ii4l8nWVEyp1OI/m2DvEaoci+SLp/oIh/VbEMsdj7D
Fvoajo6KRu53/Ucwg+u9i5tygpCcu5X52DB1qZWzNZKxl98EdXRTb9e6n3jNDBti
dz4EAd/YZhADiS5sX9kZLbSE1Kzirx8ULsftZ3T31rkU9n1pJD6tH7qyHDot3oV+
XnONRBb2coJ5E7qeRqa8T35uG9a4JGgnDICg3qISPXn+ZNX4u61vXQu/r2YzTGn5
0b6G3X8YTCLti4w14/eGm9bMw3J5TLqRDg/d73B34ju6xcKPP83j/UmHFyxpNnGD
mg6mi8PyeEiwPbglSdVBaFQoOroK9Q==
=Hket
-----END PGP SIGNATURE-----
Merge tag 'fs_for_v5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull UDF and isofs updates from Jan Kara:
"Several smaller fixes and cleanups in UDF and isofs"
* tag 'fs_for_v5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
udf_get_extendedattr() had no boundary checks.
isofs: joliet: Fix iocharset=utf8 mount option
udf: Fix iocharset=utf8 mount option
udf: Get rid of 0-length arrays in struct fileIdentDesc
udf: Get rid of 0-length arrays
udf: Remove unused declaration
udf: Check LVID earlier
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmEmO6kACgkQnJ2qBz9k
QNmP4wf+N4n3pDgCfEBPTFBMzghIqLfFSdRx713cj7u6Bi7qwZB90ChdgGC/J2lh
KVh9KvzneSqo4CoaYYLECDkM9cLGZuX0YEWX7L0+LivEoIVuoFK0OKDcyptKl5FK
fKi5tK287NlgXh9PqxuPlYUsaxQKh7PSqO4+9IgHFX/P7NMbxdRiGg8Nb+rp9XhP
IDzfR/qNqlyjYHJUae5GuE2hsbX8lFAkSnmQiz8ya6c3IZdc8GmaVL1jJI+ynre/
CMHysDV2QPS5UcMjO2YnoG3Osl18TOTsutnCDPzYI2qr4UGB+4RxfCgEVVdc6F3v
p4VfqfJB09sHpkRBrN9Lz+MgxxkVkQ==
=CRIk
-----END PGP SIGNATURE-----
Merge tag 'fiemap_for_v5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull FIEMAP cleanups from Jan Kara:
"FIEMAP cleanups from Christoph transitioning all remaining filesystems
supporting FIEMAP (ext2, hpfs) to iomap API and removing the old
helper"
* tag 'fiemap_for_v5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
fs: remove generic_block_fiemap
hpfs: use iomap_fiemap to implement ->fiemap
ext2: use iomap_fiemap to implement ->fiemap
ext2: make ext2_iomap_ops available unconditionally
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmEmOoUACgkQnJ2qBz9k
QNm8XggA6cbU79idj2HgDG5Wj4X3cd1+NlpPP2HytvTomwqYHJmHPAg0b44AsfHT
ToHeYhIqb78MbLHn8fl4g5+O01raqihbUyIcRNcgdeQcle7Phuv/hDlsr8EOhfea
mec4vottwaoiNn+4o7ciUAIri0stbVL8VPpnOy0X7iCmX7OC2vUvhQQGwvyfmXZm
+F/8jppfvfsxl73SbWhG+sgD83D4ASZ+BGwC1Fx9tHitOVTJzg+CR2C1J4seaK6E
r8+u/K6YKSrCGgA2dOVgzYyp2vHPGbHS4Z3H1HDXS+mtqxHJYqPR54kRT1BqCEQX
Nm4CP7+u6a+86HLpBZi1NhvYrgLTig==
=LJaZ
-----END PGP SIGNATURE-----
Merge tag 'fsnotify_for_v5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull fsnotify updates from Jan Kara:
"fsnotify speedups when notification actually isn't used and support
for identifying processes which caused fanotify events through pidfd
instead of normal pid"
* tag 'fsnotify_for_v5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
fsnotify: optimize the case of no marks of any type
fsnotify: count all objects with attached connectors
fsnotify: count s_fsnotify_inode_refs for attached connectors
fsnotify: replace igrab() with ihold() on attach connector
fanotify: add pidfd support to the fanotify API
fanotify: introduce a generic info record copying helper
fanotify: minor cosmetic adjustments to fid labels
kernel/pid.c: implement additional checks upon pidfd_create() parameters
kernel/pid.c: remove static qualifier from pidfd_create()
The recursion protection for eventfd_signal() is based on a per CPU
variable and relies on the !RT semantics of spin_lock_irqsave() for
protecting this per CPU variable. On RT kernels spin_lock_irqsave() neither
disables preemption nor interrupts which allows the spin lock held section
to be preempted. If the preempting task invokes eventfd_signal() as well,
then the recursion warning triggers.
Paolo suggested to protect the per CPU variable with a local lock, but
that's heavyweight and actually not necessary. The goal of this protection
is to prevent the task stack from overflowing, which can be achieved with a
per task recursion protection as well.
Replace the per CPU variable with a per task bit similar to other recursion
protection bits like task_struct::in_page_owner. This works on both !RT and
RT kernels and removes as a side effect the extra per CPU storage.
No functional change for !RT kernels.
Reported-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Link: https://lore.kernel.org/r/87wnp9idso.ffs@tglx
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmEntw0THGlkcnlvbW92
QGdtYWlsLmNvbQAKCRBKf944AhHzi8CFB/4/1LVBiC2P9tqIr0S4rLaCBW91Xm4v
oYbqoEuzzzl9FPYndSka2hq5x8wg+0nBCXiejafYVbIZsvE/UN+C5+H1mCD5NwyO
imXHJ3lqKuZRHrGCkMSM3TJuOijPIU2gqVR+xb0vIfqjr0mU6YgLvvRBcY0QNimQ
gLPoMwFGYwGWSLdcBfnHYSGWzmJk4rE94SSkL9Rg1NjkPslBahOrpA/GwNbltGsU
+jYIWAZwpfgu2SCWPdUdYpA/Rw518WjGjZ9pOuZmFKg8R2mSJ4LVb5wZ4bsi1b5j
CGc4KjNV+koeSMBlBex1EDdVXvqxkviNiWP1jm4FKz/fWcpD5DWv5467
=t1nf
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-5.14-rc8' of git://github.com/ceph/ceph-client
Pull ceph fixes from Ilya Dryomov:
"Two memory management fixes for the filesystem"
* tag 'ceph-for-5.14-rc8' of git://github.com/ceph/ceph-client:
ceph: fix possible null-pointer dereference in ceph_mdsmap_decode()
ceph: correctly handle releasing an embedded cap flush
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmEmQy4ACgkQxWXV+ddt
WDuCRRAAmuO+6Zsl5MSq0hBnpec/VBN6lTi9VPt184BjW1IWsqwR1Ax8dVQEKgCm
gzkGYEuVq2L5p+/ugWKKftAbmUU85Jf3AIsv81SCJQosRkxVXAdbrZOv00yUZy6/
5YOdO+9u61otvtO6LcZz9l+0LcpSmrBwEszluyIS+nArgQyZwX2aZTjcScDJvB9+
1y7Eo6eIbqbcJOf4mLDIJh0bHaiA7HB6jYJkbsnz51wBU2ETATzNzAoyP5ReTPGc
1s0uxrpY37kHcUUTd6q8VLDTM6Ei4vF2zQm0jWcrw0K3hM6yPuH+GiEADoV/xsls
6pbtss1E81rHEQjcK8brf6CxbOak8/WXV0gRia/3avkFteVlax+NJxRdVhksuJln
siGlQqASX3vYdNL0nG+U0ml1Y9C1ZXTXu4lGjS6rtT9oeV+YSccG2UjIT9LEtuON
W/zE4bUMqCddcZFEPH5jNK+ChGS8mmfs+UFFR+W/JzMIO8Uji5/K44FZDFBo0Oc/
3JEgk7ZV4D+8SblBPMxJx0fZqbE8ggKM+IN5CAyscINOWOxrmRiaFFRygRX0TLDB
2uts9owItW6zvaTRY6RclVeCvJ6ARQli4pv7YxZmH85hhtCbn515imWvLWw4+tSg
QwrtDnPVMSJTdzFHvsmeE9lM6Vaw0ur70Ysyd29k/XJu3WwRdkM=
=jN7s
-----END PGP SIGNATURE-----
Merge tag 'for-5.14-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fix from David Sterba:
"One more fix that I think qualifies for a late merge. It's a revert of
a one-liner fix that meanwhile got backported to stable kernels and we
got reports from users.
The broken fix prevents creating compressed inline extents, which
could be noticeable on space consumption.
Technically it's a regression as the patch was merged in 5.14-rc1 but
got propagated to several stable kernels and has higher exposure than
a 'typical' development cycle bug"
* tag 'for-5.14-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Revert "btrfs: compression: don't try to compress if we don't have enough pages"
It turns out that the SIGIO/FASYNC situation is almost exactly the same
as the EPOLLET case was: user space really wants to be notified after
every operation.
Now, in a perfect world it should be sufficient to only notify user
space on "state transitions" when the IO state changes (ie when a pipe
goes from unreadable to readable, or from unwritable to writable). User
space should then do as much as possible - fully emptying the buffer or
what not - and we'll notify it again the next time the state changes.
But as with EPOLLET, we have at least one case (stress-ng) where the
kernel sent SIGIO due to the pipe being marked for asynchronous
notification, but the user space signal handler then didn't actually
necessarily read it all before returning (it read more than what was
written, but since there could be multiple writes, it could leave data
pending).
The user space code then expected to get another SIGIO for subsequent
writes - even though the pipe had been readable the whole time - and
would only then read more.
This is arguably a user space bug - and Colin King already fixed the
stress-ng code in question - but the kernel regression rules are clear:
it doesn't matter if kernel people think that user space did something
silly and wrong. What matters is that it used to work.
So if user space depends on specific historical kernel behavior, it's a
regression when that behavior changes. It's on us: we were silly to
have that non-optimal historical behavior, and our old kernel behavior
was what user space was tested against.
Because of how the FASYNC notification was tied to wakeup behavior, this
was first broken by commits f467a6a664 and 1b6b26ae70 ("pipe: fix
and clarify pipe read/write wakeup logic"), but at the time it seems
nobody noticed. Probably because the stress-ng problem case ends up
being timing-dependent too.
It was then unwittingly fixed by commit 3a34b13a88 ("pipe: make pipe
writes always wake up readers") only to be broken again when by commit
3b844826b6 ("pipe: avoid unnecessary EPOLLET wakeups under normal
loads").
And at that point the kernel test robot noticed the performance
refression in the stress-ng.sigio.ops_per_sec case. So the "Fixes" tag
below is somewhat ad hoc, but it matches when the issue was noticed.
Fix it for good (knock wood) by simply making the kill_fasync() case
separate from the wakeup case. FASYNC is quite rare, and we clearly
shouldn't even try to use the "avoid unnecessary wakeups" logic for it.
Link: https://lore.kernel.org/lkml/20210824151337.GC27667@xsang-OptiPlex-9020/
Fixes: 3b844826b6 ("pipe: avoid unnecessary EPOLLET wakeups under normal loads")
Reported-by: kernel test robot <oliver.sang@intel.com>
Tested-by: Oliver Sang <oliver.sang@intel.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kcalloc() is called to allocate memory for m->m_info, and if it fails,
ceph_mdsmap_destroy() behind the label out_err will be called:
ceph_mdsmap_destroy(m);
In ceph_mdsmap_destroy(), m->m_info is dereferenced through:
kfree(m->m_info[i].export_targets);
To fix this possible null-pointer dereference, check m->m_info before the
for loop to free m->m_info[i].export_targets.
[ jlayton: fix up whitespace damage
only kfree(m->m_info) if it's non-NULL ]
Reported-by: TOTE Robot <oslab@tsinghua.edu.cn>
Signed-off-by: Tuo Li <islituo@gmail.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
The ceph_cap_flush structures are usually dynamically allocated, but
the ceph_cap_snap has an embedded one.
When force umounting, the client will try to remove all the session
caps. During this, it will free them, but that should not be done
with the ones embedded in a capsnap.
Fix this by adding a new boolean that indicates that the cap flush is
embedded in a capsnap, and skip freeing it if that's set.
At the same time, switch to using list_del_init() when detaching the
i_list and g_list heads. It's possible for a forced umount to remove
these objects but then handle_cap_flushsnap_ack() races in and does the
list_del_init() again, corrupting memory.
Cc: stable@vger.kernel.org
URL: https://tracker.ceph.com/issues/52283
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This reverts commit f216562731.
[BUG]
It's no longer possible to create compressed inline extent after commit
f216562731 ("btrfs: compression: don't try to compress if we don't
have enough pages").
[CAUSE]
For compression code, there are several possible reasons we have a range
that needs to be compressed while it's no more than one page.
- Compressed inline write
The data is always smaller than one sector and the test lacks the
condition to properly recognize a non-inline extent.
- Compressed subpage write
For the incoming subpage compressed write support, we require page
alignment of the delalloc range.
And for 64K page size, we can compress just one page into smaller
sectors.
For those reasons, the requirement for the data to be more than one page
is not correct, and is already causing regression for compressed inline
data writeback. The idea of skipping one page to avoid wasting CPU time
could be revisited in the future.
[FIX]
Fix it by reverting the offending commit.
Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Link: https://lore.kernel.org/linux-btrfs/afa2742.c084f5d6.17b6b08dffc@tnonline.net
Fixes: f216562731 ("btrfs: compression: don't try to compress if we don't have enough pages")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 3efee0567b4a ("fs: remove mandatory file locking support") removes
some operations in functions rw_verify_area().
As these functions are now simplified, do some syntactic clean-up as
follow-up to the removal as well, which was pointed out by compiler
warnings and static analysis.
No functional change.
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
When parsing the ExtendedAttr data, malicous or corrupt attribute length
could cause kernel hangs and buffer overruns in some special cases.
Link: https://lore.kernel.org/r/20210822093332.25234-1-stian.skjelstad@gmail.com
Signed-off-by: Stian Skjelstad <stian.skjelstad@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
We added CONFIG_MANDATORY_FILE_LOCKING in 2015, and soon after turned it
off in Fedora and RHEL8. Several other distros have followed suit.
I've heard of one problem in all that time: Someone migrated from an
older distro that supported "-o mand" to one that didn't, and the host
had a fstab entry with "mand" in it which broke on reboot. They didn't
actually _use_ mandatory locking so they just removed the mount option
and moved on.
This patch rips out mandatory locking support wholesale from the kernel,
along with the Kconfig option and the Documentation file. It also
changes the mount code to ignore the "mand" mount option instead of
erroring out, and to throw a big, ugly warning.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
There is an existing lock hierarchy of
&dev->event_lock --> &fasync_struct.fa_lock --> &f->f_owner.lock
from the following call chain:
input_inject_event():
spin_lock_irqsave(&dev->event_lock,...);
input_handle_event():
input_pass_values():
input_to_handler():
evdev_events():
evdev_pass_values():
spin_lock(&client->buffer_lock);
__pass_event():
kill_fasync():
kill_fasync_rcu():
read_lock(&fa->fa_lock);
send_sigio():
read_lock_irqsave(&fown->lock,...);
&dev->event_lock is HARDIRQ-safe, so interrupts have to be disabled
while grabbing &fasync_struct.fa_lock, otherwise we invert the lock
hierarchy. However, since kill_fasync which calls kill_fasync_rcu is
an exported symbol, it may not necessarily be called with interrupts
disabled.
As kill_fasync_rcu may be called with interrupts disabled (for
example, in the call chain above), we replace calls to
read_lock/read_unlock on &fasync_struct.fa_lock in kill_fasync_rcu
with read_lock_irqsave/read_unlock_irqrestore.
Signed-off-by: Desmond Cheong Zhi Xi <desmondcheongzx@gmail.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Syzbot reports a potential deadlock in do_fcntl:
========================================================
WARNING: possible irq lock inversion dependency detected
5.12.0-syzkaller #0 Not tainted
--------------------------------------------------------
syz-executor132/8391 just changed the state of lock:
ffff888015967bf8 (&f->f_owner.lock){.+..}-{2:2}, at: f_getown_ex fs/fcntl.c:211 [inline]
ffff888015967bf8 (&f->f_owner.lock){.+..}-{2:2}, at: do_fcntl+0x8b4/0x1200 fs/fcntl.c:395
but this lock was taken by another, HARDIRQ-safe lock in the past:
(&dev->event_lock){-...}-{2:2}
and interrupts could create inverse lock ordering between them.
other info that might help us debug this:
Chain exists of:
&dev->event_lock --> &new->fa_lock --> &f->f_owner.lock
Possible interrupt unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&f->f_owner.lock);
local_irq_disable();
lock(&dev->event_lock);
lock(&new->fa_lock);
<Interrupt>
lock(&dev->event_lock);
*** DEADLOCK ***
This happens because there is a lock hierarchy of
&dev->event_lock --> &new->fa_lock --> &f->f_owner.lock
from the following call chain:
input_inject_event():
spin_lock_irqsave(&dev->event_lock,...);
input_handle_event():
input_pass_values():
input_to_handler():
evdev_events():
evdev_pass_values():
spin_lock(&client->buffer_lock);
__pass_event():
kill_fasync():
kill_fasync_rcu():
read_lock(&fa->fa_lock);
send_sigio():
read_lock_irqsave(&fown->lock,...);
However, since &dev->event_lock is HARDIRQ-safe, interrupts have to be
disabled while grabbing &f->f_owner.lock, otherwise we invert the lock
hierarchy.
Hence, we replace calls to read_lock/read_unlock on &f->f_owner.lock,
with read_lock_irq/read_unlock_irq.
Reported-and-tested-by: syzbot+e6d5398a02c516ce5e70@syzkaller.appspotmail.com
Signed-off-by: Desmond Cheong Zhi Xi <desmondcheongzx@gmail.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJHBAABCAAxFiEES8DXskRxsqGE6vXTAA5oQRlWghUFAmEg5JwTHGpsYXl0b25A
a2VybmVsLm9yZwAKCRAADmhBGVaCFfX3D/94AsPqwbnIyLgP2KjoltTKtnSsOZba
tzTKwrA7vPB9b0aFg0i42omwRtGQk5jPaB8J+tBx+oc/JjO+Dpj5FOXOI9PjFr//
2zXBuMdvYWaPlsl9nw5dFzhybdPnv+HgNXA8PRqy/DifU7/oLWxHzQ4sGvr1XXNc
vow5KzLe1K4R+hQBwqKLh/9VVE4+foVTAwXWcFBi/RrDf5X97BF7s98JYsNxfpbn
EYSyLr818TCUlivAUzx+y0JD+qUknZLuNYWZ2HkYvI29y5h1gUCVmn1orlCDdDq/
G3SCF8M66l4xGW+bpzrEabqIDl+0IWhX+grU+UVdiMH4YuXmY4HPdNgKYIpJolyy
zdg8cCO3OAHTJiwkj5jSfxFHQnzPr58LSQGDfrIrNjcDKlQbx3c+8R5yMy7Ar7jZ
XAM6h88PAuBknTDWBQogZtuSqKbV8D2LsAUVgRKA7iKSwYXXUmZdW+UDiOavsHmg
n2fbi1sPP7woQKyXzFddwxG3+2Nzby8BE7xuyHTXdOrQJNE1PvICH+WVo2m8+FCe
uLKLXNf4UECO2MaSyd6k90v3AVty4u1EDTbitgcERztWGzazZdTVlmRdpwy35V+B
wskKALzjGgqFoSwkEAnXQdk4+Gk9EtnrXD+HYfmxGYKk87fOBCgJtqDWYw3RHK7n
giBfvji9HK3R8A==
=xZLO
-----END PGP SIGNATURE-----
Merge tag 'locks-v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux
Pull mandatory file locking deprecation warning from Jeff Layton:
"As discussed on the list, this patch just adds a new warning for folks
who still have mandatory locking enabled and actually mount with '-o
mand'. I'd like to get this in for v5.14 so we can push this out into
stable kernels and hopefully reach folks who have mounts with -o mand.
For now, I'm operating under the assumption that we'll fully remove
this support in v5.15, but we can move that out if any legitimate
users of this facility speak up between now and then"
* tag 'locks-v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
fs: warn about impending deprecation of mandatory locks
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmEgaZ8QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpmFCEACzSqGGsFj7kxxxX45n8r9+3cnfSN2HRnBe
rcKEEypK7ZyohfIfCQ+Qxhi6NCkYYtV3Enr7ErKUAbe8XpO0Eza84h0bjEBozSUa
ZxyDVOVPHF4r0P6pGmYKNNijlPImx7ORL8kAfGTDLi0f3Qc1Chmkej1K9g5SBjIz
qxzvv+I6L2OIokZwjmw/zCOD9GMvFtfPbL4xG1MEdXxH+aXtitkcnGlh7ZWxfFnz
Mw4/QiyBOsya2clhvqT0FckED63wy5h0es/R0WMleIYQbSeORF2YZGejd7LoaeQj
5PMvTeqv1f85lBo9BuUxp23/KDdwy0MOaKktOMFGWOkAvRH0ezjebNtUr6yBtXnt
UyD/Wpdhpo0x6ZcpqB7UnwADKgd41GjPfQKqH/zNZta1vhfldg0mzuN6wfQpumg6
j+fp/7uY/lwo5fpOF4CLwVrQEENi4Yir0aA4M6s53UknmrO26H0qHRM/uiHhmRVr
gG5s5ZwJgabepb9VP7fCQ7NLMpfiijTPFEUahBgB5QTtVomUhsWY/XyODoGkF0PS
frKe6+0IakmkX+fzNV3vTzchew8ySesQacU6y1ugl0cPqL9si1uqzYdv6+bOgry4
NKlsRAqeYQAivQSdxdAE96+zIWnwJqQ6GKyvb/9XgPACGSWkzm7s5yXdXWKLwoBH
FSPqTXs6Tw==
=/PEi
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.14-2021-08-20' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
"A few small fixes that should go into this release:
- Fix never re-assigning an initial error value for io_uring_enter()
for SQPOLL, if asked to do nothing
- Fix xa_alloc_cycle() return value checking, for cases where we have
wrapped around
- Fix for a ctx pin issue introduced in this cycle (Pavel)"
* tag 'io_uring-5.14-2021-08-20' of git://git.kernel.dk/linux-block:
io_uring: fix xa_alloc_cycle() error return value check
io_uring: pin ctx on fallback execution
io_uring: only assign io_uring_enter() SQPOLL error in actual error case
We've had CONFIG_MANDATORY_FILE_LOCKING since 2015 and a lot of distros
have disabled it. Warn the stragglers that still use "-o mand" that
we'll be dropping support for that mount option.
Cc: stable@vger.kernel.org
Signed-off-by: Jeff Layton <jlayton@kernel.org>
We currently check for ret != 0 to indicate error, but '1' is a valid
return and just indicates that the allocation succeeded with a wrap.
Correct the check to be for < 0, like it was before the xarray
conversion.
Cc: stable@vger.kernel.org
Fixes: 61cf93700f ("io_uring: Convert personality_idr to XArray")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmEdNDEACgkQxWXV+ddt
WDvfDw//cDnR8HZEtrHwHX9qHitcYs6pubdwwGAsFlSZ/wh0iX05TxUjho4gGYMZ
Kp9PXipMOEdxNLJ8oaPkI+i8vIXxTWWqAm5ZePkV0cjg+vTgqqKf9NLcMtS34kP4
/GeQJgul9oreTMbXCx219J0B6lKpl6Iv0sCaSFyN09GIPNI8F6nyDbJTA3JKTRWb
ElP8mGvdUFFcOKsG6Wh6BU/WVU/My7d+HumApsRXB2lDwmMambAkX0iGpRElGrbD
ub+5ya0WeO8DB6KsVa4W8cMO5sWV9L9FcXMtGlwLbIkOxFdHvP7CT1pvH3TZe9Wy
mr8oAL01IktuNjZgQ5sUn+yISf+LuHnWjhpu+QBRuylZiwfpMwSPb0geLcrXcYGj
i8ERlmJvwbm6dAQlQDbA3yZKH6+FzePyTR99std2LK9JtbqBaFeSS6WM05SpRUDJ
FNHCLOzsswzBUE54nkqsb+A8tBXpcxnvQkrU+nJeDNUYM9w6S5mbCGeZjxe/n+ov
TGprz1ar2Ppm9YMH0zj6wOM690nJZYNrAvtUmeCl1xlLERYIoV2jXS1SzkaMdcQu
u3UVVsOCghPN5krEac4jgiGBdVHvjVnJb/qGBNTpj3aDX29PADHUJr7TmzGJBa8F
ePWqDngDYi+cVrm9JMls9UJhaCVmhzLAXtN5X3+fKfe5bNQE4gU=
=aR16
-----END PGP SIGNATURE-----
Merge tag 'for-5.14-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fix from David Sterba:
"One more fix for cross-rename, adding a missing check for directory
and subvolume, this could lead to a crash"
* tag 'for-5.14-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: prevent rename2 from exchanging a subvol with a directory from different parents
I had forgotten just how sensitive hackbench is to extra pipe wakeups,
and commit 3a34b13a88 ("pipe: make pipe writes always wake up
readers") ended up causing a quite noticeable regression on larger
machines.
Now, hackbench isn't necessarily a hugely meaningful benchmark, and it's
not clear that this matters in real life all that much, but as Mel
points out, it's used often enough when comparing kernels and so the
performance regression shows up like a sore thumb.
It's easy enough to fix at least for the common cases where pipes are
used purely for data transfer, and you never have any exciting poll
usage at all. So set a special 'poll_usage' flag when there is polling
activity, and make the ugly "EPOLLET has crazy legacy expectations"
semantics explicit to only that case.
I would love to limit it to just the broken EPOLLET case, but the pipe
code can't see the difference between epoll and regular select/poll, so
any non-read/write waiting will trigger the extra wakeup behavior. That
is sufficient for at least the hackbench case.
Apart from making the odd extra wakeup cases more explicitly about
EPOLLET, this also makes the extra wakeup be at the _end_ of the pipe
write, not at the first write chunk. That is actually much saner
semantics (as much as you can call any of the legacy edge-triggered
expectations for EPOLLET "sane") since it means that you know the wakeup
will happen once the write is done, rather than possibly in the middle
of one.
[ For stable people: I'm putting a "Fixes" tag on this, but I leave it
up to you to decide whether you actually want to backport it or not.
It likely has no impact outside of synthetic benchmarks - Linus ]
Link: https://lore.kernel.org/lkml/20210802024945.GA8372@xsang-OptiPlex-9020/
Fixes: 3a34b13a88 ("pipe: make pipe writes always wake up readers")
Reported-by: kernel test robot <oliver.sang@intel.com>
Tested-by: Sandeep Patil <sspatil@android.com>
Tested-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The BFQ scheduler and ioprio_check_cap() both assume that the RT
priority class (IOPRIO_CLASS_RT) can have up to 8 different priority
levels, similarly to the BE class (IOPRIO_CLASS_iBE). This is
controlled using the IOPRIO_BE_NR macro , which is badly named as the
number of levels also applies to the RT class.
Introduce the class independent IOPRIO_NR_LEVELS macro, defined to 8,
to make things clear. Keep the old IOPRIO_BE_NR macro definition as an
alias for IOPRIO_NR_LEVELS.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Link: https://lore.kernel.org/r/20210811033702.368488-6-damien.lemoal@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
inode_detach_wb references the "main" bdi of the inode. With the
recent change to move the bdi from the request_queue to the gendisk
this causes a guaranteed use after free when using certain cgroup
configurations. The big itself is older through as any non-default
inode reference (e.g. an open file descriptor) could have injected
this use after free even before that.
Fixes: 52ebea749a ("writeback: make backing_dev_info host cgroup-specific bdi_writebacks")
Reported-by: Qian Cai <quic_qiancai@quicinc.com>
Reported-by: syzbot <syzbot+1fb38bb7d3ce0fa3e1c4@syzkaller.appspotmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210816122614.601358-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The dev_t is used as the inode hash, so we should only released it
once then block device inode is gone from the inode cache. Move it
to bdev_free_inode to ensure that.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210816122614.601358-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Cross-rename lacks a check when that would prevent exchanging a
directory and subvolume from different parent subvolume. This causes
data inconsistencies and is caught before commit by tree-checker,
turning the filesystem to read-only.
Calling the renameat2 with RENAME_EXCHANGE flags like
renameat2(AT_FDCWD, namesrc, AT_FDCWD, namedest, (1 << 1))
on two paths:
namesrc = dir1/subvol1/dir2
namedest = subvol2/subvol3
will cause key order problem with following write time tree-checker
report:
[1194842.307890] BTRFS critical (device loop1): corrupt leaf: root=5 block=27574272 slot=10 ino=258, invalid previous key objectid, have 257 expect 258
[1194842.322221] BTRFS info (device loop1): leaf 27574272 gen 8 total ptrs 11 free space 15444 owner 5
[1194842.331562] BTRFS info (device loop1): refs 2 lock_owner 0 current 26561
[1194842.338772] item 0 key (256 1 0) itemoff 16123 itemsize 160
[1194842.338793] inode generation 3 size 16 mode 40755
[1194842.338801] item 1 key (256 12 256) itemoff 16111 itemsize 12
[1194842.338809] item 2 key (256 84 2248503653) itemoff 16077 itemsize 34
[1194842.338817] dir oid 258 type 2
[1194842.338823] item 3 key (256 84 2363071922) itemoff 16043 itemsize 34
[1194842.338830] dir oid 257 type 2
[1194842.338836] item 4 key (256 96 2) itemoff 16009 itemsize 34
[1194842.338843] item 5 key (256 96 3) itemoff 15975 itemsize 34
[1194842.338852] item 6 key (257 1 0) itemoff 15815 itemsize 160
[1194842.338863] inode generation 6 size 8 mode 40755
[1194842.338869] item 7 key (257 12 256) itemoff 15801 itemsize 14
[1194842.338876] item 8 key (257 84 2505409169) itemoff 15767 itemsize 34
[1194842.338883] dir oid 256 type 2
[1194842.338888] item 9 key (257 96 2) itemoff 15733 itemsize 34
[1194842.338895] item 10 key (258 12 256) itemoff 15719 itemsize 14
[1194842.339163] BTRFS error (device loop1): block=27574272 write time tree block corruption detected
[1194842.339245] ------------[ cut here ]------------
[1194842.443422] WARNING: CPU: 6 PID: 26561 at fs/btrfs/disk-io.c:449 csum_one_extent_buffer+0xed/0x100 [btrfs]
[1194842.511863] CPU: 6 PID: 26561 Comm: kworker/u17:2 Not tainted 5.14.0-rc3-git+ #793
[1194842.511870] Hardware name: empty empty/S3993, BIOS PAQEX0-3 02/24/2008
[1194842.511876] Workqueue: btrfs-worker-high btrfs_work_helper [btrfs]
[1194842.511976] RIP: 0010:csum_one_extent_buffer+0xed/0x100 [btrfs]
[1194842.512068] RSP: 0018:ffffa2c284d77da0 EFLAGS: 00010282
[1194842.512074] RAX: 0000000000000000 RBX: 0000000000001000 RCX: ffff928867bd9978
[1194842.512078] RDX: 0000000000000000 RSI: 0000000000000027 RDI: ffff928867bd9970
[1194842.512081] RBP: ffff92876b958000 R08: 0000000000000001 R09: 00000000000c0003
[1194842.512085] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
[1194842.512088] R13: ffff92875f989f98 R14: 0000000000000000 R15: 0000000000000000
[1194842.512092] FS: 0000000000000000(0000) GS:ffff928867a00000(0000) knlGS:0000000000000000
[1194842.512095] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1194842.512099] CR2: 000055f5384da1f0 CR3: 0000000102fe4000 CR4: 00000000000006e0
[1194842.512103] Call Trace:
[1194842.512128] ? run_one_async_free+0x10/0x10 [btrfs]
[1194842.631729] btree_csum_one_bio+0x1ac/0x1d0 [btrfs]
[1194842.631837] run_one_async_start+0x18/0x30 [btrfs]
[1194842.631938] btrfs_work_helper+0xd5/0x1d0 [btrfs]
[1194842.647482] process_one_work+0x262/0x5e0
[1194842.647520] worker_thread+0x4c/0x320
[1194842.655935] ? process_one_work+0x5e0/0x5e0
[1194842.655946] kthread+0x135/0x160
[1194842.655953] ? set_kthread_struct+0x40/0x40
[1194842.655965] ret_from_fork+0x1f/0x30
[1194842.672465] irq event stamp: 1729
[1194842.672469] hardirqs last enabled at (1735): [<ffffffffbd1104f5>] console_trylock_spinning+0x185/0x1a0
[1194842.672477] hardirqs last disabled at (1740): [<ffffffffbd1104cc>] console_trylock_spinning+0x15c/0x1a0
[1194842.672482] softirqs last enabled at (1666): [<ffffffffbdc002e1>] __do_softirq+0x2e1/0x50a
[1194842.672491] softirqs last disabled at (1651): [<ffffffffbd08aab7>] __irq_exit_rcu+0xa7/0xd0
The corrupted data will not be written, and filesystem can be unmounted
and mounted again (all changes since the last commit will be lost).
Add the missing check for new_ino so that all non-subvolumes must reside
under the same parent subvolume. There's an exception allowing to
exchange two subvolumes from any parents as the directory representing a
subvolume is only a logical link and does not have any other structures
related to the parent subvolume, unlike files, directories etc, that
are always in the inode namespace of the parent subvolume.
Fixes: cdd1fedf82 ("btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT")
CC: stable@vger.kernel.org # 4.7+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
- Fix support for NFIT "virtual" ranges (BIOS-defined memory disks)
- Fix recovery from failed label storage areas on NVDIMM devices
- Miscellaneous cleanups from Ira's investigation of dax_direct_access
paths preparing for stray-write protection.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSbo+XnGs+rwLz9XGXfioYZHlFsZwUCYRhC0wAKCRDfioYZHlFs
Z6InAQD+duS9GS5DnnFInmRDj/rMRQFVB4X25mmSlViYOR0gNwEAtJQP03CGAp+G
+DP7/nu2HrIhx8Ng8vTsu8ZnO8ge7Qw=
=zmii
-----END PGP SIGNATURE-----
Merge tag 'libnvdimm-fixes-5.14-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm fixes from Dan Williams:
"A couple of fixes for long standing bugs, a warning fixup, and some
miscellaneous dax cleanups.
The bugs were recently found due to new platforms looking to use the
ACPI NFIT "virtual" device definition, and new error injection
capabilities to trigger error responses to label area requests. Ira's
cleanups have been long pending, I neglected to send them earlier, and
see no harm in including them now. This has all appeared in -next with
no reported issues.
Summary:
- Fix support for NFIT "virtual" ranges (BIOS-defined memory disks)
- Fix recovery from failed label storage areas on NVDIMM devices
- Miscellaneous cleanups from Ira's investigation of
dax_direct_access paths preparing for stray-write protection"
* tag 'libnvdimm-fixes-5.14-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
tools/testing/nvdimm: Fix missing 'fallthrough' warning
libnvdimm/region: Fix label activation vs errors
ACPI: NFIT: Fix support for virtual SPA ranges
dax: Ensure errno is returned from dax_direct_access
fs/dax: Clarify nr_pages to dax_direct_access()
fs/fuse: Remove unneeded kaddr parameter
If an SQPOLL based ring is newly created and an application issues an
io_uring_enter(2) system call on it, then we can return a spurious
-EOWNERDEAD error. This happens because there's nothing to submit, and
if the caller doesn't specify any other action, the initial error
assignment of -EOWNERDEAD never gets overwritten. This causes us to
return it directly, even if it isn't valid.
Move the error assignment into the actual failure case instead.
Cc: stable@vger.kernel.org
Fixes: d9d05217cb ("io_uring: stop SQPOLL submit on creator's death")
Reported-by: Sherlock Holo sherlockya@gmail.com
Link: https://github.com/axboe/liburing/issues/413
Signed-off-by: Jens Axboe <axboe@kernel.dk>
- fix to revert to the historic write behavior (Bart Van Assche)
-----BEGIN PGP SIGNATURE-----
iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAmEXaXgLHGhjaEBsc3Qu
ZGUACgkQD55TZVIEUYNQLg/+Lagdz93ESb7VZHGKMQQkyMM4Zx8DBv3eMRaIAw19
jK87v15tGrrcse/JLmBWo1s3d5HDZGKOYhsUsv2dAqsa3P7S5p7Hihz4WSGlEQAS
UnqqHUafVTPwBqHgt1StF9BpE6QH2zovlJeHnSok6fPvJcUvC5h9Z83mgNW2SUf/
zut1GnqVp82jaDDfJymLIFpT4hRjfj2CpsMa38YU/M0Bunhn87tUFKHVzpdnTG9G
v0iLXuGfax1KWJCX3Sf4Pw9vCCTzIUHmWrbH/8X/AywYe5enhuHfTFQAxn623jAg
TzFoU/ByR3Je4zhDmci20Kdgay3LREgjGO3iloZG2KcnRJZOSzYU+SX5IWQZvLon
JWDqDzr8iR7DIdrfNjIbehYj9DRdlxn1iUr8mvCVK6uxN2deyiLHamD2kqv9fklW
D6TOHHkwrCF8k+jQfAc9l5+vk98UsJwFyT9BYatA6U/jtffxlsf7OuN0LHRtzu7a
4zdy5U/7tqT7W4PHy4/ICZN2ka2mm1c5I7JyjEgdj0Qongml4m7g/3vxSEKPJCeB
Rj2SCA8163RqYTywEUO5lcjpTbwZBG4pPx6PMGIrhCGGnqdl+RcNVy3Kt2LEdbiq
WXq7hQGoOsZLRkloej1B2D9x9mqyYPLzT+w/xzd5iJKVrLv06LHyi/d0GCKTHUNp
XN8=
=dsC9
-----END PGP SIGNATURE-----
Merge tag 'configfs-5.14' of git://git.infradead.org/users/hch/configfs
Pull configfs fix from Christoph Hellwig:
- fix to revert to the historic write behavior (Bart Van Assche)
* tag 'configfs-5.14' of git://git.infradead.org/users/hch/configfs:
configfs: restore the kernel v5.13 text attribute write behavior
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmEW5MQACgkQiiy9cAdy
T1H6zQwAwOyQrMGA3VtsatmhhQo89oHbkUB4q4+tY7xabPNgTRAPT9BpRSFsrW4y
njeiaP1T7nmiU1yJsYQD/JwRv005pBO/xa7sMsLb0RSca//kzin4WTTzxom3RUBW
hU29PvAxl+AjaRvbo6VpSn6xuHH1BDcZU+YfWtX3c6tE30sdzwjyMu7rEhivNGCf
0ukIZuIaEJrZmCwMZHT8+qE1dBOKEad8I39POG1v+mybQCWJvo4MAUDyYHZYKTJz
6e3JDARI19G8hfq1oVAM/g5gIBRBEISug3jenOMVG/QnBnRsBvyuTIunoq4ba+S6
qp3jv3p24DaUe9FPp5sYyAhuJHo0rzSwGBv/SdikJA+3xb8k2E3rECAUbf4j/NmV
/sj0tN/6/Z05/6L4ZgqUjdS2KfLztDvgzTGnv/LsB097Nhb8hVIhsKoqzypmyAcH
5AsjXFETMCclWE7oE8DkzaAOJqKD7MGJ6KXCjbt8JcFb/b6L/QQbRyRB5ggupgVn
Ic7gn/iM
=RlpY
-----END PGP SIGNATURE-----
Merge tag '5.14-rc5-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"Four CIFS/SMB3 Fixes, all for stable, two relating to deferred close,
and one for the 'modefromsid' mount option (when 'idsfromsid' not
specified)"
* tag '5.14-rc5-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: Call close synchronously during unlink/rename/lease break.
cifs: Handle race conditions during rename
cifs: use the correct max-length for dentry_path_raw()
cifs: create sd context must be a multiple of 8
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmEWwP4QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpgqrEAC93z9w25k1mqVZLjMJgp1IKT3KsB1xfn+i
QwLz1OVpveyf5fKzZEK6xPn6DhX22FmGca0FhPNYANlNJNgp8X1cT0REf+hdDbYJ
IqF14Ue9DO5mbBiutb9aEYoTSZjg1HaZFPRfu71QkFZbTuVx8CoMwDsl1OXG7bcO
5G67khXZFkChmaJVgaANfR4EKKqHXrVt+EOoOyBYdSZLWloWu2KuODXt/FtmllHc
kf7bPTw0Za5f6oSJXX52B+RlMyK3HrA9FZZor8alyUd0GVsgu5vwY2vFNSyft8t2
RazK/oqU+hBmV4wNVrAYAByYiq+j3lVLvE3AXPI81cXLyPyCVG0GqqRcycuUaMjV
WikJyo6dGQh70yOXw3BhQojPsEL1OPcyMxIi+3g/TKL9kIYOBaO7Es+KwoEKD4fX
fAkucTsb/Tc3WsYI3ELOedy9WuIMzAgysZxcTUxphKmRtwFNXLOYmQJEW5dK6MxZ
sZfve0JCQsFdebHOjIIanPRJSwksTt0Xfdm1g+R8sDBfnLV81PbGY+8lrRgUiBJ9
DYDd0hRPFnBuWT8h1qToUqKcUHL73IefXstMnstdtWwrL1FZs4JsPGYdZMxqfXSk
Z/50aHSmB9KaouFaIrxM4rndrAgP2BJwXOu2f1YbiPTcKNpmdYTX5jT7mfaLSWX5
ZpF6fkDiTA==
=lx8b
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.14-2021-08-13' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
"A bit bigger than the previous weeks, but mostly just a few stable
bound fixes. In detail:
- Followup fixes to patches from last week for io-wq, turns out they
weren't complete (Hao)
- Two lockdep reported fixes out of the RT camp (me)
- Sync the io_uring-cp example with liburing, as a few bug fixes
never made it to the kernel carried version (me)
- SQPOLL related TIF_NOTIFY_SIGNAL fix (Nadav)
- Use WRITE_ONCE() when writing sq flags (Nadav)
- io_rsrc_put_work() deadlock fix (Pavel)"
* tag 'io_uring-5.14-2021-08-13' of git://git.kernel.dk/linux-block:
tools/io_uring/io_uring-cp: sync with liburing example
io_uring: fix ctx-exit io_rsrc_put_work() deadlock
io_uring: drop ctx->uring_lock before flushing work item
io-wq: fix IO_WORKER_F_FIXED issue in create_io_worker()
io-wq: fix bug of creating io-wokers unconditionally
io_uring: rsrc ref lock needs to be IRQ safe
io_uring: Use WRITE_ONCE() when writing to sq_flags
io_uring: clear TIF_NOTIFY_SIGNAL when running task work
and a reference handling fix from Jeff that should address some memory
corruption reports in the snaprealm area. Both marked for stable.
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmEVaqsTHGlkcnlvbW92
QGdtYWlsLmNvbQAKCRBKf944AhHzi/DBCACd7+mnAXIwajwoDdXFIJT7/tfimdvU
cMrh6ciZNtEKxm23flQ1AFJXlXR/nlZRspfOmlmsl9bB4TAlXnhJ/s4JaiuOMMTh
OQ4oz0vAbGELkPsXB/FXGSSk1wTFEjCocFsJwoYiUkYjD7Qt12BZKNkFYgj/MVc2
wyJ5K1buqBLVFDU+CymqDzc07YpG1zn888o7UGWFTyevldRAHl2euxqbnr0S4qb9
OS5UKO3aFCEt5PT9RKRHygCGjuHym/fgXgPm9aNY4rYBE9qOXloVUOD5bhMHBJ2E
g506xhOurqbGv4O9oj+gvBwtQwY/TF8BvCA79koQSHNIYQsC/bcXenST
=m8x8
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-5.14-rc6' of git://github.com/ceph/ceph-client
Pull ceph fixes from Ilya Dryomov:
"A patch to avoid a soft lockup in ceph_check_delayed_caps() from Luis
and a reference handling fix from Jeff that should address some memory
corruption reports in the snaprealm area.
Both marked for stable"
* tag 'ceph-for-5.14-rc6' of git://github.com/ceph/ceph-client:
ceph: take snap_empty_lock atomically with snaprealm refcount change
ceph: reduce contention in ceph_check_delayed_caps()
Pull ucounts fix from Eric Biederman:
"This fixes the ucount sysctls on big endian architectures.
The counts were expanded to be longs instead of ints, and the sysctl
code was overlooked, so only the low 32bit were being processed. On
litte endian just processing the low 32bits is fine, but on 64bit big
endian processing just the low 32bits results in the high order bits
instead of the low order bits being processed and nothing works
proper.
This change took a little bit to mature as we have the SYSCTL_ZERO,
and SYSCTL_INT_MAX macros that are only usable for sysctls operating
on ints, but unfortunately are not obviously broken. Which resulted in
the versions of this change working on big endian and not on little
endian, because the int SYSCTL_ZERO when extended 64bit wound up being
0x100000000. So we only allowed values greater than 0x100000000 and
less than 0faff. Which unfortunately broken everything that tried to
set the sysctls. (First reported with the windows subsystem for
linux).
I have tested this on x86_64 64bit after first reproducing the
problems with the earlier version of this change, and then verifying
the problems do not exist when we use appropriate long min and max
values for extra1 and extra2"
* 'for-v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
ucounts: add missing data type changes
During unlink/rename/lease break, deferred work for close is
scheduled immediately but in an asynchronous manner which might
lead to race with actual(unlink/rename) commands.
This change will schedule close synchronously which will avoid
the race conditions with other commands.
Signed-off-by: Rohith Surabattula <rohiths@microsoft.com>
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Cc: stable@vger.kernel.org # 5.13
Signed-off-by: Steve French <stfrench@microsoft.com>
When rename is executed on directory which has files for which
close is deferred, then rename will fail with EACCES.
This patch will try to close all deferred files when EACCES is received
and retry rename on a directory.
Signed-off-by: Rohith Surabattula <rohiths@microsoft.com>
Cc: stable@vger.kernel.org # 5.13
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Just check inode_unhashed on the whole device bdev inode instead,
and provide a helper to check for that information.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210809064028.1198327-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently iocharset=utf8 mount option is broken. To use UTF-8 as iocharset,
it is required to use utf8 mount option.
Fix iocharset=utf8 mount option to use be equivalent to the utf8 mount
option.
If UTF-8 as iocharset is used then s_nls_iocharset is set to NULL. So
simplify code around, remove s_utf8 field as to distinguish between UTF-8
and non-UTF-8 it is needed just to check if s_nls_iocharset is set to NULL
or not.
Link: https://lore.kernel.org/r/20210808162453.1653-5-pali@kernel.org
Signed-off-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Currently iocharset=utf8 mount option is broken. To use UTF-8 as iocharset,
it is required to use utf8 mount option.
Fix iocharset=utf8 mount option to use be equivalent to the utf8 mount
option.
If UTF-8 as iocharset is used then s_nls_map is set to NULL. So simplify
code around, remove UDF_FLAG_NLS_MAP and UDF_FLAG_UTF8 flags as to
distinguish between UTF-8 and non-UTF-8 it is needed just to check if
s_nls_map set to NULL or not.
Link: https://lore.kernel.org/r/20210808162453.1653-4-pali@kernel.org
Signed-off-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Get rid of 0-length arrays in struct fileIdentDesc. This requires a bit
of cleaning up as the second variable length array in this structure is
often used and the code abuses the fact that the first two arrays have
the same type and offset in struct fileIdentDesc.
Signed-off-by: Jan Kara <jack@suse.cz>
Declare variable length arrays using [] instead of the old-style
declarations using arrays with 0 members. Also comment out entries in
structures beyond the first variable length array (we still do keep them
in comments as a reminder there are further entries in the structure
behind the variable length array). Accessing such entries needs a
careful offset math anyway so it is safer to not have them declared.
Signed-off-by: Jan Kara <jack@suse.cz>
We were checking validity of LVID entries only when getting
implementation use information from LVID in udf_sb_lvidiu(). However if
the LVID is suitably corrupted, it can cause problems also to code such
as udf_count_free() which doesn't use udf_sb_lvidiu(). So check validity
of LVID already when loading it from the disk and just disable LVID
altogether when it is not valid.
Reported-by: syzbot+7fbfe5fed73ebb675748@syzkaller.appspotmail.com
Signed-off-by: Jan Kara <jack@suse.cz>