Граф коммитов

77959 Коммитов

Автор SHA1 Сообщение Дата
Sun Ke c93ccd63b1 cachefiles: fix error return code in cachefiles_ondemand_copen()
The cache_size field of copen is specified by the user daemon.
If cache_size < 0, then the OPEN request is expected to fail,
while copen itself shall succeed. However, returning 0 is indeed
unexpected when cache_size is an invalid error code.

Fix this by returning error when cache_size is an invalid error code.

Changes
=======
v4: update the code suggested by Dan
v3: update the commit log suggested by Jingbo.

Fixes: c838305450 ("cachefiles: notify the user daemon when looking up cookie")
Signed-off-by: Sun Ke <sunke32@huawei.com>
Suggested-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Suggested-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Dan Carpenter <dan.carpenter@oracle.com>
Link: https://lore.kernel.org/r/20220818111935.1683062-1-sunke32@huawei.com/ # v2
Link: https://lore.kernel.org/r/20220818125038.2247720-1-sunke32@huawei.com/ # v3
Link: https://lore.kernel.org/r/20220826023515.3437469-1-sunke32@huawei.com/ # v4
2022-08-31 16:41:10 +01:00
Enzo Matsumiya 27893dfc12 cifs: fix small mempool leak in SMB2_negotiate()
In some cases of failure (dialect mismatches) in SMB2_negotiate(), after
the request is sent, the checks would return -EIO when they should be
rather setting rc = -EIO and jumping to neg_exit to free the response
buffer from mempool.

Signed-off-by: Enzo Matsumiya <ematsumiya@suse.de>
Cc: stable@vger.kernel.org
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-08-30 20:08:13 -05:00
Steve French 3e3761f1ec smb3: use filemap_write_and_wait_range instead of filemap_write_and_wait
When doing insert range and collapse range we should be
writing out the cached pages for the ranges affected but not
the whole file.

Fixes: c3a72bb213 ("smb3: Move the flush out of smb2_copychunk_range() into its callers")
Cc: stable@vger.kernel.org
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-08-30 17:10:29 -05:00
David Howells 9c8b7a293f smb3: fix temporary data corruption in insert range
insert range doesn't discard the affected cached region
so can risk temporarily corrupting file data.

Also includes some minor cleanup (avoiding rereading
inode size repeatedly unnecessarily) to make it clearer.

Cc: stable@vger.kernel.org
Fixes: 7fe6fe95b9 ("cifs: add FALLOC_FL_INSERT_RANGE support")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-08-28 22:34:08 -05:00
Steve French fa30a81f25 smb3: fix temporary data corruption in collapse range
collapse range doesn't discard the affected cached region
so can risk temporarily corrupting the file data. This
fixes xfstest generic/031

I also decided to merge a minor cleanup to this into the same patch
(avoiding rereading inode size repeatedly unnecessarily) to make it
clearer.

Cc: stable@vger.kernel.org
Fixes: 5476b5dd82 ("cifs: add support for FALLOC_FL_COLLAPSE_RANGE")
Reported-by: David Howells <dhowells@redhat.com>
Tested-by: David Howells <dhowells@redhat.com>
Reviewed-by: David Howells <dhowells@redhat.com>
cc: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-08-28 22:34:07 -05:00
David Howells c3a72bb213 smb3: Move the flush out of smb2_copychunk_range() into its callers
Move the flush out of smb2_copychunk_range() into its callers.  This will
allow the pagecache to be invalidated between the flush and the operation
in smb3_collapse_range() and smb3_insert_range().

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-08-28 22:34:07 -05:00
Linus Torvalds b467192ec7 Seventeen hotfixes. Mostly memory management things. Ten patches are
cc:stable, addressing pre-6.0 issues.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCYwvgrAAKCRDdBJ7gKXxA
 jlweAQC9dzE08Elxl4F7Uvxe+62JWVeflBRrT7sJ6jU1Gu3QcQEAhhI1Xit3/MGq
 pRytDBObGADxlA67c9eNq6J5pCT/7gE=
 =pD67
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2022-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull more hotfixes from Andrew Morton:
 "Seventeen hotfixes.  Mostly memory management things.

  Ten patches are cc:stable, addressing pre-6.0 issues"

* tag 'mm-hotfixes-stable-2022-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  .mailmap: update Luca Ceresoli's e-mail address
  mm/mprotect: only reference swap pfn page if type match
  squashfs: don't call kmalloc in decompressors
  mm/damon/dbgfs: avoid duplicate context directory creation
  mailmap: update email address for Colin King
  asm-generic: sections: refactor memory_intersects
  bootmem: remove the vmemmap pages from kmemleak in put_page_bootmem
  ocfs2: fix freeing uninitialized resource on ocfs2_dlm_shutdown
  Revert "memcg: cleanup racy sum avoidance code"
  mm/zsmalloc: do not attempt to free IS_ERR handle
  binder_alloc: add missing mmap_lock calls when using the VMA
  mm: re-allow pinning of zero pfns (again)
  vmcoreinfo: add kallsyms_num_syms symbol
  mailmap: update Guilherme G. Piccoli's email addresses
  writeback: avoid use-after-free after removing device
  shmem: update folio if shmem_replace_page() updates the page
  mm/hugetlb: avoid corrupting page->mapping in hugetlb_mcopy_atomic_pte
2022-08-28 14:49:59 -07:00
Phillip Lougher 1f13dff09f squashfs: don't call kmalloc in decompressors
The decompressors may be called while in an atomic section.  So move the
kmalloc() out of this path, and into the "page actor" init function.

This fixes a regression introduced by commit
f268eedddf ("squashfs: extend "page actor" to handle missing pages")

Link: https://lkml.kernel.org/r/20220822215430.15933-1-phillip@squashfs.org.uk
Fixes: f268eedddf ("squashfs: extend "page actor" to handle missing pages")
Reported-by: Chris Murphy <lists@colorremedies.com>
Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-28 14:02:45 -07:00
Heming Zhao 550842cc60 ocfs2: fix freeing uninitialized resource on ocfs2_dlm_shutdown
After commit 0737e01de9 ("ocfs2: ocfs2_mount_volume does cleanup job
before return error"), any procedure after ocfs2_dlm_init() fails will
trigger crash when calling ocfs2_dlm_shutdown().

ie: On local mount mode, no dlm resource is initialized.  If
ocfs2_mount_volume() fails in ocfs2_find_slot(), error handling will call
ocfs2_dlm_shutdown(), then does dlm resource cleanup job, which will
trigger kernel crash.

This solution should bypass uninitialized resources in
ocfs2_dlm_shutdown().

Link: https://lkml.kernel.org/r/20220815085754.20417-1-heming.zhao@suse.com
Fixes: 0737e01de9 ("ocfs2: ocfs2_mount_volume does cleanup job before return error")
Signed-off-by: Heming Zhao <heming.zhao@suse.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-28 14:02:45 -07:00
Khazhismel Kumykov f87904c075 writeback: avoid use-after-free after removing device
When a disk is removed, bdi_unregister gets called to stop further
writeback and wait for associated delayed work to complete.  However,
wb_inode_writeback_end() may schedule bandwidth estimation dwork after
this has completed, which can result in the timer attempting to access the
just freed bdi_writeback.

Fix this by checking if the bdi_writeback is alive, similar to when
scheduling writeback work.

Since this requires wb->work_lock, and wb_inode_writeback_end() may get
called from interrupt, switch wb->work_lock to an irqsafe lock.

Link: https://lkml.kernel.org/r/20220801155034.3772543-1-khazhy@google.com
Fixes: 45a2966fd6 ("writeback: fix bandwidth estimate for spiky workload")
Signed-off-by: Khazhismel Kumykov <khazhy@google.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Michael Stapelberg <stapelberg+linux@google.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-28 14:02:43 -07:00
Linus Torvalds 8379c0b31f for-6.0-rc3-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmMLY9oACgkQxWXV+ddt
 WDue/w/8C3ZF8nLAI/sMrUpef2vSD62bvkKRRS45wzR2uod6yc0Fle9upzBssJQZ
 qO3mQ53+QV+imCq7dY5mmtmwCUJNmbV5gbiMoF1OoV9TYtpZb/NIDklSX8se2eJX
 drdAWQr2pYwU2M4duA4IEW08TvQ2TFh0JiqMi0aYM5apyL80uv3WniOu+xpRipA3
 CMFAnDqayIgQ5OIsedqNy2MBLBopodUL5PZv/H7/g6KSKIuAZP9zgg1eKPfaz2t3
 HO183ubmMbVtxgxeu+EnvCkg/iQ5hQiuGmyi0FLYMs/A6/NglwBnIJU5jCMQhcp6
 HO5+FSUn6lHQetVzt2uHb9Lo+gX4FtCaHqVv1bXT62lnmDsZO1D7RVSg1Fra+CY+
 jJmi8vvIbfbYlSZPZlJANoWe8ODOMVPk+pM4SFHlxOWGAY6HViX2RfHnIjNj5x9O
 iDSTGvH6++nBF1Wu2/Xja/VKZ1avxRyTu2srW8JOF62j/tTU/EoPJcO9rxXOBBmC
 Hi4UmJ690p3h5xZeeiyE8CmaSlPtfdCcnc/97FnusEjBao9O7THX0PCDVJX6VBkm
 hVk01Z6+az1UNcD18KecvCpKYF/At4WpjaUGgf7q+LBfJXuXA6jfzOVDJMKV3TFd
 n1yMFg+duGj90l8gT0aa/VQiBlUlnzQKz6ceqyKkPccwveNis6I=
 =p8YV
 -----END PGP SIGNATURE-----

Merge tag 'for-6.0-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "Fixes:

   - check that subvolume is writable when changing xattrs from security
     namespace

   - fix memory leak in device lookup helper

   - update generation of hole file extent item when merging holes

   - fix space cache corruption and potential double allocations; this
     is a rare bug but can be serious once it happens, stable backports
     and analysis tool will be provided

   - fix error handling when deleting root references

   - fix crash due to assert when attempting to cancel suspended device
     replace, add message what to do if mount fails due to missing
     replace item

  Regressions:

   - don't merge pages into bio if their page offset is not contiguous

   - don't allow large NOWAIT direct reads, this could lead to short
     reads eg. in io_uring"

* tag 'for-6.0-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: add info when mount fails due to stale replace target
  btrfs: replace: drop assert for suspended replace
  btrfs: fix silent failure when deleting root reference
  btrfs: fix space cache corruption and potential double allocations
  btrfs: don't allow large NOWAIT direct reads
  btrfs: don't merge pages into bio if their page offset is not contiguous
  btrfs: update generation of hole file extent item when merging holes
  btrfs: fix possible memory leak in btrfs_get_dev_args_from_path()
  btrfs: check if root is readonly while setting security xattr
2022-08-28 10:44:04 -07:00
Linus Torvalds c7bb3fbc1b 5 cifs/smb3 fixes, three for stable
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmMKi7cACgkQiiy9cAdy
 T1GV7gv/aQR1eQpk1wOnctwFWWaoRLKiiAtq18fg78JfGqjTyJaclxB+c1TV6PKY
 WPbQq0TqLgAi46QKTtgDNnxexXzIz7pSBZGsM02PVb4hGfeO7GRcG7PVuPSrRCvE
 nq/p2BT9FyxqyL0pXox6GLSZ86IYYnjmI1HPseE9WqD33hsNIgdwpLNlh1fWzv04
 kFU/TZ1hboOky/plnhjkvyR4QyK9oUX44qj8kuwWN838gSruKI/+wVprIIKMuiFg
 2bNwFdwQeM/7mWnCEHOdyhO9Upy0WCVQHjqq3+LkE52xgdUxd0my5wQKdO5LiTv3
 AZGJ4XGsTc3aaWPZ/zVMMmt9JDFRe84JmQOdk0Ff2MxKTWmQYo+ZznArRUuYf8cc
 6jPK1SefQEqqrtA4h/c+osNfAJJmpL3KxbQ4Hpj55cQZzY32iPuLsW2JyQKY+cUG
 7j+8Av5wmM8iwJzH/B2DL3TIhY1xhPAsAsd6BXhF+iDzcCG7XgDx8VUTFPhp8QsD
 W8re2OMz
 =tss7
 -----END PGP SIGNATURE-----

Merge tag '6.0-rc2-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull cfis fixes from Steve French:

 - two locking fixes (zero range, punch hole)

 - DFS 9 fix (padding), affecting some servers

 - three minor cleanup changes

* tag '6.0-rc2-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: Add helper function to check smb1+ server
  cifs: Use help macro to get the mid header size
  cifs: Use help macro to get the header preamble size
  cifs: skip extra NULL byte in filenames
  smb3: missing inode locks in punch hole
  smb3: missing inode locks in zero range
2022-08-28 10:35:16 -07:00
Zhang Xiaoxu d291e703f4 cifs: Add helper function to check smb1+ server
SMB1 server's header_preamble_size is not 0, add use is_smb1 function
to simplify the code, no actual functional changes.

Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-08-24 22:30:09 -05:00
Zhang Xiaoxu b6b3624d01 cifs: Use help macro to get the mid header size
It's better to use MID_HEADER_SIZE because the unfolded expression
too long. No actual functional changes, minor readability improvement.

Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-08-24 22:30:04 -05:00
Zhang Xiaoxu 9789de8bdc cifs: Use help macro to get the header preamble size
It's better to use HEADER_PREAMBLE_SIZE because the unfolded expression
too long. No actual functional changes, minor readability improvement.

Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-08-24 22:29:59 -05:00
Paulo Alcantara a1d2eb51f0 cifs: skip extra NULL byte in filenames
Since commit:
 cifs: alloc_path_with_tree_prefix: do not append sep. if the path is empty
alloc_path_with_tree_prefix() function was no longer including the
trailing separator when @path is empty, although @out_len was still
assuming a path separator thus adding an extra byte to the final
filename.

This has caused mount issues in some Synology servers due to the extra
NULL byte in filenames when sending SMB2_CREATE requests with
SMB2_FLAGS_DFS_OPERATIONS set.

Fix this by checking if @path is not empty and then add extra byte for
separator.  Also, do not include any trailing NULL bytes in filename
as MS-SMB2 requires it to be 8-byte aligned and not NULL terminated.

Cc: stable@vger.kernel.org
Fixes: 7eacba3b00 ("cifs: alloc_path_with_tree_prefix: do not append sep. if the path is empty")
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-08-24 12:22:24 -05:00
Linus Torvalds 062d26ad0b fs.fixes.v6.0-rc3
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCYwSG9QAKCRCRxhvAZXjc
 or0AAP0ddEPI06qpWdQEvrv2wBJtpZ/3DG3mmAAlYVhVWXwKdwEA8AoYyRkcVaba
 Um476CdoNti4BwIUA5j7PZw625ax+AM=
 =FAYy
 -----END PGP SIGNATURE-----

Merge tag 'fs.fixes.v6.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping

Pull file_remove_privs() fix from Christian Brauner:
 "As part of Stefan's and Jens' work to add async buffered write
  support to xfs we refactored file_remove_privs() and added
  __file_remove_privs() to avoid calling __remove_privs() when
  IOCB_NOWAIT is passed.

  While debugging a recent performance regression report I found that
  during review we missed that commit faf99b5635 ("fs: add
  __remove_file_privs() with flags parameter") accidently changed
  behavior when dentry_needs_remove_privs() returns zero.

  Before the commit it would still call inode_has_no_xattr() setting
  the S_NOSEC bit and thereby avoiding even calling into
  dentry_needs_remove_privs() the next time this function is called.
  After that commit inode_has_no_xattr() would only be called if
  __remove_privs() had to be called.

  Restore the old behavior. This is likely the cause of the performance
  regression"

* tag 'fs.fixes.v6.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping:
  fs: __file_remove_privs(): restore call to inode_has_no_xattr()
2022-08-23 19:17:26 -07:00
Linus Torvalds 95607ad99b Thirteen fixes, almost all for MM. Seven of these are cc:stable and the
remainder fix up the changes which went into this -rc cycle.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCYwQZcgAKCRDdBJ7gKXxA
 jnCxAQCk8L6PPm0L2KvKr5Vu3M/T0o9SvfxfM5yho80zM68fHQD/eLxz+nd3m+N5
 K7Mdbcb2u6F46qQaS+S5RialEWKpsw8=
 =WtBo
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2022-08-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
 "Thirteen fixes, almost all for MM.

  Seven of these are cc:stable and the remainder fix up the changes
  which went into this -rc cycle"

* tag 'mm-hotfixes-stable-2022-08-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  kprobes: don't call disarm_kprobe() for disabled kprobes
  mm/shmem: shmem_replace_page() remember NR_SHMEM
  mm/shmem: tmpfs fallocate use file_modified()
  mm/shmem: fix chattr fsflags support in tmpfs
  mm/hugetlb: support write-faults in shared mappings
  mm/hugetlb: fix hugetlb not supporting softdirty tracking
  mm/uffd: reset write protection when unregister with wp-mode
  mm/smaps: don't access young/dirty bit if pte unpresent
  mm: add DEVICE_ZONE to FOR_ALL_ZONES
  kernel/sys_ni: add compat entry for fadvise64_64
  mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW
  Revert "zram: remove double compression logic"
  get_maintainer: add Alan to .get_maintainer.ignore
2022-08-23 13:33:08 -07:00
Anand Jain f2c3bec215 btrfs: add info when mount fails due to stale replace target
If the replace target device reappears after the suspended replace is
cancelled, it blocks the mount operation as it can't find the matching
replace-item in the metadata. As shown below,

   BTRFS error (device sda5): replace devid present without an active replace item

To overcome this situation, the user can run the command

   btrfs device scan --forget <replace target device>

and try the mount command again. And also, to avoid repeating the issue,
superblock on the devid=0 must be wiped.

   wipefs -a device-path-to-devid=0.

This patch adds some info when this situation occurs.

Reported-by: Samuel Greiner <samuel@balkonien.org>
Link: https://lore.kernel.org/linux-btrfs/b4f62b10-b295-26ea-71f9-9a5c9299d42c@balkonien.org/T/
CC: stable@vger.kernel.org # 5.0+
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-08-23 22:15:21 +02:00
Anand Jain 59a3991984 btrfs: replace: drop assert for suspended replace
If the filesystem mounts with the replace-operation in a suspended state
and try to cancel the suspended replace-operation, we hit the assert. The
assert came from the commit fe97e2e173 ("btrfs: dev-replace: replace's
scrub must not be running in suspended state") that was actually not
required. So just remove it.

 $ mount /dev/sda5 /btrfs

    BTRFS info (device sda5): cannot continue dev_replace, tgtdev is missing
    BTRFS info (device sda5): you may cancel the operation after 'mount -o degraded'

 $ mount -o degraded /dev/sda5 /btrfs <-- success.

 $ btrfs replace cancel /btrfs

    kernel: assertion failed: ret != -ENOTCONN, in fs/btrfs/dev-replace.c:1131
    kernel: ------------[ cut here ]------------
    kernel: kernel BUG at fs/btrfs/ctree.h:3750!

After the patch:

 $ btrfs replace cancel /btrfs

    BTRFS info (device sda5): suspended dev_replace from /dev/sda5 (devid 1) to <missing disk> canceled

Fixes: fe97e2e173 ("btrfs: dev-replace: replace's scrub must not be running in suspended state")
CC: stable@vger.kernel.org # 5.0+
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-08-23 22:15:21 +02:00
Filipe Manana 47bf225a8d btrfs: fix silent failure when deleting root reference
At btrfs_del_root_ref(), if btrfs_search_slot() returns an error, we end
up returning from the function with a value of 0 (success). This happens
because the function returns the value stored in the variable 'err',
which is 0, while the error value we got from btrfs_search_slot() is
stored in the 'ret' variable.

So fix it by setting 'err' with the error value.

Fixes: 8289ed9f93 ("btrfs: replace the BUG_ON in btrfs_del_root_ref with proper error handling")
CC: stable@vger.kernel.org # 5.16+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-08-23 22:15:21 +02:00
Omar Sandoval ced8ecf026 btrfs: fix space cache corruption and potential double allocations
When testing space_cache v2 on a large set of machines, we encountered a
few symptoms:

1. "unable to add free space :-17" (EEXIST) errors.
2. Missing free space info items, sometimes caught with a "missing free
   space info for X" error.
3. Double-accounted space: ranges that were allocated in the extent tree
   and also marked as free in the free space tree, ranges that were
   marked as allocated twice in the extent tree, or ranges that were
   marked as free twice in the free space tree. If the latter made it
   onto disk, the next reboot would hit the BUG_ON() in
   add_new_free_space().
4. On some hosts with no on-disk corruption or error messages, the
   in-memory space cache (dumped with drgn) disagreed with the free
   space tree.

All of these symptoms have the same underlying cause: a race between
caching the free space for a block group and returning free space to the
in-memory space cache for pinned extents causes us to double-add a free
range to the space cache. This race exists when free space is cached
from the free space tree (space_cache=v2) or the extent tree
(nospace_cache, or space_cache=v1 if the cache needs to be regenerated).
struct btrfs_block_group::last_byte_to_unpin and struct
btrfs_block_group::progress are supposed to protect against this race,
but commit d0c2f4fa55 ("btrfs: make concurrent fsyncs wait less when
waiting for a transaction commit") subtly broke this by allowing
multiple transactions to be unpinning extents at the same time.

Specifically, the race is as follows:

1. An extent is deleted from an uncached block group in transaction A.
2. btrfs_commit_transaction() is called for transaction A.
3. btrfs_run_delayed_refs() -> __btrfs_free_extent() runs the delayed
   ref for the deleted extent.
4. __btrfs_free_extent() -> do_free_extent_accounting() ->
   add_to_free_space_tree() adds the deleted extent back to the free
   space tree.
5. do_free_extent_accounting() -> btrfs_update_block_group() ->
   btrfs_cache_block_group() queues up the block group to get cached.
   block_group->progress is set to block_group->start.
6. btrfs_commit_transaction() for transaction A calls
   switch_commit_roots(). It sets block_group->last_byte_to_unpin to
   block_group->progress, which is block_group->start because the block
   group hasn't been cached yet.
7. The caching thread gets to our block group. Since the commit roots
   were already switched, load_free_space_tree() sees the deleted extent
   as free and adds it to the space cache. It finishes caching and sets
   block_group->progress to U64_MAX.
8. btrfs_commit_transaction() advances transaction A to
   TRANS_STATE_SUPER_COMMITTED.
9. fsync calls btrfs_commit_transaction() for transaction B. Since
   transaction A is already in TRANS_STATE_SUPER_COMMITTED and the
   commit is for fsync, it advances.
10. btrfs_commit_transaction() for transaction B calls
    switch_commit_roots(). This time, the block group has already been
    cached, so it sets block_group->last_byte_to_unpin to U64_MAX.
11. btrfs_commit_transaction() for transaction A calls
    btrfs_finish_extent_commit(), which calls unpin_extent_range() for
    the deleted extent. It sees last_byte_to_unpin set to U64_MAX (by
    transaction B!), so it adds the deleted extent to the space cache
    again!

This explains all of our symptoms above:

* If the sequence of events is exactly as described above, when the free
  space is re-added in step 11, it will fail with EEXIST.
* If another thread reallocates the deleted extent in between steps 7
  and 11, then step 11 will silently re-add that space to the space
  cache as free even though it is actually allocated. Then, if that
  space is allocated *again*, the free space tree will be corrupted
  (namely, the wrong item will be deleted).
* If we don't catch this free space tree corruption, it will continue
  to get worse as extents are deleted and reallocated.

The v1 space_cache is synchronously loaded when an extent is deleted
(btrfs_update_block_group() with alloc=0 calls btrfs_cache_block_group()
with load_cache_only=1), so it is not normally affected by this bug.
However, as noted above, if we fail to load the space cache, we will
fall back to caching from the extent tree and may hit this bug.

The easiest fix for this race is to also make caching from the free
space tree or extent tree synchronous. Josef tested this and found no
performance regressions.

A few extra changes fall out of this change. Namely, this fix does the
following, with step 2 being the crucial fix:

1. Factor btrfs_caching_ctl_wait_done() out of
   btrfs_wait_block_group_cache_done() to allow waiting on a caching_ctl
   that we already hold a reference to.
2. Change the call in btrfs_cache_block_group() of
   btrfs_wait_space_cache_v1_finished() to
   btrfs_caching_ctl_wait_done(), which makes us wait regardless of the
   space_cache option.
3. Delete the now unused btrfs_wait_space_cache_v1_finished() and
   space_cache_v1_done().
4. Change btrfs_cache_block_group()'s `int load_cache_only` parameter to
   `bool wait` to more accurately describe its new meaning.
5. Change a few callers which had a separate call to
   btrfs_wait_block_group_cache_done() to use wait = true instead.
6. Make btrfs_wait_block_group_cache_done() static now that it's not
   used outside of block-group.c anymore.

Fixes: d0c2f4fa55 ("btrfs: make concurrent fsyncs wait less when waiting for a transaction commit")
CC: stable@vger.kernel.org # 5.12+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-08-23 22:13:54 +02:00
David Howells ba0803050d smb3: missing inode locks in punch hole
smb3 fallocate punch hole was not grabbing the inode or filemap_invalidate
locks so could have race with pagemap reinstantiating the page.

Cc: stable@vger.kernel.org
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-08-23 03:15:27 -05:00
David Howells c919c164fc smb3: missing inode locks in zero range
smb3 fallocate zero range was not grabbing the inode or filemap_invalidate
locks so could have race with pagemap reinstantiating the page.

Cc: stable@vger.kernel.org
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-08-23 01:09:08 -05:00
Linus Torvalds 072e51356c NFS client bugfixes for Linux 6.0
Highlights include:
 
 Stable fixes
 - NFS: Fix another fsync() issue after a server reboot
 
 Bugfixes
 - NFS: unlink/rmdir shouldn't call d_delete() twice on ENOENT
 - NFS: Fix missing unlock in nfs_unlink()
 - Add sanity checking of the file type used by __nfs42_ssc_open
 - Fix a case where we're failing to set task->tk_rpc_status
 
 Cleanups
 - Remove the flag NFS_CONTEXT_RESEND_WRITES that got obsoleted by the
   fsync() fix
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAmMDk3wACgkQZwvnipYK
 APJLpw/+ONqG16L5W31/BzGJ80DlG9CERMad7Yt8+lk+ih574k/OrCotHThMyBm9
 2TfY3S8zD9QoLnsPesDKeoc6AYyL3el0Wo2vKmWlGvrirvrzNt9nMc61CDMs2IHT
 kN7gjO2P1LCZln8GTE87C4tI3Pg0Cwr4UUlyHHjMSKdYuJckJugj1gDvblSjn5h4
 bGKGEJ9G71G1REn013sVqmQ6huvQ3iif07X5NaN7T5e+TpNFet/0AlTmrA9zsUDI
 WPm+efP+ieTmihvhqOSYdV31uHN/ECx4p60ITzAlWwPYPyXr1M0r9acUGX10ENna
 eX1B9nyxbUAzO6rxxPgXi3LXgvmgRDVEmbSs5IL985XR2zsVR+AF3dgAwoJXqV9y
 7mAtoiwyqe3idvaK+mHU4OWCSqdhZbauJJ+Jc0ZHZHy2vzHPS2CWcpvXHjVTw63R
 txOkUFL89SwnqJv03N6CZt4OyY1av97dDOEvPqHuRx4NyfT3v/QvF5W3V/UvLnt2
 hTPNGIRUPZU1lpfqEgd7NXWO6LLtkWK2MciRGVnSFf2S5uKYqvlbPsVqWc6CviXc
 Mu4o2RoctkIwxexSfHY0p7UQrbu3OvYgTuIzgy6cIZ2GK70L29UpJYBe5YEh9Qru
 J/Pgn1ZSdGgDgwqzR8S92PTbbKq1caOqnGReFdyJDnCetb6LrrA=
 =tpKO
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-5.20-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client fixes from Trond Myklebust:
"Stable fixes:
   - NFS: Fix another fsync() issue after a server reboot

  Bugfixes:
   - NFS: unlink/rmdir shouldn't call d_delete() twice on ENOENT
   - NFS: Fix missing unlock in nfs_unlink()
   - Add sanity checking of the file type used by __nfs42_ssc_open
   - Fix a case where we're failing to set task->tk_rpc_status

  Cleanups:
   - Remove the NFS_CONTEXT_RESEND_WRITES flag that got obsoleted by the
     fsync() fix"

* tag 'nfs-for-5.20-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  SUNRPC: RPC level errors should set task->tk_rpc_status
  NFSv4.2 fix problems with __nfs42_ssc_open
  NFS: unlink/rmdir shouldn't call d_delete() twice on ENOENT
  NFS: Cleanup to remove unused flag NFS_CONTEXT_RESEND_WRITES
  NFS: Remove a bogus flag setting in pnfs_write_done_resend_to_mds
  NFS: Fix another fsync() issue after a server reboot
  NFS: Fix missing unlock in nfs_unlink()
2022-08-22 11:40:01 -07:00
Linus Torvalds d3cd67d671 fs.idmapped.fixes.v6.0-rc3
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCYwNkcwAKCRCRxhvAZXjc
 ohHlAQC0Sb0jZxLDHeKXr6lHR+a+jOYTisM/8GkCygBhYBqlFgD+KclaIVJp9v/1
 O88/iv91XfomkDSxNknv+MxoWE2i7Ao=
 =hvvJ
 -----END PGP SIGNATURE-----

Merge tag 'fs.idmapped.fixes.v6.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping

Pull idmapping fixes from Christian Brauner:

 - Since Seth joined as co-maintainer for idmapped mounts we decided to
   use a shared git tree. Konstantin suggested we use vfs/idmapping.git
   on kernel.org under the vfs/ namespace. So this updates the tree in
   the maintainers file.

 - Ensure that POSIX ACLs checking, getting, and setting works correctly
   for filesystems mountable with a filesystem idmapping that want to
   support idmapped mounts.

   Since no filesystems mountable with an fs_idmapping do yet support
   idmapped mounts there is no problem. But this could change in the
   future, so add a check to refuse to create idmapped mounts when the
   mounter is not privileged over the mount's idmapping.

 - Check that caller is privileged over the idmapping that will be
   attached to a mount.

   Currently no FS_USERNS_MOUNT filesystems support idmapped mounts,
   thus this is not a problem as only CAP_SYS_ADMIN in init_user_ns is
   allowed to set up idmapped mounts. But this could change in the
   future, so add a check to refuse to create idmapped mounts when the
   mounter is not privileged over the mount's idmapping.

 - Fix POSIX ACLs for ntfs3. While looking at our current POSIX ACL
   handling in the context of some overlayfs work I went through a range
   of other filesystems checking how they handle them currently and
   encountered a few bugs in ntfs3.

   I've sent this some time ago and the fixes haven't been picked up
   even though the pull request for other ntfs3 fixes got sent after.
   This should really be fixed as right now POSIX ACLs are broken in
   certain circumstances for ntfs3.

* tag 'fs.idmapped.fixes.v6.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping:
  ntfs: fix acl handling
  fs: require CAP_SYS_ADMIN in target namespace for idmapped mounts
  MAINTAINERS: update idmapping tree
  acl: handle idmapped mounts for idmapped filesystems
2022-08-22 11:33:02 -07:00
Linus Torvalds b20ee4813f File locking bugfix for v6.0
-----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCAAxFiEES8DXskRxsqGE6vXTAA5oQRlWghUFAmMDXnITHGpsYXl0b25A
 a2VybmVsLm9yZwAKCRAADmhBGVaCFZDRD/9IvIV0OVtEW79KpHzTHs1mfJ5ImQsH
 2kYnIoamjgM47IUTtwgPDQp41iORqqETSUVUlHxGZdWygl6mOvXT1ImMr7EHIA/I
 GOJ3j6Sol2gfpn2/Oh1QASI07wJ7xeyXQEFw5eFne54EgkR6E+UgPIaGzXpdVePf
 3w26eJZnk2XfjIXiaMa4M7KB+Hb4ornc1u8dWOGrMUPBTw83OwCSCND+aZqVObp4
 Vj1pB9BpK1KQBoh/nquPvjyywB+C3MZWTYVq7T7WN9r3OwTVhnFT0hLbH/ww4dgs
 Xfm0px/9G3cDt7mnaF0X8ynpj1+Hfu+yVTd5WmANXeYBXm8z+Nc0I6mdY8edaMPk
 dOEllm+uKAykv9dj8JeuF+9XZkrvOg31GAbnJWq04YI1AvDuEm7Z7STNSTAbR7TI
 SoJox0oQd9LpqBrUaD0QyOJyHy2Bhr0AOAq3z7e13ZpEdK9ITsk8Q49vsqfA57Js
 xxzpCna+ClhKsLHCoRtqAy+Kosp2+52dH0T+1BNsBNQjwD2apHZwlPm/6mJzHOOd
 E875Uswj567qlP00AzeYCCq0zdgqdFUwedJe34G34XJRNksyeOouRqmOZbkfF5AX
 niID7OSuDVdvqPGDlkzQPDkewciBjzmTf1I80m5/S1uAqzRCtB/Q9ULIiYVQGcom
 lvq/ZmyMSJfPaQ==
 =kfx9
 -----END PGP SIGNATURE-----

Merge tag 'filelock-v6.0-2' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux

Pull file locking fix from Jeff Layton:
 "Just a single patch for a bugfix in the flock() codepath, introduced
  by a patch that went in recently"

* tag 'filelock-v6.0-2' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
  locks: Fix dropped call to ->fl_release_private()
2022-08-22 10:40:09 -07:00
Josef Bacik 79d3d1d12e btrfs: don't allow large NOWAIT direct reads
Dylan and Jens reported a problem where they had an io_uring test that
was returning short reads, and bisected it to ee5b46a353 ("btrfs:
increase direct io read size limit to 256 sectors").

The root cause is their test was doing larger reads via io_uring with
NOWAIT and async.  This was triggering a page fault during the direct
read, however the first page was able to work just fine and thus we
submitted a 4k read for a larger iocb.

Btrfs allows for partial IO's in this case specifically because we don't
allow page faults, and thus we'll attempt to do any io that we can,
submit what we could, come back and fault in the rest of the range and
try to do the remaining IO.

However for !is_sync_kiocb() we'll call ->ki_complete() as soon as the
partial dio is done, which is incorrect.  In the sync case we can exit
the iomap code, submit more io's, and return with the amount of IO we
were able to complete successfully.

We were always doing short reads in this case, but for NOWAIT we were
getting saved by the fact that we were limiting direct reads to
sectorsize, and if we were larger than that we would return EAGAIN.

Fix the regression by simply returning EAGAIN in the NOWAIT case with
larger reads, that way io_uring can retry and get the larger IO and have
the fault logic handle everything properly.

This still leaves the AIO short read case, but that existed before this
change.  The way to properly fix this would be to handle partial iocb
completions, but that's a lot of work, for now deal with the regression
in the most straightforward way possible.

Reported-by: Dylan Yudaken <dylany@fb.com>
Fixes: ee5b46a353 ("btrfs: increase direct io read size limit to 256 sectors")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-08-22 18:08:07 +02:00
Qu Wenruo 4a445b7b61 btrfs: don't merge pages into bio if their page offset is not contiguous
[BUG]
Zygo reported on latest development branch, he could hit
ASSERT()/BUG_ON() caused crash when doing RAID5 recovery (intentionally
corrupt one disk, and let btrfs to recover the data during read/scrub).

And The following minimal reproducer can cause extent state leakage at
rmmod time:

  mkfs.btrfs -f -d raid5 -m raid5 $dev1 $dev2 $dev3 -b 1G > /dev/null
  mount $dev1 $mnt
  fsstress -w -d $mnt -n 25 -s 1660807876
  sync
  fssum -A -f -w /tmp/fssum.saved $mnt
  umount $mnt

  # Wipe the dev1 but keeps its super block
  xfs_io -c "pwrite -S 0x0 1m 1023m" $dev1
  mount $dev1 $mnt
  fssum -r /tmp/fssum.saved $mnt > /dev/null
  umount $mnt
  rmmod btrfs

This will lead to the following extent states leakage:

  BTRFS: state leak: start 499712 end 503807 state 5 in tree 1 refs 1
  BTRFS: state leak: start 495616 end 499711 state 5 in tree 1 refs 1
  BTRFS: state leak: start 491520 end 495615 state 5 in tree 1 refs 1
  BTRFS: state leak: start 487424 end 491519 state 5 in tree 1 refs 1
  BTRFS: state leak: start 483328 end 487423 state 5 in tree 1 refs 1
  BTRFS: state leak: start 479232 end 483327 state 5 in tree 1 refs 1
  BTRFS: state leak: start 475136 end 479231 state 5 in tree 1 refs 1
  BTRFS: state leak: start 471040 end 475135 state 5 in tree 1 refs 1

[CAUSE]
Since commit 7aa51232e2 ("btrfs: pass a btrfs_bio to
btrfs_repair_one_sector"), we always use btrfs_bio->file_offset to
determine the file offset of a page.

But that usage assume that, one bio has all its page having a continuous
page offsets.

Unfortunately that's not true, btrfs only requires the logical bytenr
contiguous when assembling its bios.

From above script, we have one bio looks like this:

  fssum-27671  submit_one_bio: bio logical=217739264 len=36864
  fssum-27671  submit_one_bio:   r/i=5/261 page_offset=466944 <<<
  fssum-27671  submit_one_bio:   r/i=5/261 page_offset=724992 <<<
  fssum-27671  submit_one_bio:   r/i=5/261 page_offset=729088
  fssum-27671  submit_one_bio:   r/i=5/261 page_offset=733184
  fssum-27671  submit_one_bio:   r/i=5/261 page_offset=737280
  fssum-27671  submit_one_bio:   r/i=5/261 page_offset=741376
  fssum-27671  submit_one_bio:   r/i=5/261 page_offset=745472
  fssum-27671  submit_one_bio:   r/i=5/261 page_offset=749568
  fssum-27671  submit_one_bio:   r/i=5/261 page_offset=753664

Note that the 1st and the 2nd page has non-contiguous page offsets.

This means, at repair time, we will have completely wrong file offset
passed in:

   kworker/u32:2-19927  btrfs_repair_one_sector: r/i=5/261 page_off=729088 file_off=475136 bio_offset=8192

Since the file offset is incorrect, we latter incorrectly set the extent
states, and no way to really release them.

Thus later it causes the leakage.

In fact, this can be even worse, since the file offset is incorrect, we
can hit cases like the incorrect file offset belongs to a HOLE, and
later cause btrfs_num_copies() to trigger error, finally hit
BUG_ON()/ASSERT() later.

[FIX]
Add an extra condition in btrfs_bio_add_page() for uncompressed IO.

Now we will have more strict requirement for bio pages:

- They should all have the same mapping
  (the mapping check is already implied by the call chain)

- Their logical bytenr should be adjacent
  This is the same as the old condition.

- Their page_offset() (file offset) should be adjacent
  This is the new check.
  This would result a slightly increased amount of bios from btrfs
  (needs holes and inside the same stripe boundary to trigger).

  But this would greatly reduce the confusion, as it's pretty common
  to assume a btrfs bio would only contain continuous page cache.

Later we may need extra cleanups, as we no longer needs to handle gaps
between page offsets in endio functions.

Currently this should be the minimal patch to fix commit 7aa51232e2
("btrfs: pass a btrfs_bio to btrfs_repair_one_sector").

Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Fixes: 7aa51232e2 ("btrfs: pass a btrfs_bio to btrfs_repair_one_sector")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-08-22 18:06:58 +02:00
Filipe Manana e6e3dec6c3 btrfs: update generation of hole file extent item when merging holes
When punching a hole into a file range that is adjacent with a hole and we
are not using the no-holes feature, we expand the range of the adjacent
file extent item that represents a hole, to save metadata space.

However we don't update the generation of hole file extent item, which
means a full fsync will not log that file extent item if the fsync happens
in a later transaction (since commit 7f30c07288 ("btrfs: stop copying
old file extents when doing a full fsync")).

For example, if we do this:

    $ mkfs.btrfs -f -O ^no-holes /dev/sdb
    $ mount /dev/sdb /mnt
    $ xfs_io -f -c "pwrite -S 0xab 2M 2M" /mnt/foobar
    $ sync

We end up with 2 file extent items in our file:

1) One that represents the hole for the file range [0, 2M), with a
   generation of 7;

2) Another one that represents an extent covering the range [2M, 4M).

After that if we do the following:

    $ xfs_io -c "fpunch 2M 2M" /mnt/foobar

We end up with a single file extent item in the file, which represents a
hole for the range [0, 4M) and with a generation of 7 - because we end
dropping the data extent for range [2M, 4M) and then update the file
extent item that represented the hole at [0, 2M), by increasing
length from 2M to 4M.

Then doing a full fsync and power failing:

    $ xfs_io -c "fsync" /mnt/foobar
    <power failure>

will result in the full fsync not logging the file extent item that
represents the hole for the range [0, 4M), because its generation is 7,
which is lower than the generation of the current transaction (8).
As a consequence, after mounting again the filesystem (after log replay),
the region [2M, 4M) does not have a hole, it still points to the
previous data extent.

So fix this by always updating the generation of existing file extent
items representing holes when we merge/expand them. This solves the
problem and it's the same approach as when we merge prealloc extents that
got written (at btrfs_mark_extent_written()). Setting the generation to
the current transaction's generation is also what we do when merging
the new hole extent map with the previous one or the next one.

A test case for fstests, covering both cases of hole file extent item
merging (to the left and to the right), will be sent soon.

Fixes: 7f30c07288 ("btrfs: stop copying old file extents when doing a full fsync")
CC: stable@vger.kernel.org # 5.18+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-08-22 18:06:42 +02:00
Zixuan Fu 9ea0106a7a btrfs: fix possible memory leak in btrfs_get_dev_args_from_path()
In btrfs_get_dev_args_from_path(), btrfs_get_bdev_and_sb() can fail if
the path is invalid. In this case, btrfs_get_dev_args_from_path()
returns directly without freeing args->uuid and args->fsid allocated
before, which causes memory leak.

To fix these possible leaks, when btrfs_get_bdev_and_sb() fails,
btrfs_put_dev_args_from_path() is called to clean up the memory.

Reported-by: TOTE Robot <oslab@tsinghua.edu.cn>
Fixes: faa775c41d ("btrfs: add a btrfs_get_dev_args_from_path helper")
CC: stable@vger.kernel.org # 5.16
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Zixuan Fu <r33s3n6@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-08-22 18:06:33 +02:00
Goldwyn Rodrigues b51111271b btrfs: check if root is readonly while setting security xattr
For a filesystem which has btrfs read-only property set to true, all
write operations including xattr should be denied. However, security
xattr can still be changed even if btrfs ro property is true.

This happens because xattr_permission() does not have any restrictions
on security.*, system.*  and in some cases trusted.* from VFS and
the decision is left to the underlying filesystem. See comments in
xattr_permission() for more details.

This patch checks if the root is read-only before performing the set
xattr operation.

Testcase:

  DEV=/dev/vdb
  MNT=/mnt

  mkfs.btrfs -f $DEV
  mount $DEV $MNT
  echo "file one" > $MNT/f1

  setfattr -n "security.one" -v 2 $MNT/f1
  btrfs property set /mnt ro true

  setfattr -n "security.one" -v 1 $MNT/f1

  umount $MNT

CC: stable@vger.kernel.org # 4.9+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-08-22 18:06:30 +02:00
Christian Brauner 0c3bc7899e
ntfs: fix acl handling
While looking at our current POSIX ACL handling in the context of some
overlayfs work I went through a range of other filesystems checking how they
handle them currently and encountered ntfs3.

The posic_acl_{from,to}_xattr() helpers always need to operate on the
filesystem idmapping. Since ntfs3 can only be mounted in the initial user
namespace the relevant idmapping is init_user_ns.

The posix_acl_{from,to}_xattr() helpers are concerned with translating between
the kernel internal struct posix_acl{_entry} and the uapi struct
posix_acl_xattr_{header,entry} and the kernel internal data structure is cached
filesystem wide.

Additional idmappings such as the caller's idmapping or the mount's idmapping
are handled higher up in the VFS. Individual filesystems usually do not need to
concern themselves with these.

The posix_acl_valid() helper is concerned with checking whether the values in
the kernel internal struct posix_acl can be represented in the filesystem's
idmapping. IOW, if they can be written to disk. So this helper too needs to
take the filesystem's idmapping.

Fixes: be71b5cba2 ("fs/ntfs3: Add attrib operations")
Cc: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
Cc: ntfs3@lists.linux.dev
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2022-08-22 12:52:23 +02:00
Linus Torvalds 367bcbc5b5 5 cifs/smb3 fixes, one for stable
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmMBX5YACgkQiiy9cAdy
 T1GNGgv/bH2tKQ+xOSoHgAUB8x5C1qWCba1KqmF/1e86QWkEaOy3APPH0D4whm2T
 KsO9bI2xLAO7TnsGjr+llPRWLE4uyL3tg4PRuoHJdLlq0UibvWcd7crtRfJtyNOm
 E8S0ouJ0+0Dv026c3KKovYrnHtin/I/KLZp57W1qmslyJEaHflxW4RqEOROLYTjK
 FpEtKyMO1/ivcK6s8l2RcNMcHqkx+HZvzNub4hAdUJOTf8DJ2lLc/PjfDtB4HQlE
 d+Pant1zdZaehZcymzk6x8oq1LIgfVgkFTWJyrr/FLJq2S8E/XbC54tdzTfIySsY
 F5So874fBm6Eal+0qxdm4cYbrE+YjeKwyTHlY836D0InOHxT2noE8i0KSNCP9sfj
 K/gt4SmxkxMASUycZ5rmOvhw/RghOSWtTNTQeiU1fheuXBaCe+dKgdAEQXdqQ0Kt
 I4gCuT6eGRdz9J6AVVIzsaYZFODRlBh50AMS3xkrRlqJ1jxn0E3Zuu3DTTleeknW
 IgGiZ7LC
 =dod2
 -----END PGP SIGNATURE-----

Merge tag '6.0-rc1-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs client fixes from Steve French:

 - memory leak fix

 - two small cleanups

 - trivial strlcpy removal

 - update missing entry for cifs headers in MAINTAINERS file

* tag '6.0-rc1-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: move from strlcpy with unused retval to strscpy
  cifs: Fix memory leak on the deferred close
  cifs: remove useless parameter 'is_fsctl' from SMB2_ioctl()
  cifs: remove unused server parameter from calc_smb_size()
  cifs: missing directory in MAINTAINERS file
2022-08-21 10:21:16 -07:00
Peter Xu f369b07c86 mm/uffd: reset write protection when unregister with wp-mode
The motivation of this patch comes from a recent report and patchfix from
David Hildenbrand on hugetlb shared handling of wr-protected page [1].

With the reproducer provided in commit message of [1], one can leverage
the uffd-wp lazy-reset of ptes to trigger a hugetlb issue which can affect
not only the attacker process, but also the whole system.

The lazy-reset mechanism of uffd-wp was used to make unregister faster,
meanwhile it has an assumption that any leftover pgtable entries should
only affect the process on its own, so not only the user should be aware
of anything it does, but also it should not affect outside of the process.

But it seems that this is not true, and it can also be utilized to make
some exploit easier.

So far there's no clue showing that the lazy-reset is important to any
userfaultfd users because normally the unregister will only happen once
for a specific range of memory of the lifecycle of the process.

Considering all above, what this patch proposes is to do explicit pte
resets when unregister an uffd region with wr-protect mode enabled.

It should be the same as calling ioctl(UFFDIO_WRITEPROTECT, wp=false)
right before ioctl(UFFDIO_UNREGISTER) for the user.  So potentially it'll
make the unregister slower.  From that pov it's a very slight abi change,
but hopefully nothing should break with this change either.

Regarding to the change itself - core of uffd write [un]protect operation
is moved into a separate function (uffd_wp_range()) and it is reused in
the unregister code path.

Note that the new function will not check for anything, e.g.  ranges or
memory types, because they should have been checked during the previous
UFFDIO_REGISTER or it should have failed already.  It also doesn't check
mmap_changing because we're with mmap write lock held anyway.

I added a Fixes upon introducing of uffd-wp shmem+hugetlbfs because that's
the only issue reported so far and that's the commit David's reproducer
will start working (v5.19+).  But the whole idea actually applies to not
only file memories but also anonymous.  It's just that we don't need to
fix anonymous prior to v5.19- because there's no known way to exploit.

IOW, this patch can also fix the issue reported in [1] as the patch 2 does.

[1] https://lore.kernel.org/all/20220811103435.188481-3-david@redhat.com/

Link: https://lkml.kernel.org/r/20220811201340.39342-1-peterx@redhat.com
Fixes: b1f9e87686 ("mm/uffd: enable write protection for shmem & hugetlbfs")
Signed-off-by: Peter Xu <peterx@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-20 15:17:45 -07:00
Peter Xu efd4149342 mm/smaps: don't access young/dirty bit if pte unpresent
These bits should only be valid when the ptes are present.  Introducing
two booleans for it and set it to false when !pte_present() for both pte
and pmd accountings.

The bug is found during code reading and no real world issue reported, but
logically such an error can cause incorrect readings for either smaps or
smaps_rollup output on quite a few fields.

For example, it could cause over-estimate on values like Shared_Dirty,
Private_Dirty, Referenced.  Or it could also cause under-estimate on
values like LazyFree, Shared_Clean, Private_Clean.

Link: https://lkml.kernel.org/r/20220805160003.58929-1-peterx@redhat.com
Fixes: b1d4d9e0cb ("proc/smaps: carefully handle migration entries")
Fixes: c94b6923fa ("/proc/PID/smaps: Add PMD migration entry parsing")
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Huang Ying <ying.huang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-20 15:17:45 -07:00
Olga Kornievskaia fcfc8be1e9 NFSv4.2 fix problems with __nfs42_ssc_open
A destination server while doing a COPY shouldn't accept using the
passed in filehandle if its not a regular filehandle.

If alloc_file_pseudo() has failed, we need to decrement a reference
on the newly created inode, otherwise it leaks.

Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Fixes: ec4b092508 ("NFS: inter ssc open")
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-08-19 20:31:57 -04:00
NeilBrown f16857e62b NFS: unlink/rmdir shouldn't call d_delete() twice on ENOENT
nfs_unlink() calls d_delete() twice if it receives ENOENT from the
server - once in nfs_dentry_handle_enoent() from nfs_safe_remove and
once in nfs_dentry_remove_handle_error().

nfs_rmddir() also calls it twice - the nfs_dentry_handle_enoent() call
is direct and inside a region locked with ->rmdir_sem

It is safe to call d_delete() twice if the refcount > 1 as the dentry is
simply unhashed.
If the refcount is 1, the first call sets d_inode to NULL and the second
call crashes.

This patch guards the d_delete() call from nfs_dentry_handle_enoent()
leaving the one under ->remdir_sem in case that is important.

In mainline it would be safe to remove the d_delete() call.  However in
older kernels to which this might be backported, that would change the
behaviour of nfs_unlink().  nfs_unlink() used to unhash the dentry which
resulted in nfs_dentry_handle_enoent() not calling d_delete().  So in
older kernels we need the d_delete() in nfs_dentry_remove_handle_error()
when called from nfs_unlink() but not when called from nfs_rmdir().

To make the code work correctly for old and new kernels, and from both
nfs_unlink() and nfs_rmdir(), we protect the d_delete() call with
simple_positive().  This ensures it is never called in a circumstance
where it could crash.

Fixes: 3c59366c20 ("NFS: don't unhash dentry during unlink/rename")
Fixes: 9019fb391d ("NFS: Label the dentry with a verifier in nfs_rmdir() and nfs_unlink()")
Signed-off-by: NeilBrown <neilb@suse.de>
Tested-by: Olga Kornievskaia <aglo@umich.edu>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-08-19 20:31:36 -04:00
Linus Torvalds 50cd95ac46 execve fix for v6.0-rc2
- Replace remaining kmap() uses with kmap_local_page() (Fabio M. De Francesco)
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmL/3lEWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJu9xD/9SivxMYldHN8E+KrVUaOUZB6OF
 k97f+C4d438/kiZKIIGUqHv2Zbq4w9IFX3YQ9cmbN9dArEbTnvuGH/RTGI2L4yK1
 uS4BG+9RZZgZfZ+I2JOxX1elqo2qloEkQQJMSvLSXFqElrb6eQWDSdizWMWto+Yb
 l3Fsfq3zNf2cbxcCmlSaOZ4FFS3t5Yc7r/ArIRNPXVXIMcZkBdqj0fJ7nR7PplJY
 YKg7ioAIYUoY2nH97hIWvMWTP+DJc9vXp9+SRUB45qph9gkOBBl34aOC39iMIkEH
 AhPcGPf09DCZAu9sSfx3jH/YVV2jJg2DNbRw9bQlmfu+fdQBYCw1a2mhenpc0rUI
 OA4R26KMCI7338aWjDJy9N+kY2fhm0J8L2NxZ4ySo0ZMQf3VxbPsXDh6ijX69ijr
 DllH09o18RCIxhCRLOfGa3Je93FGnd2b/CI1z1CdQqQ/mdCWRvl3jRCpOXstqcAq
 5ptU0tIEeY8+hor6rd2RfT5xqd1LjIrNdE0m5UUuNEv30IP2ldKIBWLH6jvZ5LJ2
 U/YWJ8/GR+SLrpMNIX6fUAS+2VZNHUbUj2cfmB87OpBmKOcp+vJwf65OzujIfT21
 OeMBkjql8DQlXhHk41SMDPrzzx76TJ0IaK0JCRPeUOggZwBX7txfUW5Xyi2UxseD
 SRT6R+SXibr0vQyemQ==
 =DTae
 -----END PGP SIGNATURE-----

Merge tag 'execve-v6.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull execve fix from Kees Cook:

 - Replace remaining kmap() uses with kmap_local_page() (Fabio M. De
   Francesco)

* tag 'execve-v6.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  exec: Replace kmap{,_atomic}() with kmap_local_page()
2022-08-19 14:02:24 -07:00
Linus Torvalds 42c54d5491 for-6.0-rc1-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmL/dJgACgkQxWXV+ddt
 WDtUmg/9Fje7L+jAtQLqvJYvvGCFCCs8gkm+Rwh9MLYHcEDnBtRWzDWYTKcfO9WW
 mPmiNlNPAbnAw/9apeJDvFOrd4Vqr8ZD3Vr0flXl5EJ9QXSTL6JXfaiLS1o6rQ0q
 OXqxVTh5B5aEmfsEhyJBLzsZGdISJbr60dAE8/ZDX9wo8cDavZ6YxToIroUeizGO
 dx2DY5A7OQRKTEf4aJgu4zcm1Sq+U5A8M1pcfU4Rhb/YapVgcCc2wrdrw4NOkNJu
 B/NZFcD0BcF2bZW7uHT5vr8rGzj2xZUCkuggBQ6/0/h7OdRXIzHCanMw4lPy602v
 IfZASF7eYkq7FHRbANj/WAYHOFl41vS+whqZ+sEI/qrW+ZyrODIzS67sbU/Bsa7+
 ZoL6RSlIbcMgz7XVqcc5d8bKNK/Hc3MCyMWlLT0XScm05BiJy5O2iEZ7UOJ4r8E+
 6J/pFD1bNdacp6UrzwEjyXuSufvp4pJdNWn5ttIWYTBygXMT8AJ403yIheiV3KA3
 SkoMj4A54tF3G7NhkzfR5sC7hlgcA0njJLzloicmNgP7E12vmcDL2rTdTt/Yl0cw
 3w6ztJS+1sWocFSJ43lE2muHlj7wW7QDYPvul1t6yWgO9wtWECo7c/ASeRl0zPkP
 atAJtsr3uV9/aA2ae9QeWzut1W3hbE5NFQOLPcF/iXN+kxMmesI=
 =HgV0
 -----END PGP SIGNATURE-----

Merge tag 'for-6.0-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "A few short fixes and a lockdep warning fix (needs moving some code):

   - tree-log replay fixes:
      - fix error handling when looking up extent refs
      - fix warning when setting inode number of links

   - relocation fixes:
      - reset block group read-only status when relocation fails
      - unset control structure if transaction fails when starting
        to process a block group
      - add lockdep annotations to fix a warning during relocation
        where blocks temporarily belong to another tree and can lead
        to reversed dependencies

   - tree-checker verifies that extent items don't overlap"

* tag 'for-6.0-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: tree-checker: check for overlapping extent items
  btrfs: fix warning during log replay when bumping inode link count
  btrfs: fix lost error handling when looking up extended ref on log replay
  btrfs: fix lockdep splat with reloc root extent buffers
  btrfs: move lockdep class helpers to locking.c
  btrfs: unset reloc control if transaction commit fails in prepare_to_relocate()
  btrfs: reset RO counter on block group if we fail to relocate
2022-08-19 13:33:48 -07:00
Linus Torvalds a3a78b6332 4 ksmbd server fixes
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmL+83EACgkQiiy9cAdy
 T1EB9Qv/ROymhtx17C1kvaxZYzXPjMKNl3OVPpvLNbGaUbkuf3Yp/5ajrixVF/hW
 m+xNhRbX8TmYNPhlnnLLZusiLkrCgyKMhVttQvAjuLhtvzZLQWOuoNlG/cwyUoiv
 X51fVkYO4psfVw3LdWG2tZT1dTI0oDoPNWdKlMpz9Rc31GYWTteHsEdcpaGgEVaL
 LCBKe1+21s9qWX8+mARymFZKyyHkrPwvGkL6PepUPR3pj96PsCPm3mgiU4/BJVOA
 E/Bxk/dtL7WsX4TXldzjXWCvcTHUpC5uApYYWywTGOamur9uycRUHrtlIj7+/Unl
 NMqSKkAr5X+kCmAdT+rxhF0WQbNxbsmbI9X0zJOtQIU0UbaB7j/GIAmCiGZ258NM
 xoFcAB627bCjd0XgXmxKE3e2hG8VVnqQOLS+em7KmWrrdT+ieTCY4R9m08YNCzn1
 bIhg0vD/HKFBeu48MrhzquTGXvu0bpbYjRoh5eerZSXTOPU6zgiZiOuYBGqJMmvy
 lJVvZzU3
 =3FeD
 -----END PGP SIGNATURE-----

Merge tag '5.20-rc2-ksmbd-smb3-server-fixes' of git://git.samba.org/ksmbd

Pull ksmbd server fixes from Steve French:

 - important sparse file fix

 - allocation size fix

 - fix incorrect rc on bad share

 - share config fix

* tag '5.20-rc2-ksmbd-smb3-server-fixes' of git://git.samba.org/ksmbd:
  ksmbd: don't remove dos attribute xattr on O_TRUNC open
  ksmbd: remove unnecessary generic_fillattr in smb2_open
  ksmbd: request update to stale share config
  ksmbd: return STATUS_BAD_NETWORK_NAME error status if share is not configured
2022-08-19 13:26:52 -07:00
Wolfram Sang 13609a8b3a cifs: move from strlcpy with unused retval to strscpy
Follow the advice of the below link and prefer 'strscpy' in this
subsystem. Conversion is 1:1 because the return value is not used.
Generated by a coccinelle script.

Link: https://lore.kernel.org/r/CAHk-=wgfRnXz0W3D37d01q3JFkr_i_uTL=V6A6G1oUZcprmknw@mail.gmail.com/
Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-08-19 11:02:26 -05:00
Zhang Xiaoxu ca08d0eac0 cifs: Fix memory leak on the deferred close
xfstests on smb21 report kmemleak as below:

  unreferenced object 0xffff8881767d6200 (size 64):
    comm "xfs_io", pid 1284, jiffies 4294777434 (age 20.789s)
    hex dump (first 32 bytes):
      80 5a d0 11 81 88 ff ff 78 8a aa 63 81 88 ff ff  .Z......x..c....
      00 71 99 76 81 88 ff ff 00 00 00 00 00 00 00 00  .q.v............
    backtrace:
      [<00000000ad04e6ea>] cifs_close+0x92/0x2c0
      [<0000000028b93c82>] __fput+0xff/0x3f0
      [<00000000d8116851>] task_work_run+0x85/0xc0
      [<0000000027e14f9e>] do_exit+0x5e5/0x1240
      [<00000000fb492b95>] do_group_exit+0x58/0xe0
      [<00000000129a32d9>] __x64_sys_exit_group+0x28/0x30
      [<00000000e3f7d8e9>] do_syscall_64+0x35/0x80
      [<00000000102e8a0b>] entry_SYSCALL_64_after_hwframe+0x46/0xb0

When cancel the deferred close work, we should also cleanup the struct
cifs_deferred_close.

Fixes: 9e992755be ("cifs: Call close synchronously during unlink/rename/lease break.")
Fixes: e3fc065682 ("cifs: Deferred close performance improvements")
Cc: stable@vger.kernel.org
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-08-19 10:59:53 -05:00
Stefan Roesch 41191cf6bf
fs: __file_remove_privs(): restore call to inode_has_no_xattr()
This restores the call to inode_has_no_xattr() in the function
__file_remove_privs(). In case the dentry_meeds_remove_privs() returned
0, the function inode_has_no_xattr() was not called.

Signed-off-by: Stefan Roesch <shr@fb.com>
Fixes: faf99b5635 ("fs: add __remove_file_privs() with flags parameter")
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Link: https://lore.kernel.org/r/20220816153158.1925040-1-shr@fb.com
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2022-08-18 09:39:33 +02:00
Enzo Matsumiya 400d0ad63b cifs: remove useless parameter 'is_fsctl' from SMB2_ioctl()
SMB2_ioctl() is always called with is_fsctl = true, so doesn't make any
sense to have it at all.

Thus, always set SMB2_0_IOCTL_IS_FSCTL flag on the request.

Also, as per MS-SMB2 3.3.5.15 "Receiving an SMB2 IOCTL Request", servers
must fail the request if the request flags is zero anyway.

Signed-off-by: Enzo Matsumiya <ematsumiya@suse.de>
Reviewed-by: Tom Talpey <tom@talpey.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-08-17 23:30:49 -05:00
Enzo Matsumiya 68ed14496b cifs: remove unused server parameter from calc_smb_size()
This parameter is unused by the called function

Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Enzo Matsumiya <ematsumiya@suse.de>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-08-17 18:07:13 -05:00
Linus Torvalds 3b06a27557 ntfs3 for 6.0
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEh0DEKNP0I9IjwfWEqbAzH4MkB7YFAmL7xrEACgkQqbAzH4Mk
 B7Y3iw/+KZLidS4J9Lq+BhoDj21brkrvE+iiBhLClN8LQNn7jwkF+mTpYcGxvm+3
 fgrT5YTUt6OBx9N06qjUnKKcwBrOq1xFo6hdlCVHQiZbEUgYsSglE6xVdyt5RICb
 AVD6T7uKnHq4cQHe8796h23/Vfd3z2rAetvz5w/6Gd+mOpguLj+sx7QGE7R6o3V/
 uvbxj8z7cFkrVy6b7H5gaDUFzAhVxBzZY+2P1CUOg4uUy0YI0PeUbJli8zt/qORP
 Mr5mTEeKc1sxWzuDASppjgCPjdQVN+jgy7hQEpC6SLDR5HgtjncCCRE+dA1ZcnQm
 PQCG1Xn8CSII8bDCu6Lvr6KtxhRBdG/wb99zpn50wBmb6CzMJGOGmBwrPMMhW8Zo
 8ZBYHCY5YgDuXNkFpQrivayrADGaLhmAl1BTjXTDCQU7MoxxFsPO8D/swufvNf3W
 5eC5ezQ8FY3sSHRuDvVGHe8djvgsGvfxQAMrbfMJqEBFuPg3EYjJOeRpZK6NUyk4
 U31Jtz0hYSuU0dnoEaZFQ23/K7/vl2kile6VlNPApFR+y8OtSARN++Za58xdtg2y
 H8XmEbuN/g8XxPXz55Smf4Y8RzaIZ2S56aA19nBqza5o1a6gQUr2SomTHGRLJsFQ
 5xLyuUBrZRDxS8jcxa5TTfj7CNFBJkaxtXU8M66vIzXhXcm5+9U=
 =iz2l
 -----END PGP SIGNATURE-----

Merge tag 'ntfs3_for_6.0' of https://github.com/Paragon-Software-Group/linux-ntfs3

Pull ntfs3 updates from Konstantin Komarov:

 - implement FALLOC_FL_INSERT_RANGE

 - fix some logic errors

 - fixed xfstests (tested on x86_64): generic/064 generic/213
   generic/300 generic/361 generic/449 generic/485

 - some dead code removed or refactored

* tag 'ntfs3_for_6.0' of https://github.com/Paragon-Software-Group/linux-ntfs3: (39 commits)
  fs/ntfs3: uninitialized variable in ntfs_set_acl_ex()
  fs/ntfs3: Remove unused function wnd_bits
  fs/ntfs3: Make ni_ins_new_attr return error
  fs/ntfs3: Create MFT zone only if length is large enough
  fs/ntfs3: Refactoring attr_insert_range to restore after errors
  fs/ntfs3: Refactoring attr_punch_hole to restore after errors
  fs/ntfs3: Refactoring attr_set_size to restore after errors
  fs/ntfs3: New function ntfs_bad_inode
  fs/ntfs3: Make MFT zone less fragmented
  fs/ntfs3: Check possible errors in run_pack in advance
  fs/ntfs3: Added comments to frecord functions
  fs/ntfs3: Fill duplicate info in ni_add_name
  fs/ntfs3: Make static function attr_load_runs
  fs/ntfs3: Add new argument is_mft to ntfs_mark_rec_free
  fs/ntfs3: Remove unused mi_mark_free
  fs/ntfs3: Fix very fragmented case in attr_punch_hole
  fs/ntfs3: Fix work with fragmented xattr
  fs/ntfs3: Make ntfs_fallocate return -ENOSPC instead of -EFBIG
  fs/ntfs3: extend ni_insert_nonresident to return inserted ATTR_LIST_ENTRY
  fs/ntfs3: Check reserved size for maximum allowed
  ...
2022-08-17 14:51:22 -07:00
Linus Torvalds ae2a823643 dcache: move the DCACHE_OP_COMPARE case out of the __d_lookup_rcu loop
__d_lookup_rcu() is one of the hottest functions in the kernel on
certain loads, and it is complicated by filesystems that might want to
have their own name compare function.

We can improve code generation by moving the test of DCACHE_OP_COMPARE
outside the loop, which makes the loop itself much simpler, at the cost
of some code duplication.  But both cases end up being simpler, and the
"native" direct case-sensitive compare particularly so.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-08-17 14:33:03 -07:00
David Howells 932c29a10d locks: Fix dropped call to ->fl_release_private()
Prior to commit 4149be7bda, sys_flock() would allocate the file_lock
struct it was going to use to pass parameters, call ->flock() and then call
locks_free_lock() to get rid of it - which had the side effect of calling
locks_release_private() and thus ->fl_release_private().

With commit 4149be7bda, however, this is no longer the case: the struct
is now allocated on the stack, and locks_free_lock() is no longer called -
and thus any remaining private data doesn't get cleaned up either.

This causes afs flock to cause oops.  Kasan catches this as a UAF by the
list_del_init() in afs_fl_release_private() for the file_lock record
produced by afs_fl_copy_lock() as the original record didn't get delisted.
It can be reproduced using the generic/504 xfstest.

Fix this by reinstating the locks_release_private() call in sys_flock().
I'm not sure if this would affect any other filesystems.  If not, then the
release could be done in afs_flock() instead.

Changes
=======
ver #2)
 - Don't need to call ->fl_release_private() after calling the security
   hook, only after calling ->flock().

Fixes: 4149be7bda ("fs/lock: Don't allocate file_lock in flock_make_lock().")
cc: Chuck Lever <chuck.lever@oracle.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/166075758809.3532462.13307935588777587536.stgit@warthog.procyon.org.uk/ # v1
Acked-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
2022-08-17 15:08:58 -04:00
Josef Bacik 899b7f69f2 btrfs: tree-checker: check for overlapping extent items
We're seeing a weird problem in production where we have overlapping
extent items in the extent tree.  It's unclear where these are coming
from, and in debugging we realized there's no check in the tree checker
for this sort of problem.  Add a check to the tree-checker to make sure
that the extents do not overlap each other.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-08-17 16:20:25 +02:00
Filipe Manana 769030e118 btrfs: fix warning during log replay when bumping inode link count
During log replay, at add_link(), we may increment the link count of
another inode that has a reference that conflicts with a new reference
for the inode currently being processed.

During log replay, at add_link(), we may drop (unlink) a reference from
some inode in the subvolume tree if that reference conflicts with a new
reference found in the log for the inode we are currently processing.

After the unlink, If the link count has decreased from 1 to 0, then we
increment the link count to prevent the inode from being deleted if it's
evicted by an iput() call, because we may have references to add to that
inode later on (and we will fixup its link count later during log replay).

However incrementing the link count from 0 to 1 triggers a warning:

  $ cat fs/inode.c
  (...)
  void inc_nlink(struct inode *inode)
  {
        if (unlikely(inode->i_nlink == 0)) {
                 WARN_ON(!(inode->i_state & I_LINKABLE));
                 atomic_long_dec(&inode->i_sb->s_remove_count);
        }
  (...)

The I_LINKABLE flag is only set when creating an O_TMPFILE file, so it's
never set during log replay.

Most of the time, the warning isn't triggered even if we dropped the last
reference of the conflicting inode, and this is because:

1) The conflicting inode was previously marked for fixup, through a call
   to link_to_fixup_dir(), which increments the inode's link count;

2) And the last iput() on the inode has not triggered eviction of the
   inode, nor was eviction triggered after the iput(). So at add_link(),
   even if we unlink the last reference of the inode, its link count ends
   up being 1 and not 0.

So this means that if eviction is triggered after link_to_fixup_dir() is
called, at add_link() we will read the inode back from the subvolume tree
and have it with a correct link count, matching the number of references
it has on the subvolume tree. So if when we are at add_link() the inode
has exactly one reference only, its link count is 1, and after the unlink
its link count becomes 0.

So fix this by using set_nlink() instead of inc_nlink(), as the former
accepts a transition from 0 to 1 and it's what we use in other similar
contexts (like at link_to_fixup_dir().

Also make add_inode_ref() use set_nlink() instead of inc_nlink() to
bump the link count from 0 to 1.

The warning is actually harmless, but it may scare users. Josef also ran
into it recently.

CC: stable@vger.kernel.org # 5.1+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-08-17 16:19:50 +02:00
Filipe Manana 7a6b75b799 btrfs: fix lost error handling when looking up extended ref on log replay
During log replay, when processing inode references, if we get an error
when looking up for an extended reference at __add_inode_ref(), we ignore
it and proceed, returning success (0) if no other error happens after the
lookup. This is obviously wrong because in case an extended reference
exists and it encodes some name not in the log, we need to unlink it,
otherwise the filesystem state will not match the state it had after the
last fsync.

So just make __add_inode_ref() return an error it gets from the extended
reference lookup.

Fixes: f186373fef ("btrfs: extended inode refs")
CC: stable@vger.kernel.org # 4.9+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-08-17 16:19:45 +02:00
Josef Bacik b40130b23c btrfs: fix lockdep splat with reloc root extent buffers
We have been hitting the following lockdep splat with btrfs/187 recently

  WARNING: possible circular locking dependency detected
  5.19.0-rc8+ #775 Not tainted
  ------------------------------------------------------
  btrfs/752500 is trying to acquire lock:
  ffff97e1875a97b8 (btrfs-treloc-02#2){+.+.}-{3:3}, at: __btrfs_tree_lock+0x24/0x110

  but task is already holding lock:
  ffff97e1875a9278 (btrfs-tree-01/1){+.+.}-{3:3}, at: __btrfs_tree_lock+0x24/0x110

  which lock already depends on the new lock.

  the existing dependency chain (in reverse order) is:

  -> #2 (btrfs-tree-01/1){+.+.}-{3:3}:
	 down_write_nested+0x41/0x80
	 __btrfs_tree_lock+0x24/0x110
	 btrfs_init_new_buffer+0x7d/0x2c0
	 btrfs_alloc_tree_block+0x120/0x3b0
	 __btrfs_cow_block+0x136/0x600
	 btrfs_cow_block+0x10b/0x230
	 btrfs_search_slot+0x53b/0xb70
	 btrfs_lookup_inode+0x2a/0xa0
	 __btrfs_update_delayed_inode+0x5f/0x280
	 btrfs_async_run_delayed_root+0x24c/0x290
	 btrfs_work_helper+0xf2/0x3e0
	 process_one_work+0x271/0x590
	 worker_thread+0x52/0x3b0
	 kthread+0xf0/0x120
	 ret_from_fork+0x1f/0x30

  -> #1 (btrfs-tree-01){++++}-{3:3}:
	 down_write_nested+0x41/0x80
	 __btrfs_tree_lock+0x24/0x110
	 btrfs_search_slot+0x3c3/0xb70
	 do_relocation+0x10c/0x6b0
	 relocate_tree_blocks+0x317/0x6d0
	 relocate_block_group+0x1f1/0x560
	 btrfs_relocate_block_group+0x23e/0x400
	 btrfs_relocate_chunk+0x4c/0x140
	 btrfs_balance+0x755/0xe40
	 btrfs_ioctl+0x1ea2/0x2c90
	 __x64_sys_ioctl+0x88/0xc0
	 do_syscall_64+0x38/0x90
	 entry_SYSCALL_64_after_hwframe+0x63/0xcd

  -> #0 (btrfs-treloc-02#2){+.+.}-{3:3}:
	 __lock_acquire+0x1122/0x1e10
	 lock_acquire+0xc2/0x2d0
	 down_write_nested+0x41/0x80
	 __btrfs_tree_lock+0x24/0x110
	 btrfs_lock_root_node+0x31/0x50
	 btrfs_search_slot+0x1cb/0xb70
	 replace_path+0x541/0x9f0
	 merge_reloc_root+0x1d6/0x610
	 merge_reloc_roots+0xe2/0x260
	 relocate_block_group+0x2c8/0x560
	 btrfs_relocate_block_group+0x23e/0x400
	 btrfs_relocate_chunk+0x4c/0x140
	 btrfs_balance+0x755/0xe40
	 btrfs_ioctl+0x1ea2/0x2c90
	 __x64_sys_ioctl+0x88/0xc0
	 do_syscall_64+0x38/0x90
	 entry_SYSCALL_64_after_hwframe+0x63/0xcd

  other info that might help us debug this:

  Chain exists of:
    btrfs-treloc-02#2 --> btrfs-tree-01 --> btrfs-tree-01/1

   Possible unsafe locking scenario:

	 CPU0                    CPU1
	 ----                    ----
    lock(btrfs-tree-01/1);
				 lock(btrfs-tree-01);
				 lock(btrfs-tree-01/1);
    lock(btrfs-treloc-02#2);

   *** DEADLOCK ***

  7 locks held by btrfs/752500:
   #0: ffff97e292fdf460 (sb_writers#12){.+.+}-{0:0}, at: btrfs_ioctl+0x208/0x2c90
   #1: ffff97e284c02050 (&fs_info->reclaim_bgs_lock){+.+.}-{3:3}, at: btrfs_balance+0x55f/0xe40
   #2: ffff97e284c00878 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: btrfs_relocate_block_group+0x236/0x400
   #3: ffff97e292fdf650 (sb_internal#2){.+.+}-{0:0}, at: merge_reloc_root+0xef/0x610
   #4: ffff97e284c02378 (btrfs_trans_num_writers){++++}-{0:0}, at: join_transaction+0x1a8/0x5a0
   #5: ffff97e284c023a0 (btrfs_trans_num_extwriters){++++}-{0:0}, at: join_transaction+0x1a8/0x5a0
   #6: ffff97e1875a9278 (btrfs-tree-01/1){+.+.}-{3:3}, at: __btrfs_tree_lock+0x24/0x110

  stack backtrace:
  CPU: 1 PID: 752500 Comm: btrfs Not tainted 5.19.0-rc8+ #775
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
  Call Trace:

   dump_stack_lvl+0x56/0x73
   check_noncircular+0xd6/0x100
   ? lock_is_held_type+0xe2/0x140
   __lock_acquire+0x1122/0x1e10
   lock_acquire+0xc2/0x2d0
   ? __btrfs_tree_lock+0x24/0x110
   down_write_nested+0x41/0x80
   ? __btrfs_tree_lock+0x24/0x110
   __btrfs_tree_lock+0x24/0x110
   btrfs_lock_root_node+0x31/0x50
   btrfs_search_slot+0x1cb/0xb70
   ? lock_release+0x137/0x2d0
   ? _raw_spin_unlock+0x29/0x50
   ? release_extent_buffer+0x128/0x180
   replace_path+0x541/0x9f0
   merge_reloc_root+0x1d6/0x610
   merge_reloc_roots+0xe2/0x260
   relocate_block_group+0x2c8/0x560
   btrfs_relocate_block_group+0x23e/0x400
   btrfs_relocate_chunk+0x4c/0x140
   btrfs_balance+0x755/0xe40
   btrfs_ioctl+0x1ea2/0x2c90
   ? lock_is_held_type+0xe2/0x140
   ? lock_is_held_type+0xe2/0x140
   ? __x64_sys_ioctl+0x88/0xc0
   __x64_sys_ioctl+0x88/0xc0
   do_syscall_64+0x38/0x90
   entry_SYSCALL_64_after_hwframe+0x63/0xcd

This isn't necessarily new, it's just tricky to hit in practice.  There
are two competing things going on here.  With relocation we create a
snapshot of every fs tree with a reloc tree.  Any extent buffers that
get initialized here are initialized with the reloc root lockdep key.
However since it is a snapshot, any blocks that are currently in cache
that originally belonged to the fs tree will have the normal tree
lockdep key set.  This creates the lock dependency of

  reloc tree -> normal tree

for the extent buffer locking during the first phase of the relocation
as we walk down the reloc root to relocate blocks.

However this is problematic because the final phase of the relocation is
merging the reloc root into the original fs root.  This involves
searching down to any keys that exist in the original fs root and then
swapping the relocated block and the original fs root block.  We have to
search down to the fs root first, and then go search the reloc root for
the block we need to replace.  This creates the dependency of

  normal tree -> reloc tree

which is why lockdep complains.

Additionally even if we were to fix this particular mismatch with a
different nesting for the merge case, we're still slotting in a block
that has a owner of the reloc root objectid into a normal tree, so that
block will have its lockdep key set to the tree reloc root, and create a
lockdep splat later on when we wander into that block from the fs root.

Unfortunately the only solution here is to make sure we do not set the
lockdep key to the reloc tree lockdep key normally, and then reset any
blocks we wander into from the reloc root when we're doing the merged.

This solves the problem of having mixed tree reloc keys intermixed with
normal tree keys, and then allows us to make sure in the merge case we
maintain the lock order of

  normal tree -> reloc tree

We handle this by setting a bit on the reloc root when we do the search
for the block we want to relocate, and any block we search into or COW
at that point gets set to the reloc tree key.  This works correctly
because we only ever COW down to the parent node, so we aren't resetting
the key for the block we're linking into the fs root.

With this patch we no longer have the lockdep splat in btrfs/187.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-08-17 16:19:12 +02:00
Josef Bacik 0a27a0474d btrfs: move lockdep class helpers to locking.c
These definitions exist in disk-io.c, which is not related to the
locking.  Move this over to locking.h/c where it makes more sense.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-08-17 16:19:10 +02:00
Zixuan Fu 85f02d6c85 btrfs: unset reloc control if transaction commit fails in prepare_to_relocate()
In btrfs_relocate_block_group(), the rc is allocated.  Then
btrfs_relocate_block_group() calls

relocate_block_group()
  prepare_to_relocate()
    set_reloc_control()

that assigns rc to the variable fs_info->reloc_ctl. When
prepare_to_relocate() returns, it calls

btrfs_commit_transaction()
  btrfs_start_dirty_block_groups()
    btrfs_alloc_path()
      kmem_cache_zalloc()

which may fail for example (or other errors could happen). When the
failure occurs, btrfs_relocate_block_group() detects the error and frees
rc and doesn't set fs_info->reloc_ctl to NULL. After that, in
btrfs_init_reloc_root(), rc is retrieved from fs_info->reloc_ctl and
then used, which may cause a use-after-free bug.

This possible bug can be triggered by calling btrfs_ioctl_balance()
before calling btrfs_ioctl_defrag().

To fix this possible bug, in prepare_to_relocate(), check if
btrfs_commit_transaction() fails. If the failure occurs,
unset_reloc_control() is called to set fs_info->reloc_ctl to NULL.

The error log in our fault-injection testing is shown as follows:

  [   58.751070] BUG: KASAN: use-after-free in btrfs_init_reloc_root+0x7ca/0x920 [btrfs]
  ...
  [   58.753577] Call Trace:
  ...
  [   58.755800]  kasan_report+0x45/0x60
  [   58.756066]  btrfs_init_reloc_root+0x7ca/0x920 [btrfs]
  [   58.757304]  record_root_in_trans+0x792/0xa10 [btrfs]
  [   58.757748]  btrfs_record_root_in_trans+0x463/0x4f0 [btrfs]
  [   58.758231]  start_transaction+0x896/0x2950 [btrfs]
  [   58.758661]  btrfs_defrag_root+0x250/0xc00 [btrfs]
  [   58.759083]  btrfs_ioctl_defrag+0x467/0xa00 [btrfs]
  [   58.759513]  btrfs_ioctl+0x3c95/0x114e0 [btrfs]
  ...
  [   58.768510] Allocated by task 23683:
  [   58.768777]  ____kasan_kmalloc+0xb5/0xf0
  [   58.769069]  __kmalloc+0x227/0x3d0
  [   58.769325]  alloc_reloc_control+0x10a/0x3d0 [btrfs]
  [   58.769755]  btrfs_relocate_block_group+0x7aa/0x1e20 [btrfs]
  [   58.770228]  btrfs_relocate_chunk+0xf1/0x760 [btrfs]
  [   58.770655]  __btrfs_balance+0x1326/0x1f10 [btrfs]
  [   58.771071]  btrfs_balance+0x3150/0x3d30 [btrfs]
  [   58.771472]  btrfs_ioctl_balance+0xd84/0x1410 [btrfs]
  [   58.771902]  btrfs_ioctl+0x4caa/0x114e0 [btrfs]
  ...
  [   58.773337] Freed by task 23683:
  ...
  [   58.774815]  kfree+0xda/0x2b0
  [   58.775038]  free_reloc_control+0x1d6/0x220 [btrfs]
  [   58.775465]  btrfs_relocate_block_group+0x115c/0x1e20 [btrfs]
  [   58.775944]  btrfs_relocate_chunk+0xf1/0x760 [btrfs]
  [   58.776369]  __btrfs_balance+0x1326/0x1f10 [btrfs]
  [   58.776784]  btrfs_balance+0x3150/0x3d30 [btrfs]
  [   58.777185]  btrfs_ioctl_balance+0xd84/0x1410 [btrfs]
  [   58.777621]  btrfs_ioctl+0x4caa/0x114e0 [btrfs]
  ...

Reported-by: TOTE Robot <oslab@tsinghua.edu.cn>
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Zixuan Fu <r33s3n6@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-08-17 16:18:58 +02:00
Seth Forshee bf1ac16edf
fs: require CAP_SYS_ADMIN in target namespace for idmapped mounts
Idmapped mounts should not allow a user to map file ownsership into a
range of ids which is not under the control of that user. However, we
currently don't check whether the mounter is privileged wrt to the
target user namespace.

Currently no FS_USERNS_MOUNT filesystems support idmapped mounts, thus
this is not a problem as only CAP_SYS_ADMIN in init_user_ns is allowed
to set up idmapped mounts. But this could change in the future, so add a
check to refuse to create idmapped mounts when the mounter does not have
CAP_SYS_ADMIN in the target user namespace.

Fixes: bd303368b7 ("fs: support mapped mounts of mapped filesystems")
Signed-off-by: Seth Forshee <sforshee@digitalocean.com>
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Link: https://lore.kernel.org/r/20220816164752.2595240-1-sforshee@digitalocean.com
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2022-08-17 11:27:11 +02:00
Christian Brauner abfcf55d8b
acl: handle idmapped mounts for idmapped filesystems
Ensure that POSIX ACLs checking, getting, and setting works correctly
for filesystems mountable with a filesystem idmapping ("fs_idmapping")
that want to support idmapped mounts ("mnt_idmapping").

Note that no filesystems mountable with an fs_idmapping do yet support
idmapped mounts. This is required infrastructure work to unblock this.

As we explained in detail in [1] the fs_idmapping is irrelevant for
getxattr() and setxattr() when mapping the ACL_{GROUP,USER} {g,u}ids
stored in the uapi struct posix_acl_xattr_entry in
posix_acl_fix_xattr_{from,to}_user().

But for acl_permission_check() and posix_acl_{g,s}etxattr_idmapped_mnt()
the fs_idmapping matters.

acl_permission_check():
  During lookup POSIX ACLs are retrieved directly via i_op->get_acl() and
  are returned via the kernel internal struct posix_acl which contains
  e_{g,u}id members of type k{g,u}id_t that already take the
  fs_idmapping into acccount.

  For example, a POSIX ACL stored with u4 on the backing store is mapped
  to k10000004 in the fs_idmapping. The mnt_idmapping remaps the POSIX ACL
  to k20000004. In order to do that the fs_idmapping needs to be taken
  into account but that doesn't happen yet (Again, this is a
  counterfactual currently as fuse doesn't support idmapped mounts
  currently. It's just used as a convenient example.):

  fs_idmapping:  u0:k10000000:r65536
  mnt_idmapping: u0:v20000000:r65536
  ACL_USER:      k10000004

  acl_permission_check()
  -> check_acl()
     -> get_acl()
        -> i_op->get_acl() == fuse_get_acl()
           -> posix_acl_from_xattr(u0:k10000000:r65536 /* fs_idmapping */, ...)
              {
                      k10000004 = make_kuid(u0:k10000000:r65536 /* fs_idmapping */,
                                            u4 /* ACL_USER */);
              }
     -> posix_acl_permission()
        {
                -1 = make_vfsuid(u0:v20000000:r65536 /* mnt_idmapping */,
                                 &init_user_ns,
                                 k10000004);
                vfsuid_eq_kuid(-1, k10000004 /* caller_fsuid */)
        }

  In order to correctly map from the fs_idmapping into mnt_idmapping we
  require the relevant fs_idmaping to be passed:

  acl_permission_check()
  -> check_acl()
     -> get_acl()
        -> i_op->get_acl() == fuse_get_acl()
           -> posix_acl_from_xattr(u0:k10000000:r65536 /* fs_idmapping */, ...)
              {
                      k10000004 = make_kuid(u0:k10000000:r65536 /* fs_idmapping */,
                                            u4 /* ACL_USER */);
              }
     -> posix_acl_permission()
        {
                v20000004 = make_vfsuid(u0:v20000000:r65536 /* mnt_idmapping */,
                                        u0:k10000000:r65536 /* fs_idmapping */,
                                        k10000004);
                vfsuid_eq_kuid(v20000004, k10000004 /* caller_fsuid */)
        }

  The initial_idmapping is only correct for the current situation because
  all filesystems that currently support idmapped mounts do not support
  being mounted with an fs_idmapping.

  Note that ovl_get_acl() is used to retrieve the POSIX ACLs from the
  relevant lower layer and the lower layer's mnt_idmapping needs to be
  taken into account and so does the fs_idmapping. See 0c5fd887d2 ("acl:
  move idmapped mount fixup into vfs_{g,s}etxattr()") for more details.

For posix_acl_{g,s}etxattr_idmapped_mnt() it is not as obvious why the
fs_idmapping matters as it is for acl_permission_check(). Especially
because it doesn't matter for posix_acl_fix_xattr_{from,to}_user() (See
[1] for more context.).

Because posix_acl_{g,s}etxattr_idmapped_mnt() operate on the uapi
struct posix_acl_xattr_entry which contains {g,u}id_t values and thus
give the impression that the fs_idmapping is irrelevant as at this point
appropriate {g,u}id_t values have seemlingly been generated.

As we've stated multiple times this assumption is wrong and in fact the
uapi struct posix_acl_xattr_entry is taking idmappings into account
depending at what place it is operated on.

posix_acl_getxattr_idmapped_mnt()
  When posix_acl_getxattr_idmapped_mnt() is called the values stored in
  the uapi struct posix_acl_xattr_entry are mapped according to the
  fs_idmapping. This happened when they were read from the backing store
  and then translated from struct posix_acl into the uapi
  struct posix_acl_xattr_entry during posix_acl_to_xattr().

  In other words, the fs_idmapping matters as the values stored as
  {g,u}id_t in the uapi struct posix_acl_xattr_entry have been generated
  by it.

  So we need to take the fs_idmapping into account during make_vfsuid()
  in posix_acl_getxattr_idmapped_mnt().

posix_acl_setxattr_idmapped_mnt()
  When posix_acl_setxattr_idmapped_mnt() is called the values stored as
  {g,u}id_t in uapi struct posix_acl_xattr_entry are intended to be the
  values that ultimately get turned back into a k{g,u}id_t in
  posix_acl_from_xattr() (which turns the uapi
  struct posix_acl_xattr_entry into the kernel internal struct posix_acl).

  In other words, the fs_idmapping matters as the values stored as
  {g,u}id_t in the uapi struct posix_acl_xattr_entry are intended to be
  the values that will be undone in the fs_idmapping when writing to the
  backing store.

  So we need to take the fs_idmapping into account during from_vfsuid()
  in posix_acl_setxattr_idmapped_mnt().

Link: https://lore.kernel.org/all/20220801145520.1532837-1-brauner@kernel.org [1]
Fixes: 0c5fd887d2 ("acl: move idmapped mount fixup into vfs_{g,s}etxattr()")
Cc: Seth Forshee <sforshee@digitalocean.com>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Reviewed-by: Seth Forshee <sforshee@digitalocean.com>
Link: https://lore.kernel.org/r/20220816113514.43304-1-brauner@kernel.org
2022-08-17 11:23:31 +02:00
Fabio M. De Francesco 3a608cfee9 exec: Replace kmap{,_atomic}() with kmap_local_page()
The use of kmap() and kmap_atomic() are being deprecated in favor of
kmap_local_page().

There are two main problems with kmap(): (1) It comes with an overhead as
mapping space is restricted and protected by a global lock for
synchronization and (2) it also requires global TLB invalidation when the
kmap’s pool wraps and it might block when the mapping space is fully
utilized until a slot becomes available.

With kmap_local_page() the mappings are per thread, CPU local, can take
page faults, and can be called from any context (including interrupts).
It is faster than kmap() in kernels with HIGHMEM enabled. Furthermore,
the tasks can be preempted and, when they are scheduled to run again, the
kernel virtual addresses are restored and are still valid.

Since the use of kmap_local_page() in exec.c is safe, it should be
preferred everywhere in exec.c.

As said, since kmap_local_page() can be also called from atomic context,
and since remove_arg_zero() doesn't (and shouldn't ever) rely on an
implicit preempt_disable(), this function can also safely replace
kmap_atomic().

Therefore, replace kmap() and kmap_atomic() with kmap_local_page() in
fs/exec.c.

Tested with xfstests on a QEMU/KVM x86_32 VM, 6GB RAM, booting a kernel
with HIGHMEM64GB enabled.

Cc: Eric W. Biederman <ebiederm@xmission.com>
Suggested-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20220803182856.28246-1-fmdefrancesco@gmail.com
2022-08-16 12:11:27 -07:00
Namjae Jeon 17661ecf6a ksmbd: don't remove dos attribute xattr on O_TRUNC open
When smb client open file in ksmbd share with O_TRUNC, dos attribute
xattr is removed as well as data in file. This cause the FSCTL_SET_SPARSE
request from the client fails because ksmbd can't update the dos attribute
after setting ATTR_SPARSE_FILE. And this patch fix xfstests generic/469
test also.

Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Reviewed-by: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-08-15 21:07:01 -05:00
Hyunchul Lee c90b31eaf9 ksmbd: remove unnecessary generic_fillattr in smb2_open
Remove unnecessary generic_fillattr to fix wrong
AllocationSize of SMB2_CREATE response, And
Move the call of ksmbd_vfs_getattr above the place
where stat is needed because of truncate.

This patch fixes wrong AllocationSize of SMB2_CREATE
response. Because ext4 updates inode->i_blocks only
when disk space is allocated, generic_fillattr does
not set stat.blocks properly for delayed allocation.
But ext4 returns the blocks that include the delayed
allocation blocks when getattr is called.

The issue can be reproduced with commands below:

touch ${FILENAME}
xfs_io -c "pwrite -S 0xAB 0 40k" ${FILENAME}
xfs_io -c "stat" ${FILENAME}

40KB are written, but the count of blocks is 8.

Signed-off-by: Hyunchul Lee <hyc.lee@gmail.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-08-15 21:06:54 -05:00
Al Viro 3f61631d47 take care to handle NULL ->proc_lseek()
Easily done now, just by clearing FMODE_LSEEK in ->f_mode
during proc_reg_open() for such entries.

Fixes: 868941b144 "fs: remove no_llseek"
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-08-14 15:16:18 -04:00
Linus Torvalds aea23e7c46 fix for /proc/mounts escaping
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCYvg//wAKCRBZ7Krx/gZQ
 631TAP92pf2EYSUbWRjSshaxYrIJeiDpIBBikSoEOQj4GG4hUwD/U6f/LnFW0a8P
 58kXaxXI9cNBEVRhwI/dZKW9q/WxBAY=
 =3rHw
 -----END PGP SIGNATURE-----

Merge tag 'pull-work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull /proc/mounts fix from Al Viro:
 "Fix for /proc/mounts escaping - escape the '#' character too"

* tag 'pull-work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  vfs: escape hash as well
2022-08-13 17:35:58 -07:00
Linus Torvalds 332019e23a 8 cifs/smb3 fixes, mostly restructuring/cleanup, including two for stable
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmL3/wkACgkQiiy9cAdy
 T1Glxwv/Vv6SjM+hXSGeNvSIGmp+Thxv2u19kCSEamHVoURSZoDxWtDNVw262MLF
 Jhd9PTK36ivG7suwxAALInN1bL8nXW6cENB3a0XOR93XaPCtTudXSiZPKXbgXIkl
 kib99S5N5Pm4Dxk6B4WpOCeOS/pkI5fFhR2es4ovBSQR2JacyvjMcJwRkk37lZns
 v9XnvlvQcuhqBL8SIs012AgTRnd1gyIskIf9lghA+OOD87cFt7QhnhHmpKmcdFjw
 eXYqRXncwLgCy9a/CGP0KHP251xJuhiL5iZKZ3qfRq/kvM8Z40mDtTA7M/i9UUV1
 ankjdLhZTpEdBjXHd17hm5BDcxkxIrPjQki64mo73ytvFUB7+MBTGSX579X93QKT
 R1TtzwLvw/1H6Zo03CFREDk5Bz6rGjAC12XbgSwIOWexF4SMHgyxTrQja4R8eiHb
 dNdzuxbJrdYrcTMKjH8l7sB6452Etn00Ua8LHZPcJYPF/4rhFeRd3pvrZ5UwgARL
 UQjS7pXk
 =gZTH
 -----END PGP SIGNATURE-----

Merge tag '5.20-rc-smb3-client-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6

Pull more cifs updates from Steve French:

 - two fixes for stable, one for a lock length miscalculation, and
   another fixes a lease break timeout bug

 - improvement to handle leases, allows the close timeout to be
   configured more safely

 - five restructuring/cleanup patches

* tag '5.20-rc-smb3-client-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: Do not access tcon->cfids->cfid directly from is_path_accessible
  cifs: Add constructor/destructors for tcon->cfid
  SMB3: fix lease break timeout when multiple deferred close handles for the same file.
  smb3: allow deferred close timeout to be configurable
  cifs: Do not use tcon->cfid directly, use the cfid we get from open_cached_dir
  cifs: Move cached-dir functions into a separate file
  cifs: Remove {cifs,nfs}_fscache_release_page()
  cifs: fix lock length calculation
2022-08-13 17:31:18 -07:00
David Howells 8549a26308 afs: Enable multipage folio support
Enable multipage folio support for the afs filesystem.

Support has already been implemented in netfslib, fscache and cachefiles
and in most of afs, but I've waited for Matthew Wilcox's latest folio
changes.

Note that it does require a change to afs_write_begin() to return the
correct subpage.  This is a "temporary" change as we're working on
getting rid of the need for ->write_begin() and ->write_end()
completely, at least as far as network filesystems are concerned - but
it doesn't prevent afs from making use of the capability.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: kafs-testing@auristor.com
Cc: Marc Dionne <marc.dionne@auristor.com>
Cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/lkml/2274528.1645833226@warthog.procyon.org.uk/
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-08-13 17:20:51 -07:00
Linus Torvalds f6eb0fed6a Misc timer fixes:
- fix a potential use-after-free bug in posix timers
  - correct a prototype
  - address a build warning
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmL3epQRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1iPZw/+I/9GXcf3SzbG5M6Nf21SJpSjC4hAHHgb
 eyv5MUNxKvCHU5iT2SrCvgKjESl5I/E70kubeRHJnvarBPUzGnHHzGlYIYOaJPQ7
 irJpUj/6R8ps4UsMBJ8vj5f3b7163zhBJVP8egDW6roT1HUrYTFeIjIli/SOCxpY
 H1/DqHlbEALE5o5xykg3zuqAbywym+hNRleIVls4wqjZNnfqiTElSuW9xqw9xt3n
 9xYmOKZaztdv5Lp2JCm7QOu2byGzeHje72ppsDcBZ3EBvHUBLSndhfe5NQUGhtxy
 UlBqAELA653uPgPnNKLRMqt/kop8emHqvAx8T0RawPwoUS6XGDVxRX+my8+HKklg
 P8KsM/8W7+3KTHz0bf72DEHTFiXCzlswRzdOSvP5bR4xw1G4ychzvuxAiPDFR3zT
 v7uPgykxxCrEexVCBCdPmrl4WikwLJtcrSXtJ4bsisxQFlq7WWd2/osZkTffI3pN
 IIxDXuHFHC78lrUMk2OQ+ITBz01z4nCFSlgMGZ6ZY6ppS1Rndy1HG/B2NgjW1zGP
 Y/1xq/nWaql0QO7RmyoJXt1ZSMJYCyKFocRDh9nBmtBSlYm3A8aIA8b4i1VRRG1G
 8HOkdS8ef2eOWj8wqk0NvoTbiGjV7YM5pf0g1dmRLA+aGCBD1P9/iFcBv5b6Uxaq
 qZ7ZtuQzsyc=
 =Plg8
 -----END PGP SIGNATURE-----

Merge tag 'timers-urgent-2022-08-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer fixes from Ingo Molnar:
 "Misc timer fixes:

   - fix a potential use-after-free bug in posix timers

   - correct a prototype

   - address a build warning"

* tag 'timers-urgent-2022-08-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  posix-cpu-timers: Cleanup CPU timers before freeing them during exec
  time: Correct the prototype of ns_to_kernel_old_timeval and ns_to_timespec64
  posix-timers: Make do_clock_gettime() static
2022-08-13 14:38:22 -07:00
Linus Torvalds 9872e4a873 New code for 6.0:
- Return error codes from block device flushes to userspace.
  - Fix a deadlock between reclaim and mount time quotacheck.
  - Fix an unnecessary ENOSPC return when doing COW on a filesystem with
    severe free space fragmentation.
  - Fix a miscalculation in the transaction reservation computations for
    file removal operations.
 
 Signed-off-by: Darrick J. Wong <djwong@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmL1KE8ACgkQ+H93GTRK
 tOt3zBAAlJNBx8jbxGipyDtt7Lxo0Dev2eJEPU2n43CMjl2vnnVSeaGRSWHZNGP3
 3untqmcoR2bX1mpOWwrg9zrftcimYFm3fyW4kpv91+YTL7huX+nHCMBfuDSv8I1U
 FLAVVOU+te0f5kJIcUAJIfotZg8lOo5Exb/lhRyNFRpH+KgBrq/PDforKSvwLs0r
 fGbbMI/+D7CST0+O8nYvhZc/a2ebc1EjlAoPZLTqXrXaljrJwlveRZq9QlY2x2EY
 OJhdc23atDp7D5TBY7Cpv8a7QqGMxSrBLkFqdY0Ne3ui0EiFlDnkQhrWQj2e5P+A
 MFbcwu4JoHmC/hnNq6pTMtoV09YkXKb+SpmisPHQ7jC0D5pBbdPkrVoer5FULVn6
 oedirarGvARd0ymTRILUl4QIko5ITBFDqbOv1fGv4wP4dUrPLE04MP28oJDFb2V9
 CIc3RQKtMdlEbNYc3ocAC+JjE4kAWr5gA0l+rIPEG/7xrcHmoie0wNLXBdGn+u6V
 RdyO9Vx9ma0mJ1jGWJXwqe8UMoPWsr/ASlOlO+xxSQ8k3ffoyS1z20oo/N+d8kOx
 yOtLA+Vk/T1N1dyDB7hcLu97+C5gwdFW7fsFQ+rHcP88mwWpM625uLGy+yt6W8qJ
 5gSeEn192pmGz5aEy+ePChmjHTIMglOYr/bvPAH6eoVHMqrKjXI=
 =wFgu
 -----END PGP SIGNATURE-----

Merge tag 'xfs-5.20-merge-8' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull more xfs updates from Darrick Wong:
 "There's not a lot this time around, just the usual bug fixes and
  corrections for missing error returns.

   - Return error codes from block device flushes to userspace

   - Fix a deadlock between reclaim and mount time quotacheck

   - Fix an unnecessary ENOSPC return when doing COW on a filesystem
     with severe free space fragmentation

   - Fix a miscalculation in the transaction reservation computations
     for file removal operations"

* tag 'xfs-5.20-merge-8' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: fix inode reservation space for removing transaction
  xfs: Fix false ENOSPC when performing direct write on a delalloc extent in cow fork
  xfs: fix intermittent hang during quotacheck
  xfs: check return codes when flushing block devices
2022-08-13 13:50:11 -07:00
Trond Myklebust edf79efcc9 NFS: Remove a bogus flag setting in pnfs_write_done_resend_to_mds
Since pnfs_write_done_resend_to_mds() does not actually call
end_page_writeback() on the pages that are being redirected to the
metadata server, callers of fsync() do not see the I/O as complete until
the writeback to the MDS finishes. We therefore do not need to set
NFS_CONTEXT_RESEND_WRITES, since there is nothing to redrive.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-08-13 13:02:14 -04:00
Trond Myklebust 67f4b5dc49 NFS: Fix another fsync() issue after a server reboot
Currently, when the writeback code detects a server reboot, it redirties
any pages that were not committed to disk, and it sets the flag
NFS_CONTEXT_RESEND_WRITES in the nfs_open_context of the file descriptor
that dirtied the file. While this allows the file descriptor in question
to redrive its own writes, it violates the fsync() requirement that we
should be synchronising all writes to disk.
While the problem is infrequent, we do see corner cases where an
untimely server reboot causes the fsync() call to abandon its attempt to
sync data to disk and causing data corruption issues due to missed error
conditions or similar.

In order to tighted up the client's ability to deal with this situation
without introducing livelocks, add a counter that records the number of
times pages are redirtied due to a server reboot-like condition, and use
that in fsync() to redrive the sync to disk.

Fixes: 2197e9b06c ("NFS: Fix up fsync() when the server rebooted")
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-08-13 13:02:13 -04:00
Sun Ke 2067231a9e NFS: Fix missing unlock in nfs_unlink()
Add the missing unlock before goto.

Fixes: 3c59366c20 ("NFS: don't unhash dentry during unlink/rename")
Signed-off-by: Sun Ke <sunke32@huawei.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-08-13 13:02:13 -04:00
Ronnie Sahlberg 7eb59a9870 cifs: Do not access tcon->cfids->cfid directly from is_path_accessible
cfids will soon keep a list of cached fids so we should not access this
directly from outside of cached_dir.c

Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-08-12 17:40:15 -05:00
Ronnie Sahlberg a63ec83c46 cifs: Add constructor/destructors for tcon->cfid
and move the structure definitions into cached_dir.h

Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-08-11 20:08:32 -05:00
Bharath SM 9e31678fb4 SMB3: fix lease break timeout when multiple deferred close handles for the same file.
Solution is to send lease break ack immediately even in case of
deferred close handles to avoid lease break request timing out
and let deferred closed handle gets closed as scheduled.
Later patches could optimize cases where we then close some
of these handles sooner for the cases where lease break is to 'none'

Cc: stable@kernel.org
Signed-off-by: Bharath SM <bharathsm@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-08-11 20:07:06 -05:00
Steve French 5efdd9122e smb3: allow deferred close timeout to be configurable
Deferred close can be a very useful feature for allowing
caching data for read, and for minimizing the number of
reopens needed for a file that is repeatedly opened and
close but there are workloads where its default (1 second,
similar to actimeo/acregmax) is much too small.

Allow the user to configure the amount of time we can
defer sending the final smb3 close when we have a
handle lease on the file (rather than forcing it to depend
on value of actimeo which is often unrelated, and less safe).

Adds new mount parameter "closetimeo=" which is the maximum
number of seconds we can wait before sending an SMB3
close when we have a handle lease for it.  Default value
also is set to slightly larger at 5 seconds (although some
other clients use larger default this should still help).

Suggested-by: Bharath SM <bharathsm@microsoft.com>
Reviewed-by: Bharath SM <bharathsm@microsoft.com>
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-08-11 20:03:04 -05:00
Ronnie Sahlberg dcb45fd7f5 cifs: Do not use tcon->cfid directly, use the cfid we get from open_cached_dir
They are the same right now but tcon-> will later point to a different
type of struct containing a list of cfids.

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-08-11 20:03:04 -05:00
Linus Torvalds 8745889a7f New code for 6.0:
- Remove iomap_writepage and all callers, since the mm apparently never
    called the zonefs or gfs2 writepage functions.
 
 Signed-off-by: Darrick J. Wong <djwong@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmL1H7kACgkQ+H93GTRK
 tOvzfw/+JJQM3WjwCUg+11O9E+oKS3wbczr0yAd2m8j+EqapdndXzIVevcZKXoTx
 K4zOK9oDecPtRKgQkvrDt7HrMB7oYv8tuSzyfcsNVHbMA6U3twkLdr5c19/lm9uj
 rnP2Xrs0RkiiFpImmTHsviPEyzniJ+BjtRDF7FxSFELxREae4EQW3YX2MjffvqQA
 dT+xXptWiOSa3ygwfoGqVeOLOMt0DqXICiV0GLrGxD6S7TLRRIPo7ojYS4703vUL
 VFTAUvhC4CD9/vsEwPnl91Jq2s06tO3LE4V6vJDPI7/uQFPcubLmcK8GpaYB6+OQ
 q9Fhpc9cU/3JTKt6Sw9uNOqA5hfUKBdJmhWE3FqZ2arql2C9tY2o+cHvRBKZWMZ9
 FdLKSwsuDpL+pYsWOPn7wU8BHZVTDDl7CtDNTCurNkkNgaAbK8C0X7QcT16RRyDF
 SAPHlg0XFewLgJ+9HNyDv70VT1VLYiJNq/h0d/EMO1+FuT4ArBOTOSe4zNNXqD3w
 vVFtbBhjGMf1ffqiMM5GdOPh0vxacL8jfxM7xyQ4yooSkecZCEvtNnuCysNTFDbl
 53b9bjk+OSuWCb7efE6p82wU+gr617Zp2/YxALl4E0FlozeRHuRimWBtABZqi/g6
 aKJL42ASY+PLJPACDjo0LhDFuCRbd75OATUGtBva7mkYWUANlMc=
 =FuyV
 -----END PGP SIGNATURE-----

Merge tag 'iomap-6.0-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull more iomap updates from Darrick Wong:
 "In the past 10 days or so I've not heard any ZOMG STOP style
  complaints about removing ->writepage support from gfs2 or zonefs, so
  here's the pull request removing them (and the underlying fs iomap
  support) from the kernel:

   - Remove iomap_writepage and all callers, since the mm apparently
     never called the zonefs or gfs2 writepage functions"

* tag 'iomap-6.0-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  iomap: remove iomap_writepage
  zonefs: remove ->writepage
  gfs2: remove ->writepage
  gfs2: stop using generic_writepages in gfs2_ail1_start_one
2022-08-11 13:11:49 -07:00
Linus Torvalds 786da5da56 We have a good pile of various fixes and cleanups from Xiubo, Jeff,
Luis and others, almost exclusively in the filesystem.  Several patches
 touch files outside of our normal purview to set the stage for bringing
 in Jeff's long awaited ceph+fscrypt series in the near future.  All of
 them have appropriate acks and sat in linux-next for a while.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmL1HF8THGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHziwOuB/97JKHFuOlP1HrD6fYe5a0ul9zC9VG4
 57XPDNqG2PSmfXCjvZhyVU4n53sUlJTqzKDSTXydoPCMQjtyHvysA6gEvcgUJFPd
 PHaZDCd9TmqX8my67NiTK70RVpNR9BujJMVMbOfM+aaisl0K6WQbitO+BfhEiJcK
 QStdKm5lPyf02ESH9jF+Ga0DpokARaLbtDFH7975owxske6gWuoPBCJNrkMooKiX
 LjgEmNgH1F/sJSZXftmKdlw9DtGBFaLQBdfbfSB5oVPRb7chI7xBeraNr6Od3rls
 o4davbFkcsOr+s6LJPDH2BJobmOg+HoMoma7ezspF7ZqBF4Uipv5j3VC
 =1427
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-5.20-rc1' of https://github.com/ceph/ceph-client

Pull ceph updates from Ilya Dryomov:
 "We have a good pile of various fixes and cleanups from Xiubo, Jeff,
  Luis and others, almost exclusively in the filesystem.

  Several patches touch files outside of our normal purview to set the
  stage for bringing in Jeff's long awaited ceph+fscrypt series in the
  near future. All of them have appropriate acks and sat in linux-next
  for a while"

* tag 'ceph-for-5.20-rc1' of https://github.com/ceph/ceph-client: (27 commits)
  libceph: clean up ceph_osdc_start_request prototype
  libceph: fix ceph_pagelist_reserve() comment typo
  ceph: remove useless check for the folio
  ceph: don't truncate file in atomic_open
  ceph: make f_bsize always equal to f_frsize
  ceph: flush the dirty caps immediatelly when quota is approaching
  libceph: print fsid and epoch with osd id
  libceph: check pointer before assigned to "c->rules[]"
  ceph: don't get the inline data for new creating files
  ceph: update the auth cap when the async create req is forwarded
  ceph: make change_auth_cap_ses a global symbol
  ceph: fix incorrect old_size length in ceph_mds_request_args
  ceph: switch back to testing for NULL folio->private in ceph_dirty_folio
  ceph: call netfs_subreq_terminated with was_async == false
  ceph: convert to generic_file_llseek
  ceph: fix the incorrect comment for the ceph_mds_caps struct
  ceph: don't leak snap_rwsem in handle_cap_grant
  ceph: prevent a client from exceeding the MDS maximum xattr size
  ceph: choose auth MDS for getxattr with the Xs caps
  ceph: add session already open notify support
  ...
2022-08-11 12:41:07 -07:00
Ronnie Sahlberg 05b98fd2da cifs: Move cached-dir functions into a separate file
Also rename crfid to cfid to have consistent naming for this variable.

This commit does not change any logic.

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-08-11 10:33:18 -05:00
Atte Heikkilä 4963d74f8a ksmbd: request update to stale share config
ksmbd_share_config_get() retrieves the cached share config as long
as there is at least one connection to the share. This is an issue when
the user space utilities are used to update share configs. In that case
there is a need to inform ksmbd that it should not use the cached share
config for a new connection to the share. With these changes the tree
connection flag KSMBD_TREE_CONN_FLAG_UPDATE indicates this. When this
flag is set, ksmbd removes the share config from the shares hash table
meaning that ksmbd_share_config_get() ends up requesting a share config
from user space.

Signed-off-by: Atte Heikkilä <atteh.mailbox@gmail.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-08-11 01:05:18 -05:00
Namjae Jeon fe54833dc8 ksmbd: return STATUS_BAD_NETWORK_NAME error status if share is not configured
If share is not configured in smb.conf, smb2 tree connect should return
STATUS_BAD_NETWORK_NAME instead of STATUS_BAD_NETWORK_PATH.

Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Reviewed-by: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-08-11 01:05:18 -05:00
David Howells cd04345598 cifs: Remove {cifs,nfs}_fscache_release_page()
Remove {cifs,nfs}_fscache_release_page() from fs/cifs/fscache.h.  This
functionality got built directly into cifs_release_folio() and will
hopefully be replaced with netfs_release_folio() at some point.

The "nfs_" version is a copy and paste error and should've been altered to
read "cifs_".  That can also be removed.

Reported-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
cc: Steve French <smfrench@gmail.com>
cc: linux-cifs@vger.kernel.org
cc: samba-technical@lists.samba.org
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-08-10 21:26:08 -05:00
hexiaole 031d166f96 xfs: fix inode reservation space for removing transaction
In 'fs/xfs/libxfs/xfs_trans_resv.c', the comment for transaction of removing a
directory entry writes:

/* fs/xfs/libxfs/xfs_trans_resv.c begin */
/*
 * For removing a directory entry we can modify:
 *    the parent directory inode: inode size
 *    the removed inode: inode size
...
xfs_calc_remove_reservation(
        struct xfs_mount        *mp)
{
        return XFS_DQUOT_LOGRES(mp) +
                xfs_calc_iunlink_add_reservation(mp) +
                max((xfs_calc_inode_res(mp, 1) +
...
/* fs/xfs/libxfs/xfs_trans_resv.c end */

There has 2 inode size of space to be reserverd, but the actual code
for inode reservation space writes.

There only count for 1 inode size to be reserved in
'xfs_calc_inode_res(mp, 1)', rather than 2.

Signed-off-by: hexiaole <hexiaole@kylinos.cn>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[djwong: remove redundant code citations]
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-08-10 17:43:49 -07:00
Paulo Alcantara 773891ffd4 cifs: fix lock length calculation
The lock length was wrongly set to 0 when fl_end == OFFSET_MAX, thus
failing to lock the whole file when l_start=0 and l_len=0.

This fixes test 2 from cthon04.

Before patch:

$ ./cthon04/lock/tlocklfs -t 2 /mnt

Creating parent/child synchronization pipes.

Test #1 - Test regions of an unlocked file.
        Parent: 1.1  - F_TEST  [               0,               1] PASSED.
        Parent: 1.2  - F_TEST  [               0,          ENDING] PASSED.
        Parent: 1.3  - F_TEST  [               0,7fffffffffffffff] PASSED.
        Parent: 1.4  - F_TEST  [               1,               1] PASSED.
        Parent: 1.5  - F_TEST  [               1,          ENDING] PASSED.
        Parent: 1.6  - F_TEST  [               1,7fffffffffffffff] PASSED.
        Parent: 1.7  - F_TEST  [7fffffffffffffff,               1] PASSED.
        Parent: 1.8  - F_TEST  [7fffffffffffffff,          ENDING] PASSED.
        Parent: 1.9  - F_TEST  [7fffffffffffffff,7fffffffffffffff] PASSED.

Test #2 - Try to lock the whole file.
        Parent: 2.0  - F_TLOCK [               0,          ENDING] PASSED.
        Child:  2.1  - F_TEST  [               0,               1] FAILED!
        Child:  **** Expected EACCES, returned success...
        Child:  **** Probably implementation error.

**  CHILD pass 1 results: 0/0 pass, 0/0 warn, 1/1 fail (pass/total).
        Parent: Child died

** PARENT pass 1 results: 10/10 pass, 0/0 warn, 0/0 fail (pass/total).

After patch:

$ ./cthon04/lock/tlocklfs -t 2 /mnt

Creating parent/child synchronization pipes.

Test #2 - Try to lock the whole file.
        Parent: 2.0  - F_TLOCK [               0,          ENDING] PASSED.
        Child:  2.1  - F_TEST  [               0,               1] PASSED.
        Child:  2.2  - F_TEST  [               0,          ENDING] PASSED.
        Child:  2.3  - F_TEST  [               0,7fffffffffffffff] PASSED.
        Child:  2.4  - F_TEST  [               1,               1] PASSED.
        Child:  2.5  - F_TEST  [               1,          ENDING] PASSED.
        Child:  2.6  - F_TEST  [               1,7fffffffffffffff] PASSED.
        Child:  2.7  - F_TEST  [7fffffffffffffff,               1] PASSED.
        Child:  2.8  - F_TEST  [7fffffffffffffff,          ENDING] PASSED.
        Child:  2.9  - F_TEST  [7fffffffffffffff,7fffffffffffffff] PASSED.
        Parent: 2.10 - F_ULOCK [               0,          ENDING] PASSED.

** PARENT pass 1 results: 2/2 pass, 0/0 warn, 0/0 fail (pass/total).

**  CHILD pass 1 results: 9/9 pass, 0/0 warn, 0/0 fail (pass/total).

Fixes: d80c69846d ("cifs: fix signed integer overflow when fl_end is OFFSET_MAX")
Reported-by: Xiaoli Feng <xifeng@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-08-10 16:45:37 -05:00
Linus Torvalds aeb6e6ac18 NFS client updates for Linux 5.20
Highlights include:
 
 Stable fixes:
 - pNFS/flexfiles: Fix infinite looping when the RDMA connection errors out
 
 Bugfixes:
 - NFS: fix port value parsing
 - SUNRPC: Reinitialise the backchannel request buffers before reuse
 - SUNRPC: fix expiry of auth creds
 - NFSv4: Fix races in the legacy idmapper upcall
 - NFS: O_DIRECT fixes from Jeff Layton
 - NFSv4.1: Fix OP_SEQUENCE error handling
 - SUNRPC: Fix an RPC/RDMA performance regression
 - NFS: Fix case insensitive renames
 - NFSv4/pnfs: Fix a use-after-free bug in open
 - NFSv4.1: RECLAIM_COMPLETE must handle EACCES
 
 Features:
 - NFSv4.1: session trunking enhancements
 - NFSv4.2: READ_PLUS performance optimisations
 - NFS: relax the rules for rsize/wsize mount options
 - NFS: don't unhash dentry during unlink/rename
 - SUNRPC: Fail faster on bad verifier
 - NFS/SUNRPC: Various tracing improvements
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAmLzy24ACgkQZwvnipYK
 APJJvhAAnVmv3jeOLjwDm+3X9xBevTq8PXAikWVeJgbSKCdjfqXO1J+XF49MpxXl
 N8PQZMqnwWkF3WvhvYobblwGSl6gJ3RnhVuTgdv4jiPl0ZWyS3XngJY0dCTnNgdL
 5O9AjoOMtnGVZwN5j8agymA8f2TUcel5mED6sAk10t2zDZY7VxuQqjp0m696jYjF
 0PKBmxoC4+6tXtYJS7d2PGRCTjfEUx2BLnnGuLOKsB7X8f63XmxJiu8/AIiY7spr
 M/M+BjAF3ok86dT1LjGlkvSNp23H70Wmsv98udfvxIWBm1l4972oCR/CS+kh3/mU
 dYM+oQ3JxxgKEN7Fdak+zU/+qma9q5z2rPFpSIT1fMEuaKN/7H2cbiHPi5RnEBLa
 AHWilX/lWBIMnZJZd9g3yYcGe6E/pkT6TqW5JY+2510koyfNER4IismAWMx2iYKU
 D0WSZOkmEBS/OYZxpTnqGwvS4L1szo9DN3c+yG2KXLifnmVPpjXZO25wahqSuUo3
 V6eYUCXRJmVg+IuXGsMNdrjxGYxD12xChoYzx5RlXls2lwHGeZr+iG3aL3+XayHa
 I1Kji3500UmfEOUUUr4UiQ428dOdL3QqNzVzdymN8Vh4d7v64LUL0GSseY+10Xrs
 xcbR6l/hwjBIo+I1Bi2mmv3W10tKErFy9eBIKzql3D6VHg7ESOo=
 =U00h
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-5.20-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client updates from Trond Myklebust:
 "Highlights include:

  Stable fixes:
   - pNFS/flexfiles: Fix infinite looping when the RDMA connection
     errors out

  Bugfixes:
   - NFS: fix port value parsing
   - SUNRPC: Reinitialise the backchannel request buffers before reuse
   - SUNRPC: fix expiry of auth creds
   - NFSv4: Fix races in the legacy idmapper upcall
   - NFS: O_DIRECT fixes from Jeff Layton
   - NFSv4.1: Fix OP_SEQUENCE error handling
   - SUNRPC: Fix an RPC/RDMA performance regression
   - NFS: Fix case insensitive renames
   - NFSv4/pnfs: Fix a use-after-free bug in open
   - NFSv4.1: RECLAIM_COMPLETE must handle EACCES

  Features:
   - NFSv4.1: session trunking enhancements
   - NFSv4.2: READ_PLUS performance optimisations
   - NFS: relax the rules for rsize/wsize mount options
   - NFS: don't unhash dentry during unlink/rename
   - SUNRPC: Fail faster on bad verifier
   - NFS/SUNRPC: Various tracing improvements"

* tag 'nfs-for-5.20-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (46 commits)
  NFS: Improve readpage/writepage tracing
  NFS: Improve O_DIRECT tracing
  NFS: Improve write error tracing
  NFS: don't unhash dentry during unlink/rename
  NFSv4/pnfs: Fix a use-after-free bug in open
  NFS: nfs_async_write_reschedule_io must not recurse into the writeback code
  SUNRPC: Don't reuse bvec on retransmission of the request
  SUNRPC: Reinitialise the backchannel request buffers before reuse
  NFSv4.1: RECLAIM_COMPLETE must handle EACCES
  NFSv4.1 probe offline transports for trunking on session creation
  SUNRPC create a function that probes only offline transports
  SUNRPC export xprt_iter_rewind function
  SUNRPC restructure rpc_clnt_setup_test_and_add_xprt
  NFSv4.1 remove xprt from xprt_switch if session trunking test fails
  SUNRPC create an rpc function that allows xprt removal from rpc_clnt
  SUNRPC enable back offline transports in trunking discovery
  SUNRPC create an iterator to list only OFFLINE xprts
  NFSv4.1 offline trunkable transports on DESTROY_SESSION
  SUNRPC add function to offline remove trunkable transports
  SUNRPC expose functions for offline remote xprt functionality
  ...
2022-08-10 14:04:32 -07:00
Linus Torvalds b1701d5e29 - hugetlb_vmemmap cleanups from Muchun Song
- hardware poisoning support for 1GB hugepages, from Naoya Horiguchi
 
 - highmem documentation fixups from Fabio De Francesco
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCYvMFaQAKCRDdBJ7gKXxA
 jrM7AQCsaVsrRJgVMNqjDvLAsWFEm48s6/smQT4UZ9n/rPQUsAEA0iRpOfbjcxwK
 npAml1mp5uP2AkimTVLB6Bokt6N+wAg=
 =9Wta
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2022-08-09' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull remaining MM updates from Andrew Morton:
 "Three patch series - two that perform cleanups and one feature:

   - hugetlb_vmemmap cleanups from Muchun Song

   - hardware poisoning support for 1GB hugepages, from Naoya Horiguchi

   - highmem documentation fixups from Fabio De Francesco"

* tag 'mm-stable-2022-08-09' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (23 commits)
  Documentation/mm: add details about kmap_local_page() and preemption
  highmem: delete a sentence from kmap_local_page() kdocs
  Documentation/mm: rrefer kmap_local_page() and avoid kmap()
  Documentation/mm: avoid invalid use of addresses from kmap_local_page()
  Documentation/mm: don't kmap*() pages which can't come from HIGHMEM
  highmem: specify that kmap_local_page() is callable from interrupts
  highmem: remove unneeded spaces in kmap_local_page() kdocs
  mm, hwpoison: enable memory error handling on 1GB hugepage
  mm, hwpoison: skip raw hwpoison page in freeing 1GB hugepage
  mm, hwpoison: make __page_handle_poison returns int
  mm, hwpoison: set PG_hwpoison for busy hugetlb pages
  mm, hwpoison: make unpoison aware of raw error info in hwpoisoned hugepage
  mm, hwpoison, hugetlb: support saving mechanism of raw error pages
  mm/hugetlb: make pud_huge() and follow_huge_pud() aware of non-present pud entry
  mm/hugetlb: check gigantic_page_runtime_supported() in return_unused_surplus_pages()
  mm: hugetlb_vmemmap: use PTRS_PER_PTE instead of PMD_SIZE / PAGE_SIZE
  mm: hugetlb_vmemmap: move code comments to vmemmap_dedup.rst
  mm: hugetlb_vmemmap: improve hugetlb_vmemmap code readability
  mm: hugetlb_vmemmap: replace early_param() with core_param()
  mm: hugetlb_vmemmap: move vmemmap code related to HugeTLB to hugetlb_vmemmap.c
  ...
2022-08-10 11:18:00 -07:00
Dan Carpenter d4073595d0
fs/ntfs3: uninitialized variable in ntfs_set_acl_ex()
The goto out calls kfree(value) on an uninitialized pointer.  Just
return directly as the other error paths do.

Fixes: 460bbf2990 ("fs/ntfs3: Do not change mode if ntfs_set_ea failed")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
2022-08-10 19:12:58 +03:00
Jiapeng Chong 96964352e2
fs/ntfs3: Remove unused function wnd_bits
Since the function wnd_bits is defined but not called in any file, it is
a useless function, and we delete it in view of the brevity of the code.

Remove some warnings found by running scripts/kernel-doc, which is
caused by using 'make W=1'.

fs/ntfs3/bitmap.c:54:19: warning: unused function 'wnd_bits' [-Wunused-function].

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
2022-08-10 19:12:48 +03:00
Linus Torvalds e394ff83bb NFSD 6.0 Release Notes
Work on "courteous server", which was introduced in 5.19, continues
 apace. This release introduces a more flexible limit on the number
 of NFSv4 clients that NFSD allows, now that NFSv4 clients can remain
 in courtesy state long after the lease expiration timeout. The
 client limit is adjusted based on the physical memory size of the
 server.
 
 The NFSD filecache is a cache of files held open by NFSv4 clients or
 recently touched by NFSv2 or NFSv3 clients. This cache had some
 significant scalability constraints that have been relieved in this
 release. Thanks to all who contributed to this work.
 
 A data corruption bug found during the most recent NFS bake-a-thon
 that involves NFSv3 and NFSv4 clients writing the same file has been
 addressed in this release.
 
 This release includes several improvements in CPU scalability for
 NFSv4 operations. In addition, Neil Brown provided patches that
 simplify locking during file lookup, creation, rename, and removal
 that enables subsequent work on making these operations more
 scalable. We expect to see that work materialize in the next
 release.
 
 There are also numerous single-patch fixes, clean-ups, and the
 usual improvements in observability.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmLujF0ACgkQM2qzM29m
 f5c14w/+Lsoryo5vdTXMAZBXBNvVdXQmHqLIEotJEVA3sECr+Kad2bF8rFaCWVzS
 Gf9KhTetmcDO9O73I5I/UtJ2qFT9B4I6baSGpOzIkjsM/aKeEEQpbpdzPYhKrCEv
 bQu3P54js7snH4YV8s0I39nBFOWdnahYXaw7peqE/2GOHxaR2mz88bkrQ+OybCxz
 KETqUxA6bKzkOT61S0nHcnQKd8HQzhocMDtrxtANHGsMM167ngI1dw4tUQAtfAUI
 s9R+GS6qwiKgwGz1oqhTR6LA/h4DROxPnc7AieuD9FvuAnR3kXw61bGMN5Biwv2T
 JZUTBbQvWhNasSV+7qOY9nBu+sHVC6Q7OZ5C9F/KjMyqCioDX0DnbxX9uKP20CDd
 EAAMS8n4Tdgd4KRBWdkLXPzizWYAjZQmFIJtcZne1JzGZ4IWRnikgM5qD6n1VviZ
 kcPRm5EN3DRHA+Hte4jG0EHIrE/7g5gnf+zr9dWl3uNhZtfTmumCfU16YYmKG8pP
 QN4kXBR2w7dAvp8nRaOsY6bBFLDAk/jHbpY8Q4xoUO4tsojfWayCTGVFOrecOjxv
 uSn0LhiidC5pLlkcPgwemhysVywDzr+gGXBRJXeUOHfdd05Q2gbFK8OpqDSvJ3dZ
 aC/RxFvHc8jaktUcuIjkE6Rsz6AVaAH3EZj84oMZ4hZhyGbEreg=
 =PEJ3
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd updates from Chuck Lever:
 "Work on 'courteous server', which was introduced in 5.19, continues
  apace. This release introduces a more flexible limit on the number of
  NFSv4 clients that NFSD allows, now that NFSv4 clients can remain in
  courtesy state long after the lease expiration timeout. The client
  limit is adjusted based on the physical memory size of the server.

  The NFSD filecache is a cache of files held open by NFSv4 clients or
  recently touched by NFSv2 or NFSv3 clients. This cache had some
  significant scalability constraints that have been relieved in this
  release. Thanks to all who contributed to this work.

  A data corruption bug found during the most recent NFS bake-a-thon
  that involves NFSv3 and NFSv4 clients writing the same file has been
  addressed in this release.

  This release includes several improvements in CPU scalability for
  NFSv4 operations. In addition, Neil Brown provided patches that
  simplify locking during file lookup, creation, rename, and removal
  that enables subsequent work on making these operations more scalable.
  We expect to see that work materialize in the next release.

  There are also numerous single-patch fixes, clean-ups, and the usual
  improvements in observability"

* tag 'nfsd-6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (78 commits)
  lockd: detect and reject lock arguments that overflow
  NFSD: discard fh_locked flag and fh_lock/fh_unlock
  NFSD: use (un)lock_inode instead of fh_(un)lock for file operations
  NFSD: use explicit lock/unlock for directory ops
  NFSD: reduce locking in nfsd_lookup()
  NFSD: only call fh_unlock() once in nfsd_link()
  NFSD: always drop directory lock in nfsd_unlink()
  NFSD: change nfsd_create()/nfsd_symlink() to unlock directory before returning.
  NFSD: add posix ACLs to struct nfsd_attrs
  NFSD: add security label to struct nfsd_attrs
  NFSD: set attributes when creating symlinks
  NFSD: introduce struct nfsd_attrs
  NFSD: verify the opened dentry after setting a delegation
  NFSD: drop fh argument from alloc_init_deleg
  NFSD: Move copy offload callback arguments into a separate structure
  NFSD: Add nfsd4_send_cb_offload()
  NFSD: Remove kmalloc from nfsd4_do_async_copy()
  NFSD: Refactor nfsd4_do_copy()
  NFSD: Refactor nfsd4_cleanup_inter_ssc() (2/2)
  NFSD: Refactor nfsd4_cleanup_inter_ssc() (1/2)
  ...
2022-08-09 14:56:49 -07:00
Trond Myklebust 3fa5cbdc44 NFS: Improve readpage/writepage tracing
Switch formatting to better match that used by other NFS tracepoints.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-08-09 14:11:34 -04:00
Trond Myklebust b313eb9152 NFS: Improve O_DIRECT tracing
Switch the formatting to match the other NFS tracepoints.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-08-09 14:11:34 -04:00
Trond Myklebust af887e437b NFS: Improve write error tracing
Don't leak request pointers, but use the "device:inode" labelling that
is used by all the other trace points. Furthermore, replace use of page
indexes with an offset, again in order to align behaviour with other
NFS trace points.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-08-09 14:11:34 -04:00
Thadeu Lima de Souza Cascardo e362359ace posix-cpu-timers: Cleanup CPU timers before freeing them during exec
Commit 55e8c8eb2c ("posix-cpu-timers: Store a reference to a pid not a
task") started looking up tasks by PID when deleting a CPU timer.

When a non-leader thread calls execve, it will switch PIDs with the leader
process. Then, as it calls exit_itimers, posix_cpu_timer_del cannot find
the task because the timer still points out to the old PID.

That means that armed timers won't be disarmed, that is, they won't be
removed from the timerqueue_list. exit_itimers will still release their
memory, and when that list is later processed, it leads to a
use-after-free.

Clean up the timers from the de-threaded task before freeing them. This
prevents a reported use-after-free.

Fixes: 55e8c8eb2c ("posix-cpu-timers: Store a reference to a pid not a task")
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/20220809170751.164716-1-cascardo@canonical.com
2022-08-09 20:02:13 +02:00
Linus Torvalds 15205c2829 fscache fixes
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAmLyXZ4ACgkQ+7dXa6fL
 C2tJCxAAhH4ufVMk1r8H/lflCulaBw917zgoOO/daO9FjKDtL6hReTN6OgJLLULw
 a1ZMsRxl1NJy5duGYRswUR6gS8y4LzTdILvqfvP+9YSoZzou+uTHHuKFJheA6x++
 /cRQFBQmx8JTBrEQXSHULa3FfKsU0UCwxCzy7q4o5zJogK6yiLhApxORIB5ZYi7W
 k/3iKA0isqKsSEwmRt3D9ekypb3E/8QGaQV7Bng6/1wldFFmL1g5w/ubY8TCsefV
 gnNUA3Ops3LsYj9a0vQGJ6lXKIol2nFClvmFM1qvb09u4PaVa2tofXd5FQuy/iC/
 z2+ULUtBi7gs+DjsxPBf1cnbeA3BzNu5BfdQFd3QJksKFHcIMGsKnTQehPuVPhdn
 p0IGZrf3/sncmXWVZ322w+R0TLiwB0CX9fh4WS5aYw9WHtRg+FGeFCoVLSV774M0
 OiRxefg6A4sBMP2XwHICRXWPtqnBj2PYhQ6bj6pCHJg4sGfrYnf/q7vWFI1rJ9JW
 258lhyWaaO0SqPMtdYamBQzPQAAsuG72Mrf5k3pFLWIuPPb9Ho3B7fwbtjcMQcbv
 q+apsiiPk/aqbjbI8YclOD+XgRQMUaRO2s2+JCAQvoFoWPRLYE6ZaTDdesN0nHJB
 PgTFACx/dPtTCpQxIUDldDsj8wk4iPmPZiQ31gVzKr53v2Jvkhc=
 =LWMd
 -----END PGP SIGNATURE-----

Merge tag 'fscache-fixes-20220809' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

Pull fscache updates from David Howells:

 - Fix a cookie access ref leak if a cookie is invalidated a second time
   before the first invalidation is actually processed.

 - Add a tracepoint to log cookie lookup failure

* tag 'fscache-fixes-20220809' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
  fscache: add tracepoint when failing cookie
  fscache: don't leak cookie access refs if invalidation is in progress or failed
2022-08-09 10:11:56 -07:00
Linus Torvalds 4b22e20741 AFS fixes
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAmLpXWwACgkQ+7dXa6fL
 C2tpSg//WX4+1XvsehIZxykJYmQ7XjVjrg9uGIEaslUry2nqEECDeK7LppKLuCoD
 1xMBc5GRn4h5YbBCGbMWnr+r6eezcXdwr5Fv+On4fD3l5AWTk5xTaR8v1So3G9bK
 nAkWaoSrFapuifWgBkDGy9NhLW/k6d24BAKCqGhOlPcVzi4B95C4gQ5sIlMlH9xw
 jUkHB+HSLZj/yBV47sWQUs0uy5fCiOgm3qgWYl2cTHD4qJ052i0ELO30uAv9Oykq
 YjgOUryZlCP/vLSWzOeK8bgINjCLetd+2jGT96bV15BouQ4uoHSrgt7ACtp1I3NP
 WAPGdzlzXZqaYSrqX2sPP/Tmq0/SipoKCRqK5e2yetmqs7zznKZe0ATwFhjJYp7L
 IOULEaOAyrAeT9KQEwqtANXXoUxyEUqz2GKiybXWC8oUFJ0RAKpSf4LRi9cpoaRz
 VDmi0F4/ZA+nMDOPPdbSZTtDCu+68Lz2l6Z7ONwIZ1qzyXJB4BC4wb+9rx7UftA9
 QuX9TqGUtjlWiPcM0qvdFoPOIA3xOXbZVtlvqygFxzDDXGlRVZdUa3uqMjELKzK1
 v1yRZIS00zqWM4lWFyaQNCZHwFhlnixRzkAmvQcioAb/NJL3eA4n3FbL+RBTSrWP
 XEtgpI32ROobA3S+69LomYH4AisMgP9XDP04mOQqQfIxg9GFdEI=
 =fXqy
 -----END PGP SIGNATURE-----

Merge tag 'afs-fixes-20220802' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

Pull AFS fixes from David Howells:
 "Fix AFS refcount handling.

  The first patch converts afs to use refcount_t for its refcounts and
  the second patch fixes afs_put_call() and afs_put_server() to save the
  values they're going to log in the tracepoint before decrementing the
  refcount"

* tag 'afs-fixes-20220802' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
  afs: Fix access after dec in put functions
  afs: Use refcount_t rather than atomic_t
2022-08-09 10:08:08 -07:00
Linus Torvalds 426b4ca2d6 fs.setgid.v6.0
-----BEGIN PGP SIGNATURE-----
 
 iHQEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCYvIYIgAKCRCRxhvAZXjc
 omE8AQDAZG2YjNJfMnUUaaWYO3+zTaHlQp7OQkQTXIHfcfViXQD4vPt3Wxh3rrF+
 J8fwNcWmXhSNei8HP6cA06QmSajnDQ==
 =GF9/
 -----END PGP SIGNATURE-----

Merge tag 'fs.setgid.v6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull setgid updates from Christian Brauner:
 "This contains the work to move setgid stripping out of individual
  filesystems and into the VFS itself.

  Creating files that have both the S_IXGRP and S_ISGID bit raised in
  directories that themselves have the S_ISGID bit set requires
  additional privileges to avoid security issues.

  When a filesystem creates a new inode it needs to take care that the
  caller is either in the group of the newly created inode or they have
  CAP_FSETID in their current user namespace and are privileged over the
  parent directory of the new inode. If any of these two conditions is
  true then the S_ISGID bit can be raised for an S_IXGRP file and if not
  it needs to be stripped.

  However, there are several key issues with the current implementation:

   - S_ISGID stripping logic is entangled with umask stripping.

     For example, if the umask removes the S_IXGRP bit from the file
     about to be created then the S_ISGID bit will be kept.

     The inode_init_owner() helper is responsible for S_ISGID stripping
     and is called before posix_acl_create(). So we can end up with two
     different orderings:

     1. FS without POSIX ACL support

        First strip umask then strip S_ISGID in inode_init_owner().

        In other words, if a filesystem doesn't support or enable POSIX
        ACLs then umask stripping is done directly in the vfs before
        calling into the filesystem:

     2. FS with POSIX ACL support

        First strip S_ISGID in inode_init_owner() then strip umask in
        posix_acl_create().

        In other words, if the filesystem does support POSIX ACLs then
        unmask stripping may be done in the filesystem itself when
        calling posix_acl_create().

     Note that technically filesystems are free to impose their own
     ordering between posix_acl_create() and inode_init_owner() meaning
     that there's additional ordering issues that influence S_ISGID
     inheritance.

     (Note that the commit message of commit 1639a49ccd ("fs: move
     S_ISGID stripping into the vfs_*() helpers") gets the ordering
     between inode_init_owner() and posix_acl_create() the wrong way
     around. I realized this too late.)

   - Filesystems that don't rely on inode_init_owner() don't get S_ISGID
     stripping logic.

     While that may be intentional (e.g. network filesystems might just
     defer setgid stripping to a server) it is often just a security
     issue.

     Note that mandating the use of inode_init_owner() was proposed as
     an alternative solution but that wouldn't fix the ordering issues
     and there are examples such as afs where the use of
     inode_init_owner() isn't possible.

     In any case, we should also try the cleaner and generalized
     solution first before resorting to this approach.

   - We still have S_ISGID inheritance bugs years after the initial
     round of S_ISGID inheritance fixes:

       e014f37db1 ("xfs: use setattr_copy to set vfs inode attributes")
       01ea173e10 ("xfs: fix up non-directory creation in SGID directories")
       fd84bfdddd ("ceph: fix up non-directory creation in SGID directories")

  All of this led us to conclude that the current state is too messy.
  While we won't be able to make it completely clean as
  posix_acl_create() is still a filesystem specific call we can improve
  the S_SIGD stripping situation quite a bit by hoisting it out of
  inode_init_owner() and into the respective vfs creation operations.

  The obvious advantage is that we don't need to rely on individual
  filesystems getting S_ISGID stripping right and instead can
  standardize the ordering between S_ISGID and umask stripping directly
  in the VFS.

  A few short implementation notes:

   - The stripping logic needs to happen in vfs_*() helpers for the sake
     of stacking filesystems such as overlayfs that rely on these
     helpers taking care of S_ISGID stripping.

   - Security hooks have never seen the mode as it is ultimately seen by
     the filesystem because of the ordering issue we mentioned. Nothing
     is changed for them. We simply continue to strip the umask before
     passing the mode down to the security hooks.

   - The following filesystems use inode_init_owner() and thus relied on
     S_ISGID stripping: spufs, 9p, bfs, btrfs, ext2, ext4, f2fs,
     hfsplus, hugetlbfs, jfs, minix, nilfs2, ntfs3, ocfs2, omfs,
     overlayfs, ramfs, reiserfs, sysv, ubifs, udf, ufs, xfs, zonefs,
     bpf, tmpfs.

     We've audited all callchains as best as we could. More details can
     be found in the commit message to 1639a49ccd ("fs: move S_ISGID
     stripping into the vfs_*() helpers")"

* tag 'fs.setgid.v6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  ceph: rely on vfs for setgid stripping
  fs: move S_ISGID stripping into the vfs_*() helpers
  fs: Add missing umask strip in vfs_tmpfile
  fs: add mode_strip_sgid() helper
2022-08-09 09:52:28 -07:00
Jeff Layton 1a1e3aca9d fscache: add tracepoint when failing cookie
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: David Howells <dhowells@redhat.com>
2022-08-09 14:13:59 +01:00
Jeff Layton fb24771faf fscache: don't leak cookie access refs if invalidation is in progress or failed
It's possible for a request to invalidate a fscache_cookie will come in
while we're already processing an invalidation. If that happens we
currently take an extra access reference that will leak. Only call
__fscache_begin_cookie_access if the FSCACHE_COOKIE_DO_INVALIDATE bit
was previously clear.

Also, ensure that we attempt to clear the bit when the cookie is
"FAILED" and put the reference to avoid an access leak.

Fixes: 85e4ea1049 ("fscache: Fix invalidation/lookup race")
Suggested-by: David Howells <dhowells@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: David Howells <dhowells@redhat.com>
2022-08-09 14:13:55 +01:00
Linus Torvalds eb555cb5b7 12 server fixes, including for oopses, memory leaks and adding a channel lock
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmLwixsACgkQiiy9cAdy
 T1FT6Av+NY7R2f7/vDRzX/8u1fFXmyQDU2/dl3S0ysqqFEYE5hnHgbFtVP1IYwId
 /IAV8R6E/YMm8Nkbkf173EqHI8E9QNUSngTCk1bcvsiNSW4z3QOsujs9B6oTUeXM
 ARthU4eFJQW0wcTyjElVcm59I/jdtpQKaoVDQ/uXlCjPMT+0BUHS4mFRJkAx6DrX
 7wOO0vJ/WCqO2u0rSvK4unOZsvhrKHibtPMvQSiV6k3a+LVCJ5/5lwBpENKK4Mdw
 n4LeNhSDtm6HE6WDFIxSj008WPX7CF3JvX+zvZJEsB7uFNry/GVkWySPLqVYUmtp
 aSftdnWnw0Mv9AG3L4OMzyCQQP89ccoV81d6lyyJEBaGiOFhH8L1v6b7qAStCZue
 zyyFWIVUXnstQKwDY8QWLKjmi7H+I1drFExxn0vABInPcaijQE8Mct0fSt015Tc1
 EuCsO1JRyo+zn0mx7a1L80kHUH+yvHjKkfbINvPSa+qTFy21bgQZcG5CWSPpjadr
 4zibpKko
 =cDaA
 -----END PGP SIGNATURE-----

Merge tag '5.20-rc-ksmbd-server-fixes' of git://git.samba.org/ksmbd

Pull ksmbd updates from Steve French:

 - fixes for memory access bugs (out of bounds access, oops, leak)

 - multichannel fixes

 - session disconnect performance improvement, and session register
   improvement

 - cleanup

* tag '5.20-rc-ksmbd-server-fixes' of git://git.samba.org/ksmbd:
  ksmbd: fix heap-based overflow in set_ntacl_dacl()
  ksmbd: prevent out of bound read for SMB2_TREE_CONNNECT
  ksmbd: prevent out of bound read for SMB2_WRITE
  ksmbd: fix use-after-free bug in smb2_tree_disconect
  ksmbd: fix memory leak in smb2_handle_negotiate
  ksmbd: fix racy issue while destroying session on multichannel
  ksmbd: use wait_event instead of schedule_timeout()
  ksmbd: fix kernel oops from idr_remove()
  ksmbd: add channel rwlock
  ksmbd: replace sessions list in connection with xarray
  MAINTAINERS: ksmbd: add entry for documentation
  ksmbd: remove unused ksmbd_share_configs_cleanup function
2022-08-08 20:15:13 -07:00
Linus Torvalds f30adc0d33 iov_iter stuff, part 2, rebased
* more new_sync_{read,write}() speedups - ITER_UBUF introduction
 * ITER_PIPE cleanups
 * unification of iov_iter_get_pages/iov_iter_get_pages_alloc and
   switching them to advancing semantics
 * making ITER_PIPE take high-order pages without splitting them
 * handling copy_page_from_iter() for high-order pages properly
 
 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCYvHI8QAKCRBZ7Krx/gZQ
 62CQAPsGlbebqBeAT2pMulaGDxfLAsgz5Yf4BEaMLhPtRqFOQgD+KrZQId7Sd8O0
 3IWucpTb2c4jvLlXhGMS+XWnusQH+AQ=
 =pBux
 -----END PGP SIGNATURE-----

Merge tag 'pull-work.iov_iter-rebased' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull more iov_iter updates from Al Viro:

 - more new_sync_{read,write}() speedups - ITER_UBUF introduction

 - ITER_PIPE cleanups

 - unification of iov_iter_get_pages/iov_iter_get_pages_alloc and
   switching them to advancing semantics

 - making ITER_PIPE take high-order pages without splitting them

 - handling copy_page_from_iter() for high-order pages properly

* tag 'pull-work.iov_iter-rebased' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (32 commits)
  fix copy_page_from_iter() for compound destinations
  hugetlbfs: copy_page_to_iter() can deal with compound pages
  copy_page_to_iter(): don't split high-order page in case of ITER_PIPE
  expand those iov_iter_advance()...
  pipe_get_pages(): switch to append_pipe()
  get rid of non-advancing variants
  ceph: switch the last caller of iov_iter_get_pages_alloc()
  9p: convert to advancing variant of iov_iter_get_pages_alloc()
  af_alg_make_sg(): switch to advancing variant of iov_iter_get_pages()
  iter_to_pipe(): switch to advancing variant of iov_iter_get_pages()
  block: convert to advancing variants of iov_iter_get_pages{,_alloc}()
  iov_iter: advancing variants of iov_iter_get_pages{,_alloc}()
  iov_iter: saner helper for page array allocation
  fold __pipe_get_pages() into pipe_get_pages()
  ITER_XARRAY: don't open-code DIV_ROUND_UP()
  unify the rest of iov_iter_get_pages()/iov_iter_get_pages_alloc() guts
  unify xarray_get_pages() and xarray_get_pages_alloc()
  unify pipe_get_pages() and pipe_get_pages_alloc()
  iov_iter_get_pages(): sanity-check arguments
  iov_iter_get_pages_alloc(): lift freeing pages array on failure exits into wrapper
  ...
2022-08-08 20:04:35 -07:00
Al Viro c7d57ab163 hugetlbfs: copy_page_to_iter() can deal with compound pages
... since April 2021

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-08-08 22:37:26 -04:00
Al Viro b53589927d ceph: switch the last caller of iov_iter_get_pages_alloc()
here nothing even looks at the iov_iter after the call, so we couldn't
care less whether it advances or not.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-08-08 22:37:24 -04:00