Граф коммитов

73558 Коммитов

Автор SHA1 Сообщение Дата
Filipe Manana dc2872247e btrfs: keep track of the last logged keys when logging a directory
After the first time we log a directory in the current transaction, for
each directory item in a changed leaf of the subvolume tree, we have to
check if we previously logged the item, in order to overwrite it in case
its data changed or skip it in case its data hasn't changed.

Checking if we have logged each item before not only wastes times, but it
also adds lock contention on the log tree. So in order to minimize the
number of times we do such checks, keep track of the offset of the last
key we logged for a directory and, on the next time we log the directory,
skip the checks for any new keys that have an offset greater than the
offset we have previously saved. This is specially effective for index
keys, because the offset for these keys comes from a monotonically
increasing counter.

This patch is part of a patchset comprised of the following 5 patches:

  btrfs: remove root argument from btrfs_log_inode() and its callees
  btrfs: remove redundant log root assignment from log_dir_items()
  btrfs: factor out the copying loop of dir items from log_dir_items()
  btrfs: insert items in batches when logging a directory when possible
  btrfs: keep track of the last logged keys when logging a directory

This is patch 5/5.

The following test was used on a non-debug kernel to measure the impact
it has on a directory fsync:

  $ cat test-dir-fsync.sh
  #!/bin/bash

  DEV=/dev/nvme0n1
  MNT=/mnt/nvme0n1

  NUM_NEW_FILES=100000
  NUM_FILE_DELETES=1000

  mkfs.btrfs -f $DEV
  mount -o ssd $DEV $MNT

  mkdir $MNT/testdir

  for ((i = 1; i <= $NUM_NEW_FILES; i++)); do
      echo -n > $MNT/testdir/file_$i
  done

  # fsync the directory, this will log the new dir items and the inodes
  # they point to, because these are new inodes.
  start=$(date +%s%N)
  xfs_io -c "fsync" $MNT/testdir
  end=$(date +%s%N)

  dur=$(( (end - start) / 1000000 ))
  echo "dir fsync took $dur ms after adding $NUM_NEW_FILES files"

  # sync to force transaction commit and wipeout the log.
  sync

  del_inc=$(( $NUM_NEW_FILES / $NUM_FILE_DELETES ))
  for ((i = 1; i <= $NUM_NEW_FILES; i += $del_inc)); do
      rm -f $MNT/testdir/file_$i
  done

  # fsync the directory, this will only log dir items, there are no
  # dentries pointing to new inodes.
  start=$(date +%s%N)
  xfs_io -c "fsync" $MNT/testdir
  end=$(date +%s%N)

  dur=$(( (end - start) / 1000000 ))
  echo "dir fsync took $dur ms after deleting $NUM_FILE_DELETES files"

  umount $MNT

Test results with NUM_NEW_FILES set to 100 000 and 1 000 000:

**** before patchset, 100 000 files, 1000 deletes ****

dir fsync took 848 ms after adding 100000 files
dir fsync took 175 ms after deleting 1000 files

**** after patchset, 100 000 files, 1000 deletes ****

dir fsync took 758 ms after adding 100000 files  (-11.2%)
dir fsync took 63 ms after deleting 1000 files   (-94.1%)

**** before patchset, 1 000 000 files, 1000 deletes ****

dir fsync took 9945 ms after adding 1000000 files
dir fsync took 473 ms after deleting 1000 files

**** after patchset, 1 000 000 files, 1000 deletes ****

dir fsync took 8677 ms after adding 1000000 files (-13.6%)
dir fsync took 146 ms after deleting 1000 files   (-105.6%)

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:08:02 +02:00
Filipe Manana 086dcbfa50 btrfs: insert items in batches when logging a directory when possible
When logging a directory, we scan its directory items from the subvolume
tree and then copy one by one into the log tree. This is not efficient
since we generally are able to insert several items in a batch, using a
single btree operation for adding several items at once. The reason we
copy items one by one is that we must check if each item was previously
logged in the current transaction, and if it was we either overwrite it
or skip it in case its content did not change in the subvolume tree (this
can happen only for dir item keys, but not for dir index keys), and doing
such check makes it a bit cumbersome to attempt batch insertions.

However the chances for doing batch insertions are very frequent and
always happen when:

1) Logging the directory for the first time in the current transaction,
   as none of the items exist in the log tree yet;

2) Logging new dir index keys, because the offset for new dir index keys
   comes from a monotonically increasing counter. This means if we keep
   adding dentries to a directory, through creation of new files and
   sub-directories or by adding new links or renaming from some other
   directory into the one we are logging, all the new dir index keys
   have a new offset that is greater than the offset of any previously
   logged index keys, so we can insert them in batches into the log tree.

For dir item keys, since their offset depends on the result of an hash
function against the dentry's name, unless the directory is being logged
for the first time in the current transaction, the chances being able to
insert the items in the log using batches is pretty much random and not
predictable, as it depends on the names of the dentries, but still happens
often enough.

So change directory logging to keep track of consecutive directory items
that don't exist yet in the log and batch insert them.

This patch is part of a patchset comprised of the following 5 patches:

  btrfs: remove root argument from btrfs_log_inode() and its callees
  btrfs: remove redundant log root assignment from log_dir_items()
  btrfs: factor out the copying loop of dir items from log_dir_items()
  btrfs: insert items in batches when logging a directory when possible
  btrfs: keep track of the last logged keys when logging a directory

This is patch 4/5. The change log of the last patch (5/5) has performance
results.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:08:02 +02:00
Filipe Manana eb10d85ee7 btrfs: factor out the copying loop of dir items from log_dir_items()
In preparation for the next change, move the loop that processes a leaf
and copies its directory items to the log, into a separate helper
function. This makes the next change simpler and it also helps making
log_dir_items() a bit shorter (specially after the next change).

This patch is part of a patchset comprised of the following 5 patches:

  btrfs: remove root argument from btrfs_log_inode() and its callees
  btrfs: remove redundant log root assignment from log_dir_items()
  btrfs: factor out the copying loop of dir items from log_dir_items()
  btrfs: insert items in batches when logging a directory when possible
  btrfs: keep track of the last logged keys when logging a directory

This is patch 3/5. The change log of the last patch (5/5) has performance
results.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:08:02 +02:00
Filipe Manana d46fb845af btrfs: remove redundant log root assignment from log_dir_items()
At log_dir_items() we are assigning the exact same value to the local
variable 'log', once when it's declared and once again shortly after.
Remove the later assignment as it's pointless.

This patch is part of a patchset comprised of the following 5 patches:

  btrfs: remove root argument from btrfs_log_inode() and its callees
  btrfs: remove redundant log root assignment from log_dir_items()
  btrfs: factor out the copying loop of dir items from log_dir_items()
  btrfs: insert items in batches when logging a directory when possible
  btrfs: keep track of the last logged keys when logging a directory

This is patch 2/5. The change log of the last patch (5/5) has performance
results.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:08:02 +02:00
Filipe Manana 90d04510a7 btrfs: remove root argument from btrfs_log_inode() and its callees
The root argument passed to btrfs_log_inode() is unncessary, as it is
always the root of the inode we are going to log. This root also gets
unnecessarily propagated to several functions called by btrfs_log_inode(),
and all of them take the inode as an argument as well. So just remove
the root argument from these functions and have them get the root from
the inode where needed.

This patch is part of a patchset comprised of the following 5 patches:

  btrfs: remove root argument from btrfs_log_inode() and its callees
  btrfs: remove redundant log root assignment from log_dir_items()
  btrfs: factor out the copying loop of dir items from log_dir_items()
  btrfs: insert items in batches when logging a directory when possible
  btrfs: keep track of the last logged keys when logging a directory

This is patch 1/5. The change log of the last patch (5/5) has performance
results.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:08:02 +02:00
Johannes Thumshirn 2d81eb1c3f btrfs: zoned: let the for_treelog test in the allocator stand out
The statement which decides if an extent allocation on a zoned device is
for the dedicated tree-log block group or not and if we can use the block
group we picked for this allocation is not easy to read but an important
part of the allocator.

Rewrite into an if condition instead of a plain boolean test to make it
stand out more, like the version which tests for the dedicated
data-relocation block group.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:08:02 +02:00
Johannes Thumshirn 4b01c44f15 btrfs: rename setup_extent_mapping in relocation code
In btrfs code we have two functions called setup_extent_mapping, one in
the extent_map code and one in the relocation code. While both are
private to their respective implementation, this can still be confusing
for the reader.

So rename the version in relocation.c to setup_relocation_extent_mapping.
No functional changes.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:08:01 +02:00
Johannes Thumshirn 960a3166ae btrfs: zoned: allow preallocation for relocation inodes
Now that we use a dedicated block group and regular writes for data
relocation, we can preallocate the space needed for a relocated inode,
just like we do in regular mode.

Essentially this reverts commit 32430c6148 ("btrfs: zoned: enable
relocation on a zoned filesystem") as it is not needed anymore.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:08:01 +02:00
Johannes Thumshirn 2adada886b btrfs: check for relocation inodes on zoned btrfs in should_nocow
Prepare for allowing preallocation for relocation inodes.

Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:08:01 +02:00
Johannes Thumshirn e6d261e3b1 btrfs: zoned: use regular writes for relocation
Now that we have a dedicated block group for relocation, we can use
REQ_OP_WRITE instead of  REQ_OP_ZONE_APPEND for writing out the data on
relocation.

Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:08:01 +02:00
Johannes Thumshirn 35156d8527 btrfs: zoned: only allow one process to add pages to a relocation inode
Don't allow more than one process to add pages to a relocation inode on
a zoned filesystem, otherwise we cannot guarantee the sequential write
rule once we're filling preallocated extents on a zoned filesystem.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:08:01 +02:00
Johannes Thumshirn c2707a2556 btrfs: zoned: add a dedicated data relocation block group
Relocation in a zoned filesystem can fail with a transaction abort with
error -22 (EINVAL). This happens because the relocation code assumes that
the extents we relocated the data to have the same size the source extents
had and ensures this by preallocating the extents.

But in a zoned filesystem we currently can't preallocate the extents as
this would break the sequential write required rule. Therefore it can
happen that the writeback process kicks in while we're still adding pages
to a delalloc range and starts writing out dirty pages.

This then creates destination extents that are smaller than the source
extents, triggering the following safety check in get_new_location():

 1034         if (num_bytes != btrfs_file_extent_disk_num_bytes(leaf, fi)) {
 1035                 ret = -EINVAL;
 1036                 goto out;
 1037         }

Temporarily create a dedicated block group for the relocation process, so
no non-relocation data writes can interfere with the relocation writes.

This is needed that we can switch the relocation process on a zoned
filesystem from the REQ_OP_ZONE_APPEND writing we use for data to a scheme
like in a non-zoned filesystem using REQ_OP_WRITE and preallocation.

Fixes: 32430c6148 ("btrfs: zoned: enable relocation on a zoned filesystem")
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:08:01 +02:00
Johannes Thumshirn 37f00a6d2e btrfs: introduce btrfs_is_data_reloc_root
There are several places in our codebase where we check if a root is the
root of the data reloc tree and subsequent patches will introduce more.

Factor out the check into a small helper function instead of open coding
it multiple times.

Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:08:01 +02:00
Qu Wenruo 38d5e541dd btrfs: unexport repair_io_failure()
Function repair_io_failure() is no longer used out of extent_io.c since
commit 8b9b6f2554 ("btrfs: scrub: cleanup the remaining nodatasum
fixup code"), which removes the last external caller.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:08:01 +02:00
Filipe Manana f6df27dd27 btrfs: do not commit delayed inode when logging a file in full sync mode
When logging a regular file in full sync mode, we are currently committing
its delayed inode item. This is to ensure that we never miss copying the
inode item, with its most up to date data, into the log tree.

However that is not necessary since commit e4545de5b0 ("Btrfs: fix fsync
data loss after append write"), because even if we don't find the leaf
with the inode item when looking for leaves that changed in the current
transaction, we end up logging the inode item later using the in-memory
content. In case we find the leaf containing the inode item, we already
end up using the in-memory inode for filling the inode item in the log
tree, and not the inode item that is in the fs/subvolume tree, as it
might be not up to date (copy_items() -> fill_inode_item()).

So don't commit the delayed inode item, which brings a couple of benefits:

1) Avoid writing the inode item to the fs/subvolume btree, saving time and
   reducing lock contention on the btree;

2) In case no other item for the inode was changed, added or deleted in
   the same leaf where the inode item is located, we ended up copying
   all the items in that leaf to the log tree - it's harmless from a
   functional point of view, but it wastes time and log tree space.

This patch is part of a patch set comprised of the following patches:

  btrfs: check if a log tree exists at inode_logged()
  btrfs: remove no longer needed checks for NULL log context
  btrfs: do not log new dentries when logging that a new name exists
  btrfs: always update the logged transaction when logging new names
  btrfs: avoid expensive search when dropping inode items from log
  btrfs: add helper to truncate inode items when logging inode
  btrfs: avoid expensive search when truncating inode items from the log
  btrfs: avoid search for logged i_size when logging inode if possible
  btrfs: avoid attempt to drop extents when logging inode for the first time
  btrfs: do not commit delayed inode when logging a file in full sync mode

This is patch 10/10 and the following test results compare a branch with
the whole patch set applied versus a branch without any of the patches
applied.

The following script was used to test dbench with 8 and 16 jobs on a
machine with 12 cores, 64G of RAM, a NVME device and using a non-debug
kernel config (Debian's default):

  $ cat test.sh
  #!/bin/bash

  if [ $# -ne 1 ]; then
      echo "Use $0 NUM_JOBS"
      exit 1
  fi

  NUM_JOBS=$1

  DEV=/dev/nvme0n1
  MNT=/mnt/nvme0n1
  MOUNT_OPTIONS="-o ssd"
  MKFS_OPTIONS="-m single -d single"

  echo "performance" | \
      tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

  mkfs.btrfs -f $MKFS_OPTIONS $DEV
  mount $MOUNT_OPTIONS $DEV $MNT

  dbench -D $MNT -t 120 $NUM_JOBS

  umount $MNT

The results were the following:

8 jobs, before patchset:

 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX    4113896     0.009   238.665
 Close        3021699     0.001     0.590
 Rename        174215     0.082   238.733
 Unlink        830977     0.049   238.642
 Deltree           96     2.232     8.022
 Mkdir             48     0.003     0.005
 Qpathinfo    3729013     0.005     2.672
 Qfileinfo     653206     0.001     0.152
 Qfsinfo       683866     0.002     0.526
 Sfileinfo     335055     0.004     1.571
 Find         1441800     0.016     4.288
 WriteX       2049644     0.010     3.982
 ReadX        6449786     0.003     0.969
 LockX          13400     0.002     0.043
 UnlockX        13400     0.001     0.075
 Flush         288349     2.521   245.516

Throughput 1075.73 MB/sec  8 clients  8 procs  max_latency=245.520 ms

8 jobs, after patchset:

 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX    4154282     0.009   156.675
 Close        3051450     0.001     0.843
 Rename        175912     0.072     4.444
 Unlink        839067     0.048    66.050
 Deltree           96     2.131     5.979
 Mkdir             48     0.002     0.004
 Qpathinfo    3765575     0.005     3.079
 Qfileinfo     659582     0.001     0.099
 Qfsinfo       690474     0.002     0.155
 Sfileinfo     338366     0.004     1.419
 Find         1455816     0.016     3.423
 WriteX       2069538     0.010     4.328
 ReadX        6512429     0.003     0.840
 LockX          13530     0.002     0.078
 UnlockX        13530     0.001     0.051
 Flush         291158     2.500   163.468

Throughput 1105.45 MB/sec  8 clients  8 procs  max_latency=163.474 ms

+2.7% throughput, -40.1% max latency

16 jobs, before patchset:

 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX    5457602     0.033   337.098
 Close        4008979     0.002     2.018
 Rename        231051     0.323   254.054
 Unlink       1102209     0.202   337.243
 Deltree          160     6.521    31.720
 Mkdir             80     0.003     0.007
 Qpathinfo    4946147     0.014     6.988
 Qfileinfo     867440     0.001     1.642
 Qfsinfo       907081     0.003     1.821
 Sfileinfo     444433     0.005     2.053
 Find         1912506     0.067     7.854
 WriteX       2724852     0.018     7.428
 ReadX        8553883     0.003     2.059
 LockX          17770     0.003     0.350
 UnlockX        17770     0.002     0.627
 Flush         382533     2.810   353.691

Throughput 1413.09 MB/sec  16 clients  16 procs  max_latency=353.696 ms

16 jobs, after patchset:

 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX    5393156     0.034   303.181
 Close        3961986     0.002     1.502
 Rename        228359     0.320   253.379
 Unlink       1088920     0.206   303.409
 Deltree          160     6.419    30.088
 Mkdir             80     0.003     0.004
 Qpathinfo    4887967     0.015     7.722
 Qfileinfo     857408     0.001     1.651
 Qfsinfo       896343     0.002     2.147
 Sfileinfo     439317     0.005     4.298
 Find         1890018     0.073     8.347
 WriteX       2693356     0.018     6.373
 ReadX        8453485     0.003     3.836
 LockX          17562     0.003     0.486
 UnlockX        17562     0.002     0.635
 Flush         378023     2.802   315.904

Throughput 1454.46 MB/sec  16 clients  16 procs  max_latency=315.910 ms

+2.9% throughput, -11.3% max latency

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:08:01 +02:00
Filipe Manana 5328b2a7ff btrfs: avoid attempt to drop extents when logging inode for the first time
When logging an extent, in the fast fsync path, we always attempt do drop
or trim any existing extents with a range that match or overlap the range
of the extent we are about to log. We do that through a call to
btrfs_drop_extents().

However this is not needed when we are logging the inode for the first
time in the current transaction, since we have no inode items of the
inode in the log tree. Calling btrfs_drop_extents() does a deletion search
on the log tree, which is expensive when we have concurrent tasks
accessing the log tree because a deletion search always acquires a write
lock on the extent buffers at levels 2, 1 and 0, adding significant lock
contention, specially taking into account the height of a log tree rarely
(if ever) goes beyond 2 or 3, due to its short life.

So skip the call to btrfs_drop_extents() when the inode was not previously
logged in the current transaction.

This patch is part of a patch set comprised of the following patches:

  btrfs: check if a log tree exists at inode_logged()
  btrfs: remove no longer needed checks for NULL log context
  btrfs: do not log new dentries when logging that a new name exists
  btrfs: always update the logged transaction when logging new names
  btrfs: avoid expensive search when dropping inode items from log
  btrfs: add helper to truncate inode items when logging inode
  btrfs: avoid expensive search when truncating inode items from the log
  btrfs: avoid search for logged i_size when logging inode if possible
  btrfs: avoid attempt to drop extents when logging inode for the first time
  btrfs: do not commit delayed inode when logging a file in full sync mode

This is patch 9/10 and test results are listed in the change log of the
last patch in the set.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:08:01 +02:00
Filipe Manana a5c733a4b6 btrfs: avoid search for logged i_size when logging inode if possible
If we are logging that an inode exists and the inode was not logged
before, we can avoid searching in the log tree for the inode item since we
know it does not exists. That wastes time and adds more lock contention on
the extent buffers of the log tree when there are other tasks that are
logging other inodes.

This patch is part of a patch set comprised of the following patches:

  btrfs: check if a log tree exists at inode_logged()
  btrfs: remove no longer needed checks for NULL log context
  btrfs: do not log new dentries when logging that a new name exists
  btrfs: always update the logged transaction when logging new names
  btrfs: avoid expensive search when dropping inode items from log
  btrfs: add helper to truncate inode items when logging inode
  btrfs: avoid expensive search when truncating inode items from the log
  btrfs: avoid search for logged i_size when logging inode if possible
  btrfs: avoid attempt to drop extents when logging inode for the first time
  btrfs: do not commit delayed inode when logging a file in full sync mode

This is patch 8/10 and test results are listed in the change log of the
last patch in the set.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:08:01 +02:00
Filipe Manana 4934a81502 btrfs: avoid expensive search when truncating inode items from the log
Whenever we are logging a file inode in full sync mode we call
btrfs_truncate_inode_items() to delete items of the inode we may have
previously logged.

That results in doing a btree search for deletion, which is expensive
because it always acquires write locks for extent buffers at levels 2, 1
and 0, and it balances any node that is less than half full. Acquiring
the write locks can block the task if the extent buffers are already
locked by another task or block other tasks attempting to lock them,
which is specially bad in case of log trees since they are small due to
their short life, with a root node at a level typically not greater than
level 2.

If we know that we are logging the inode for the first time in the current
transaction, we can skip the call to btrfs_truncate_inode_items(), avoiding
the deletion search. This change does that.

This patch is part of a patch set comprised of the following patches:

  btrfs: check if a log tree exists at inode_logged()
  btrfs: remove no longer needed checks for NULL log context
  btrfs: do not log new dentries when logging that a new name exists
  btrfs: always update the logged transaction when logging new names
  btrfs: avoid expensive search when dropping inode items from log
  btrfs: add helper to truncate inode items when logging inode
  btrfs: avoid expensive search when truncating inode items from the log
  btrfs: avoid search for logged i_size when logging inode if possible
  btrfs: avoid attempt to drop extents when logging inode for the first time
  btrfs: do not commit delayed inode when logging a file in full sync mode

This is patch 7/10 and test results are listed in the change log of the
last patch in the set.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:08:01 +02:00
Filipe Manana 8a2b3da191 btrfs: add helper to truncate inode items when logging inode
Move the call to btrfs_truncate_inode_items(), and the surrounding retry
loop, into a local helper function. This avoids some repetition and avoids
making the next change a bit awkward due to a bit of too much indentation.

This patch is part of a patch set comprised of the following patches:

  btrfs: check if a log tree exists at inode_logged()
  btrfs: remove no longer needed checks for NULL log context
  btrfs: do not log new dentries when logging that a new name exists
  btrfs: always update the logged transaction when logging new names
  btrfs: avoid expensive search when dropping inode items from log
  btrfs: add helper to truncate inode items when logging inode
  btrfs: avoid expensive search when truncating inode items from the log
  btrfs: avoid search for logged i_size when logging inode if possible
  btrfs: avoid attempt to drop extents when logging inode for the first time
  btrfs: do not commit delayed inode when logging a file in full sync mode

This is patch 6/10 and test results are listed in the change log of the
last patch in the set.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:08:00 +02:00
Filipe Manana 88e221cdac btrfs: avoid expensive search when dropping inode items from log
Whenever we are logging a directory inode, logging that an inode exists or
logging an inode that has changes in its references or xattrs, we attempt
to delete items of this inode we may have previously logged (through calls
to drop_objectid_items()).

That attempt does a btree search for deletion, which is expensive because
it always acquires write locks for extent buffers at levels 2, 1 and 0,
and it balances any node that is less than half full. Acquiring the write
locks can block the task if the extent buffers are already locked or block
other tasks attempting to lock them, which is specially bad in case of log
trees since they are small due to their short life, with a root node at a
level typically not greater than level 2.

If we know that we are logging the inode for the first time in the current
transaction, we can skip the search. This change does that.

This patch is part of a patch set comprised of the following patches:

  btrfs: check if a log tree exists at inode_logged()
  btrfs: remove no longer needed checks for NULL log context
  btrfs: do not log new dentries when logging that a new name exists
  btrfs: always update the logged transaction when logging new names
  btrfs: avoid expensive search when dropping inode items from log
  btrfs: add helper to truncate inode items when logging inode
  btrfs: avoid expensive search when truncating inode items from the log
  btrfs: avoid search for logged i_size when logging inode if possible
  btrfs: avoid attempt to drop extents when logging inode for the first time
  btrfs: do not commit delayed inode when logging a file in full sync mode

This is patch 5/10 and test results are listed in the change log of the
last patch in the set.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:08:00 +02:00
Filipe Manana 130341be7f btrfs: always update the logged transaction when logging new names
When we are logging a new name for an inode, due to a link or rename
operation, if the inode has ancestor inodes that are new, created in the
current transaction, we need to log that these inodes exist. To ensure
that a subsequent explicit fsync on one of these ancestor inodes does
sync the log, we don't set the logged_trans field of these inodes.
This was done in commit 75b463d2b4 ("btrfs: do not commit logs and
transactions during link and rename operations"), to avoid syncing a
log after a rename or link operation.

In order to allow for future changes to do some optimizations, change
this behaviour to always update the logged_trans of any logged inode
and don't update the last_log_commit of the inode if we are logging
that it exists. This accomplishes that same objective with simpler
logic, allowing for some optimizations in the next patches.

So just do that simplification.

This patch is part of a patch set comprised of the following patches:

  btrfs: check if a log tree exists at inode_logged()
  btrfs: remove no longer needed checks for NULL log context
  btrfs: do not log new dentries when logging that a new name exists
  btrfs: always update the logged transaction when logging new names
  btrfs: avoid expensive search when dropping inode items from log
  btrfs: add helper to truncate inode items when logging inode
  btrfs: avoid expensive search when truncating inode items from the log
  btrfs: avoid search for logged i_size when logging inode if possible
  btrfs: avoid attempt to drop extents when logging inode for the first time
  btrfs: do not commit delayed inode when logging a file in full sync mode

This is patch 4/10 and test results are listed in the change log of the
last patch in the set.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:08:00 +02:00
Filipe Manana c48792c6ee btrfs: do not log new dentries when logging that a new name exists
When logging a new name for an inode, due to a link or rename operation,
we don't need to log all new dentries of the parent directories and their
subdirectories. We only want to log the names of the inode and that any
new parent directories exist. So in this case don't trigger logging of
the new dentries, that is only need when doing an explicit fsync on a
directory or on a file which requires logging its parent directories.

This avoids unnecessary work and reduces contention on the extent buffers
of a log tree.

This patch is part of a patch set comprised of the following patches:

  btrfs: check if a log tree exists at inode_logged()
  btrfs: remove no longer needed checks for NULL log context
  btrfs: do not log new dentries when logging that a new name exists
  btrfs: always update the logged transaction when logging new names
  btrfs: avoid expensive search when dropping inode items from log
  btrfs: add helper to truncate inode items when logging inode
  btrfs: avoid expensive search when truncating inode items from the log
  btrfs: avoid search for logged i_size when logging inode if possible
  btrfs: avoid attempt to drop extents when logging inode for the first time
  btrfs: do not commit delayed inode when logging a file in full sync mode

This is patch 3/10 and test results are listed in the change log of the
last patch in the set.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:08:00 +02:00
Filipe Manana 289cffcb03 btrfs: remove no longer needed checks for NULL log context
Since commit 75b463d2b4 ("btrfs: do not commit logs and transactions
during link and rename operations"), we always pass a non-NULL log context
to btrfs_log_inode_parent() and therefore to all the functions that it
calls. So remove the checks we have all over the place that test for a
NULL log context, making the code shorter and easier to read, as well as
reducing the size of the generated code.

This patch is part of a patch set comprised of the following patches:

  btrfs: check if a log tree exists at inode_logged()
  btrfs: remove no longer needed checks for NULL log context
  btrfs: do not log new dentries when logging that a new name exists
  btrfs: always update the logged transaction when logging new names
  btrfs: avoid expensive search when dropping inode items from log
  btrfs: add helper to truncate inode items when logging inode
  btrfs: avoid expensive search when truncating inode items from the log
  btrfs: avoid search for logged i_size when logging inode if possible
  btrfs: avoid attempt to drop extents when logging inode for the first time
  btrfs: do not commit delayed inode when logging a file in full sync mode

This is patch 2/10 and test results are listed in the change log of the
last patch in the set.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:08:00 +02:00
Filipe Manana 1e0860f3b3 btrfs: check if a log tree exists at inode_logged()
In case an inode was never logged since it was loaded from disk and was
modified in the current transaction (its ->last_trans matches the ID of
the current transaction), inode_logged() returns true even if there's no
existing log tree. In this case we can simply check if a log tree exists
and return false if it does not. This avoids a caller of inode_logged()
doing some unnecessary, but harmless, work.

For btrfs_log_new_name() it avoids it logging an inode in case it was
never logged since it was loaded from disk and there is currently no log
tree for the inode's root. For the remaining callers of inode_logged(),
btrfs_del_dir_entries_in_log() and btrfs_del_inode_ref_in_log(), it has
no effect since they already check if a log tree exists through their
calls to join_running_log_trans().

So just add a check to inode_logged() to verify if a log tree exists, and
return false if it does not.

This patch is part of a patch set comprised of the following patches:

  btrfs: check if a log tree exists at inode_logged()
  btrfs: remove no longer needed checks for NULL log context
  btrfs: do not log new dentries when logging that a new name exists
  btrfs: always update the logged transaction when logging new names
  btrfs: avoid expensive search when dropping inode items from log
  btrfs: add helper to truncate inode items when logging inode
  btrfs: avoid expensive search when truncating inode items from the log
  btrfs: avoid search for logged i_size when logging inode if possible
  btrfs: avoid attempt to drop extents when logging inode for the first time
  btrfs: do not commit delayed inode when logging a file in full sync mode

This is patch 1/10 and test results are listed in the change log of the
last patch in the set.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:08:00 +02:00
Anand Jain cdccc03a8a btrfs: remove stale comment about the btrfs_show_devname
There were few lockdep warnings because btrfs_show_devname() was using
device_list_mutex as recorded in the commits:

  0ccd05285e ("btrfs: fix a possible umount deadlock")
  779bf3fefa ("btrfs: fix lock dep warning, move scratch dev out of device_list_mutex and uuid_mutex")

And finally, commit 88c14590cd ("btrfs: use RCU in btrfs_show_devname
for device list traversal") removed the device_list_mutex from
btrfs_show_devname for performance reasons.

This patch removes a stale comment about the function
btrfs_show_devname and device_list_mutex.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:08:00 +02:00
Anand Jain b7cb29e666 btrfs: update latest_dev when we create a sprout device
When we add a device to the seed filesystem (sprouting) it is a new
filesystem (and fsid) on the device added. Update the latest_dev so
that /proc/self/mounts shows the correct device.

Example:

  $ btrfstune -S1 /dev/vg/seed
  $ mount /dev/vg/seed /btrfs
  mount: /btrfs: WARNING: device write-protected, mounted read-only.

  $ cat /proc/self/mounts | grep btrfs
  /dev/mapper/vg-seed /btrfs btrfs ro,relatime,space_cache,subvolid=5,subvol=/ 0 0

  $ btrfs dev add -f /dev/vg/new /btrfs

Before:

  $ cat /proc/self/mounts | grep btrfs
  /dev/mapper/vg-seed /btrfs btrfs ro,relatime,space_cache,subvolid=5,subvol=/ 0 0

After:

  $ cat /proc/self/mounts | grep btrfs
  /dev/mapper/vg-new /btrfs btrfs ro,relatime,space_cache,subvolid=5,subvol=/ 0 0

Tested-by: Su Yue <l@damenly.su>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:08:00 +02:00
Anand Jain 6605fd2f39 btrfs: use latest_dev in btrfs_show_devname
The test case btrfs/238 reports the warning below:

 WARNING: CPU: 3 PID: 481 at fs/btrfs/super.c:2509 btrfs_show_devname+0x104/0x1e8 [btrfs]
 CPU: 2 PID: 1 Comm: systemd Tainted: G        W  O 5.14.0-rc1-custom #72
 Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015
 Call trace:
   btrfs_show_devname+0x108/0x1b4 [btrfs]
   show_mountinfo+0x234/0x2c4
   m_show+0x28/0x34
   seq_read_iter+0x12c/0x3c4
   vfs_read+0x29c/0x2c8
   ksys_read+0x80/0xec
   __arm64_sys_read+0x28/0x34
   invoke_syscall+0x50/0xf8
   do_el0_svc+0x88/0x138
   el0_svc+0x2c/0x8c
   el0t_64_sync_handler+0x84/0xe4
   el0t_64_sync+0x198/0x19c

Reason:
While btrfs_prepare_sprout() moves the fs_devices::devices into
fs_devices::seed_list, the btrfs_show_devname() searches for the devices
and found none, leading to the warning as in above.

Fix:
latest_dev is updated according to the changes to the device list.
That means we could use the latest_dev->name to show the device name in
/proc/self/mounts, the pointer will be always valid as it's assigned
before the device is deleted from the list in remove or replace.
The RCU protection is sufficient as the device structure is freed after
synchronization.

Reported-by: Su Yue <l@damenly.su>
Tested-by: Su Yue <l@damenly.su>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:08:00 +02:00
Anand Jain d24fa5c1da btrfs: convert latest_bdev type to btrfs_device and rename
In preparation to fix a bug in btrfs_show_devname().

Convert fs_devices::latest_bdev type from struct block_device to struct
btrfs_device and, rename the member to fs_devices::latest_dev.
So that btrfs_show_devname() can use fs_devices::latest_dev::name.

Tested-by: Su Yue <l@damenly.su>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:08:00 +02:00
Naohiro Aota 7ae9bd1803 btrfs: zoned: finish relocating block group
We will no longer write to a relocating block group. So, we can finish it
now.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:08:00 +02:00
Naohiro Aota be1a1d7a5d btrfs: zoned: finish fully written block group
If we have written to the zone capacity, the device automatically
deactivates the zone. Sync up block group side (the active BG list and
zone_is_active flag) with it.

We need to do it both on data BGs and metadata BGs. On data side, we add a
hook to btrfs_finish_ordered_io(). On metadata side, we use
end_extent_buffer_writeback().

To reduce excess lookup of a block group, we mark the last extent buffer in
a block group with EXTENT_BUFFER_ZONE_FINISH flag. This cannot be done for
data (ordered_extent), because the address may change due to
REQ_OP_ZONE_APPEND.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:08:00 +02:00
Naohiro Aota a85f05e59b btrfs: zoned: avoid chunk allocation if active block group has enough space
The current extent allocator tries to allocate a new block group when the
existing block groups do not have enough space. On a ZNS device, a new
block group means a new active zone. If the number of active zones has
already reached the max_active_zones, activating a new zone needs to finish
an existing zone, leading to wasting the free space there.

So, instead, it should reuse the existing active block groups as much as
possible when we can't activate any other zones without sacrificing an
already activated block group.

While at it, I converted find_free_extent_update_loop() to check the
found_extent() case early and made the other conditions simpler.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:07:59 +02:00
Naohiro Aota a12b0dc0aa btrfs: move ffe_ctl one level up
We are passing too many variables as it is from btrfs_reserve_extent() to
find_free_extent(). The next commit will add min_alloc_size to ffe_ctl, and
that means another pass-through argument. Take this opportunity to move
ffe_ctl one level up and drop the redundant arguments.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:07:59 +02:00
Naohiro Aota eb66a010d5 btrfs: zoned: activate new block group
Activate new block group at btrfs_make_block_group(). We do not check the
return value. If failed, we can try again later at the actual extent
allocation phase.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:07:59 +02:00
Naohiro Aota 2e654e4bb9 btrfs: zoned: activate block group on allocation
Activate a block group when trying to allocate an extent from it. We check
read-only case and no space left case before trying to activate a block
group not to consume the number of active zones uselessly.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:07:59 +02:00
Naohiro Aota 68a384b5ab btrfs: zoned: load active zone info for block group
Load activeness of underlying zones of a block group. When underlying zones
are active, we add the block group to the fs_info->zone_active_bgs list.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:07:59 +02:00
Naohiro Aota afba2bc036 btrfs: zoned: implement active zone tracking
Add zone_is_active flag to btrfs_block_group. This flag indicates the
underlying zones are all active. Such zone active block groups are tracked
by fs_info->active_bg_list.

btrfs_dev_{set,clear}_active_zone() take responsibility for the underlying
device part. They set/clear the bitmap to indicate zone activeness and
count the number of zones we can activate left.

btrfs_zone_{activate,finish}() take responsibility for the logical part and
the list management. In addition, btrfs_zone_finish() wait for any writes
on it and send REQ_OP_ZONE_FINISH to the zone.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:07:59 +02:00
Naohiro Aota dafc340dbd btrfs: zoned: introduce physical_map to btrfs_block_group
We will use a block group's physical location to track active zones and
finish fully written zones in the following commits. Since the zone
activation is done in the extent allocation context which already holding
the tree locks, we can't query the chunk tree for the physical locations.
So, copy the location info into a block group and use it for activation.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:07:59 +02:00
Naohiro Aota ea6f8ddcde btrfs: zoned: load active zone information from devices
The ZNS specification defines a limit on the number of zones that can be in
the implicit open, explicit open or closed conditions. Any zone with such
condition is defined as an active zone and correspond to any zone that is
being written or that has been only partially written. If the maximum
number of active zones is reached, we must either reset or finish some
active zones before being able to chose other zones for storing data.

Load queue_max_active_zones() and track the number of active zones left on
the device.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:07:59 +02:00
Naohiro Aota 8376d9e1ed btrfs: zoned: finish superblock zone once no space left for new SB
If there is no more space left for a new superblock in a superblock zone,
then it is better to ZONE_FINISH the zone and frees up the active zone
count.

Since btrfs_advance_sb_log() can now issue REQ_OP_ZONE_FINISH, we also need
to convert it to return int for the error case.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:07:59 +02:00
Naohiro Aota 9658b72ef3 btrfs: zoned: locate superblock position using zone capacity
sb_write_pointer() returns the write position of next superblock. For READ,
we need a previous location. When the pointer is at the head, the previous
one is the last one of the other zone. Calculate the last one's position
from zone capacity.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:07:59 +02:00
Naohiro Aota 5daaf552d1 btrfs: zoned: consider zone as full when no more SB can be written
We cannot write beyond zone capacity. So, we should consider a zone as
"full" when the write pointer goes beyond capacity - the size of super
info.

Also, take this opportunity to replace a subtle duplicated code with a loop
and fix a typo in comment.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:07:59 +02:00
Naohiro Aota d8da0e8567 btrfs: zoned: tweak reclaim threshold for zone capacity
With the introduction of zone capacity, the range [capacity, length] is
always zone unusable. Counting this region as a reclaim target will
cause reclaiming too early. Reclaim block groups based on bytes that can
be usable after resetting.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:07:59 +02:00
Naohiro Aota 98173255bd btrfs: zoned: calculate free space from zone capacity
Now that we introduced capacity in a block group, we need to calculate free
space using the capacity instead of the length. Thus, bytes we account
capacity - alloc_pointer as free, and account bytes [capacity, length] as
zone unusable.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:07:58 +02:00
Naohiro Aota c46c4247ab btrfs: zoned: move btrfs_free_excluded_extents out of btrfs_calc_zone_unusable
btrfs_free_excluded_extents() is not neccessary for
btrfs_calc_zone_unusable() and it makes btrfs_calc_zone_unusable()
difficult to reuse. Move it out and call btrfs_free_excluded_extents()
in proper context.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:07:58 +02:00
Naohiro Aota 8eae532be7 btrfs: zoned: load zone capacity information from devices
The ZNS specification introduces the concept of a Zone Capacity.  A zone
capacity is an additional per-zone attribute that indicates the number of
usable logical blocks within each zone, starting from the first logical
block of each zone. It is always smaller or equal to the zone size.

With the SINGLE profile, we can set a block group's "capacity" as the same
as the underlying zone's Zone Capacity. We will limit the allocation not
to exceed in a following commit.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:07:58 +02:00
Qu Wenruo c22a3572cb btrfs: defrag: enable defrag for subpage case
With the new infrastructure which has taken subpage into consideration,
now we should be safe to allow defrag to work for subpage case.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:07:58 +02:00
Qu Wenruo c635757365 btrfs: defrag: remove the old infrastructure
Now the old infrastructure can all be removed, defrag

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:07:58 +02:00
Qu Wenruo 7b508037d4 btrfs: defrag: use defrag_one_cluster() to implement btrfs_defrag_file()
The function defrag_one_cluster() is able to defrag one range well
enough, we only need to do preparation for it, including:

- Clamp and align the defrag range
- Exclude invalid cases
- Proper inode locking

The old infrastructures will not be removed in this patch, as it would
be too noisy to review.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:07:58 +02:00
Qu Wenruo b18c3ab234 btrfs: defrag: introduce helper to defrag one cluster
This new helper, defrag_one_cluster(), will defrag one cluster (at most
256K):

- Collect all initial targets

- Kick in readahead when possible

- Call defrag_one_range() on each initial target
  With some extra range clamping.

- Update @sectors_defragged parameter

This involves one behavior change, the defragged sectors accounting is
no longer as accurate as old behavior, as the initial targets are not
consistent.

We can have new holes punched inside the initial target, and we will
skip such holes later.
But the defragged sectors accounting doesn't need to be that accurate
anyway, thus I don't want to pass those extra accounting burden into
defrag_one_range().

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:07:57 +02:00
Qu Wenruo e9eec72151 btrfs: defrag: introduce helper to defrag a range
A new helper, defrag_one_range(), is introduced to defrag one range.

This function will mostly prepare the needed pages and extent status for
defrag_one_locked_target().

As we can only have a consistent view of extent map with page and extent
bits locked, we need to re-check the range passed in to get a real
target list for defrag_one_locked_target().

Since defrag_collect_targets() will call defrag_lookup_extent() and lock
extent range, we also need to teach those two functions to skip extent
lock.  Thus new parameter, @locked, is introduced to skip extent lock if
the caller has already locked the range.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:07:57 +02:00
Qu Wenruo 22b398eeee btrfs: defrag: introduce helper to defrag a contiguous prepared range
A new helper, defrag_one_locked_target(), introduced to do the real part
of defrag.

The caller needs to ensure both page and extents bits are locked, and no
ordered extent exists for the range, and all writeback is finished.

The core defrag part is pretty straight-forward:

- Reserve space
- Set extent bits to defrag
- Update involved pages to be dirty

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:07:13 +02:00
Qu Wenruo eb793cf857 btrfs: defrag: introduce helper to collect target file extents
Introduce a helper, defrag_collect_targets(), to collect all possible
targets to be defragged.

This function will not consider things like max_sectors_to_defrag, thus
caller should be responsible to ensure we don't exceed the limit.

This function will be the first stage of later defrag rework.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:06:53 +02:00
Qu Wenruo 5767b50c00 btrfs: defrag: factor out page preparation into a helper
In cluster_pages_for_defrag(), we have complex code block inside one
for() loop.

The code block is to prepare one page for defrag, this will ensure:

- The page is locked and set up properly.
- No ordered extent exists in the page range.
- The page is uptodate.

This behavior is pretty common and will be reused by later defrag
rework.

So factor out the code into its own helper, defrag_prepare_one_page(),
for later usage, and cleanup the code by a little.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:06:34 +02:00
Qu Wenruo 76068cae63 btrfs: defrag: replace hard coded PAGE_SIZE with sectorsize
When testing subpage defrag support, I always find some strange inode
nbytes error, after a lot of debugging, it turns out that
defrag_lookup_extent() is using PAGE_SIZE as size for
lookup_extent_mapping().

Since lookup_extent_mapping() is calling __lookup_extent_mapping() with
@strict == 1, this means any extent map smaller than one page will be
ignored, prevent subpage defrag to grab a correct extent map.

There are quite some PAGE_SIZE usage in ioctl.c, but most of them are
correct usages, and can be one of the following cases:

- ioctl structure size check
  We want ioctl structure to be contained inside one page.

- real page operations

The remaining cases in defrag_lookup_extent() and
check_defrag_in_cache() will be addressed in this patch.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:06:15 +02:00
Qu Wenruo cae7968680 btrfs: defrag: also check PagePrivate for subpage cases in cluster_pages_for_defrag()
In function cluster_pages_for_defrag() we have a window where we unlock
page, either start the ordered range or read the content from disk.

When we re-lock the page, we need to make sure it still has the correct
page->private for subpage.

Thus add the extra PagePrivate check here to handle subpage cases
properly.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:05:18 +02:00
Qu Wenruo 1ccc2e8a86 btrfs: defrag: pass file_ra_state instead of file to btrfs_defrag_file()
Currently btrfs_defrag_file() accepts both "struct inode" and "struct
file" as parameter.  We can easily grab "struct inode" from "struct
file" using file_inode() helper.

The reason why we need "struct file" is just to re-use its f_ra.

Change this to pass "struct file_ra_state" parameter, so that it's more
clear what we really want.  Since we're here, also add some comments on
the function btrfs_defrag_file().

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:04:39 +02:00
Anand Jain a09f23c355 btrfs: rename and switch to bool btrfs_chunk_readonly
btrfs_chunk_readonly() checks if the given chunk is writeable. It
returns 1 for readonly, and 0 for writeable. So the return argument type
bool shall suffice instead of the current type int.

Also, rename btrfs_chunk_readonly() to btrfs_chunk_writeable() as we
check if the bg is writeable, and helps to keep the logic at the parent
function simpler to understand.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:03:57 +02:00
Sidong Yang 44bee215f7 btrfs: reflink: initialize return value to 0 in btrfs_extent_same()
Fix a warning reported by smatch that ret could be returned without
initialized.  The dedupe operations are supposed to to return 0 for a 0
length range but the caller does not pass olen == 0. To keep this
behaviour and also fix the warning initialize ret to 0.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Sidong Yang <realwakka@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:03:57 +02:00
Qu Wenruo 72a69cd030 btrfs: subpage: pack all subpage bitmaps into a larger bitmap
Currently we use u16 bitmap to make 4k sectorsize work for 64K page
size.

But this u16 bitmap is not large enough to contain larger page size like
128K, nor is space efficient for 16K page size.

To handle both cases, here we pack all subpage bitmaps into a larger
bitmap, now btrfs_subpage::bitmaps[] will be the ultimate bitmap for
subpage usage.

Each sub-bitmap will has its start bit number recorded in
btrfs_subpage_info::*_start, and its bitmap length will be recorded in
btrfs_subpage_info::bitmap_nr_bits.

All subpage bitmap operations will be converted from using direct u16
operations to bitmap operations, with above *_start calculated.

For 64K page size with 4K sectorsize, this should not cause much
difference.

While for 16K page size, we will only need 1 unsigned long (u32) to
store all the bitmaps, which saves quite some space.

Furthermore, this allows us to support larger page size like 128K and
258K.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:03:55 +02:00
Jeff Layton 482e00075d fs: remove leftover comments from mandatory locking removal
Stragglers from commit f7e33bdbd6 ("fs: remove mandatory file locking
support").

Signed-off-by: Jeff Layton <jlayton@kernel.org>
2021-10-26 12:20:50 -04:00
Eric Biggers b7e072f9b7 fscrypt: improve a few comments
Improve a few comments.  These were extracted from the patch
"fscrypt: add support for hardware-wrapped keys"
(https://lore.kernel.org/r/20211021181608.54127-4-ebiggers@kernel.org).

Link: https://lore.kernel.org/r/20211026021042.6581-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
2021-10-25 19:11:50 -07:00
Qu Wenruo 8481dd80ab btrfs: subpage: introduce btrfs_subpage_bitmap_info
Currently we use fixed size u16 bitmap for subpage bitmap.  This is fine
for 4K sectorsize with 64K page size.

But for 4K sectorsize and larger page size, the bitmap is too small,
while for smaller page size like 16K, u16 bitmaps waste too much space.

Here we introduce a new helper structure, btrfs_subpage_bitmap_info, to
record the proper bitmap size, and where each bitmap should start at.

By this, we can later compact all subpage bitmaps into one u32 bitmap.
This patch is the first step.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-25 21:17:16 +02:00
Qu Wenruo 651fb41927 btrfs: subpage: make btrfs_alloc_subpage() return btrfs_subpage directly
The existing calling convention of btrfs_alloc_subpage() is pretty
awful.  Change it to a more common pattern by returning struct
btrfs_subpage directly and let the caller to determine if the call
succeeded.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-25 21:17:16 +02:00
Qu Wenruo fdf250db89 btrfs: subpage: only call btrfs_alloc_subpage() when sectorsize is smaller than PAGE_SIZE
There are two call sites of btrfs_alloc_subpage():

- btrfs_attach_subpage()
  We have ensured sectorsize is smaller than PAGE_SIZE

- alloc_extent_buffer()
  We call btrfs_alloc_subpage() unconditionally.

The alloc_extent_buffer() forces us to check the sectorsize size against
page size inside btrfs_alloc_subpage().

Since the function name, btrfs_alloc_subpage(), already indicates it
should only get called for subpage cases, do the check in
alloc_extent_buffer() and add an ASSERT() in btrfs_alloc_subpage().

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-25 21:17:16 +02:00
Su Yue 9675ea8c9d btrfs: update comment for fs_devices::seed_list in btrfs_rm_device
Update it since commit 944d3f9fac ("btrfs: switch seed device to
list api") did conversion from fs_devices::seed to fs_devices::seed_list.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Su Yue <l@damenly.su>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-25 21:17:16 +02:00
Anand Jain 991a3daeda btrfs: drop unnecessary ret in ioctl_quota_rescan_status
There is no need for the variable ret after d66105cfa873 ("btrfs:
allocate btrfs_ioctl_quota_rescan_args on stack"), remove it.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-25 21:17:16 +02:00
Marcos Paulo de Souza 0e3dd5bce8 btrfs: send: simplify send_create_inode_if_needed
The out label is being overused, we can simply return if the condition
permits.

No functional changes.

Reviewed-by: Su Yue <l@damenly.su>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-25 21:17:16 +02:00
Nikolay Borisov f6f39f7a0a btrfs: rename btrfs_alloc_chunk to btrfs_create_chunk
The user facing function used to allocate new chunks is
btrfs_chunk_alloc, unfortunately there is yet another similar sounding
function - btrfs_alloc_chunk. This creates confusion, especially since
the latter function can be considered "private" in the sense that it
implements the first stage of chunk creation and as such is called by
btrfs_chunk_alloc.

To avoid the awkwardness that comes with having similarly named but
distinctly different in their purpose function rename btrfs_alloc_chunk
to btrfs_create_chunk, given that the main purpose of this function is
to orchestrate the whole process of allocating a chunk - reserving space
into devices, deciding on characteristics of the stripe size and
creating the in-memory structures.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-25 21:17:16 +02:00
Jens Axboe 6b19b766e8 fs: get rid of the res2 iocb->ki_complete argument
The second argument was only used by the USB gadget code, yet everyone
pays the overhead of passing a zero to be passed into aio, where it
ends up being part of the aio res2 value.

Now that everybody is passing in zero, kill off the extra argument.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-25 10:36:24 -06:00
Pavel Begunkov fb27274a90 io_uring: clusterise ki_flags access in rw_prep
ioprio setup doesn't depend on other fields that are modified in
io_prep_rw() and we can move it down in the function without worrying
about performance. It's useful as it makes iocb->ki_flags
accesses/modifications closer together, so it's more likely the compiler
will cache it in a register and avoid extra reloads.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/8ee98779c06f1b59f6039b1e292db4332efd664b.1634987320.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-25 07:42:35 -06:00
Pavel Begunkov b9a6b8f92f io_uring: kill unused param from io_file_supports_nowait
io_file_supports_nowait() doesn't use rw argument anymore, remove it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/4bd6709fc573d70c866ea656cb7a7dbe94be8026.1634987320.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-25 07:42:35 -06:00
Pavel Begunkov d6a644a795 io_uring: clean up timeout async_data allocation
opcode prep functions are one of the first things that are called, we
can't have ->async_data allocated at this point and it's certainly a
bug. Reflect this assumption in io_timeout_prep() and add a WARN_ONCE
just in case.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/75a28ca7dbcc5af8b6cd9092819e8384c24dedd4.1634987320.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-25 07:42:35 -06:00
Pavel Begunkov afb7f56fc6 io_uring: don't try io-wq polling if not supported
If an opcode doesn't support polling, just let it be executed
synchronously in iowq, otherwise it will do a nonblock attempt just to
fail in io_arm_poll_handler() and return back to blocking execution.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/6401256db01b88f448f15fcd241439cb76f5b940.1634987320.git.asml.silence@gmail.com
Reviewed-by: Hao Xu <haoxu@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-25 07:42:33 -06:00
Pavel Begunkov 658d0a4016 io_uring: check if opcode needs poll first on arming
->pollout or ->pollin are set only for opcodes that need a file, so if
io_arm_poll_handler() tests them first we can be sure that the request
has file set and the ->file check can be removed.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9adfe4f543d984875e516fce6da35348aab48668.1634987320.git.asml.silence@gmail.com
Reviewed-by: Hao Xu <haoxu@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-25 07:42:31 -06:00
Pavel Begunkov d01905db14 io_uring: clean iowq submit work cancellation
If we've got IO_WQ_WORK_CANCEL in io_wq_submit_work(), handle the error
on the same lines as the check instead of having a weird code flow. The
main loop doesn't change but goes one indention left.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/ff4a09cf41f7a22bbb294b6f1faea721e21fe615.1634987320.git.asml.silence@gmail.com
Reviewed-by: Hao Xu <haoxu@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-25 07:42:29 -06:00
Pavel Begunkov 255657d237 io_uring: clean io_wq_submit_work()'s main loop
Do a bit of cleaning for the main loop of io_wq_submit_work(). Get rid
of switch, just replace it with a single if as we're retrying in both
other cases. Kill issue_sqe label, Get rid of needs_poll nesting and
disambiguate a bit the comment.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/ed12ce0c64e051f9a6b8a37a24f8ea554d299c29.1634987320.git.asml.silence@gmail.com
Reviewed-by: Hao Xu <haoxu@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-25 07:42:20 -06:00
Tim Gardner e34e6f8133 gfs2: Fix unused value warning in do_gfs2_set_flags()
Coverity complains of an unused value:

CID 119623 (#1 of 1): Unused value (UNUSED_VALUE)
assigned_value: Assigning value -1 to error here, but that stored value is
overwritten before it can be used.
237        error = -EPERM;

Fix it by removing the assignment.

Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-10-25 08:42:20 +02:00
Alexander Aring 660a6126f8 gfs2: check context in gfs2_glock_put
Add a might_sleep call into gfs2_glock_put which can sleep in DLM when
the last reference is released.  This will show problems earlier, and
not only when the last reference is put.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-10-25 08:42:20 +02:00
Andreas Gruenbacher 7427f3bb49 gfs2: Fix glock_hash_walk bugs
So far, glock_hash_walk took a reference on each glock it iterated over, and it
was the examiner's responsibility to drop those references.  Dropping the final
reference to a glock can sleep and the examiners are called in a RCU critical
section with spin locks held, so examiners that didn't need the extra reference
had to drop it asynchronously via gfs2_glock_queue_put or similar.  This wasn't
done correctly in thaw_glock which did call gfs2_glock_put, and not at all in
dump_glock_func.

Change glock_hash_walk to not take glock references at all.  That way, the
examiners that don't need them won't have to bother with slow asynchronous
puts, and the examiners that do need references can take them themselves.

Reported-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-10-25 08:42:20 +02:00
Andreas Gruenbacher 486408d690 gfs2: Cancel remote delete work asynchronously
In gfs2_inode_lookup and gfs2_create_inode, we're calling
gfs2_cancel_delete_work which currently cancels any remote delete work
(delete_work_func) synchronously.  This means that if the work is
currently running, it will wait for it to finish.  We're doing this to
pevent a previous instance of an inode from having any influence on the
next instance.

However, delete_work_func uses gfs2_inode_lookup internally, and we can
end up in a deadlock when delete_work_func gets interrupted at the wrong
time.  For example,

  (1) An inode's iopen glock has delete work queued, but the inode
      itself has been evicted from the inode cache.

  (2) The delete work is preempted before reaching gfs2_inode_lookup.

  (3) Another process recreates the inode (gfs2_create_inode).  It tries
      to cancel any outstanding delete work, which blocks waiting for
      the ongoing delete work to finish.

  (4) The delete work calls gfs2_inode_lookup, which blocks waiting for
      gfs2_create_inode to instantiate and unlock the new inode =>
      deadlock.

It turns out that when the delete work notices that its inode has been
re-instantiated, it will do nothing.  This means that it's safe to
cancel the delete work asynchronously.  This prevents the kind of
deadlock described above.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2021-10-25 08:42:20 +02:00
Bob Peterson 8793e14985 gfs2: set glock object after nq
Before this patch, function gfs2_create_inode called glock_set_object to
set the gl_object for inode and iopen glocks before the glock was locked.
That's wrong because other competing processes like evict may be
blocked waiting for the glock and still have gl_object set before the
actual eviction can take place.

This patch moves the call to glock_set_object until after the glock is
acquire in function gfs2_create_inode, so it waits for possibly
competing evicts to finish their processing first.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-10-25 08:42:19 +02:00
Bob Peterson 4b3113a257 gfs2: remove RDF_UPTODATE flag
The new GLF_INSTANTIATE_NEEDED flag obsoletes the old rgrp flag
GFS2_RDF_UPTODATE, so this patch replaces it like we did with inodes.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-10-25 08:42:19 +02:00
Bob Peterson ec1d398dd7 gfs2: Eliminate GIF_INVALID flag
With the addition of the new GLF_INSTANTIATE_NEEDED flag, the
GIF_INVALID flag is now redundant. This patch removes it.
Since inode_instantiate is only called when instantiation is needed,
the check in inode_instantiate is removed too.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-10-25 08:42:19 +02:00
Bob Peterson f2e70d8f2f gfs2: fix GL_SKIP node_scope problems
Before this patch, when a glock was locked, the very first holder on the
queue would unlock the lockref and call the go_instantiate glops function
(if one existed), unless GL_SKIP was specified. When we introduced the new
node-scope concept, we allowed multiple holders to lock glocks in EX mode
and share the lock.

But node-scope introduced a new problem: if the first holder has GL_SKIP
and the next one does NOT, since it is not the first holder on the queue,
the go_instantiate op was not called. Eventually the GL_SKIP holder may
call the instantiate sub-function (e.g. gfs2_rgrp_bh_get) but there was
still a window of time in which another non-GL_SKIP holder assumes the
instantiate function had been called by the first holder. In the case of
rgrp glocks, this led to a NULL pointer dereference on the buffer_heads.

This patch tries to fix the problem by introducing two new glock flags:

GLF_INSTANTIATE_NEEDED, which keeps track of when the instantiate function
needs to be called to "fill in" or "read in" the object before it is
referenced.

GLF_INSTANTIATE_IN_PROG which is used to determine when a process is
in the process of reading in the object. Whenever a function needs to
reference the object, it checks the GLF_INSTANTIATE_NEEDED flag, and if
set, it sets GLF_INSTANTIATE_IN_PROG and calls the glops "go_instantiate"
function.

As before, the gl_lockref spin_lock is unlocked during the IO operation,
which may take a relatively long amount of time to complete. While
unlocked, if another process determines go_instantiate is still needed,
it sees GLF_INSTANTIATE_IN_PROG is set, and waits for the go_instantiate
glop operation to be completed. Once GLF_INSTANTIATE_IN_PROG is cleared,
it needs to check GLF_INSTANTIATE_NEEDED again because the other process's
go_instantiate operation may not have been successful.

Functions that previously called the instantiate sub-functions now call
directly into gfs2_instantiate so the new bits are managed properly.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-10-25 08:42:19 +02:00
Bob Peterson e6f856008d gfs2: split glock instantiation off from do_promote
Before this patch, function do_promote had a section of code that did
the actual instantiation.  This patch splits that off into its own
function, gfs2_instantiate, which prepares us for the next patch that
will use that function.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-10-25 08:42:19 +02:00
Bob Peterson 60d8bae9d1 gfs2: further simplify do_promote
This patch further simplifies function do_promote by eliminating some
redundant code in favor of using a lock_released flag. This is just
prep work for a future patch.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-10-25 08:42:19 +02:00
Bob Peterson 17a6eceeb1 gfs2: re-factor function do_promote
This patch simply re-factors function do_promote to reduce the indents.
The logic should be unchanged. This makes future patches more readable.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-10-25 08:42:19 +02:00
Andreas Gruenbacher d74d0ce5bc gfs2: Remove 'first' trace_gfs2_promote argument
Remove the 'first' argument of trace_gfs2_promote: with GL_SKIP, the
'first' holder isn't the one that instantiates the glock
(gl_instantiate), which is what the 'first' flag was apparently supposed
to indicate.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-10-25 08:42:19 +02:00
Bob Peterson 3278b977c9 gfs2: change go_lock to go_instantiate
Before this patch, the go_lock glock operations (glops) did not do
any actual locking. They were used to instantiate objects, like reading
in dinodes and rgrps from the media.

This patch renames the functions to go_instantiate for clarity.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-10-25 08:42:18 +02:00
Bob Peterson a739765cd8 gfs2: dump glocks from gfs2_consist_OBJ_i
Before this patch, failed consistency checks printed out the object
that failed, but not the object's glock. This patch makes it also
print out the object glock so we can see the glock's holders and flags
to aid with debugging.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-10-25 08:42:18 +02:00
Bob Peterson 763766c057 gfs2: dequeue iopen holder in gfs2_inode_lookup error
Before this patch, if function gfs2_inode_lookup encountered an error
after it had locked the iopen glock, it never unlocked it, relying on
the evict code to do the cleanup.  The evict code then took the
inode glock while holding the iopen glock, which violates the locking
order.  For example,

 (1) node A does a gfs2_inode_lookup that fails, leaving the iopen glock
     locked.

 (2) node B calls delete_work_func -> gfs2_lookup_by_inum ->
     gfs2_inode_lookup.  It locks the inode glock and blocks trying to
     lock the iopen glock, which is held by node A.

 (3) node A eventually calls gfs2_evict_inode -> evict_should_delete.
     It blocks trying to lock the inode glock, which is now held by
     node B.

This patch introduces error handling to function gfs2_inode_lookup
so it properly dequeues held iopen glocks on errors.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-10-25 08:42:18 +02:00
Andreas Gruenbacher b016d9a84a gfs2: Save ip from gfs2_glock_nq_init
Before this patch, when a glock was locked by function gfs2_glock_nq_init,
it initialized the holder gh_ip (return address) as gfs2_glock_nq_init.
That made it extremely difficult to track down problems because many
functions call gfs2_glock_nq_init. This patch changes the function so
that it saves gh_ip from the caller of gfs2_glock_nq_init, which makes
it easy to backtrack which holder took the lock.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2021-10-25 08:42:18 +02:00
Bob Peterson a500bd3155 gfs2: Allow append and immutable bits to coexist
Before this patch, function do_gfs2_set_flags checked if the append
and immutable flags were being set while already set. If so, error -EPERM
was given. There's no reason why these two flags should be mutually
exclusive, and if you set them separately, you will, in essence, set
one while it is already set. For example:

chattr +a /mnt/gfs2/file1
chattr +i /mnt/gfs2/file1

The first command sets the append-only flag. Since they are additive,
the second command sets the immutable flag AND append-only flag,
since they both coexist in i_diskflags. So the second command should
not return an error. This bug caused xfstests generic/545 to fail.

This patch simply removes the invalid checks.
I also eliminated an unused parm from do_gfs2_set_flags.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-10-25 08:42:18 +02:00
Bob Peterson c98c2ca5ea gfs2: Switch some BUG_ON to GLOCK_BUG_ON for debug
In rgrp.c, there are several places where it does BUG_ON. This tells us
the call stack but nothing more, which is not very helpful.
This patch switches them to GLOCK_BUG_ON which also prints the glock,
its holders, and many of the rgrp values, which will help us debug
problems in the future.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-10-25 08:42:18 +02:00
Bob Peterson c1442f6b53 gfs2: move GL_SKIP check from glops to do_promote
Before this patch, each individual "go_lock" glock operation (glop)
checked the GL_SKIP flag, and if set, would skip further processing.

This patch changes the logic so the go_lock caller, function go_promote,
checks the GL_SKIP flag before calling the go_lock op in the first place.
This avoids having to unnecessarily unlock gl_lockref.lock only to
re-lock it again.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-10-25 08:42:17 +02:00
Bob Peterson 4c69038d90 gfs2: Add GL_SKIP holder flag to dump_holder
Somehow, the GL_SKIP flag was missed when dumping glock holders.
This patch adds it to function hflags2str. I added it at the end because
I wanted Holder and Skip flags together to read "Hs" rather than "sH"
to avoid confusion with "Shared" ("SH") holder state.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-10-25 08:42:17 +02:00
Bob Peterson 6edb6ba333 gfs2: remove redundant check in gfs2_rgrp_go_lock
Before this patch, function gfs2_rgrp_go_lock checked if GL_SKIP and
ar_rgrplvb were both true. However, GL_SKIP is only set for rgrps if
ar_rgrplvb is true (see gfs2_inplace_reserve). This patch simply removes
the redundant check.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-10-25 08:42:17 +02:00
Andreas Gruenbacher b01b2d72da gfs2: Fix mmap + page fault deadlocks for direct I/O
Also disable page faults during direct I/O requests and implement a
similar kind of retry logic as in the buffered I/O case.

The retry logic in the direct I/O case differs from the buffered I/O
case in the following way: direct I/O doesn't provide the kinds of
consistency guarantees between concurrent reads and writes that buffered
I/O provides, so once we lose the inode glock while faulting in user
pages, we always resume the operation.  We never need to return a
partial read or write.

This locking problem was originally reported by Jan Kara.  Linus came up
with the idea of disabling page faults.  Many thanks to Al Viro and
Matthew Wilcox for their feedback.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-10-25 08:42:14 +02:00
Gao Xiang eaa9172ad9 erofs: get rid of ->lru usage
Currently, ->lru is a way to arrange non-LRU pages and has some
in-kernel users. In order to minimize noticable issues of page
reclaim and cache thrashing under high memory presure, limited
temporary pages were all chained with ->lru and can be reused
during the request. However, it seems that ->lru could be removed
when folio is landing.

Let's use page->private to chain temporary pages for now instead
and transform EROFS formally after the topic of the folio / file
page design is finalized.

Link: https://lore.kernel.org/r/20211022090120.14675-1-hsiangkao@linux.alibaba.com
Cc: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Kent Overstreet <kent.overstreet@gmail.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2021-10-25 08:22:59 +08:00
Linus Torvalds b20078fd69 Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull autofs fix from Al Viro:
 "Fix for a braino of mine (in getting rid of open-coded
  dentry_path_raw() in autofs a couple of cycles ago).

  Mea culpa...  Obvious -stable fodder"

* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  autofs: fix wait name hash calculation in autofs_wait()
2021-10-24 09:36:06 -10:00
Linus Torvalds c460e7896e Ten fixes for the ksmbd kernel server, for improved security and additional buffer overflow checks
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmFzb/4ACgkQiiy9cAdy
 T1HhOQv/bTsGX85V5QFBQyW7F7ymYaSuDpvb/nYFzzs4fMz8/lbCJ6GQHMK7G+uR
 ZYFdZw5c5A+KsfjRoRAKOc6HJbGERnKWvIlRkNWnmcpsXcR3PFhhOxXKXxCTBjga
 fHxAJBENbsyCtGz25pUtuOKVG9zVaz31lyXgnz53b0ATrb8CxIll7zdKTp7aGdK0
 UIWUEYeUwctIhvyBXyms8ni+15keop6/7Y1KhUvodL/VhU4YCkKerFFQojsqLEBk
 4N6ZrgEEvzZPSjMt3KkbAapMNvf8Jgy6hKrB10MWB2sJG3bG1A4MuWYzVz7hlLbx
 KXztSjbEiExCw7Y8ZXeA0pT/P51TVB9uxSLaDJhGVrXIxkZ4exUz2pS0Tp2E8eMH
 4BGylAV9vrZqjmF3HQHJu8c/+f833dwbmMzgDFlFgR01U8NQYQvLf6ZoxmnQxdJC
 5CVzO2rGV1bAVEbCJAN/wnKSCpEzjcN0ruz9bcje8MFF9mXXiJxATIgNahuaj17t
 AvOnAbwd
 =FhM8
 -----END PGP SIGNATURE-----

Merge tag '5.15-rc6-ksmbd-fixes' of git://git.samba.org/ksmbd

Pull ksmbd fixes from Steve French:
 "Ten fixes for the ksmbd kernel server, for improved security and
  additional buffer overflow checks:

   - a security improvement to session establishment to reduce the
     possibility of dictionary attacks

   - fix to ensure that maximum i/o size negotiated in the protocol is
     not less than 64K and not more than 8MB to better match expected
     behavior

   - fix for crediting (flow control) important to properly verify that
     sufficient credits are available for the requested operation

   - seven additional buffer overflow, buffer validation checks"

* tag '5.15-rc6-ksmbd-fixes' of git://git.samba.org/ksmbd:
  ksmbd: add buffer validation in session setup
  ksmbd: throttle session setup failures to avoid dictionary attacks
  ksmbd: validate OutputBufferLength of QUERY_DIR, QUERY_INFO, IOCTL requests
  ksmbd: validate credit charge after validating SMB2 PDU body size
  ksmbd: add buffer validation for smb direct
  ksmbd: limit read/write/trans buffer size not to exceed 8MB
  ksmbd: validate compound response buffer
  ksmbd: fix potencial 32bit overflow from data area check in smb2_write
  ksmbd: improve credits management
  ksmbd: add validation in smb2_ioctl
2021-10-24 06:43:59 -10:00
Andreas Gruenbacher 4fdccaa0d1 iomap: Add done_before argument to iomap_dio_rw
Add a done_before argument to iomap_dio_rw that indicates how much of
the request has already been transferred.  When the request succeeds, we
report that done_before additional bytes were tranferred.  This is
useful for finishing a request asynchronously when part of the request
has already been completed synchronously.

We'll use that to allow iomap_dio_rw to be used with page faults
disabled: when a page fault occurs while submitting a request, we
synchronously complete the part of the request that has already been
submitted.  The caller can then take care of the page fault and call
iomap_dio_rw again for the rest of the request, passing in the number of
bytes already tranferred.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-10-24 15:26:05 +02:00
Andreas Gruenbacher 97308f8b0d iomap: Support partial direct I/O on user copy failures
In iomap_dio_rw, when iomap_apply returns an -EFAULT error and the
IOMAP_DIO_PARTIAL flag is set, complete the request synchronously and
return a partial result.  This allows the caller to deal with the page
fault and retry the remainder of the request.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-10-24 15:26:05 +02:00
Andreas Gruenbacher 42c498c18a iomap: Fix iomap_dio_rw return value for user copies
When a user copy fails in one of the helpers of iomap_dio_rw, fail with
-EFAULT instead of returning 0.  This matches what iomap_dio_bio_actor
returns when it gets an -EFAULT from bio_iov_iter_get_pages.  With these
changes, iomap_dio_actor now consistently fails with -EFAULT when a user
page cannot be faulted in.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2021-10-24 15:26:05 +02:00
Andreas Gruenbacher 00bfe02f47 gfs2: Fix mmap + page fault deadlocks for buffered I/O
In the .read_iter and .write_iter file operations, we're accessing
user-space memory while holding the inode glock.  There is a possibility
that the memory is mapped to the same file, in which case we'd recurse
on the same glock.

We could detect and work around this simple case of recursive locking,
but more complex scenarios exist that involve multiple glocks,
processes, and cluster nodes, and working around all of those cases
isn't practical or even possible.

Avoid these kinds of problems by disabling page faults while holding the
inode glock.  If a page fault would occur, we either end up with a
partial read or write or with -EFAULT if nothing could be read or
written.  In either case, we know that we're not done with the
operation, so we indicate that we're willing to give up the inode glock
and then we fault in the missing pages.  If that made us lose the inode
glock, we return a partial read or write.  Otherwise, we resume the
operation.

This locking problem was originally reported by Jan Kara.  Linus came up
with the idea of disabling page faults.  Many thanks to Al Viro and
Matthew Wilcox for their feedback.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-10-24 15:26:05 +02:00
Pavel Begunkov c907e52c72 io-wq: use helper for worker refcounting
Use io_worker_release() instead of hand coding it in io_worker_exit().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/6f95f09d2cdbafcbb2e22ad0d1a2bc4d3962bf65.1634987320.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-23 08:03:46 -06:00
Linus Torvalds da4d34b669 io_uring-5.15-2021-10-22
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmFzfyQQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpiO/D/9cqYHpjGDwyftzQFJFfEy9ny6nlLm6lJef
 hsrZjC0S649FnXc0YHVLDH3/nos0XsQUYvVJnAMW9EHB6x/95JRUyxzouVz1Fewp
 w8Z+lOKymIf3X1LQoB6KQXH5ayohNtUo6HA0Ye/v+iEG+bq/lo9tCMSshpJs3afq
 UWW8RxGhrMHfqfgn/8Kkz8fEqZjXz7tssZ+1AFftTxKbk97ZWPahwjvO+xLFWl/m
 NbMkHf3xeAvDL747ccrVBOerRZUPySXZElgkPzdjQ4y5HHZrpxt/ZR9Xu7XRzgkJ
 7SEmsJ80vla19u3eW/oAn3T4EEGS3qWlei8T47kKIoT1W52S3rqjwsV/30re16GW
 sGMWdFiH/GW3VnOxs0/a4/q70je3E9DicSTs4SALTwnvjQ+vrunWgG6ojtxLcieT
 Br+km8nmDPug1wxoH2gQLN/EhGcH5hQvi4ZMiMH8MWalYpEkIADOOvAwp0GDwVoE
 6DxWeYs57rdSQnSLxDah+mAqBokqswJ/ZmuBOO/iSqXCImehLs0VL1Y+TsThVbRy
 epnBdqLk5PbDpODcYTl7on3MD3hpoHjbpnAPah0py57sroiY73sNE/ms1AUsqYPs
 fAe5tjFwhGhVWRiZMGOAG6kgTtSdxG134c0Lyvy6xACTR8rJfgcnWMwFJDWK2GDn
 ReGYJcgEOA==
 =ywLV
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-5.15-2021-10-22' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:
 "Two fixes for the max workers limit API that was introduced this
  series: one fix for an issue with that code, and one fixing a linked
  timeout regression in this series"

* tag 'io_uring-5.15-2021-10-22' of git://git.kernel.dk/linux-block:
  io_uring: apply worker limits to previous users
  io_uring: fix ltimeout unprep
  io_uring: apply max_workers limit to all future users
  io-wq: max_worker fixes
2021-10-22 17:34:31 -10:00
Hao Xu 90fa02883f io_uring: implement async hybrid mode for pollable requests
The current logic of requests with IOSQE_ASYNC is first queueing it to
io-worker, then execute it in a synchronous way. For unbound works like
pollable requests(e.g. read/write a socketfd), the io-worker may stuck
there waiting for events for a long time. And thus other works wait in
the list for a long time too.
Let's introduce a new way for unbound works (currently pollable
requests), with this a request will first be queued to io-worker, then
executed in a nonblock try rather than a synchronous way. Failure of
that leads it to arm poll stuff and then the worker can begin to handle
other works.
The detail process of this kind of requests is:

step1: original context:
           queue it to io-worker
step2: io-worker context:
           nonblock try(the old logic is a synchronous try here)
               |
               |--fail--> arm poll
                            |
                            |--(fail/ready)-->synchronous issue
                            |
                            |--(succeed)-->worker finish it's job, tw
                                           take over the req

This works much better than the old IOSQE_ASYNC logic in cases where
unbound max_worker is relatively small. In this case, number of
io-worker eazily increments to max_worker, new worker cannot be created
and running workers stuck there handling old works in IOSQE_ASYNC mode.

In my 64-core machine, set unbound max_worker to 20, run echo-server,
turns out:
(arguments: register_file, connetion number is 1000, message size is 12
Byte)
original IOSQE_ASYNC: 76664.151 tps
after this patch: 166934.985 tps

Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20211018133445.103438-1-haoxu@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-22 19:20:57 -06:00
Brian Foster 5ca5916b6b xfs: punch out data fork delalloc blocks on COW writeback failure
If writeback I/O to a COW extent fails, the COW fork blocks are
punched out and the data fork blocks left alone. It is possible for
COW fork blocks to overlap non-shared data fork blocks (due to
cowextsz hint prealloc), however, and writeback unconditionally maps
to the COW fork whenever blocks exist at the corresponding offset of
the page undergoing writeback. This means it's quite possible for a
COW fork extent to overlap delalloc data fork blocks, writeback to
convert and map to the COW fork blocks, writeback to fail, and
finally for ioend completion to cancel the COW fork blocks and leave
stale data fork delalloc blocks around in the inode. The blocks are
effectively stale because writeback failure also discards dirty page
state.

If this occurs, it is likely to trigger assert failures, free space
accounting corruption and failures in unrelated file operations. For
example, a subsequent reflink attempt of the affected file to a new
target file will trip over the stale delalloc in the source file and
fail. Several of these issues are occasionally reproduced by
generic/648, but are reproducible on demand with the right sequence
of operations and timely I/O error injection.

To fix this problem, update the ioend failure path to also punch out
underlying data fork delalloc blocks on I/O error. This is analogous
to the writeback submission failure path in xfs_discard_page() where
we might fail to map data fork delalloc blocks and consistent with
the successful COW writeback completion path, which is responsible
for unmapping from the data fork and remapping in COW fork blocks.

Fixes: 787eb48550 ("xfs: fix and streamline error handling in xfs_end_io")
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-10-22 16:04:36 -07:00
Darrick J. Wong c04c51c524 xfs: remove unused parameter from refcount code
The owner info parameter is always NULL, so get rid of the parameter.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
2021-10-22 16:04:36 -07:00
Darrick J. Wong b3b5ff412a xfs: reduce the size of struct xfs_extent_free_item
We only use EFIs to free metadata blocks -- not regular data/attr fork
extents.  Remove all the fields that we never use, for a net reduction
of 16 bytes.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
2021-10-22 16:04:36 -07:00
Darrick J. Wong c201d9ca53 xfs: rename xfs_bmap_add_free to xfs_free_extent_later
xfs_bmap_add_free isn't a block mapping function; it schedules deferred
freeing operations for a later point in a compound transaction chain.
While it's primarily used by bunmapi, its use has expanded beyond that.
Move it to xfs_alloc.c and rename the function since it's now general
freeing functionality.  Bring the slab cache bits in line with the
way we handle the other intent items.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
2021-10-22 16:04:36 -07:00
Darrick J. Wong f3c799c22c xfs: create slab caches for frequently-used deferred items
Create slab caches for the high-level structures that coordinate
deferred intent items, since they're used fairly heavily.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
2021-10-22 16:04:36 -07:00
Darrick J. Wong 9e253954ac xfs: compact deferred intent item structures
Rearrange these structs to reduce the amount of unused padding bytes.
This saves eight bytes for each of the three structs changed here, which
means they're now all (rmap/bmap are 64 bytes, refc is 32 bytes) even
powers of two.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
2021-10-22 16:04:36 -07:00
Darrick J. Wong 182696fb02 xfs: rename _zone variables to _cache
Now that we've gotten rid of the kmem_zone_t typedef, rename the
variables to _cache since that's what they are.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
2021-10-22 16:04:20 -07:00
Darrick J. Wong e7720afad0 xfs: remove kmem_zone typedef
Remove these typedefs by referencing kmem_cache directly.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
2021-10-22 16:00:31 -07:00
Linus Torvalds 5ab2ed0a8d fuse fixes for 5.15-rc7
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCYXLSYQAKCRDh3BK/laaZ
 PEfYAQCZcGVboa5uIrCYmVnEgXXf5NX0UrrM0ytvnVssGcgUOQEA8nAx3hwyvwvS
 onA14DgXIz3koEE48PWv3gbJdpL/kAM=
 =R0ip
 -----END PGP SIGNATURE-----

Merge tag 'fuse-fixes-5.15-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse

Pull fuse fixes from Miklos Szeredi:
 "Syzbot discovered a race in case of reusing the fuse sb (introduced in
  this cycle).

  Fix it by doing the s_fs_info initialization at the proper place"

* tag 'fuse-fixes-5.15-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  fuse: clean up error exits in fuse_fill_super()
  fuse: always initialize sb->s_fs_info
  fuse: clean up fuse_mount destruction
  fuse: get rid of fuse_put_super()
  fuse: check s_root when destroying sb
2021-10-22 10:39:47 -10:00
Christoph Hellwig 1e03a36bdf block: simplify the block device syncing code
Get rid of the indirections and just provide a sync_bdevs
helper for the generic sync code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211019062530.2174626-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-22 08:36:55 -06:00
Christoph Hellwig 680e667bc2 ntfs3: use sync_blockdev_nowait
Use sync_blockdev_nowait instead of opencoding it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20211019062530.2174626-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-22 08:36:55 -06:00
Christoph Hellwig cb9568ee75 fat: use sync_blockdev_nowait
Use sync_blockdev_nowait instead of opencoding it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20211019062530.2174626-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-22 08:36:55 -06:00
Christoph Hellwig 1226dfff57 btrfs: use sync_blockdev
Use sync_blockdev instead of opencoding it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Acked-by: David Sterba <dsterba@suse.com>
Link: https://lore.kernel.org/r/20211019062530.2174626-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-22 08:36:55 -06:00
Christoph Hellwig 70164eb6cc block: remove __sync_blockdev
Instead offer a new sync_blockdev_nowait helper for the !wait case.
This new helper is exported as it will grow modular callers in a bit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211019062530.2174626-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-22 08:36:55 -06:00
Christoph Hellwig 9a208ba5c9 fs: remove __sync_filesystem
There is no clear benefit in having this helper vs just open coding it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20211019062530.2174626-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-22 08:36:55 -06:00
Christoph Hellwig 8c6aabd1c7 nfsd/blocklayout: use ->get_unique_id instead of sending SCSI commands
Call the ->get_unique_id method to query the SCSI identifiers.  This can
use the cached VPD page in the sd driver instead of sending a command
on every LAYOUTGET.  It will also allow to support NVMe based volumes
if the draft for that ever takes off.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: J. Bruce Fields <bfields@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20211021060607.264371-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-22 08:33:57 -06:00
Pavel Begunkov b22fa62a35 io_uring: apply worker limits to previous users
Another change to the API io-wq worker limitation API added in 5.15,
apply the limit to all prior users that already registered a tctx. It
may be confusing as it's now, in particular the change covers the
following 2 cases:

TASK1                   | TASK2
_________________________________________________
ring = create()         |
                        | limit_iowq_workers()
*not limited*           |

TASK1                   | TASK2
_________________________________________________
ring = create()         |
                        | issue_requests()
limit_iowq_workers()    |
                        | *not limited*

A note on locking, it's safe to traverse ->tctx_list as we hold
->uring_lock, but do that after dropping sqd->lock to avoid possible
problems. It's also safe to access tctx->io_wq there because tasks
kill it only after removing themselves from tctx_list, see
io_uring_cancel_generic() -> io_uring_clean_tctx()

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d6e09ecc3545e4dc56e43c906ee3d71b7ae21bed.1634818641.git.asml.silence@gmail.com
Reviewed-by: Hao Xu <haoxu@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-21 11:19:38 -06:00
Miklos Szeredi 964d32e512 fuse: clean up error exits in fuse_fill_super()
Instead of "goto err", return error directly, since there's no error
cleanup to do now.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-10-21 10:01:39 +02:00
Miklos Szeredi 80019f1138 fuse: always initialize sb->s_fs_info
Syzkaller reports a null pointer dereference in fuse_test_super() that is
caused by sb->s_fs_info being NULL.

This is due to the fact that fuse_fill_super() is initializing s_fs_info,
which is too late, it's already on the fs_supers list.  The initialization
needs to be done in sget_fc() with the sb_lock held.

Move allocation of fuse_mount and fuse_conn from fuse_fill_super() into
fuse_get_tree().

After this ->kill_sb() will always be called with non-NULL ->s_fs_info,
hence fuse_mount_destroy() can drop the test for non-NULL "fm".

Reported-by: syzbot+74a15f02ccb51f398601@syzkaller.appspotmail.com
Fixes: 5d5b74aa9c ("fuse: allow sharing existing sb")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-10-21 10:01:39 +02:00
Miklos Szeredi c191cd07ee fuse: clean up fuse_mount destruction
1. call fuse_mount_destroy() for open coded variants

2. before deactivate_locked_super() don't need fuse_mount destruction since
that will now be done (if ->s_fs_info is not cleared)

3. rearrange fuse_mount setup in fuse_get_tree_submount() so that the
regular pattern can be used

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-10-21 10:01:39 +02:00
Miklos Szeredi a27c061a49 fuse: get rid of fuse_put_super()
The ->put_super callback is called from generic_shutdown_super() in case of
a fully initialized sb.  This is called from kill_***_super(), which is
called from ->kill_sb instances.

Fuse uses ->put_super to destroy the fs specific fuse_mount and drop the
reference to the fuse_conn, while it does the same on each error case
during sb setup.

This patch moves the destruction from fuse_put_super() to
fuse_mount_destroy(), called at the end of all ->kill_sb instances.  A
follup patch will clean up the error paths.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-10-21 10:01:38 +02:00
Miklos Szeredi d534d31d6a fuse: check s_root when destroying sb
Checking "fm" works because currently sb->s_fs_info is cleared on error
paths; however, sb->s_root is what generic_shutdown_super() checks to
determine whether the sb was fully initialized or not.

This change will allow cleanup of sb setup error paths.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-10-21 10:01:38 +02:00
Ian Kent 25f54d08f1 autofs: fix wait name hash calculation in autofs_wait()
There's a mistake in commit 2be7828c9f ("get rid of autofs_getpath()")
that affects kernels from v5.13.0, basically missed because of me not
fully testing the change for Al.

The problem is that the hash calculation for the wait name qstr hasn't
been updated to account for the change to use dentry_path_raw(). This
prevents the correct matching an existing wait resulting in multiple
notifications being sent to the daemon for the same mount which must
not occur.

The problem wasn't discovered earlier because it only occurs when
multiple processes trigger a request for the same mount concurrently
so it only shows up in more aggressive testing.

Fixes: 2be7828c9f ("get rid of autofs_getpath()")
Cc: stable@vger.kernel.org
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-10-20 21:09:02 -04:00
Len Baker 6446c4fb12 aio: Prefer struct_size over open coded arithmetic
As noted in the "Deprecated Interfaces, Language Features, Attributes,
and Conventions" documentation [1], size calculations (especially
multiplication) should not be performed in memory allocator (or similar)
function arguments due to the risk of them overflowing. This could lead
to values wrapping around and a smaller allocation being made than the
caller was expecting. Using those allocations could lead to linear
overflows of heap memory and other misbehaviors.

So, use the struct_size() helper to do the arithmetic instead of the
argument "size + size * count" in the kzalloc() function.

[1] https://www.kernel.org/doc/html/latest/process/deprecated.html#open-coded-arithmetic-in-allocator-arguments

Signed-off-by: Len Baker <len.baker@gmx.com>
Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2021-10-20 18:24:31 -05:00
Len Baker 98b160c828 writeback: prefer struct_size over open coded arithmetic
As noted in the "Deprecated Interfaces, Language Features, Attributes,
and Conventions" documentation [1], size calculations (especially
multiplication) should not be performed in memory allocator (or similar)
function arguments due to the risk of them overflowing. This could lead
to values wrapping around and a smaller allocation being made than the
caller was expecting. Using those allocations could lead to linear
overflows of heap memory and other misbehaviors.

In this case these are not actually dynamic sizes: all the operands
involved in the calculation are constant values. However it is better to
refactor them anyway, just to keep the open-coded math idiom out of
code.

So, use the struct_size() helper to do the arithmetic instead of the
argument "size + count * size" in the kzalloc() functions.

This code was detected with the help of Coccinelle and audited and fixed
manually.

[1] https://www.kernel.org/doc/html/latest/process/deprecated.html#open-coded-arithmetic-in-allocator-arguments

Signed-off-by: Len Baker <len.baker@gmx.com>
Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2021-10-20 18:20:28 -05:00
Gustavo A. R. Silva c2e4e3b756 xfs: Use kvcalloc() instead of kvzalloc()
Use 2-factor argument multiplication form kvcalloc() instead of
kvzalloc().

Link: https://github.com/KSPP/linux/issues/162
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2021-10-20 18:14:12 -05:00
Linus Torvalds 2f111a6fd5 Two important filesystem fixes, marked for stable. The blocklisted
superblocks issue was particularly annoying because for unexperienced
 users it essentially exacted a reboot to establish a new functional
 mount in that scenario.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmFwWuYTHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHziwdHB/wJEvFkMQrlzbgVmijhmneU+TseAMxR
 UBGnsyHdiimIDWqzb81cBuDfrocQzyhntghP2lBzcbzI+gZN1KlrYzKAbYk++cfi
 E5Zbw3U8+moa5B2CnO19QEgmJY5DoXYXb6AbO3udIIj1Ls9lx0ByUyDoSn6fZyVH
 iUQ9OH7zVTsTscoaBiEVcutmhQjIFjoYJqPpfCg6/15xcXX/L1DvxQFBWOxXqHQw
 LYfCQIu8orrA2QdZpuTRpklrMg1Ih+RmqYTdQST6tTtTKJUrHPI0r3A8c2vUoBk1
 ph4fBNsAMUqFn1fIGT88PJg81RC5RC3E6D5PqErzRFsPbAv9FHfGYvGQ
 =FadF
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-5.15-rc7' of git://github.com/ceph/ceph-client

Pull ceph fixes from Ilya Dryomov:
 "Two important filesystem fixes, marked for stable.

  The blocklisted superblocks issue was particularly annoying because
  for unexperienced users it essentially exacted a reboot to establish a
  new functional mount in that scenario"

* tag 'ceph-for-5.15-rc7' of git://github.com/ceph/ceph-client:
  ceph: fix handling of "meta" errors
  ceph: skip existing superblocks that are blocklisted or shut down when mounting
2021-10-20 10:23:05 -10:00
Andreas Gruenbacher 1b223f7065 gfs2: Eliminate ip->i_gh
Now that gfs2_file_buffered_write is the only remaining user of
ip->i_gh, we can move the glock holder to the stack (or rather, use the
one we already have on the stack); there is no need for keeping the
holder in the inode anymore.

This is slightly complicated by the fact that we're using ip->i_gh for
the statfs inode in gfs2_file_buffered_write as well.  Writing to the
statfs inode isn't very common, so allocate the statfs holder
dynamically when needed.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-10-20 19:33:09 +02:00
Andreas Gruenbacher b924bdab74 gfs2: Move the inode glock locking to gfs2_file_buffered_write
So far, for buffered writes, we were taking the inode glock in
gfs2_iomap_begin and dropping it in gfs2_iomap_end with the intention of
not holding the inode glock while iomap_write_actor faults in user
pages.  It turns out that iomap_write_actor is called inside iomap_begin
... iomap_end, so the user pages were still faulted in while holding the
inode glock and the locking code in iomap_begin / iomap_end was
completely pointless.

Move the locking into gfs2_file_buffered_write instead.  We'll take care
of the potential deadlocks due to faulting in user pages while holding a
glock in a subsequent patch.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-10-20 19:33:09 +02:00
Bob Peterson dc732906c2 gfs2: Introduce flag for glock holder auto-demotion
This patch introduces a new HIF_MAY_DEMOTE flag and infrastructure that
will allow glocks to be demoted automatically on locking conflicts.
When a locking request comes in that isn't compatible with the locking
state of an active holder and that holder has the HIF_MAY_DEMOTE flag
set, the holder will be demoted before the incoming locking request is
granted.

Note that this mechanism demotes active holders (with the HIF_HOLDER
flag set), while before we were only demoting glocks without any active
holders.  This allows processes to keep hold of locks that may form a
cyclic locking dependency; the core glock logic will then break those
dependencies in case a conflicting locking request occurs.  We'll use
this to avoid giving up the inode glock proactively before faulting in
pages.

Processes that allow a glock holder to be taken away indicate this by
calling gfs2_holder_allow_demote(), which sets the HIF_MAY_DEMOTE flag.
Later, they call gfs2_holder_disallow_demote() to clear the flag again,
and then they check if their holder is still queued: if it is, they are
still holding the glock; if it isn't, they can re-acquire the glock (or
abort).

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-10-20 19:33:08 +02:00
Andreas Gruenbacher 6144464937 gfs2: Clean up function may_grant
Pass the first current glock holder into function may_grant and
deobfuscate the logic there.

While at it, switch from BUG_ON to GLOCK_BUG_ON in may_grant.  To make
that build cleanly, de-constify the may_grant arguments.

We're now using function find_first_holder in do_promote, so move the
function's definition above do_promote.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-10-20 19:33:08 +02:00
Andreas Gruenbacher 2eb7509a05 gfs2: Add wrapper for iomap_file_buffered_write
Add a wrapper around iomap_file_buffered_write.  We'll add code for when
the operation needs to be retried here later.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-10-20 19:33:08 +02:00
Christoph Hellwig 008f75a20e block: cleanup the flush plug helpers
Consolidate the various helpers into a single blk_flush_plug helper that
takes a plk_plug and the from_scheduler bool and switch all callsites to
call it directly.  Checks that the plug is non-NULL must be performed by
the caller, something that most already do anyway.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211020144119.142582-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-20 09:56:11 -06:00
Pavel Begunkov 4ea672ab69 io_uring: fix ltimeout unprep
io_unprep_linked_timeout() is broken, first it needs to return back
REQ_F_ARM_LTIMEOUT, so the linked timeout is enqueued and disarmed. But
now we refcounted it, and linked timeouts may get not executed at all,
leaking a request.

Just kill the unprep optimisation.

Fixes: 906c6caaf5 ("io_uring: optimise io_prep_linked_timeout()")
Reported-by: Beld Zhang <beldzhang@gmail.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/51b8e2bfc4bea8ee625cf2ba62b2a350cc9be031.1634719585.git.asml.silence@gmail.com
Link: https://github.com/axboe/liburing/issues/460
Reported-by: Beld Zhang <beldzhang@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-20 09:54:16 -06:00
Pavel Begunkov e139a1ec92 io_uring: apply max_workers limit to all future users
Currently, IORING_REGISTER_IOWQ_MAX_WORKERS applies only to the task
that issued it, it's unexpected for users. If one task creates a ring,
limits workers and then passes it to another task the limit won't be
applied to the other task.

Another pitfall is that a task should either create a ring or submit at
least one request for IORING_REGISTER_IOWQ_MAX_WORKERS to work at all,
furher complicating the picture.

Change the API, save the limits and apply to all future users. Note, it
should be done first before giving away the ring or submitting new
requests otherwise the result is not guaranteed.

Fixes: 2e480058dd ("io-wq: provide a way to limit max number of workers")
Link: https://github.com/axboe/liburing/issues/460
Reported-by: Beld Zhang <beldzhang@gmail.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/51d0bae97180e08ab722c0d5c93e7439cfb6f697.1634683237.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-20 09:54:06 -06:00
Changcheng Deng 898df2447b io_uring: Use ERR_CAST() instead of ERR_PTR(PTR_ERR())
Use ERR_CAST() instead of ERR_PTR(PTR_ERR()).
This makes it more readable and also fix this warning detected by
err_cast.cocci:
./fs/io_uring.c: WARNING: 3208: 11-18: ERR_CAST can be used with buf

Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: Changcheng Deng <deng.changcheng@zte.com.cn>
Link: https://lore.kernel.org/r/20211020084948.1038420-1-deng.changcheng@zte.com.cn
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-20 08:02:35 -06:00
Vivek Goyal 15bf32398a security: Return xattr name from security_dentry_init_security()
Right now security_dentry_init_security() only supports single security
label and is used by SELinux only. There are two users of this hook,
namely ceph and nfs.

NFS does not care about xattr name. Ceph hardcodes the xattr name to
security.selinux (XATTR_NAME_SELINUX).

I am making changes to fuse/virtiofs to send security label to virtiofsd
and I need to send xattr name as well. I also hardcoded the name of
xattr to security.selinux.

Stephen Smalley suggested that it probably is a good idea to modify
security_dentry_init_security() to also return name of xattr so that
we can avoid this hardcoding in the callers.

This patch adds a new parameter "const char **xattr_name" to
security_dentry_init_security() and LSM puts the name of xattr
too if caller asked for it (xattr_name != NULL).

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: James Morris <jamorris@linux.microsoft.com>
[PM: fixed typos in the commit description]
Signed-off-by: Paul Moore <paul@paul-moore.com>
2021-10-20 08:17:08 -04:00
Marios Makassikis 0d994cd482 ksmbd: add buffer validation in session setup
Make sure the security buffer's length/offset are valid with regards to
the packet length.

Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Marios Makassikis <mmakassikis@freebox.fr>
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-10-20 00:07:10 -05:00
Namjae Jeon 621be84a9d ksmbd: throttle session setup failures to avoid dictionary attacks
To avoid dictionary attacks (repeated session setups rapidly sent) to
connect to server, ksmbd make a delay of a 5 seconds on session setup
failure to make it harder to send enough random connection requests
to break into a server if a user insert the wrong password 10 times
in a row.

Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-10-20 00:07:10 -05:00
Hyunchul Lee 34061d6b76 ksmbd: validate OutputBufferLength of QUERY_DIR, QUERY_INFO, IOCTL requests
Validate OutputBufferLength of QUERY_DIR, QUERY_INFO, IOCTL requests and
check the free size of response buffer for these requests.

Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-10-20 00:07:10 -05:00
Hao Xu 3b44b3712c io_uring: split logic of force_nonblock
Currently force_nonblock stands for three meanings:
 - nowait or not
 - in an io-worker or not(hold uring_lock or not)

Let's split the logic to two flags, IO_URING_F_NONBLOCK and
IO_URING_F_UNLOCKED for convenience of the next patch.

Suggested-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/20211018133431.103298-1-haoxu@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 18:21:42 -06:00
Pavel Begunkov bc369921d6 io-wq: max_worker fixes
First, fix nr_workers checks against max_workers, with max_worker
registration, it may pretty easily happen that nr_workers > max_workers.

Also, synchronise writing to acct->max_worker with wqe->lock. It's not
an actual problem, but as we don't care about io_wqe_create_worker(),
it's better than WRITE_ONCE()/READ_ONCE().

Fixes: 2e480058dd ("io-wq: provide a way to limit max number of workers")
Reported-by: Beld Zhang <beldzhang@gmail.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/11f90e6b49410b7d1a88f5d04fb8d95bb86b8cf3.1634671835.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 17:09:34 -06:00
Darrick J. Wong 9fa47bdcd3 xfs: use separate btree cursor cache for each btree type
Now that we have the infrastructure to track the max possible height of
each btree type, we can create a separate slab cache for cursors of each
type of btree.  For smaller indices like the free space btrees, this
means that we can pack more cursors into a slab page, improving slab
utilization.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-10-19 11:45:16 -07:00
Darrick J. Wong 0ed5f7356d xfs: compute absolute maximum nlevels for each btree type
Add code for all five btree types so that we can compute the absolute
maximum possible btree height for each btree type.  This is a setup for
the next patch, which makes every btree type have its own cursor cache.

The functions are exported so that we can have xfs_db report the
absolute maximum btree heights for each btree type, rather than making
everyone run their own ad-hoc computations.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-10-19 11:45:16 -07:00
Darrick J. Wong bc8883eb77 xfs: kill XFS_BTREE_MAXLEVELS
Nobody uses this symbol anymore, so kill it.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-10-19 11:45:16 -07:00
Darrick J. Wong 9ec691205e xfs: compute the maximum height of the rmap btree when reflink enabled
Instead of assuming that the hardcoded XFS_BTREE_MAXLEVELS value is big
enough to handle the maximally tall rmap btree when all blocks are in
use and maximally shared, let's compute the maximum height assuming the
rmapbt consumes as many blocks as possible.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-10-19 11:45:16 -07:00
Darrick J. Wong 1b236ad7ba xfs: clean up xfs_btree_{calc_size,compute_maxlevels}
During review of the next patch, Dave remarked that he found these two
btree geometry calculation functions lacking in documentation and that
they performed more work than was really necessary.

These functions take the same parameters and have nearly the same logic;
the only real difference is in the return values.  Reword the function
comment to make it clearer what each function does, and move them to be
adjacent to reinforce their relation.

Clean up both of them to stop opencoding the howmany functions, stop
using the uint typedefs, and make them both support computations for
more than 2^32 leaf records, since we're going to need all of the above
for files with large data forks and large rmap btrees.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-10-19 11:45:16 -07:00
Darrick J. Wong b74e15d720 xfs: compute maximum AG btree height for critical reservation calculation
Compute the actual maximum AG btree height for deciding if a per-AG
block reservation is critically low.  This only affects the sanity check
condition, since we /generally/ will trigger on the 10% threshold.  This
is a long-winded way of saying that we're removing one more usage of
XFS_BTREE_MAXLEVELS.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-10-19 11:45:15 -07:00
Darrick J. Wong 7cb3efb4cf xfs: rename m_ag_maxlevels to m_allocbt_maxlevels
Years ago when XFS was thought to be much more simple, we introduced
m_ag_maxlevels to specify the maximum btree height of per-AG btrees for
a given filesystem mount.  Then we observed that inode btrees don't
actually have the same height and split that off; and now we have rmap
and refcount btrees with much different geometries and separate
maxlevels variables.

The 'ag' part of the name doesn't make much sense anymore, so rename
this to m_alloc_maxlevels to reinforce that this is the maximum height
of the *free space* btrees.  This sets us up for the next patch, which
will add a variable to track the maximum height of all AG btrees.

(Also take the opportunity to improve adjacent comments and fix minor
style problems.)

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-10-19 11:45:15 -07:00
Darrick J. Wong c940a0c54a xfs: dynamically allocate cursors based on maxlevels
To support future btree code, we need to be able to size btree cursors
dynamically for very large btrees.  Switch the maxlevels computation to
use the precomputed values in the superblock, and create cursors that
can handle a certain height.  For now, we retain the btree cursor cache
that can handle up to 9-level btrees, though a subsequent patch
introduces separate caches for each btree type, where each cache's
objects will be exactly tall enough to handle the specific btree type.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-10-19 11:45:15 -07:00
Darrick J. Wong c0643f6fdd xfs: encode the max btree height in the cursor
Encode the maximum btree height in the cursor, since we're soon going to
allow smaller cursors for AG btrees and larger cursors for file btrees.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-10-19 11:45:15 -07:00
Darrick J. Wong 56370ea6e5 xfs: refactor btree cursor allocation function
Refactor btree allocation to a common helper.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-10-19 11:45:15 -07:00
Darrick J. Wong 69724d920e xfs: rearrange xfs_btree_cur fields for better packing
Reduce the size of the btree cursor structure some more by rearranging
fields to eliminate unused space.  While we're at it, fix the ragged
indentation and a spelling error.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-10-19 11:45:14 -07:00
Darrick J. Wong 6ca444cfd6 xfs: prepare xfs_btree_cur for dynamic cursor heights
Split out the btree level information into a separate struct and put it
at the end of the cursor structure as a VLA.  Files with huge data forks
(and in the future, the realtime rmap btree) will require the ability to
support many more levels than a per-AG btree cursor, which means that
we're going to create per-btree type cursor caches to conserve memory
for the more common case.

Note that a subsequent patch actually introduces dynamic cursor heights.
This one merely rearranges the structure to prepare for that.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-10-19 11:45:14 -07:00
Darrick J. Wong eae5db476f xfs: dynamically allocate btree scrub context structure
Reorganize struct xchk_btree so that we can dynamically size the context
structure to fit the type of btree cursor that we have.  This will
enable us to use memory more efficiently once we start adding very tall
btree types.  Right-size the lastkey array to match the number of *node*
levels in the tree so that we stop wasting space.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-10-19 11:45:14 -07:00
Darrick J. Wong d47fef9342 xfs: don't track firstrec/firstkey separately in xchk_btree
The btree scrubbing code checks that the records (or keys) that it finds
in a btree block are all in order by calling the btree cursor's
->recs_inorder function.  This of course makes no sense for the first
item in the block, so we switch that off with a separate variable in
struct xchk_btree.

Christoph helped me figure out that the variable is unnecessary, since
we just accessed bc_ptrs[level] and can compare that against zero.  Use
that, and save ourselves some memory space.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-10-19 11:45:14 -07:00
Darrick J. Wong efb79ea310 xfs: reduce the size of nr_ops for refcount btree cursors
We're never going to run more than 4 billion btree operations on a
refcount cursor, so shrink the field to an unsigned int to reduce the
structure size.  Fix whitespace alignment too.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-10-19 11:45:14 -07:00
Darrick J. Wong cc41174047 xfs: remove xfs_btree_cur.bc_blocklog
This field isn't used by anyone, so get rid of it.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-10-19 11:45:13 -07:00
Darrick J. Wong 94a14cfd3b xfs: fix incorrect decoding in xchk_btree_cur_fsbno
During review of subsequent patches, Dave and I noticed that this
function doesn't work quite right -- accessing cur->bc_ino depends on
the ROOT_IN_INODE flag, not LONG_PTRS.  Fix that and the parentheses
isssue.  While we're at it, remove the piece that accesses cur->bc_ag,
because block 0 of an AG is never part of a btree.

Note: This changes the btree scrubber tracepoints behavior -- if the
cursor has no buffer for a certain level, it will always report
NULLFSBLOCK.  It is assumed that anyone tracing the online fsck code
will also be tracing xchk_start/xchk_done or otherwise be aware of what
exactly is being scrubbed.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-10-19 11:45:13 -07:00
Brian Foster 892a666faf xfs: fix perag reference leak on iteration race with growfs
The for_each_perag*() set of macros are hacky in that some (i.e.
those based on sb_agcount) rely on the assumption that perag
iteration terminates naturally with a NULL perag at the specified
end_agno. Others allow for the final AG to have a valid perag and
require the calling function to clean up any potential leftover
xfs_perag reference on termination of the loop.

Aside from providing a subtly inconsistent interface, the former
variant is racy with growfs because growfs can create discoverable
post-eofs perags before the final superblock update that completes
the grow operation and increases sb_agcount. This leads to the
following assert failure (reproduced by xfs/104) in the perag free
path during unmount:

 XFS: Assertion failed: atomic_read(&pag->pag_ref) == 0, file: fs/xfs/libxfs/xfs_ag.c, line: 195

This occurs because one of the many for_each_perag() loops in the
code that is expected to terminate with a NULL pag (and thus has no
post-loop xfs_perag_put() check) raced with a growfs and found a
non-NULL post-EOFS perag, but terminated naturally based on the
end_agno check without releasing the post-EOFS perag.

Rework the iteration logic to lift the agno check from the main for
loop conditional to the iteration helper function. The for loop now
purely terminates on a NULL pag and xfs_perag_next() avoids taking a
reference to any perag beyond end_agno in the first place.

Fixes: f250eedcf7 ("xfs: make for_each_perag... a first class citizen")
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-10-19 11:45:13 -07:00
Brian Foster 8ed004eb9d xfs: terminate perag iteration reliably on agcount
The for_each_perag_from() iteration macro relies on sb_agcount to
process every perag currently within EOFS from a given starting
point. It's perfectly valid to have perag structures beyond
sb_agcount, however, such as if a growfs is in progress. If a perag
loop happens to race with growfs in this manner, it will actually
attempt to process the post-EOFS perag where ->pag_agno ==
sb_agcount. This is reproduced by xfs/104 and manifests as the
following assert failure in superblock write verifier context:

 XFS: Assertion failed: agno < mp->m_sb.sb_agcount, file: fs/xfs/libxfs/xfs_types.c, line: 22

Update the corresponding macro to only process perags that are
within the current sb_agcount.

Fixes: 58d43a7e32 ("xfs: pass perags around in fsmap data dev functions")
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-10-19 11:45:13 -07:00
Brian Foster f1788b5e5e xfs: rename the next_agno perag iteration variable
Rename the next_agno variable to be consistent across the several
iteration macros and shorten line length.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-10-19 11:45:13 -07:00
Brian Foster bf2307b195 xfs: fold perag loop iteration logic into helper function
Fold the loop iteration logic into a helper in preparation for
further fixups. No functional change in this patch.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-10-19 11:45:12 -07:00
Qing Wang 53eb47b491 xfs: replace snprintf in show functions with sysfs_emit
coccicheck complains about the use of snprintf() in sysfs show functions.

Fix the coccicheck warning:
WARNING: use scnprintf or sprintf.

Use sysfs_emit instead of scnprintf or sprintf makes more sense.

Signed-off-by: Qing Wang <wangqing@vivo.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-10-19 11:45:12 -07:00
J. Bruce Fields e9728cc72d locks: remove changelog comments
This is only of historical interest, and anyone interested in the
history can dig out an old version of locks.c from from git.

Triggered by the observation that it references the now-removed
Documentation/filesystems/mandatory-locking.rst.

Reported-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
2021-10-19 14:11:39 -04:00
Arnd Bergmann 00169246e6 io_uring: warning about unused-but-set parameter
When enabling -Wunused warnings by building with W=1, I get an
instance of the -Wunused-but-set-parameter warning in the io_uring code:

fs/io_uring.c: In function 'io_queue_async_work':
fs/io_uring.c:1445:61: error: parameter 'locked' set but not used [-Werror=unused-but-set-parameter]
 1445 | static void io_queue_async_work(struct io_kiocb *req, bool *locked)
      |                                                       ~~~~~~^~~~~~

There are very few warnings of this type, so it would be nice to enable
this by default and fix all the existing instances. As the assignment
serves no purpose by itself other than to prevent developers from using
the variable, an easy workaround is to remove the assignment and just
rename the argument to "dont_use".

Fixes: f237c30a56 ("io_uring: batch task work locking")
Link: https://lore.kernel.org/lkml/20210920121352.93063-1-arnd@kernel.org/
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20211019153507.348480-1-arnd@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 09:50:18 -06:00
Gao Xiang 622ceaddb7 erofs: lzma compression support
Add MicroLZMA support in order to maximize compression ratios for
specific scenarios. For example, it's useful for low-end embedded
boards and as a secondary algorithm in a file for specific access
patterns.

MicroLZMA is a new container format for raw LZMA1, which was created
by Lasse Collin aiming to minimize old LZMA headers and get rid of
unnecessary EOPM (end of payload marker) as well as to enable
fixed-sized output compression, especially for 4KiB pclusters.

Similar to LZ4, inplace I/O approach is used to minimize runtime
memory footprint when dealing with I/O. Overlapped decompression is
handled with 1) bounced buffer for data under processing or 2) extra
short-lived pages from the on-stack pagepool which will be shared in
the same read request (128KiB for example).

Link: https://lore.kernel.org/r/20211010213145.17462-8-xiang@kernel.org
Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2021-10-19 23:44:30 +08:00
Gao Xiang 966edfb0a3 erofs: rename some generic methods in decompressor
Previously, some LZ4 methods were named with `generic'. However, while
evaluating the effective LZMA approach, it seems they aren't quite
generic at all (e.g. no need preparing dstpages for most LZMA cases.)

Avoid such naming instead.

Link: https://lore.kernel.org/r/20211010213145.17462-7-xiang@kernel.org
Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2021-10-19 23:44:30 +08:00
Gao Xiang 386292919c erofs: introduce readmore decompression strategy
Previously, the readahead window was strictly followed by EROFS
decompression strategy in order to minimize extra memory footprint.
However, it could become inefficient if just reading the partial
requested data for much big LZ4 pclusters and the upcoming LZMA
implementation.

Let's try to request the leading data in a pcluster without
triggering memory reclaiming instead for the LZ4 approach first
to boost up 100% randread of large big pclusters, and it has no real
impact on low memory scenarios.

It also introduces a way to expand read lengths in order to decompress
the whole pcluster, which is useful for LZMA since the algorithm
itself is relatively slow and causes CPU bound, but LZ4 is not.

Link: https://lore.kernel.org/r/20211008200839.24541-4-xiang@kernel.org
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2021-10-19 23:44:30 +08:00
Gao Xiang 72bb52620f erofs: introduce the secondary compression head
Previously, for each HEAD lcluster, it can be either HEAD or PLAIN
lcluster to indicate whether the whole pcluster is compressed or not.

In this patch, a new HEAD2 head type is introduced to specify another
compression algorithm other than the primary algorithm for each
compressed file, which can be used for upcoming LZMA compression and
LZ4 range dictionary compression for various data patterns.

It has been stayed in the EROFS roadmap for years. Complete it now!

Link: https://lore.kernel.org/r/20211017165721.2442-1-xiang@kernel.org
Reviewed-by: Yue Hu <huyue2@yulong.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2021-10-19 23:44:19 +08:00
Jens Axboe 5ca7a8b3f6 io_uring: inform block layer of how many requests we are submitting
The block layer can use this knowledge to make smarter decisions on
how to handle the request, if it knows that N more may be coming. Switch
to using blk_start_plug_nr_ios() to pass in that information.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 05:49:56 -06:00
Pavel Begunkov 88459b50b4 io_uring: simplify io_file_supports_nowait()
Make sure that REQ_F_SUPPORT_NOWAIT is always set io_prep_rw(), and so
we can stop caring about setting it down the line simplifying
io_file_supports_nowait().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/60c8f1f5e2cb45e00f4897b2cec10c5b3669da91.1634425438.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 05:49:56 -06:00
Pavel Begunkov 35645ac3c1 io_uring: combine REQ_F_NOWAIT_{READ,WRITE} flags
Merge REQ_F_NOWAIT_READ and REQ_F_NOWAIT_WRITE into one flag, i.e.
REQ_F_SUPPORT_NOWAIT. First it gets rid of dependence on CONFIG_64BIT
but also simplifies the code.

One thing to consider is when we don't have ->{read,write}_iter and go
through loop_rw_iter(). Just fail it with -EAGAIN if we expect nowait
behaviour but not sure whether it supports it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/f832a20e5186c2e79c6519280c238f559a1d2bbc.1634425438.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 05:49:56 -06:00
Pavel Begunkov e74ead135b io_uring: arm poll for non-nowait files
Don't check if we can do nowait before arming apoll, there are several
reasons for that. First, we don't care much about files that don't
support nowait. Second, it may be useful -- we don't want to be taking
away extra workers from io-wq when it can go in some async. Even if it
will go through io-wq eventually, it make difference in the numbers of
workers actually used. And the last one, it's needed to clean nowait in
future commits.

[kernel test robot: fix unused-var]

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9d06f3cb2c8b686d970269a87986f154edb83043.1634425438.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 05:49:56 -06:00
Noah Goldstein b10841c98c fs/io_uring: Prioritise checking faster conditions first in io_write
This commit reorders the conditions in a branch in io_write. The
reorder to check 'ret2 == -EAGAIN' first as checking
'(req->ctx->flags & IORING_SETUP_IOPOLL)' will likely be more
expensive due to 2x memory derefences.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
Link: https://lore.kernel.org/r/20211017013229.4124279-1-goldstein.w.n@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 05:49:56 -06:00
Pavel Begunkov 5cb03d6342 io_uring: clean io_prep_rw()
We already store req->file in a variable in io_prep_rw(), just use it
instead of a couple of left references to kicob->ki_filp.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/2f5889fc7ab670daefd5ccaedd99416d8355f0ad.1634314022.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 05:49:56 -06:00
Pavel Begunkov 578c0ee234 io_uring: optimise fixed rw rsrc node setting
Move fixed rw io_req_set_rsrc_node() from rw prep into
io_import_fixed(), if we're using fixed buffers it will always be called
during submission as we save the state in advance,

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/68c06f66d5aa9661f1e4b88d08c52d23528297ec.1634314022.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 05:49:56 -06:00
Pavel Begunkov caa8fe6e86 io_uring: return iovec from __io_import_iovec
We pass iovec** into __io_import_iovec(), which should keep it,
initialise and modify accordingly. It's expensive, return it directly
from __io_import_iovec encoding errors with ERR_PTR if needed.

io_import_iovec keeps the old interface, but it's inline and so
everything is optimised nicely.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/6230e9769982f03a8f86fa58df24666088c44d3e.1634314022.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 05:49:56 -06:00
Pavel Begunkov d1d681b084 io_uring: optimise io_import_iovec fixed path
Delay loading req->rw.{addr,len} in io_import_iovec until it's really
needed, so removing extra loads for the fixed path, which doesn't use
them.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/3cc48dd0c4f1a37c4ce9aab5784281a2d83ad8be.1634314022.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 05:49:55 -06:00
Pavel Begunkov 9882131cd9 io_uring: kill io_wq_current_is_worker() in iopoll
Don't decide about locking based on io_wq_current_is_worker(), it's not
consistent with all other code and is expensive, use issue_flags.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/7546d5a58efa4360173541c6fe02ee6b8c7b4ea7.1634314022.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 05:49:55 -06:00
Pavel Begunkov 9983028e76 io_uring: optimise req->ctx reloads
Don't load req->ctx in advance, it takes an extra register and the field
stays valid even after opcode handlers. It also optimises out req->ctx
load in io_iopoll_req_issued() once it's inlined.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1e45ff671c44be0eb904f2e448a211734893fa0b.1634314022.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 05:49:55 -06:00
Pavel Begunkov 607b6fb801 io_uring: rearrange io_read()/write()
Combine force_nonblock branches (which is already optimised by
compiler), flip branches so the most hot/common path is the first, e.g.
as with non on-stack iov setup, and add extra likely/unlikely
attributions for errror paths.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/2c2536c5896d70994de76e387ea09a0402173a3f.1634144845.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 05:49:55 -06:00
Pavel Begunkov 5e49c973fc io_uring: clean up io_import_iovec
Make io_import_iovec taking struct io_rw_state instead of an iter
pointer. First it takes care of initialising iovec pointer, which can be
forgotten. Even more, we can not init it if not needed, e.g. in case of
IORING_OP_READ_FIXED or IORING_OP_READ. Also hide saving iter_state
inside of it by splitting out an inline function of it to avoid extra
ifs.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b1bbc213a95e5272d4da5867bb977d9acb6f2109.1634144845.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 05:49:55 -06:00
Pavel Begunkov 51aac424ae io_uring: optimise io_import_iovec nonblock passing
First, change IO_URING_F_NONBLOCK to take sign bit of the int, so
checking for it can be turned into test + sign-based-jump, makes the
binary smaller and may be faster.

Then, instead of passing need_lock boolean into io_import_iovec() just
give it issue_flags, which is already stored somewhere. Saves some space
on stack, a couple of test + cmov operations and other conversions.

note: we still leave
force_nonblock = issue_flags & IO_URING_F_NONBLOCK
variable, but it's optimised out by the compiler into testing
issue_flags directly.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/ee96547e692f6c975c229cd82fc721679571a734.1634144845.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 05:49:55 -06:00
Pavel Begunkov c88598a92a io_uring: optimise read/write iov state storing
Currently io_read() and io_write() keep separate pointers to an iter and
to struct iov_iter_state, which is not great for register spilling and
requires more on-stack copies. They are both either on-stack or in
req->async_data at the same time, so use struct io_rw_state and keep a
pointer only to it, so having all the state with just one pointer.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/5c5e7ffd7dc25fc35075c70411ba99df72f237fa.1634144845.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 05:49:55 -06:00
Pavel Begunkov 538941e268 io_uring: encapsulate rw state
Add a new struct io_rw_state storing all iov related bits: fast iov,
iterator and iterator state. Not much changes here, simply convert
struct io_async_rw to use it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e8245ffcb568b228a009ec1eb79c993c813679f1.1634144845.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 05:49:55 -06:00
Pavel Begunkov 258f3a7f84 io_uring: optimise rw comletion handlers
Don't override req->result in io_complete_rw_iopoll() when it's already
of the same value, we have an if just above it, so move the assignment
there. Also, add one simle unlikely() in __io_complete_rw_common().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/8dfeb4f84026a20172bcf82c05010abe955874ae.1634144845.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 05:49:55 -06:00
Pavel Begunkov f80a50a632 io_uring: prioritise read success path over fails
Rearrange io_read return handling so first we expect it completing
successfully and only then checking for errors, which is a colder path.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/c91c7c2da11815ec8b04b5d872f60dc4cde662c5.1634144845.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 05:49:55 -06:00
Pavel Begunkov 04f34081c5 io_uring: consistent typing for issue_flags
Some of the functions keep issue_flags as int, change those to unsigned.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/04ad43797783bc9cc7567f287ab545518f8e8cf2.1634144845.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 05:49:55 -06:00
Pavel Begunkov ab40940247 io_uring: optimise rsrc referencing
Apparently, percpu_ref_put/get() are expensive enough if done per
request, get them in a batch and cache on the submission side to avoid
getting it over and over again. Also, if we're completing under
uring_lock, return refs back into the cache instead of
perfcpu_ref_put(). Pretty similar to how we do tctx->cached_refs
accounting, but fall back to normal putting when we already changed a
rsrc node by the time of free.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b40d8c5bc77d3c9550df8a319117a374ac85f8f4.1633817310.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 05:49:55 -06:00
Pavel Begunkov a46be971ed io_uring: optimise io_req_set_rsrc_node()
io_req_set_rsrc_node() reloads loads req->ctx, however it's already in
registers in all use cases, so better to pass it as a parameter.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/67a25557b8a51e90bfd578447a6f1671911b05ae.1633817310.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 05:49:55 -06:00
Pavel Begunkov def77acf43 io_uring: fix io_free_batch_list races
[  158.514382] WARNING: CPU: 5 PID: 15251 at fs/io_uring.c:1141 io_free_batch_list+0x269/0x360
[  158.514426] RIP: 0010:io_free_batch_list+0x269/0x360
[  158.514437] Call Trace:
[  158.514440]  __io_submit_flush_completions+0xde/0x180
[  158.514444]  tctx_task_work+0x14a/0x220
[  158.514447]  task_work_run+0x64/0xa0
[  158.514448]  __do_sys_io_uring_enter+0x7c/0x970
[  158.514450]  __x64_sys_io_uring_enter+0x22/0x30
[  158.514451]  do_syscall_64+0x43/0x90
[  158.514453]  entry_SYSCALL_64_after_hwframe+0x44/0xae

We should not touch request internals including req->comp_list.next
after putting our ref if it's not final, e.g. we can start freeing
requests from the free cache.

Fixed: 62ca9cb93e7f8 ("io_uring: optimise io_free_batch_list()")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b1f4df38fbb8f111f52911a02fd418d0283a4e6f.1634047298.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 05:49:55 -06:00
Pavel Begunkov 0cd3e3ddb4 io_uring: remove extra io_ring_exit_work wake up
task_work_add() takes care of waking up the thread, remove useless
wake_up_process().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/de9a71ee255112dcaed3b5d426be24934e74722c.1633532552.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 05:49:55 -06:00
Pavel Begunkov 4a04d1d148 io_uring: optimise out req->opcode reloading
Looking at the assembly, the compiler decided to reload req->opcode in
io_op_defs[opcode].needs_file instead of one it had in a register, so
store it in a temp variable so it can be optimised out. Also move the
personality block later, it's better for spilling/etc. as it only
depends on @sqe, which we're keeping anyway.

By the way, zero req->opcode if it over IORING_OP_LAST, not a problem,
at the moment but is safer.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/6ba869f5f8b7b0f991c87fdf089f0abf87cbe06b.1633532552.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 05:49:55 -06:00
Pavel Begunkov 5a158c6b0d io_uring: reshuffle io_submit_state bits
struct io_submit_state's ->free_list and ->link are hotter and smaller
than ->plug, place them first.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/6ad3c15849f50b27ad012c042c73e6e069d22df7.1633532552.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 05:49:55 -06:00
Pavel Begunkov 756ab7c0ec io_uring: safer fallback_work free
Add extra wq flushing for fallback_work, that's not necessary but safer
if invariants of io_fallback_req_func() change.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/24179419d6748516299600bc914f50b9e0b02275.1633532552.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 05:49:55 -06:00
Pavel Begunkov 6d63416dc5 io_uring: optimise plugging
Plugging is only needed with requests that also need a file, so hide
plugging under a ->needs_file check. Also, place ->needs_file and ->plug
bits into the same byte of io_op_defs, it may matter for compilers, e.g.
only with the change a tested one decided to optimise two memory testb
into a more with two register testb.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1600d1287bb7d16451d4ef3343252787a5314927.1633532552.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 05:49:55 -06:00
Pavel Begunkov 54daa9b2d8 io_uring: correct fill events helpers types
CQE result is a 32-bit integer, so the functions generating CQEs are
better to accept not long but ints. Convert io_cqring_fill_event() and
other helpers.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/7ca6f15255e9117eae28adcac272744cae29b113.1633373302.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 05:49:55 -06:00
Pavel Begunkov eb6e6f0690 io_uring: inline io_poll_complete
Inline io_poll_complete(), it's simple and doesn't have any particular
purpose.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/933d7ee3e4450749a2d892235462c8f18d030293.1633373302.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 05:49:55 -06:00
Pavel Begunkov 867f8fa5ae io_uring: inline io_req_needs_clean()
There is only a single user of io_req_needs_clean() inline it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/6111d0221ef4b439cad401e135dd6a5f990a0501.1633373302.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 05:49:54 -06:00
Pavel Begunkov d17e56eb49 io_uring: remove struct io_completion
We keep struct io_completion only as a temporal storage of cflags, Place
it in io_kiocb, it's cleaner, removes extra bits and even might be used
for future optimisations.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/5299bd5c223204065464bd87a515d0e405316086.1633373302.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 05:49:54 -06:00
Pavel Begunkov d886e185a1 io_uring: control ->async_data with a REQ_F flag
->async_data is a slow path, so it won't matter much if we do the clean
up inside io_clean_op(). Moreover, in many cases it's allocated together
with setting one or more of IO_REQ_CLEAN_FLAGS flags, so it'd go through
io_clean_op() anyway.

Control ->async_data allocation with a new flag REQ_F_ASYNC_DATA, so we
can do all the maintainence under io_req_needs_clean() fast check.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/6892cf5883c459f36bda26f30ceb16742b20b84b.1633373302.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 05:49:54 -06:00
Pavel Begunkov c1e53a6988 io_uring: optimise io_free_batch_list()
Delay reading the next node in io_free_batch_list(), allows the compiler
to load the value a bit later improving register spilling in some cases.
With gcc 11.1 it helped to move @task_refs variable from the stack to a
register and optimises out a couple of per request instructions.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/cc9fdfb6f72a4e8bc9918a5e9f2d97869a263ae4.1633373302.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 05:49:54 -06:00
Pavel Begunkov c072481ded io_uring: mark cold functions
Attribute cold functions so compilers can optimise them for size. It
shrinks the binary by 2.5-3%

   text    data     bss     dec     hex filename
  90670   14002       8  104680   198e8 ./fs/io_uring.o
  88053   14002       8  102063   18eaf ./fs/io_uring.o

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b53d385f91dca45170b67d7f11c7abd787e821f6.1633373302.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 05:49:54 -06:00
Pavel Begunkov 37f0e767e1 io_uring: optimise ctx referencing by requests
Currenlty, we allocate one ctx reference per request at submission time
and put them at free. It's batched and not so expensive but it still
bloats the kernel, adds 2 function calls for rcu and adds some overhead
for request counting in io_free_batch_list().

Always keep one reference with a request, even when it's freed and in
io_uring request caches. There is extra work at ring exit / quiesce
paths, which now need to put all cached requests. io_ring_exit_work() is
already looping, so it's not a problem. Add hybrid-busy waiting to
io_ctx_quiesce() as well for now.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/99613fbe396e80777228cde39bbda1aa8938554e.1633373302.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 05:49:54 -06:00
Pavel Begunkov d60aa65ba2 io_uring: merge CQ and poll waitqueues
->cq_wait and ->poll_wait and waken up in the same manner, use a single
waitqueue for both of them. CQ waiters are queued exclusively, so wake
up should first go over all pollers and that's what we need.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/00fe603e50000365774cf8435ef5fe03f049c1c9.1633373302.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 05:49:54 -06:00
Pavel Begunkov aede728aae io_uring: don't wake sqpoll in io_cqring_ev_posted
io_cqring_ev_posted() doesn't need to wake SQPOLL, it's either done by
userspace or with task_work, but no action is required on request
completion. Rip off bits waking it up in io_cqring_ev_posted().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b49dab27b64cf11f4c50f2f90dcaac123430e05d.1633373302.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 05:49:54 -06:00
Pavel Begunkov 765ff496c7 io_uring: optimise INIT_WQ_LIST
The invariant of io_wq_work_list is that it's empty IFF ->first is NULL,
so no need to initially set ->last. With now having more users of the
list it may play a role, i.e. used in each tw iteration and on every
completion flushing.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/c464ab5cab6e46a858c6d39c107e92b3b5291f13.1633373302.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 05:49:54 -06:00
Pavel Begunkov a33ae9ce16 io_uring: optimise request allocation
Even after fully inlining io_alloc_req() my compiler does a NULL check
in the path of successful allocation, no hacks like an empty dereference
help it. Restructure io_alloc_req() by splitting out refilling part, so
the compiler generate a slightly better binary.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/eda17571bdc7248d8e617b23e7132a5416e4680b.1633373302.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 05:49:54 -06:00
Pavel Begunkov fff4e40e30 io_uring: delay req queueing into compl-batch list
io_req_complete_state() is inlined and used in lots of places, so we
want to keep it concise. Move adding a request into a completion batch
list from io_req_complete_state() into the consumer, i.e.
__io_queue_sqe().

before vs after
   text    data     bss     dec     hex filename
  91894   14002       8  105904   19db0 ./fs/io_uring.o
  91046   14002       8  105056   19a60 ./fs/io_uring.o

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/4afca4e11abfd4cc8e99777fdcaf4d34cf4d022d.1633373302.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 05:49:54 -06:00
Pavel Begunkov 51d48dab62 io_uring: add more likely/unlikely() annotations
Add two extra unlikely() in io_submit_sqes() and one around
io_req_needs_clean() to help the compiler to avoid extra jumps
in hot paths.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/88e087afe657e7660194353aada9b00f11d480f9.1633373302.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 05:49:54 -06:00
Pavel Begunkov 7e3709d576 io_uring: optimise kiocb layout
We want ->comp_list in the second cacheline, which is hotter comparing
to the 3rd. Swap the field with ->link, which is not as hot and
controlled by flags and so not accessed unless there is a link.

By the way add a couple of comments for io_kiocb fields.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9d9dde31f8f62279a5f48c575bbc27b8290edc0c.1633373302.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 05:49:54 -06:00
Pavel Begunkov 6224590d24 io_uring: add flag to not fail link after timeout
For some reason non-off IORING_OP_TIMEOUT always fails links, it's
pretty inconvenient and unnecessary limits chaining after it to hard
linking, which is far from ideal, e.g. doesn't pair well with timeout
cancellation. Add a flag forcing it to not fail links on -ETIME.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/17c7ec0fb7a6113cc6be8cdaedcada0ba836ac0e.1633199723.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 05:49:54 -06:00
Pavel Begunkov 30d51dd4ad io_uring: clean up buffer select
Hiding a pointer to a struct io_buffer in rw.addr is error prone. We
have some place in io_kiocb, so keep kbuf's in a separate field
without aliasing and risks of it being misused.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/3e63a6a953b04cad81d9ea827b12344dd57b37b4.1633107393.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 05:49:54 -06:00
Pavel Begunkov fc0ae0244b io_uring: init opcode in io_init_req()
Move io_req_prep() call inside of io_init_req(), it simplifies a bit
error handling for callers.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/a0f59291fd52da4672c323542fd56fd899e23f8f.1633107393.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 05:49:54 -06:00
Pavel Begunkov e0eb71dcfc io_uring: don't return from io_drain_req()
Never return from io_drain_req() but punt to tw if we've got there but
it's a false positive and we shouldn't actually drain.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/93583cee51b8783706b76c73196c155b28d9e762.1633107393.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 05:49:54 -06:00
Pavel Begunkov 22b2ca310a io_uring: extra a helper for drain init
Add a helper io_init_req_drain for initialising requests with
IOSQE_DRAIN set. Also move bits from preambule of io_drain_req() in
there, because we already modify all the bits needed inside the helper.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/dcb412825b35b1cb8891245a387d7d69f8d14cef.1633107393.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 05:49:54 -06:00
Pavel Begunkov 5e371265ea io_uring: disable draining earlier
Clear ->drain_active in two more cases where we check for a need of
draining. It's not a bug, but still may lead to some extra requests
being punted to io-wq, and that may be not desirable.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d20b265f77bb4e8860b15b9987252c7c711dfcba.1632516769.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 05:49:54 -06:00
Pavel Begunkov a1cdbb4cb5 io_uring: comment why inline complete calls io_clean_op()
io_req_complete_state() calls io_clean_op() and it may be not entirely
obvious, leave a comment.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/21806f862151e223fdf439e5e8ed7178a8d66979.1632516769.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 05:49:54 -06:00
Pavel Begunkov ef05d9ebcc io_uring: kill off ->inflight_entry field
->inflight_entry is not used anymore after converting everything to
single linked lists, remove it. Also adjust io_kiocb layout, so all hot
bits are in first 3 cachelines.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/fd8d68087ede26c4e1707ce6b175aa1eb2381f2b.1632516769.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 05:49:54 -06:00
Pavel Begunkov 6962980947 io_uring: restructure submit sqes to_submit checks
Put an explicit check for number of requests to submit. First,
we can turn while into do-while and it generates better code, and second
that if can be cheaper, e.g. by using CPU flags after sub in
io_sqring_entries().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/5926baadd20c28feab7a5e1725fedf32e4553ff7.1632516769.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 05:49:54 -06:00
Pavel Begunkov d9f9d2842c io_uring: reshuffle queue_sqe completion handling
If a request completed inline the result should only be zero, it's a
grave error otherwise. So, when we see REQ_F_COMPLETE_INLINE it's not
even necessary to check the return code, and the flag check can be moved
earlier.

It's one "if" less for inline completions, and same two checks for it
normally completing (ret == 0). Those are two cases we care about the
most.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/ebd4e397a9c26d96c99b24447acc309741041a83.1632516769.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 05:49:54 -06:00
Pavel Begunkov d475a9a622 io_uring: inline hot path of __io_queue_sqe()
Extract slow paths from __io_queue_sqe() into a function and inline the
hot path. With that we have everything completely inlined on the
submission path up until io_issue_sqe().

-> io_submit_sqes()
  -> io_submit_sqe() (inlined)
    -> io_queue_sqe() (inlined)
       -> __io_queue_sqe() (inlined)
         -> io_issue_sqe()

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/f1606864d95d7f26dc28c7eec3dc6ed6ec32618a.1632516769.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 05:49:53 -06:00
Pavel Begunkov 4652fe3f10 io_uring: split slow path from io_queue_sqe
We don't want the slow path of io_queue_sqe to be inlined, so extract a
function from it.

   text    data     bss     dec     hex filename
  91950   13986       8  105944   19dd8 ./fs/io_uring.o
  91758   13986       8  105752   19d18 ./fs/io_uring.o

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/fb01253911f8fb374268f65b1ba939b54ca6583f.1632516769.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 05:49:53 -06:00
Pavel Begunkov 2a56a9bd64 io_uring: remove drain_active check from hot path
req->ctx->active_drain is a bit too expensive, partially because of two
dereferences. Do a trick, if we see it set in io_init_req(), set
REQ_F_FORCE_ASYNC and it automatically goes through a slower path where
we can catch it. It's nearly free to do in io_init_req() because there
is already ->restricted check and it's in the same byte of a bitmask.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d7e7ddc63c15e8a300833132abb3eb8fd3918aef.1632516769.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 05:49:53 -06:00
Pavel Begunkov f15a343177 io_uring: deduplicate io_queue_sqe() call sites
There are two call sites of io_queue_sqe() in io_submit_sqe(), combine
them into one, because io_queue_sqe() is inline and we don't want to
bloat binary, and will become even bigger

   text    data     bss     dec     hex filename
  92126   13986       8  106120   19e88 ./fs/io_uring.o
  91966   13986       8  105960   19de8 ./fs/io_uring.o

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/506124b8e767f0a4576f7a459f6aea3d13fb4dda.1632516769.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 05:49:53 -06:00
Pavel Begunkov 553deffd09 io_uring: don't pass state to io_submit_state_end
Submission state and ctx and coupled together, no need to passs

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e22d77a5786ef77e0c49b933ad74bae55cfb6ca6.1632516769.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 05:49:53 -06:00
Pavel Begunkov 1cce17aca6 io_uring: don't pass tail into io_free_batch_list
io_free_batch_list() iterates all requests in the passed in list,
so we don't really need to know the tail but can keep iterating until
meet NULL. Just pass the first node into it and it will be enough.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/4a12c84b6d887d980e05f417ba4172d04c64acae.1632516769.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 05:49:53 -06:00
Pavel Begunkov d4b7a5ef2b io_uring: inline completion batching helpers
We now have a single function for batched put of requests, just inline
struct req_batch and all related helpers into it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/595a2917f80dd94288cd7203052c7934f5446580.1632516769.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 05:49:53 -06:00
Pavel Begunkov f5ed3bcd5b io_uring: optimise batch completion
First, convert rest of iopoll bits to single linked lists, and also
replace per-request list_add_tail() with splicing a part of slist.

With that, use io_free_batch_list() to put/free requests. The main
advantage of it is that it's now the only user of struct req_batch and
friends, and so they can be inlined. The main overhead there was
per-request call to not-inlined io_req_free_batch(), which is expensive
enough.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b37fc6d5954b241e025eead7ab92c6f44a42f229.1632516769.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 05:49:53 -06:00
Pavel Begunkov b3fa03fd1b io_uring: convert iopoll_completed to store_release
Convert explicit barrier around iopoll_completed to smp_load_acquire()
and smp_store_release(). Similar on the callback side, but replaces a
single smp_rmb() with per-request smp_load_acquire(), neither imply any
extra CPU ordering for x86. Use READ_ONCE as usual where it doesn't
matter.

Use it to move filling CQEs by iopoll earlier, that will be necessary
to avoid traversing the list one extra time in the future.

Suggested-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/8bd663cb15efdc72d6247c38ee810964e744a450.1632516769.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 05:49:53 -06:00
Pavel Begunkov 3aa83bfb6e io_uring: add a helper for batch free
Add a helper io_free_batch_list(), which takes a single linked list and
puts/frees all requests from it in an efficient manner. Will be reused
later.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/4fc8306b542c6b1dd1d08e8021ef3bdb0ad15010.1632516769.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 05:49:53 -06:00
Pavel Begunkov 5eef4e87eb io_uring: use single linked list for iopoll
Use single linked lists for keeping iopoll requests, takes less space,
may be faster, but mostly will be of benefit for further patches.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/314033676b100cd485518c3bc55e1b95a0dcd71f.1632516769.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 05:49:53 -06:00
Pavel Begunkov e3f721e6f6 io_uring: split iopoll loop
The main loop of io_do_iopoll() iterates and does ->iopoll() until it
meets a first completed request, then it continues from that position
and splices requests to pass them through io_iopoll_complete().

Split the loop in two for clearness, iopolling and reaping completed
requests from the list.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/a7f6fd27a94845e5dc925a47a4a9765a92e514fb.1632516769.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 05:49:53 -06:00
Pavel Begunkov c2b6c6bc4e io_uring: replace list with stack for req caches
Replace struct list_head free_list serving for caching requests with
singly linked stack, which is faster.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1bc942b82422fb2624b8353bd93aca183a022846.1632516769.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 05:49:53 -06:00
Pavel Begunkov 0d9521b9b5 io-wq: add io_wq_work_node based stack
Apart from just using lists (i.e. io_wq_work_list), we also want to have
stacks, which are a bit faster, and have some interoperability between
them. Add a stack implementation based on io_wq_work_node and some
helpers.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/5d3a412a5ac0d47e0f0499d70d2207d70a68925e.1632516769.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 05:49:53 -06:00
Pavel Begunkov 3ab665b74e io_uring: remove allocation cache array
We have several of request allocation layers, remove the last one, which
is the submit->reqs array, and always use submit->free_reqs instead.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/8547095c35f7a87bab14f6447ecd30a273ed7500.1632516769.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 05:49:53 -06:00
Pavel Begunkov 6f33b0bc4e io_uring: use slist for completion batching
Currently we collect requests for completion batching in an array.
Replace them with a singly linked list. It's as fast as arrays but
doesn't take some much space in ctx, and will be used in future patches.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/a666826f2854d17e9fb9417fb302edfeb750f425.1632516769.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 05:49:53 -06:00
Pavel Begunkov 5ba3c874eb io_uring: make io_do_iopoll return number of reqs
Don't pass nr_events pointer around but return directly, it's less
expensive than pointer increments.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/f771a8153a86f16f12ff4272524e9e549c5de40b.1632516769.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 05:49:53 -06:00
Pavel Begunkov 87a115fb71 io_uring: force_nonspin
We don't really need to pass the number of requests to complete into
io_do_iopoll(), a flag whether to enforce non-spin mode is enough.

Should be straightforward, maybe except io_iopoll_check(). We pass !min
there, because we do never enter with the number of already reaped
requests is larger than the specified @min, apart from the first
iteration, where nr_events is 0 and so the final check should be
identical.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/782b39d1d8ec584eae15bca0a1feb6f0571fe5b8.1632516769.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 05:49:53 -06:00
Pavel Begunkov 6878b40e7b io_uring: mark having different creds unlikely
Hint the compiler that it's not as likely to have creds different from
current attached to a request. The current code generation is far from
ideal, hopefully it can help to some compilers to remove duplicated jump
tables and so.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e7815251ac4bf5a4a23d298c752f029ae19f3837.1632516769.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 05:49:53 -06:00
Hao Xu 8d4af6857c io_uring: return boolean value for io_alloc_async_data
boolean value is good enough for io_alloc_async_data.

Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20210922101522.9179-1-haoxu@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 05:49:53 -06:00