Some options only apply during mount time and are cleared at the end
of mount. For now, the example is USEBACKUPROOT, but CLEAR_CACHE also
fits the bill, and this is a preparation patch for also clearing that
option.
One subtlety is that the current code only resets USEBACKUPROOT on rw
mounts, but the option is meaningfully "consumed" by a ro mount, so it
feels appropriate to clear in that case as well. A subsequent read-write
remount would not go through open_ctree, which is the only place that
checks the option, so the change should be benign.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When a user attempts to remount a btrfs filesystem with
'mount -o remount,space_cache=v2', that operation silently succeeds.
Unfortunately, this is misleading, because the remount does not create
the free space tree. /proc/mounts will incorrectly show space_cache=v2,
but on the next mount, the file system will revert to the old
space_cache.
For now, we handle only the easier case, where the existing mount is
read-only and the new mount is read-write. In that case, we can create
the free space tree without contending with the block groups changing
as we go.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we attempt to create a free space tree while any block groups have
needs_free_space set, we will double add the new free space item
and hit EEXIST. Previously, we only created the free space tree on a new
mount, so we never hit the case, but if we try to create it on a
remount, such block groups could exist and trip us up.
We don't do anything with this field unless the free space tree is
enabled, so there is no harm in not setting it.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
When we mount a rw filesystem, we start the orphan cleanup process in
tree root and filesystem tree. However, when we remount a ro file system
rw, we only clean the former. Move the calls to btrfs_orphan_cleanup()
on tree_root and fs_root to the shared rw mount routine to effectively
add them on ro->rw remount.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Mounting rw and remounting from ro to rw naturally share invariants and
functionality which result in a correctly setup rw filesystem. Luckily,
there is even a strong unity in the code which implements them. In
mount's open_ctree, these operations mostly happen after an early return
for ro file systems, and in remount, they happen in a section devoted to
remounting ro->rw, after some remount specific validation passes.
However, there are unfortunately a few differences. There are small
deviations in the order of some of the operations, remount does not
start orphan cleanup in root_tree or fs_tree, remount does not create
the free space tree, and remount does not handle "one-shot" mount
options like clear_cache and uuid tree rescan.
Since we want to add building the free space tree to remount, and also
to start the same orphan cleanup process on a filesystem mounted as ro
then remounted rw, we would benefit from unifying the logic between the
two code paths.
This patch only lifts the existing common functionality, and leaves a
natural path for fixing the discrepancies.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Early on during a transaction commit we acquire the tree_log_mutex and
hold it until after we write the super blocks. But before writing the
extent buffers dirtied by the transaction and the super blocks we unblock
the transaction by setting its state to TRANS_STATE_UNBLOCKED and setting
fs_info->running_transaction to NULL.
This means that after that and before writing the super blocks, new
transactions can start. However if any transaction wants to log an inode,
it will block waiting for the transaction commit to write its dirty
extent buffers and the super blocks because the tree_log_mutex is only
released after those operations are complete, and starting a new log
transaction blocks on that mutex (at start_log_trans()).
Writing the dirty extent buffers and the super blocks can take a very
significant amount of time to complete, but we could allow the tasks
wanting to log an inode to proceed with most of their steps:
1) create the log trees
2) log metadata in the trees
3) write their dirty extent buffers
They only need to wait for the previous transaction commit to complete
(write its super blocks) before they attempt to write their super blocks,
otherwise we could end up with a corrupt filesystem after a crash.
So change start_log_trans() to use the root tree's log_mutex to serialize
for the creation of the log root tree instead of using the tree_log_mutex,
and make btrfs_sync_log() acquire the tree_log_mutex before writing the
super blocks. This allows for inode logging to wait much less time when
there is a previous transaction that is still committing, often not having
to wait at all, as by the time when we try to sync the log the previous
transaction already wrote its super blocks.
This patch belongs to a patch set that is comprised of the following
patches:
btrfs: fix race causing unnecessary inode logging during link and rename
btrfs: fix race that results in logging old extents during a fast fsync
btrfs: fix race that causes unnecessary logging of ancestor inodes
btrfs: fix race that makes inode logging fallback to transaction commit
btrfs: fix race leading to unnecessary transaction commit when logging inode
btrfs: do not block inode logging for so long during transaction commit
The following script that uses dbench was used to measure the impact of
the whole patchset:
$ cat test-dbench.sh
#!/bin/bash
DEV=/dev/nvme0n1
MNT=/mnt/btrfs
MOUNT_OPTIONS="-o ssd"
echo "performance" | \
tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
mkfs.btrfs -f -m single -d single $DEV
mount $MOUNT_OPTIONS $DEV $MNT
dbench -D $MNT -t 300 64
umount $MNT
The test was run on a machine with 12 cores, 64G of ram, using a NVMe
device and a non-debug kernel configuration (Debian's default).
Before patch set:
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 11277211 0.250 85.340
Close 8283172 0.002 6.479
Rename 477515 1.935 86.026
Unlink 2277936 0.770 87.071
Deltree 256 15.732 81.379
Mkdir 128 0.003 0.009
Qpathinfo 10221180 0.056 44.404
Qfileinfo 1789967 0.002 4.066
Qfsinfo 1874399 0.003 9.176
Sfileinfo 918589 0.061 10.247
Find 3951758 0.341 54.040
WriteX 5616547 0.047 85.079
ReadX 17676028 0.005 9.704
LockX 36704 0.003 1.800
UnlockX 36704 0.002 0.687
Flush 790541 14.115 676.236
Throughput 1179.19 MB/sec 64 clients 64 procs max_latency=676.240 ms
After patch set:
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 12687926 0.171 86.526
Close 9320780 0.002 8.063
Rename 537253 1.444 78.576
Unlink 2561827 0.559 87.228
Deltree 374 11.499 73.549
Mkdir 187 0.003 0.005
Qpathinfo 11500300 0.061 36.801
Qfileinfo 2017118 0.002 7.189
Qfsinfo 2108641 0.003 4.825
Sfileinfo 1033574 0.008 8.065
Find 4446553 0.408 47.835
WriteX 6335667 0.045 84.388
ReadX 19887312 0.003 9.215
LockX 41312 0.003 1.394
UnlockX 41312 0.002 1.425
Flush 889233 13.014 623.259
Throughput 1339.32 MB/sec 64 clients 64 procs max_latency=623.265 ms
+12.7% throughput, -8.2% max latency
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When logging an inode we may often have to fallback to a full transaction
commit, either because a new block group was allocated, there is some case
we can not deal with without a transaction commit or some error like an
ENOMEM happened. However after we fallback to a transaction commit, we
have a time window where we can make the next attempt to log any inode
commit the next transaction unnecessarily, adding additional overhead and
increasing latency.
A sequence of steps that leads to this issue is the following:
1) The current open transaction has a generation of 1000;
2) A new block group is allocated, and as a consequence we must make sure
any attempts to commit a log fallback to a transaction commit, so
btrfs_set_log_full_commit() is called from btrfs_make_block_group().
This sets fs_info->last_trans_log_full_commit to 1000;
3) Task A is holding a handle on transaction 1000 and tries to log inode X.
Once it gets to start_log_trans(), it calls btrfs_need_log_full_commit()
which returns true, since fs_info->last_trans_log_full_commit has a
value of 1000. So we end up returning EAGAIN and propagating it up to
btrfs_sync_file(), where we commit transaction 1000;
4) The transaction commit task (task A) sets the transaction state to
unblocked (TRANS_STATE_UNBLOCKED);
5) Some other task, task B, starts a new transaction with a generation of
1001;
6) Some stuff is done with transaction 1001, some btree blocks COWed, etc;
7) Transaction 1000 has not fully committed yet, we are still writing all
the extent buffers it created;
8) Some new task, task C, starts an fsync of inode Y, gets a handle for
transaction 1001, and it gets to btrfs_log_inode_parent() which does
the following check:
if (fs_info->last_trans_log_full_commit > last_committed) {
ret = 1;
goto end_no_trans;
}
At that point last_trans_log_full_commit has a value of 1000 and
last_committed (value of fs_info->last_trans_committed) has a value of
999, since transaction 1000 has not yet committed - it is either still
writing out dirty extent buffers, its super blocks or unpinning
extents.
As a consequence we return 1, which gets propagated up to
btrfs_sync_file(), which will then call btrfs_commit_transaction()
for transaction 1001.
As a consequence we have an unnecessary second transaction commit, we
previously committed transaction 1000 and now commit transaction 1001
as well, resulting in more overhead and increased latency.
So fix this double transaction commit issue simply by removing that check,
because all we need to do is wait for the previous transaction to finish
its commit, which we already do later when starting the log transaction at
start_log_trans(), because there we acquire the tree_log_mutex lock, which
is held by a transaction commit and only released after the transaction
commits its super blocks.
Another issue that check has is that it reads last_trans_log_full_commit
without using READ_ONCE(), which is incorrect since that member of
struct btrfs_fs_info is always updated with WRITE_ONCE() through the
helper btrfs_set_log_full_commit().
This double transaction commit issue can actually be triggered quite often
in long runs of dbench, since besides the creation of new block groups
that force inode logging to fallback to a transaction commit, there are
cases where dbench asks to fsync a directory which had files in it that
were previously renamed or subdirectories that were removed, resulting in
the inode logging to fallback to a full transaction commit.
This patch belongs to a patch set that is comprised of the following
patches:
btrfs: fix race causing unnecessary inode logging during link and rename
btrfs: fix race that results in logging old extents during a fast fsync
btrfs: fix race that causes unnecessary logging of ancestor inodes
btrfs: fix race that makes inode logging fallback to transaction commit
btrfs: fix race leading to unnecessary transaction commit when logging inode
btrfs: do not block inode logging for so long during transaction commit
Performance results are mentioned in the change log of the last patch.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When logging an inode and the previous transaction is still committing, we
have a time window where we can end up incorrectly think an inode has its
last_unlink_trans field with a value greater than the last transaction
committed, which results in the logging to fallback to a full transaction
commit, which is usually much more expensive than doing a log commit.
The race is described by the following steps:
1) We are at transaction 1000;
2) We modify an inode X (a directory) using transaction 1000 and set its
last_unlink_trans field to 1000, because for example we removed one
of its subdirectories;
3) We create a new inode Y with a dentry in inode X using transaction 1000,
so its generation field is set to 1000;
4) The commit for transaction 1000 is started by task A;
5) The task committing transaction 1000 sets the transaction state to
unblocked, writes the dirty extent buffers and the super blocks, then
unlocks tree_log_mutex;
6) Some task starts a new transaction with a generation of 1001;
7) We do some modification to inode Y (using transaction 1001);
8) The transaction 1000 commit starts unpinning extents. At this point
fs_info->last_trans_committed still has a value of 999;
9) Task B starts an fsync on inode Y, and gets a handle for transaction
1001. When it gets to check_parent_dirs_for_sync() it does the checking
of the ancestor dentries because the following check does not evaluate
to true:
if (S_ISREG(inode->vfs_inode.i_mode) &&
inode->generation <= last_committed &&
inode->last_unlink_trans <= last_committed)
goto out;
The generation value for inode Y is 1000 and last_committed, which has
the value read from fs_info->last_trans_committed, has a value of 999,
so that check evaluates to false and we proceed to check the ancestor
inodes.
Once we get to the first ancestor, inode X, we call
btrfs_must_commit_transaction() on it, which evaluates to true:
static bool btrfs_must_commit_transaction(...)
{
struct btrfs_fs_info *fs_info = inode->root->fs_info;
bool ret = false;
mutex_lock(&inode->log_mutex);
if (inode->last_unlink_trans > fs_info->last_trans_committed) {
/*
* Make sure any commits to the log are forced to be full
* commits.
*/
btrfs_set_log_full_commit(trans);
ret = true;
}
(...)
because inode's X last_unlink_trans has a value of 1000 and
fs_info->last_trans_committed still has a value of 999, it returns
true to check_parent_dirs_for_sync(), making it return 1 which is
propagated up to btrfs_sync_file(), causing it to fallback to a full
transaction commit of transaction 1001.
We should have not fallen back to commit transaction 1001, since inode
X had last_unlink_trans set to 1000 and the super blocks for
transaction 1000 were already written. So while not resulting in a
functional problem, it leads to a lot more work and higher latencies
for a fsync since committing a transaction is usually more expensive
than committing a log (if other filesystem changes happened under that
transaction).
Similar problem happens when logging directories, for the same reason as
btrfs_must_commit_transaction() returns true on an inode with its
last_unlink_trans having the generation of the previous transaction and
that transaction is still committing, unpinning its freed extents.
So fix this by comparing last_unlink_trans with the id of the current
transaction instead of fs_info->last_trans_committed.
This case is often hit when running dbench for a long enough duration, as
it does lots of rename and rmdir operations (both update the field
last_unlink_trans of an inode) and fsyncs of files and directories.
This patch belongs to a patch set that is comprised of the following
patches:
btrfs: fix race causing unnecessary inode logging during link and rename
btrfs: fix race that results in logging old extents during a fast fsync
btrfs: fix race that causes unnecessary logging of ancestor inodes
btrfs: fix race that makes inode logging fallback to transaction commit
btrfs: fix race leading to unnecessary transaction commit when logging inode
btrfs: do not block inode logging for so long during transaction commit
Performance results are mentioned in the change log of the last patch.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When logging an inode and we are checking if we need to log ancestors that
are new, if the previous transaction is still committing we have a time
window where we can unnecessarily log ancestor inodes that were created in
the previous transaction.
The race is described by the following steps:
1) We are at transaction 1000;
2) Directory inode X is created, its generation is set to 1000;
3) The commit for transaction 1000 is started by task A;
4) The task committing transaction 1000 sets the transaction state to
unblocked, writes the dirty extent buffers and the super blocks, then
unlocks tree_log_mutex;
5) Inode Y, a regular file, is created under directory inode X, this
results in starting a new transaction with a generation of 1001;
6) The transaction 1000 commit is unpinning extents. At this point
fs_info->last_trans_committed still has a value of 999;
7) Task B calls fsync on inode Y and gets a handle for transaction 1001;
8) Task B ends up at log_all_new_ancestors() and then because inode Y has
only one hard link, ends up at log_new_ancestors_fast(). There it reads
a value of 999 from fs_info->last_trans_committed, and sees that the
parent inode X has a generation of 1000, so we end up logging inode X:
if (inode->generation > fs_info->last_trans_committed) {
ret = btrfs_log_inode(trans, root, inode,
LOG_INODE_EXISTS, ctx);
(...)
which is not necessary since it was created in the past transaction,
with a generation of 1000, and that transaction has already committed
its super blocks - it's still unpinning extents so it has not yet
updated fs_info->last_trans_committed from 999 to 1000.
So this just causes us to spend more time logging and allocating and
writing more tree blocks for the log tree.
So fix this by comparing an inode's generation with the generation of the
transaction our transaction handle refers to - if the inode's generation
matches the generation of the current transaction than we know it is a
new inode we need to log, otherwise don't log it.
This case is often hit when running dbench for a long enough duration.
This patch belongs to a patch set that is comprised of the following
patches:
btrfs: fix race causing unnecessary inode logging during link and rename
btrfs: fix race that results in logging old extents during a fast fsync
btrfs: fix race that causes unnecessary logging of ancestor inodes
btrfs: fix race that makes inode logging fallback to transaction commit
btrfs: fix race leading to unnecessary transaction commit when logging inode
btrfs: do not block inode logging for so long during transaction commit
Performance results are mentioned in the change log of the last patch.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When logging the extents of an inode during a fast fsync, we have a time
window where we can log extents that are from the previous transaction and
already persisted. This only makes us waste time unnecessarily.
The following sequence of steps shows how this can happen:
1) We are at transaction 1000;
2) An ordered extent E from inode I completes, that is it has gone through
btrfs_finish_ordered_io(), and it set the extent maps' generation to
1000 when we unpin the extent, which is the generation of the current
transaction;
3) The commit for transaction 1000 starts by task A;
4) The task committing transaction 1000 sets the transaction state to
unblocked, writes the dirty extent buffers and the super blocks, then
unlocks tree_log_mutex;
5) Some change is made to inode I, resulting in creation of a new
transaction with a generation of 1001;
6) The transaction 1000 commit starts unpinning extents. At this point
fs_info->last_trans_committed still has a value of 999;
7) Task B starts an fsync on inode I, and when it gets to
btrfs_log_changed_extents() sees the extent map for extent E in the
list of modified extents. It sees the extent map has a generation of
1000 and fs_info->last_trans_committed has a value of 999, so it
proceeds to logging the respective file extent item and all the
checksums covering its range.
So we end up wasting time since the extent was already persisted and
is reachable through the trees pointed to by the super block committed
by transaction 1000.
So just fix this by comparing the extent maps generation against the
generation of the transaction handle - if it is smaller then the id in the
handle, we know the extent was already persisted and we do not need to log
it.
This patch belongs to a patch set that is comprised of the following
patches:
btrfs: fix race causing unnecessary inode logging during link and rename
btrfs: fix race that results in logging old extents during a fast fsync
btrfs: fix race that causes unnecessary logging of ancestor inodes
btrfs: fix race that makes inode logging fallback to transaction commit
btrfs: fix race leading to unnecessary transaction commit when logging inode
btrfs: do not block inode logging for so long during transaction commit
Performance results are mentioned in the change log of the last patch.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When we are doing a rename or a link operation for an inode that was logged
in the previous transaction and that transaction is still committing, we
have a time window where we incorrectly consider that the inode was logged
previously in the current transaction and therefore decide to log it to
update it in the log. The following steps give an example on how this
happens during a link operation:
1) Inode X is logged in transaction 1000, so its logged_trans field is set
to 1000;
2) Task A starts to commit transaction 1000;
3) The state of transaction 1000 is changed to TRANS_STATE_UNBLOCKED;
4) Task B starts a link operation for inode X, and as a consequence it
starts transaction 1001;
5) Task A is still committing transaction 1000, therefore the value stored
at fs_info->last_trans_committed is still 999;
6) Task B calls btrfs_log_new_name(), it reads a value of 999 from
fs_info->last_trans_committed and because the logged_trans field of
inode X has a value of 1000, the function does not return immediately,
instead it proceeds to logging the inode, which should not happen
because the inode was logged in the previous transaction (1000) and
not in the current one (1001).
This is not a functional problem, just wasted time and space logging an
inode that does not need to be logged, contributing to higher latency
for link and rename operations.
So fix this by comparing the inodes' logged_trans field with the
generation of the current transaction instead of comparing with the value
stored in fs_info->last_trans_committed.
This case is often hit when running dbench for a long enough duration, as
it does lots of rename operations.
This patch belongs to a patch set that is comprised of the following
patches:
btrfs: fix race causing unnecessary inode logging during link and rename
btrfs: fix race that results in logging old extents during a fast fsync
btrfs: fix race that causes unnecessary logging of ancestor inodes
btrfs: fix race that makes inode logging fallback to transaction commit
btrfs: fix race leading to unnecessary transaction commit when logging inode
btrfs: do not block inode logging for so long during transaction commit
Performance results are mentioned in the change log of the last patch.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
After removing the inode number cache that was using the free space
cache code, we can remove at least the recalc_thresholds callback from
the ops. Both code and tests use the same callback function. It's moved
before its first use.
The use_bitmaps callback is still needed by tests to create some
extents/bitmap setup.
Signed-off-by: David Sterba <dsterba@suse.com>
Since it's being used solely for the freespace cache unconditionally
set the flags required for it.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Following removal of the ino cache io_ctl_init will be called only on
behalf of the freespace inode. In this case we always want to check
CRCs so conditional code that depended on io_ctl::check_crc can be
removed.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's been deprecated since commit b547a88ea5 ("btrfs: start
deprecation of mount option inode_cache") which enumerates the reasons.
A filesystem that uses the feature (mount -o inode_cache) tracks the
inode numbers in bitmaps, that data stay on the filesystem after this
patch. The size is roughly 5MiB for 1M inodes [1], which is considered
small enough to be left there. Removal of the change can be implemented
in btrfs-progs if needed.
[1] https://lore.kernel.org/linux-btrfs/20201127145836.GZ6430@twin.jikos.cz/
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
The former is going away as part of the inode map removal so switch
callers to btrfs_find_free_objectid. No functional changes since with
INODE_MAP disabled (default) find_free_objectid was called anyway.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Those functions are going to be used even after inode cache is removed
so moved them to a more appropriate place.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since commit 72deb455b5 ("block: remove CONFIG_LBDAF") (5.2) the
sector_t type is u64 on all arches and configs so we don't need to
typecast it. It used to be unsigned long and the result of sector size
shifts were not guaranteed to fit in the type.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Superblock (and its copies) is the only data structure in btrfs which
has a fixed location on a device. Since we cannot overwrite in a
sequential write required zone, we cannot place superblock in the zone.
One easy solution is limiting superblock and copies to be placed only in
conventional zones. However, this method has two downsides: one is
reduced number of superblock copies. The location of the second copy of
superblock is 256GB, which is in a sequential write required zone on
typical devices in the market today. So, the number of superblock and
copies is limited to be two. Second downside is that we cannot support
devices which have no conventional zones at all.
To solve these two problems, we employ superblock log writing. It uses
two adjacent zones as a circular buffer to write updated superblocks.
Once the first zone is filled up, start writing into the second one.
Then, when both zones are filled up and before starting to write to the
first zone again, it reset the first zone.
We can determine the position of the latest superblock by reading write
pointer information from a device. One corner case is when both zones
are full. For this situation, we read out the last superblock of each
zone, and compare them to determine which zone is older.
The following zones are reserved as the circular buffer on ZONED btrfs.
- The primary superblock: zones 0 and 1
- The first copy: zones 16 and 17
- The second copy: zones 1024 or zone at 256GB which is minimum, and
next to it
If these reserved zones are conventional, superblock is written fixed at
the start of the zone without logging.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Placing both data and metadata in a block group is impossible in ZONED
mode. For data, we can allocate a space for it and write it immediately
after the allocation. For metadata, however, we cannot do that, because
the logical addresses are recorded in other metadata buffers to build up
the trees. As a result, a data buffer can be placed after a metadata
buffer, which is not written yet. Writing out the data buffer will break
the sequential write rule.
Check and disallow MIXED_BG with ZONED mode.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
fallocate() is implemented by reserving actual extent instead of
reservations. This can result in exposing the sequential write
constraint of host-managed zoned block devices to the application, which
would break the POSIX semantic for the fallocated file. To avoid this,
report fallocate() as not supported when in ZONED mode for now.
In the future, we may be able to implement "in-memory" fallocate() in
ZONED mode by utilizing space_info->bytes_may_use or similar, so this
returns EOPNOTSUPP.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
NODATACOW implies overwriting the file data on a device, which is
impossible in sequential required zones. Disable NODATACOW globally with
mount option and per-file NODATACOW attribute by masking FS_NOCOW_FL.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
As updates to the space cache v1 are in-place, the space cache cannot be
located over sequential zones and there is no guarantees that the device
will have enough conventional zones to store this cache. Resolve this
problem by disabling completely the space cache v1. This does not
introduce any problems with sequential block groups: all the free space
is located after the allocation pointer and no free space before the
pointer. There is no need to have such cache.
Note: we can technically use free-space-tree (space cache v2) on ZONED
mode. But, since ZONED mode now always allocates extents in a block
group sequentially regardless of underlying device zone type, it's no
use to enable and maintain the tree.
For the same reason, NODATACOW is also disabled.
In summary, ZONED will disable:
| Disabled features | Reason |
|-------------------+-----------------------------------------------------|
| RAID/DUP | Cannot handle two zone append writes to different |
| | zones |
|-------------------+-----------------------------------------------------|
| space_cache (v1) | In-place updating |
| NODATACOW | In-place updating |
|-------------------+-----------------------------------------------------|
| fallocate | Reserved extent will be a write hole |
|-------------------+-----------------------------------------------------|
| MIXED_BG | Allocated metadata region will be write holes for |
| | data writes |
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The zone append write command has a maximum IO size restriction it
accepts. This is because a zone append write command cannot be split, as
we ask the device to place the data into a specific target zone and the
device responds with the actual written location of the data.
Introduce max_zone_append_size to zone_info and fs_info to track the
value, so we can limit all I/O to a zoned block device that we want to
write using the zone append command to the device's limits.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Introduce function btrfs_check_zoned_mode() to check if ZONED flag is
enabled on the file system and if the file system consists of zoned
devices with equal zone size.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If a zoned block device is found, get its zone information (number of
zones and zone size). To avoid costly run-time zone report
commands to test the device zones type during block allocation, attach
the seq_zones bitmap to the device structure to indicate if a zone is
sequential or accept random writes. Also it attaches the empty_zones
bitmap to indicate if a zone is empty or not.
This patch also introduces the helper function btrfs_dev_is_sequential()
to test if the zone storing a block is a sequential write required zone
and btrfs_dev_is_empty_zone() to test if the zone is a empty zone.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This patch introduces the ZONED incompat flag. The flag indicates that
the volume management will satisfy the constraints imposed by
host-managed zoned block devices (aligned chunk allocation, append-only
updates, reset zone after filled).
As the zoned support will happen incrementally due to enhancing some
core infrastructure like super block writes, tree-log, raid support, the
feature will appear in sysfs only on debug builds. It will be enabled
once the support is feature complete and applications can reliably check
whether zoned support is present or not.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It simply gets assigned to 'ret' in case of errors. The flow of the
while loop is not changed by this commit since the few call sites
that 'goto next' will simply break from the loop.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In most cases when an error is returned from a function 'ret' is simply
assigned to 'err'. There is only one case where walk_up_reloc_tree can
return a positive value - in this case the code breaks from the loop and
ret is going to get its return value from btrfs_cow_block - either 0 or
negative. This retains the old logic of how 'err' used to be set at
this call site.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Use only a single 'ret' to control whether we should abort the
transaction or not. That's fine, because if we abort a transaction then
btrfs_end_transaction will return the same value as passed to
btrfs_abort_transaction. No semantic changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When we are attempting to start writeback for an existing extent in NOCOW
mode, at run_delalloc_nocow(), we must check if the extent is shared, and
if it is, fallback to a COW write. However we do such check while still
holding a read lock on the leaf that contains the file extent item, and
that check, the call to btrfs_cross_ref_exist(), can take some time
because:
1) It needs to do a search on the extent tree, which obviously takes some
time, specially if delayed references are being run at the moment, as
we can block when trying to lock currently write locked btree nodes;
2) It needs to check the delayed references for any existing reference
for our data extent, this requires acquiring the delayed references'
spinlock and maybe block on the mutex of a delayed reference head in the
case where there is a delayed reference for our data extent, in the
worst case it makes us release the path on the extent tree and retry
the whole process again (going back to step 1).
There are other operations we do while holding the leaf locked that can
take some significant time as well (specially all together):
* btrfs_extent_readonly() - to check if the block group containing the
extent is currently in RO mode. This requires taking a spinlock and
searching for the block group in a rbtree that can be big on large
filesystems;
* csum_exist_in_range() - to search if there are any checksums in the
csum tree for the extent. Like before, this can take some time if we are
in a filesystem that has both COW and NOCOW files, in which case the
csum tree is not empty;
* btrfs_inc_nocow_writers() - increment the number of nocow writers in the
block group that contains the data extent. Needs to acquire a spinlock
and search for the block group in a rbtree that can be big on large
filesystems.
So just unlock the leaf (release the path) before doing all those checks,
since we do not need it anymore. In case we can not do a NOCOW write for
the extent, due to any of those checks failing, and the writeback range
goes beyond that extents' length, we will do another btree search for the
next file extent item.
The following script that calls dbench was used to measure the impact of
this change on a VM with 8 CPUs, 16Gb of ram, using a raw NVMe device
directly (no intermediary filesystem on the host) and using a non-debug
kernel (default configuration on Debian):
$ cat test-dbench.sh
#!/bin/bash
DEV=/dev/sdk
MNT=/mnt/sdk
MOUNT_OPTIONS="-o ssd -o nodatacow"
MKFS_OPTIONS="-m single -d single"
mkfs.btrfs -f $MKFS_OPTIONS $DEV
mount $MOUNT_OPTIONS $DEV $MNT
dbench -D $MNT -t 300 64
umount $MNT
Before this change:
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 9326331 0.317 399.957
Close 6851198 0.002 6.402
Rename 394894 2.621 402.819
Unlink 1883131 0.931 398.082
Deltree 256 19.160 303.580
Mkdir 128 0.003 0.016
Qpathinfo 8452314 0.068 116.133
Qfileinfo 1481921 0.001 5.081
Qfsinfo 1549963 0.002 4.444
Sfileinfo 759679 0.084 17.079
Find 3268168 0.396 118.196
WriteX 4653310 0.056 110.993
ReadX 14618818 0.005 23.314
LockX 30364 0.003 0.497
UnlockX 30364 0.002 1.720
Flush 653619 16.954 569.299
Throughput 966.651 MB/sec 64 clients 64 procs max_latency=569.377 ms
After this change:
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 9710433 0.302 232.449
Close 7132948 0.002 11.496
Rename 411144 2.452 131.805
Unlink 1960961 0.893 230.383
Deltree 256 14.858 198.646
Mkdir 128 0.002 0.005
Qpathinfo 8800890 0.066 111.588
Qfileinfo 1542556 0.001 3.852
Qfsinfo 1613835 0.002 5.483
Sfileinfo 790871 0.081 19.492
Find 3402743 0.386 120.185
WriteX 4842918 0.054 179.312
ReadX 15220407 0.005 32.435
LockX 31612 0.003 1.533
UnlockX 31612 0.002 1.047
Flush 680567 16.320 463.323
Throughput 1016.59 MB/sec 64 clients 64 procs max_latency=463.327 ms
+5.0% throughput, -20.5% max latency
Also, the following test using fio was run:
$ cat test-fio.sh
#!/bin/bash
DEV=/dev/sdk
MNT=/mnt/sdk
MOUNT_OPTIONS="-o ssd -o nodatacow"
MKFS_OPTIONS="-d single -m single"
if [ $# -ne 4 ]; then
echo "Use $0 NUM_JOBS FILE_SIZE FSYNC_FREQ BLOCK_SIZE"
exit 1
fi
NUM_JOBS=$1
FILE_SIZE=$2
FSYNC_FREQ=$3
BLOCK_SIZE=$4
cat <<EOF > /tmp/fio-job.ini
[writers]
rw=randwrite
fsync=$FSYNC_FREQ
fallocate=none
group_reporting=1
direct=0
bs=$BLOCK_SIZE
ioengine=sync
size=$FILE_SIZE
directory=$MNT
numjobs=$NUM_JOBS
EOF
echo
echo "Using fio config:"
echo
cat /tmp/fio-job.ini
echo
echo "mount options: $MOUNT_OPTIONS"
echo
mkfs.btrfs -f $MKFS_OPTIONS $DEV > /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
echo "Creating nodatacow files before fio runs..."
for ((i = 0; i < $NUM_JOBS; i++)); do
xfs_io -f -c "pwrite -b 128M 0 $FILE_SIZE" "$MNT/writers.$i.0"
done
sync
fio /tmp/fio-job.ini
umount $MNT
Before this change:
$ ./test-fio.sh 16 512M 2 4K
(...)
WRITE: bw=28.3MiB/s (29.6MB/s), 28.3MiB/s-28.3MiB/s (29.6MB/s-29.6MB/s), io=8192MiB (8590MB), run=289800-289800msec
After this change:
$ ./test-fio.sh 16 512M 2 4K
(...)
WRITE: bw=31.2MiB/s (32.7MB/s), 31.2MiB/s-31.2MiB/s (32.7MB/s-32.7MB/s), io=8192MiB (8590MB), run=262845-262845msec
+9.7% throughput, -9.8% runtime
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The tree checker is called many times as it verifies metadata at
read/write time. The checks follow a simple pattern:
if (error_condition) {
report_error();
return -EUCLEAN;
}
All the error reporting functions are annotated as __cold that is
supposed to hint the compiler to move the statement block out of the hot
path. This does not seem to happen that often.
As the error condition is expected to be false almost always, we can
annotate it with 'unlikely' as this satisfies one of the few use cases
for the annotation. The expected outcome is a stronger hint to compiler
to reorder the checks
test
jump to exit
test
jump to exit
...
which can be observed in asm of eg. check_dir_item,
btrfs_check_chunk_valid, check_root_item or check_leaf.
There's a measurable run time improvement reported by Josef, the testing
workload went from 655 MiB/s to 677 MiB/s, which is about +3%.
There should be no functional changes but some of the conditions have
been rewritten to produce more readable result, some lines are longer
than 80, for the sake of readability.
Signed-off-by: David Sterba <dsterba@suse.com>
Without a NULL fs_info the helpers will print something like
BTRFS error (device <unknown>): ...
This can happen in contexts where fs_info is not available at all or
it's potentially unsafe due to object lifetime. The <unknown> stub does
not bring much information and with the prefix makes the message
unnecessarily longer.
Remove it for the NULL fs_info case.
BTRFS error: ...
Callers can add the device information to the message itself if needed.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In alloc_extent_buffer(), after we got a page from btree inode, we check
if that page has private pointer attached.
If attached, we check if the existing extent buffer has proper refs.
If not (the eb is being freed), we will detach that private eb pointer.
The point here is, we are detaching that eb pointer by calling:
- ClearPagePrivate()
- put_page()
The put_page() here is especially confusing, as it's decreasing the ref
from attach_page_private(). Without knowing that, it looks like the
put_page() is for the find_or_create_page() call, confusing the reader.
Since we're always modifying page private with attach_page_private() and
detach_page_private(), the only open-coded detach_page_private() here is
really confusing.
Fix it by calling detach_page_private().
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In btrfs_lookup_bio_sums() if the bio is pretty large, we want to
start readahead in the csum tree.
However the threshold is an immediate number, (PAGE_SIZE * 8), from the
initial btrfs merge.
The meaning of the value is pretty hard to guess, especially when the
immediate number is from the times when 4K sectorsize was the default
and only CRC32C was supported.
For the most common btrfs setup, CRC32 csum and 4K sectorsize,
it means just 32K read would kick readahead, while the csum itself is
only 32 bytes in size.
Now let's be more reasonable by taking both csum size and node size into
consideration.
If the csum size for the bio is larger than one leaf, then we kick the
readahead. This means for current default btrfs, the threshold will be
16M.
This change should not change performance observably, thus this is
mostly a readability enhancement.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
extent_invalidatepage() will try to clear all possible bits since it's
calling clear_extent_bit() with delete == 1.
This is currently fine, since for btree io tree, it only utilizes
EXTENT_LOCK bit. But this could be a problem for later subpage support,
which will utilize extra io tree bit to represent additional info.
This patch will just convert that clear_extent_bit() to
unlock_extent_cached().
For current code since only EXTENT_LOCKED bit is utilized, this doesn't
change the behavior, but provides a much cleaner basis for incoming
subpage support.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Parameter @phy_offset is the offset against the bio->bi_iter.bi_sector.
@phy_offset is mostly for data io to lookup the csum in btrfs_io_bio.
But for metadata, it's completely useless as metadata stores their own
csum in its header, so we can remove it.
Note: parameters @start and @end, they are not utilized at all for
current sectorsize == PAGE_SIZE case, as we can grab eb directly from
page.
But those two parameters are very important for later subpage support,
thus @start/@len are not touched here.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
That anonymous structure serve no special purpose, just replace it with
regular members.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently the type is unsigned int which could change its width
depending on the architecture. We need up to 32 bits so make it
explicit.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Introduce a new helper to handle update page status in
end_bio_extent_readpage(). This will be later used for subpage support
where the page status update can be more complex than now.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In end_bio_extent_readpage() we had a strange dance around
extent_start/extent_len.
Hidden behind the strange dance is, it's just calling
endio_readpage_release_extent() on each bvec range.
Here is an example to explain the original work flow:
Bio is for inode 257, containing 2 pages, for range [1M, 1M+8K)
end_bio_extent_extent_readpage() entered
|- extent_start = 0;
|- extent_end = 0;
|- bio_for_each_segment_all() {
| |- /* Got the 1st bvec */
| |- start = SZ_1M;
| |- end = SZ_1M + SZ_4K - 1;
| |- update = 1;
| |- if (extent_len == 0) {
| | |- extent_start = start; /* SZ_1M */
| | |- extent_len = end + 1 - start; /* SZ_1M */
| | }
| |
| |- /* Got the 2nd bvec */
| |- start = SZ_1M + 4K;
| |- end = SZ_1M + 4K - 1;
| |- update = 1;
| |- if (extent_start + extent_len == start) {
| | |- extent_len += end + 1 - start; /* SZ_8K */
| | }
| } /* All bio vec iterated */
|
|- if (extent_len) {
|- endio_readpage_release_extent(tree, extent_start, extent_len,
update);
/* extent_start == SZ_1M, extent_len == SZ_8K, uptodate = 1 */
As the above flow shows, the existing code in end_bio_extent_readpage()
is accumulates extent_start/extent_len, and when the contiguous range
stops, calls endio_readpage_release_extent() for the range.
However current behavior has something not really considered:
- The inode can change
For bio, its pages don't need to have contiguous page_offset.
This means, even pages from different inodes can be packed into one
bio.
- bvec cross page boundary
There is a feature called multi-page bvec, where bvec->bv_len can go
beyond bvec->bv_page boundary.
- Poor readability
This patch will address the problem:
- Introduce a proper structure, processed_extent, to record processed
extent range
- Integrate inode/start/end/uptodate check into
endio_readpage_release_extent()
- Add more comment on each step.
This should greatly improve the readability, now in
end_bio_extent_readpage() there are only two
endio_readpage_release_extent() calls.
- Add inode check for contiguity
Now we also ensure the inode is the same one before checking if the
range is contiguous.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In extent-io-test, there are two invalid tests:
- Invalid nodesize for test_eb_bitmaps()
Instead of the sectorsize and nodesize combination passed in, we're
always using hand-crafted nodesize, e.g:
len = (sectorsize < BTRFS_MAX_METADATA_BLOCKSIZE)
? sectorsize * 4 : sectorsize;
In above case, if we have 32K page size, then we will get a length of
128K, which is beyond max node size, and obviously invalid.
The common page size goes up to 64K so we haven't hit that
- Invalid extent buffer bytenr
For 64K page size, the only combination we're going to test is
sectorsize = nodesize = 64K.
However, in that case we will try to test an eb which bytenr is not
sectorsize aligned:
/* Do it over again with an extent buffer which isn't page-aligned. */
eb = __alloc_dummy_extent_buffer(fs_info, nodesize / 2, len);
Sector alignment is a hard requirement for any sector size.
The only exception is superblock. But anything else should follow
sector size alignment.
This is definitely an invalid test case.
This patch will fix both problems by:
- Honor the sectorsize/nodesize combination
Now we won't bother to hand-craft the length and use it as nodesize.
- Use sectorsize as the 2nd run extent buffer start
This would test the case where extent buffer is aligned to sectorsize
but not always aligned to nodesize.
Please note that, later subpage related cleanup will reduce
extent_buffer::pages[] to exactly what we need, making the sector
unaligned extent buffer operations cause problems.
Since only extent_io self tests utilize this, this patch is required for
all later cleanup/refactoring.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
A semicolon is not needed after a switch statement.
Signed-off-by: Tom Rix <trix@redhat.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function is needlessly convoluted. Fix that by:
* removing redundant sret variable definition in both if arms
* replace the again/done labels with direct return statements, the
function is short enough and doesn't do anything special upon exit
* remove BUG_ON on split_node returning a positive number - it can't
happen as split_node returns either 0 or a negative error code.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At the point when we set 'ret = 0' it's guaranteed that the function is
going to return 0 so directly return 0. No functional changes.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At inode.c:cow_file_range_inline(), after we insert the inline extent
in the fs/subvolume btree, we call btrfs_drop_extent_cache() to drop
all extent maps in the file range, however that is not necessary because
we have already done it in the call to btrfs_drop_extents(), which calls
btrfs_drop_extent_cache() for us, and since at this point we have the file
range locked in the inode's iotree (we are in the writeback path), we know
no other task can come in and read stale file extent items or find none
and therefore create either stale extent maps or an extent map that
represents a hole.
So just remove that unnecessary call to btrfs_drop_extent_cache(), as it's
doing nothing and only wasting time. This call has been around since 2008,
introduced in commit c8b978188c ("Btrfs: Add zlib compression support"),
but even back then it seems it was not necessary, since we had the range
locked in the inode's iotree and the call to btrfs_drop_extents() already
used to always call btrfs_drop_extent_cache().
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When joining a log transaction we acquire the root's log mutex, then
increment the root's log batch and log writers counters while holding
the mutex. However we don't need to increment the log batch there,
because we are holding the mutex and incremented the log writers counter
as well, so any other task trying to sync log will wait for the current
task to finish its logging and still achieve the desired log batching.
Since the log batch counter is an atomic counter and is incremented twice
at the very beginning of the fsync callback (btrfs_sync_file()), once
before flushing delalloc and once again after waiting for writeback to
complete, eliminating its increment when joining the log transaction
may provide some performance gains in case we have multiple concurrent
tasks doing fsyncs against different files in the same subvolume, as it
reduces contention on the atomic (locking the cacheline and bouncing it).
When testing fio with 32 jobs, on a 8 cores VM, doing fsyncs against
different files of the same subvolume, on top of a zram device, I could
consistently see gains (higher throughput) between 1% to 2%, which is a
very low value and possibly hard to be observed with a real device (I
couldn't observe consistent gains with my low/mid end NVMe device).
So this change is mostly motivated to just simplify the logic, as updating
the log batch counter is only relevant when an fsync starts and while not
holding the root's log mutex.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Every time we log an inode we lookup in the fs/subvol tree for xattrs and
if we have any, log them into the log tree. However it is very common to
have inodes without any xattrs, so doing the search wastes times, but more
importantly it adds contention on the fs/subvol tree locks, either making
the logging code block and wait for tree locks or making the logging code
making other concurrent operations block and wait.
The most typical use cases where xattrs are used are when capabilities or
ACLs are defined for an inode, or when SELinux is enabled.
This change makes the logging code detect when an inode does not have
xattrs and skip the xattrs search the next time the inode is logged,
unless the inode is evicted and loaded again or a xattr is added to the
inode. Therefore skipping the search for xattrs on inodes that don't ever
have xattrs and are fsynced with some frequency.
The following script that calls dbench was used to measure the impact of
this change on a VM with 8 CPUs, 16Gb of ram, using a raw NVMe device
directly (no intermediary filesystem on the host) and using a non-debug
kernel (default configuration on Debian distributions):
$ cat test.sh
#!/bin/bash
DEV=/dev/sdk
MNT=/mnt/sdk
MOUNT_OPTIONS="-o ssd"
mkfs.btrfs -f -m single -d single $DEV
mount $MOUNT_OPTIONS $DEV $MNT
dbench -D $MNT -t 200 40
umount $MNT
The results before this change:
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 5761605 0.172 312.057
Close 4232452 0.002 10.927
Rename 243937 1.406 277.344
Unlink 1163456 0.631 298.402
Deltree 160 11.581 221.107
Mkdir 80 0.003 0.005
Qpathinfo 5221410 0.065 122.309
Qfileinfo 915432 0.001 3.333
Qfsinfo 957555 0.003 3.992
Sfileinfo 469244 0.023 20.494
Find 2018865 0.448 123.659
WriteX 2874851 0.049 118.529
ReadX 9030579 0.004 21.654
LockX 18754 0.003 4.423
UnlockX 18754 0.002 0.331
Flush 403792 10.944 359.494
Throughput 908.444 MB/sec 40 clients 40 procs max_latency=359.500 ms
The results after this change:
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 6442521 0.159 230.693
Close 4732357 0.002 10.972
Rename 272809 1.293 227.398
Unlink 1301059 0.563 218.500
Deltree 160 7.796 54.887
Mkdir 80 0.008 0.478
Qpathinfo 5839452 0.047 124.330
Qfileinfo 1023199 0.001 4.996
Qfsinfo 1070760 0.003 5.709
Sfileinfo 524790 0.033 21.765
Find 2257658 0.314 125.611
WriteX 3211520 0.040 232.135
ReadX 10098969 0.004 25.340
LockX 20974 0.003 1.569
UnlockX 20974 0.002 3.475
Flush 451553 10.287 331.037
Throughput 1011.77 MB/sec 40 clients 40 procs max_latency=331.045 ms
+10.8% throughput, -8.2% max latency
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>