WSL2-Linux-Kernel/fs/btrfs
Filipe Manana 40e046acbd Btrfs: fix missing data checksums after replaying a log tree
When logging a file that has shared extents (reflinked with other files or
with itself), we can end up logging multiple checksum items that cover
overlapping ranges. This confuses the search for checksums at log replay
time causing some checksums to never be added to the fs/subvolume tree.

Consider the following example of a file that shares the same extent at
offsets 0 and 256Kb:

   [ bytenr 13893632, offset 64Kb, len 64Kb  ]
   0                                         64Kb

   [ bytenr 13631488, offset 64Kb, len 192Kb ]
   64Kb                                      256Kb

   [ bytenr 13893632, offset 0, len 256Kb    ]
   256Kb                                     512Kb

When logging the inode, at tree-log.c:copy_items(), when processing the
file extent item at offset 0, we log a checksum item covering the range
13959168 to 14024704, which corresponds to 13893632 + 64Kb and 13893632 +
64Kb + 64Kb, respectively.

Later when processing the extent item at offset 256K, we log the checksums
for the range from 13893632 to 14155776 (which corresponds to 13893632 +
256Kb). These checksums get merged with the checksum item for the range
from 13631488 to 13893632 (13631488 + 256Kb), logged by a previous fsync.
So after this we get the two following checksum items in the log tree:

   (...)
   item 6 key (EXTENT_CSUM EXTENT_CSUM 13631488) itemoff 3095 itemsize 512
           range start 13631488 end 14155776 length 524288
   item 7 key (EXTENT_CSUM EXTENT_CSUM 13959168) itemoff 3031 itemsize 64
           range start 13959168 end 14024704 length 65536

The first one covers the range from the second one, they overlap.

So far this does not cause a problem after replaying the log, because
when replaying the file extent item for offset 256K, we copy all the
checksums for the extent 13893632 from the log tree to the fs/subvolume
tree, since searching for an checksum item for bytenr 13893632 leaves us
at the first checksum item, which covers the whole range of the extent.

However if we write 64Kb to file offset 256Kb for example, we will
not be able to find and copy the checksums for the last 128Kb of the
extent at bytenr 13893632, referenced by the file range 384Kb to 512Kb.

After writing 64Kb into file offset 256Kb we get the following extent
layout for our file:

   [ bytenr 13893632, offset 64K, len 64Kb   ]
   0                                         64Kb

   [ bytenr 13631488, offset 64Kb, len 192Kb ]
   64Kb                                      256Kb

   [ bytenr 14155776, offset 0, len 64Kb     ]
   256Kb                                     320Kb

   [ bytenr 13893632, offset 64Kb, len 192Kb ]
   320Kb                                     512Kb

After fsync'ing the file, if we have a power failure and then mount
the filesystem to replay the log, the following happens:

1) When replaying the file extent item for file offset 320Kb, we
   lookup for the checksums for the extent range from 13959168
   (13893632 + 64Kb) to 14155776 (13893632 + 256Kb), through a call
   to btrfs_lookup_csums_range();

2) btrfs_lookup_csums_range() finds the checksum item that starts
   precisely at offset 13959168 (item 7 in the log tree, shown before);

3) However that checksum item only covers 64Kb of data, and not 192Kb
   of data;

4) As a result only the checksums for the first 64Kb of data referenced
   by the file extent item are found and copied to the fs/subvolume tree.
   The remaining 128Kb of data, file range 384Kb to 512Kb, doesn't get
   the corresponding data checksums found and copied to the fs/subvolume
   tree.

5) After replaying the log userspace will not be able to read the file
   range from 384Kb to 512Kb, because the checksums are missing and
   resulting in an -EIO error.

The following steps reproduce this scenario:

  $ mkfs.btrfs -f /dev/sdc
  $ mount /dev/sdc /mnt/sdc

  $ xfs_io -f -c "pwrite -S 0xa3 0 256K" /mnt/sdc/foobar
  $ xfs_io -c "fsync" /mnt/sdc/foobar
  $ xfs_io -c "pwrite -S 0xc7 256K 256K" /mnt/sdc/foobar

  $ xfs_io -c "reflink /mnt/sdc/foobar 320K 0 64K" /mnt/sdc/foobar
  $ xfs_io -c "fsync" /mnt/sdc/foobar

  $ xfs_io -c "pwrite -S 0xe5 256K 64K" /mnt/sdc/foobar
  $ xfs_io -c "fsync" /mnt/sdc/foobar

  <power failure>

  $ mount /dev/sdc /mnt/sdc
  $ md5sum /mnt/sdc/foobar
  md5sum: /mnt/sdc/foobar: Input/output error

  $ dmesg | tail
  [165305.003464] BTRFS info (device sdc): no csum found for inode 257 start 401408
  [165305.004014] BTRFS info (device sdc): no csum found for inode 257 start 405504
  [165305.004559] BTRFS info (device sdc): no csum found for inode 257 start 409600
  [165305.005101] BTRFS info (device sdc): no csum found for inode 257 start 413696
  [165305.005627] BTRFS info (device sdc): no csum found for inode 257 start 417792
  [165305.006134] BTRFS info (device sdc): no csum found for inode 257 start 421888
  [165305.006625] BTRFS info (device sdc): no csum found for inode 257 start 425984
  [165305.007278] BTRFS info (device sdc): no csum found for inode 257 start 430080
  [165305.008248] BTRFS warning (device sdc): csum failed root 5 ino 257 off 393216 csum 0x1337385e expected csum 0x00000000 mirror 1
  [165305.009550] BTRFS warning (device sdc): csum failed root 5 ino 257 off 393216 csum 0x1337385e expected csum 0x00000000 mirror 1

Fix this simply by deleting first any checksums, from the log tree, for the
range of the extent we are logging at copy_items(). This ensures we do not
get checksum items in the log tree that have overlapping ranges.

This is a long time issue that has been present since we have the clone
(and deduplication) ioctl, and can happen both when an extent is shared
between different files and within the same file.

A test case for fstests follows soon.

CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-12-13 14:09:24 +01:00
..
tests btrfs: return error pointer from alloc_test_extent_buffer 2019-12-13 14:09:24 +01:00
Kconfig btrfs: add sha256 to checksumming algorithm 2019-11-18 17:51:43 +01:00
Makefile btrfs: migrate the block group lookup code 2019-09-09 14:59:04 +02:00
acl.c
async-thread.c btrfs: add __pure attribute to functions 2019-11-18 12:46:52 +01:00
async-thread.h btrfs: add __pure attribute to functions 2019-11-18 12:46:52 +01:00
backref.c Btrfs: fix deadlock between fiemap and transaction commits 2019-07-30 18:25:12 +02:00
backref.h btrfs: fiemap: preallocate ulists for btrfs_check_shared 2019-07-01 13:34:53 +02:00
block-group.c btrfs: scrub: Don't check free space before marking a block group RO 2019-11-18 18:07:55 +01:00
block-group.h btrfs: scrub: Don't check free space before marking a block group RO 2019-11-18 18:07:55 +01:00
block-rsv.c btrfs: use btrfs_try_granting_tickets in update_global_rsv 2019-09-09 14:59:19 +02:00
block-rsv.h btrfs: migrate the global_block_rsv helpers to block-rsv.c 2019-07-02 12:30:55 +02:00
btrfs_inode.h Btrfs: remove unnecessary delalloc mutex for inodes 2019-11-18 17:51:46 +01:00
check-integrity.c btrfs: reduce stack usage for btrfsic_process_written_block 2019-09-09 14:58:58 +02:00
check-integrity.h
compression.c btrfs: drop bio_set_dev where not needed 2019-11-18 23:39:30 +01:00
compression.h btrfs: compression: remove ops pointer from workspace_manager 2019-11-18 12:46:59 +01:00
ctree.c btrfs: add blake2b to checksumming algorithms 2019-11-18 17:51:44 +01:00
ctree.h Btrfs: fix missing data checksums after replaying a log tree 2019-12-13 14:09:24 +01:00
delalloc-space.c Btrfs: remove unnecessary delalloc mutex for inodes 2019-11-18 17:51:46 +01:00
delalloc-space.h btrfs: migrate the delalloc space stuff to it's own home 2019-07-04 17:26:17 +02:00
delayed-inode.c btrfs: use refcount_inc_not_zero in kill_all_nodes 2019-11-18 12:46:51 +01:00
delayed-inode.h
delayed-ref.c btrfs: rename btrfs_space_info_add_old_bytes 2019-09-09 14:59:18 +02:00
delayed-ref.h btrfs: migrate the delayed refs rsv code 2019-07-04 17:26:17 +02:00
dev-replace.c btrfs: add __pure attribute to functions 2019-11-18 12:46:52 +01:00
dev-replace.h btrfs: add __pure attribute to functions 2019-11-18 12:46:52 +01:00
dir-item.c
disk-io.c btrfs: remove extent_map::bdev 2019-11-18 23:43:44 +01:00
disk-io.h btrfs: add __cold attribute to more functions 2019-11-18 12:46:52 +01:00
export.c btrfs: drop unused parameter is_new from btrfs_iget 2019-11-18 12:46:52 +01:00
export.h
extent-io-tree.h btrfs: move the failrec tree stuff into extent-io-tree.h 2019-11-18 12:46:47 +01:00
extent-tree.c Btrfs: fix missing data checksums after replaying a log tree 2019-12-13 14:09:24 +01:00
extent_io.c btrfs: return error pointer from alloc_test_extent_buffer 2019-12-13 14:09:24 +01:00
extent_io.h btrfs: opencode extent_buffer_get 2019-11-18 12:46:54 +01:00
extent_map.c btrfs: remove extent_map::bdev 2019-11-18 23:43:44 +01:00
extent_map.h btrfs: remove extent_map::bdev 2019-11-18 23:43:44 +01:00
file-item.c Btrfs: fix missing data checksums after replaying a log tree 2019-12-13 14:09:24 +01:00
file.c Btrfs: fix cloning range with a hole when using the NO_HOLES feature 2019-12-13 13:29:22 +01:00
free-space-cache.c btrfs: rename btrfs_block_group_cache 2019-11-18 17:51:51 +01:00
free-space-cache.h btrfs: rename btrfs_block_group_cache 2019-11-18 17:51:51 +01:00
free-space-tree.c btrfs: rename btrfs_block_group_cache 2019-11-18 17:51:51 +01:00
free-space-tree.h btrfs: rename btrfs_block_group_cache 2019-11-18 17:51:51 +01:00
inode-item.c btrfs: Make btrfs_find_name_in_ext_backref return struct btrfs_inode_extref 2019-09-09 14:59:16 +02:00
inode-map.c btrfs: qgroup: Always free PREALLOC META reserve in btrfs_delalloc_release_extents() 2019-10-15 18:50:07 +02:00
inode-map.h
inode.c btrfs: don't double lock the subvol_sem for rename exchange 2019-12-13 14:09:23 +01:00
ioctl.c btrfs: rename btrfs_block_group_cache 2019-11-18 17:51:51 +01:00
locking.c btrfs: document extent buffer locking 2019-11-18 17:51:50 +01:00
locking.h btrfs: move btrfs_unlock_up_safe to other locking functions 2019-11-18 12:46:49 +01:00
lzo.c btrfs: compression: inline free_workspace 2019-11-18 12:46:59 +01:00
misc.h btrfs: add 64bit safe helper for power of two checks 2019-11-18 12:46:50 +01:00
ordered-data.c Btrfs: fix block group remaining RO forever after error during device replace 2019-11-18 18:07:55 +01:00
ordered-data.h Btrfs: fix block group remaining RO forever after error during device replace 2019-11-18 18:07:55 +01:00
orphan.c
print-tree.c btrfs: rename extent buffer block group item accessors 2019-11-18 17:51:45 +01:00
print-tree.h
props.c btrfs: props: remove unnecessary hash_init() 2019-11-18 12:46:55 +01:00
props.h
qgroup.c btrfs: Fix error messages in qgroup_rescan_init 2019-12-13 13:29:12 +01:00
qgroup.h btrfs: rename btrfs_block_group_cache 2019-11-18 17:51:51 +01:00
raid56.c btrfs: remove pointless local variable in lock_stripe_add() 2019-11-18 12:47:00 +01:00
raid56.h btrfs: constify map parameter for nr_parity_stripes and nr_data_stripes 2019-07-01 13:34:58 +02:00
rcu-string.h
reada.c btrfs: rename btrfs_block_group_cache 2019-11-18 17:51:51 +01:00
ref-verify.c btrfs: fix uninitialized ret in ref-verify 2019-10-03 15:00:56 +02:00
ref-verify.h
relocation.c btrfs: remove extent_map::bdev 2019-11-18 23:43:44 +01:00
root-tree.c btrfs: rename the btrfs_calc_*_metadata_size helpers 2019-09-09 14:59:13 +02:00
scrub.c Btrfs: fix block group remaining RO forever after error during device replace 2019-11-18 18:07:55 +01:00
send.c Btrfs: send, skip backreference walking for extents with many references 2019-11-18 17:51:48 +01:00
send.h
space-info.c btrfs: rename btrfs_block_group_cache 2019-11-18 17:51:51 +01:00
space-info.h Btrfs: remove wait queue from space_info structure 2019-11-18 17:51:46 +01:00
struct-funcs.c btrfs: tie extent buffer and it's token together 2019-09-09 14:59:16 +02:00
super.c btrfs: add support for 4-copy replication (raid1c4) 2019-11-18 17:51:49 +01:00
sysfs.c btrfs: rename btrfs_block_group_cache 2019-11-18 17:51:51 +01:00
sysfs.h btrfs: rename btrfs_block_group_cache 2019-11-18 17:51:51 +01:00
transaction.c btrfs: rename btrfs_block_group_cache 2019-11-18 17:51:51 +01:00
transaction.h btrfs: Rename btrfs_join_transaction_nolock 2019-11-18 12:46:54 +01:00
tree-checker.c btrfs: tree-checker: Fix error format string for size_t 2019-12-13 14:09:23 +01:00
tree-checker.h
tree-defrag.c
tree-log.c Btrfs: fix missing data checksums after replaying a log tree 2019-12-13 14:09:24 +01:00
tree-log.h
ulist.c
ulist.h
uuid-tree.c
volumes.c btrfs: fix devs_max constraints for raid1c3 and raid1c4 2019-12-13 14:09:23 +01:00
volumes.h btrfs: change btrfs_fs_devices::rotating to bool 2019-11-18 17:51:51 +01:00
xattr.c Btrfs: fix failure to persist compression property xattr deletion on fsync 2019-06-17 16:37:17 +02:00
xattr.h
zlib.c btrfs: compression: inline free_workspace 2019-11-18 12:46:59 +01:00
zstd.c btrfs: compression: inline free_workspace 2019-11-18 12:46:59 +01:00