If a bio for a direct IO request fails, we were not setting the error in
the parent bio (the main DIO bio), making us not return the error to
user space in btrfs_direct_IO(), that is, it made __blockdev_direct_IO()
return the number of bytes issued for IO and not the error a bio created
and submitted by btrfs_submit_direct() got from the block layer.
This essentially happens because when we call:
dio_end_io(dio_bio, bio->bi_error);
It does not set dio_bio->bi_error to the value of the second argument.
So just add this missing assignment in endio callbacks, just as we do in
the error path at btrfs_submit_direct() when we fail to clone the dio bio
or allocate its private object. This follows the convention of what is
done with other similar APIs such as bio_endio() where the caller is
responsible for setting the bi_error field in the bio it passes as an
argument to bio_endio().
This was detected by the new generic test cases in xfstests: 271, 272,
276 and 278. Which essentially setup a dm error target, then load the
error table, do a direct IO write and unload the error table. They
expect the write to fail with -EIO, which was not getting reported
when testing against btrfs.
Cc: stable@vger.kernel.org # 4.3+
Fixes: 4246a0b63b ("block: add a bi_error field to struct bio")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Pull btrfs fixes from Chris Mason:
"This has a few fixes from Filipe, along with a readdir fix from Dave
that we've been testing for some time"
* 'for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
btrfs: properly set the termination value of ctx->pos in readdir
Btrfs: fix hang on extent buffer lock caused by the inode_paths ioctl
Btrfs: remove no longer used function extent_read_full_page_nolock()
Btrfs: fix page reading in extent_same ioctl leading to csum errors
Btrfs: fix invalid page accesses in extent_same (dedup) ioctl
Introduce new mount option alias "norecovery" for nologreplay, to keep
"norecovery" behavior the same with other filesystems.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Introduce a new mount option "nologreplay" to co-operate with "ro" mount
option to get real readonly mount, like "norecovery" in ext* and xfs.
Since the new parse_options() need to check new flags at remount time,
so add a new parameter for parse_options().
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Tested-by: Austin S. Hemmelgarn <ahferroin7@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Current "recovery" mount option will only try to use backup root.
However the word "recovery" is too generic and may be confusing for some
users.
Here introduce a new and more specific mount option, "usebackuproot" to
replace "recovery" mount option.
"Recovery" will be kept for compatibility reason, but will be
deprecated.
Also, since "usebackuproot" will only affect mount behavior and after
open_ctree() it has nothing to do with the filesystem, so clear the flag
after mount succeeded.
This provides the basis for later unified "norecovery" mount option.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
[ dropped usebackuproot from show_mount, added note about 'recovery' to
docs ]
Signed-off-by: David Sterba <dsterba@suse.com>
The number of distinct key types is not that big that we could waste one
for something new we want to store in the tree.
Similar to the temporary items, we'll introduce a new name for an
existing key value and use the objectid for further extension. The
victim is the BTRFS_DEV_STATS_KEY (248).
The device stats are an example of a permanent item.
Signed-off-by: David Sterba <dsterba@suse.com>
The number of distinct key types is not that big that we could waste one
for something new we want to store in the tree. We'll introduce a new
name for an existing key value and use the objectid for further
extension. The victim is the BTRFS_BALANCE_ITEM_KEY (248).
The nature of the balance status item is a good example of the temporary
item. It exists from beginning of the balance, keeps the status until it
finishes.
Signed-off-by: David Sterba <dsterba@suse.com>
The value of ctx->pos in the last readdir call is supposed to be set to
INT_MAX due to 32bit compatibility, unless 'pos' is intentially set to a
larger value, then it's LLONG_MAX.
There's a report from PaX SIZE_OVERFLOW plugin that "ctx->pos++"
overflows (https://forums.grsecurity.net/viewtopic.php?f=1&t=4284), on a
64bit arch, where the value is 0x7fffffffffffffff ie. LLONG_MAX before
the increment.
We can get to that situation like that:
* emit all regular readdir entries
* still in the same call to readdir, bump the last pos to INT_MAX
* next call to readdir will not emit any entries, but will reach the
bump code again, finds pos to be INT_MAX and sets it to LLONG_MAX
Normally this is not a problem, but if we call readdir again, we'll find
'pos' set to LLONG_MAX and the unconditional increment will overflow.
The report from Victor at
(http://thread.gmane.org/gmane.comp.file-systems.btrfs/49500) with debugging
print shows that pattern:
Overflow: e
Overflow: 7fffffff
Overflow: 7fffffffffffffff
PAX: size overflow detected in function btrfs_real_readdir
fs/btrfs/inode.c:5760 cicus.935_282 max, count: 9, decl: pos; num: 0;
context: dir_context;
CPU: 0 PID: 2630 Comm: polkitd Not tainted 4.2.3-grsec #1
Hardware name: Gigabyte Technology Co., Ltd. H81ND2H/H81ND2H, BIOS F3 08/11/2015
ffffffff81901608 0000000000000000 ffffffff819015e6 ffffc90004973d48
ffffffff81742f0f 0000000000000007 ffffffff81901608 ffffc90004973d78
ffffffff811cb706 0000000000000000 ffff8800d47359e0 ffffc90004973ed8
Call Trace:
[<ffffffff81742f0f>] dump_stack+0x4c/0x7f
[<ffffffff811cb706>] report_size_overflow+0x36/0x40
[<ffffffff812ef0bc>] btrfs_real_readdir+0x69c/0x6d0
[<ffffffff811dafc8>] iterate_dir+0xa8/0x150
[<ffffffff811e6d8d>] ? __fget_light+0x2d/0x70
[<ffffffff811dba3a>] SyS_getdents+0xba/0x1c0
Overflow: 1a
[<ffffffff811db070>] ? iterate_dir+0x150/0x150
[<ffffffff81749b69>] entry_SYSCALL_64_fastpath+0x12/0x83
The jump from 7fffffff to 7fffffffffffffff happens when new dir entries
are not yet synced and are processed from the delayed list. Then the code
could go to the bump section again even though it might not emit any new
dir entries from the delayed list.
The fix avoids entering the "bump" section again once we've finished
emitting the entries, both for synced and delayed entries.
References: https://forums.grsecurity.net/viewtopic.php?f=1&t=4284
Reported-by: Victor <services@swwu.com>
CC: stable@vger.kernel.org
Signed-off-by: David Sterba <dsterba@suse.com>
Tested-by: Holger Hoffstätte <holger.hoffstaette@googlemail.com>
Signed-off-by: Chris Mason <clm@fb.com>
We can safely use GFP_KERNEL in the functions called from the ioctl
handlers. Here we can allocate up to 32k so less pressure to the
allocator could help.
Signed-off-by: David Sterba <dsterba@suse.com>
Readdir is initiated from userspace and is not on the critical
writeback path, we don't need to use GFP_NOFS for allocations.
Signed-off-by: David Sterba <dsterba@suse.com>
Fallocate is initiated from userspace and is not on the critical
writeback path, we don't need to use GFP_NOFS for allocations.
Signed-off-by: David Sterba <dsterba@suse.com>
We don't need to use GFP_NOFS in all contexts, eg. during mount or for
dummy root tree, but we might for the the log tree creation.
Signed-off-by: David Sterba <dsterba@suse.com>
Scrub is not on the critical writeback path we don't need to use
GFP_NOFS for all allocations. The failures are handled and stats passed
back to userspace.
Let's use GFP_KERNEL on the paths where everything is ok, ie. setup the
global structures and the IO submission paths.
Functions that do the repair and fixups still use GFP_NOFS as we might
want to skip any other filesystem activity if we encounter an error.
This could turn out to be unnecessary, but requires more review compared
to the easy cases in this patch.
Signed-off-by: David Sterba <dsterba@suse.com>
The readahead framework is not on the critical writeback path we don't
need to use GFP_NOFS for allocations. All error paths are handled and
the readahead failures are not fatal. The actual users (scrub,
dev-replace) will trigger reads if the blocks are not found in cache.
Signed-off-by: David Sterba <dsterba@suse.com>
The send operation is not on the critical writeback path we don't need
to use GFP_NOFS for allocations. All error paths are handled and the
whole operation is restartable.
Signed-off-by: David Sterba <dsterba@suse.com>
Not needed after the previous patch named
"Btrfs: fix page reading in extent_same ioctl leading to csum errors".
Signed-off-by: Filipe Manana <fdmanana@suse.com>
In the extent_same ioctl, we were grabbing the pages (locked) and
attempting to read them without bothering about any concurrent IO
against them. That is, we were not checking for any ongoing ordered
extents nor waiting for them to complete, which leads to a race where
the extent_same() code gets a checksum verification error when it
reads the pages, producing a message like the following in dmesg
and making the operation fail to user space with -ENOMEM:
[18990.161265] BTRFS warning (device sdc): csum failed ino 259 off 495616 csum 685204116 expected csum 1515870868
Fix this by using btrfs_readpage() for reading the pages instead of
extent_read_full_page_nolock(), which waits for any concurrent ordered
extents to complete and locks the io range. Also do better error handling
and don't treat all failures as -ENOMEM, as that's clearly misleasing,
becoming identical to the checks and operation of prepare_uptodate_page().
The use of extent_read_full_page_nolock() was required before
commit f441460202 ("btrfs: fix deadlock with extent-same and readpage"),
as we had the range locked in an inode's io tree before attempting to
read the pages.
Fixes: f441460202 ("btrfs: fix deadlock with extent-same and readpage")
Cc: stable@vger.kernel.org # 4.2+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
In the extent_same ioctl we are getting the pages for the source and
target ranges and unlocking them immediately after, which is incorrect
because later we attempt to map them (with kmap_atomic) and access their
contents at btrfs_cmp_data(). When we do such access the pages might have
been relocated or removed from memory, which leads to an invalid memory
access. This issue is detected on a kernel with CONFIG_DEBUG_PAGEALLOC=y
which produces a trace like the following:
186736.677437] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[186736.680382] Modules linked in: btrfs dm_flakey dm_mod ppdev xor raid6_pq sha256_generic hmac drbg ansi_cprng acpi_cpufreq evdev sg aesni_intel aes_x86_64
parport_pc ablk_helper tpm_tis psmouse parport i2c_piix4 tpm cryptd i2c_core lrw processor button serio_raw pcspkr gf128mul glue_helper loop autofs4 ext4
crc16 mbcache jbd2 sd_mod sr_mod cdrom ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring crc32c_intel scsi_mod e1000 virtio floppy [last
unloaded: btrfs]
[186736.681319] CPU: 13 PID: 10222 Comm: duperemove Tainted: G W 4.4.0-rc6-btrfs-next-18+ #1
[186736.681319] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[186736.681319] task: ffff880132600400 ti: ffff880362284000 task.ti: ffff880362284000
[186736.681319] RIP: 0010:[<ffffffff81264d00>] [<ffffffff81264d00>] memcmp+0xb/0x22
[186736.681319] RSP: 0018:ffff880362287d70 EFLAGS: 00010287
[186736.681319] RAX: 000002c002468acf RBX: 0000000012345678 RCX: 0000000000000000
[186736.681319] RDX: 0000000000001000 RSI: 0005d129c5cf9000 RDI: 0005d129c5cf9000
[186736.681319] RBP: ffff880362287d70 R08: 0000000000000000 R09: 0000000000001000
[186736.681319] R10: ffff880000000000 R11: 0000000000000476 R12: 0000000000001000
[186736.681319] R13: ffff8802f91d4c88 R14: ffff8801f2a77830 R15: ffff880352e83e40
[186736.681319] FS: 00007f27b37fe700(0000) GS:ffff88043dda0000(0000) knlGS:0000000000000000
[186736.681319] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[186736.681319] CR2: 00007f27a406a000 CR3: 0000000217421000 CR4: 00000000001406e0
[186736.681319] Stack:
[186736.681319] ffff880362287ea0 ffffffffa048d0bd 000000000009f000 0000000000001000
[186736.681319] 0100000000000000 ffff8801f2a77850 ffff8802f91d49b0 ffff880132600400
[186736.681319] 00000000000004f8 ffff8801c1efbe41 0000000000000000 0000000000000038
[186736.681319] Call Trace:
[186736.681319] [<ffffffffa048d0bd>] btrfs_ioctl+0x24cb/0x2731 [btrfs]
[186736.681319] [<ffffffff8108a8b0>] ? arch_local_irq_save+0x9/0xc
[186736.681319] [<ffffffff8118b3d4>] ? rcu_read_unlock+0x3e/0x5d
[186736.681319] [<ffffffff811822f8>] do_vfs_ioctl+0x42b/0x4ea
[186736.681319] [<ffffffff8118b4f3>] ? __fget_light+0x62/0x71
[186736.681319] [<ffffffff8118240e>] SyS_ioctl+0x57/0x79
[186736.681319] [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
[186736.681319] Code: 0a 3c 6e 74 0d 3c 79 74 04 3c 59 75 0c c6 06 01 eb 03 c6 06 00 31 c0 eb 05 b8 ea ff ff ff 5d c3 55 31 c9 48 89 e5 48 39 d1 74 13 <0f> b6
04 0f 44 0f b6 04 0e 48 ff c1 44 29 c0 74 ea eb 02 31 c0
(gdb) list *(btrfs_ioctl+0x24cb)
0x5e0e1 is in btrfs_ioctl (fs/btrfs/ioctl.c:2972).
2967 dst_addr = kmap_atomic(dst_page);
2968
2969 flush_dcache_page(src_page);
2970 flush_dcache_page(dst_page);
2971
2972 if (memcmp(addr, dst_addr, cmp_len))
2973 ret = BTRFS_SAME_DATA_DIFFERS;
2974
2975 kunmap_atomic(addr);
2976 kunmap_atomic(dst_addr);
So fix this by making sure we keep the pages locked and respect the same
locking order as everywhere else: get and lock the pages first and then
lock the range in the inode's io tree (like for example at
__btrfs_buffered_write() and extent_readpages()). If an ordered extent
is found after locking the range in the io tree, unlock the range,
unlock the pages, wait for the ordered extent to complete and repeat the
entire locking process until no overlapping ordered extents are found.
Cc: stable@vger.kernel.org # 4.2+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
In subpagesize-blocksize scenario, the "destination offset" argument passed to
the btrfs_ioctl_clone() can be aligned to sectorsize but may not be
necessarily aligned to the machine's page size. In such cases,
truncate_inode_pages_range() ends up zeroing out the partial page and future
read operations will return incorrect data. Hence this commit explicitly
rounds down the "destination offset" to the machine's page size.
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When extending a file by either "truncate up" or by writing beyond i_size, the
page which had i_size needs to be marked "read only" so that future writes to
the page via mmap interface causes btrfs_page_mkwrite() to be invoked. If not,
a write performed after extending the file via the mmap interface will find
the page to be writaeable and continue writing to the page without invoking
btrfs_page_mkwrite() i.e. we end up writing to a file without reserving disk
space.
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_getattr() returns PAGE_CACHE_SIZE as the block size. Since
generic_fillattr() already does the right thing (by obtaining block size
from inode->i_blkbits), just remove the statement from btrfs_getattr.
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
cow_file_range_inline() limits the size of an inline extent to
PAGE_CACHE_SIZE. This breaks in subpagesize-blocksize scenarios. Fix this by
comparing against root->sectorsize.
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In subpagesize-blocksize scenario, map_length can be less than the length of a
bio vector. Such a condition may cause btrfs_submit_direct_hook() to submit a
zero length bio. Fix this by comparing map_length against block size rather
than with bv_len.
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In subpagesize-blocksize a page can map multiple extent buffers and hence
using (page index, seq) as the search key is incorrect. For example, searching
through tree modification log tree can return an entry associated with the
first extent buffer mapped by the page (if such an entry exists), when we are
actually searching for entries associated with extent buffers that are mapped
at position 2 or more in the page.
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In subpagesize-blocksize scenario it is not sufficient to search using the
first byte of the page to make sure that there are no ordered extents
present across the page. Fix this.
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In subpagesize-blocksize scenario, if i_size occurs in a block which is not
the last block in the page, then the space to be reserved should be calculated
appropriately.
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
While at it, this commit changes btrfs_truncate_page() to truncate sectorsized
blocks instead of pages. Hence the function has been renamed to
btrfs_truncate_block().
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The direct I/O read's endio and corresponding repair functions work on
page sized blocks. This commit adds the ability for direct I/O read to work on
subpagesized blocks.
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Checksums are applicable to sectorsize units. The current code uses
bio->bv_len units to compute and look up checksums. This works on machines
where sectorsize == PAGE_SIZE. This patch makes the checksum computation and
look up code to work with sectorsize units.
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently, the code reserves/releases extents in multiples of PAGE_CACHE_SIZE
units. Fix this by doing reservation/releases in block size units.
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
So the old one didn't work properly before alternatives had run.
And it was supposed to provide an optimized JMP because the
assumption was that the offset it is jumping to is within a
signed byte and thus a two-byte JMP.
So I did an x86_64 allyesconfig build and dumped all possible
sites where static_cpu_has() was used. The optimization amounted
to all in all 12(!) places where static_cpu_has() had generated
a 2-byte JMP. Which has saved us a whopping 36 bytes!
This clearly is not worth the trouble so we can remove it. The
only place where the optimization might count - in __switch_to()
- we will handle differently. But that's not subject of this
patch.
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1453842730-28463-6-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull btrfs fixes from Chris Mason:
"Dave had a small collection of fixes to the new free space tree code,
one of which was keeping our sysfs files more up to date with feature
bits as different things get enabled (lzo, raid5/6, etc).
I should have kept the sysfs stuff for rc3, since we always manage to
trip over something. This time it was GFP_KERNEL from somewhere that
is NOFS only. Instead of rebasing it out I've put a revert in, and
we'll fix it properly for rc3.
Otherwise, Filipe fixed a btrfs DIO race and Qu Wenruo fixed up a
use-after-free in our tracepoints that Dave Jones reported"
* 'for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Revert "btrfs: synchronize incompat feature bits with sysfs files"
btrfs: don't use GFP_HIGHMEM for free-space-tree bitmap kzalloc
btrfs: sysfs: check initialization state before updating features
Revert "btrfs: clear PF_NOFREEZE in cleaner_kthread()"
btrfs: async-thread: Fix a use-after-free error for trace
Btrfs: fix race between fsync and lockless direct IO writes
btrfs: add free space tree to the cow-only list
btrfs: add free space tree to lockdep classes
btrfs: tweak free space tree bitmap allocation
btrfs: tests: switch to GFP_KERNEL
btrfs: synchronize incompat feature bits with sysfs files
btrfs: sysfs: introduce helper for syncing bits with sysfs files
btrfs: sysfs: add free-space-tree bit attribute
btrfs: sysfs: fix typo in compat_ro attribute definition
This reverts commit 14e46e0495.
This ends up doing sysfs operations from deep in balance (where we
should be GFP_NOFS) and under heavy balance load, we're making races
against sysfs internals.
Revert it for now while we figure things out.
Signed-off-by: Chris Mason <clm@fb.com>
If the mount phase is not finished, we can't update the sysfs files.
Reported-by: Chris Mason <clm@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Parameter of trace_btrfs_work_queued() can be freed in its workqueue.
So no one use use that pointer after queue_work().
Fix the user-after-free bug by move the trace line before queue_work().
Reported-by: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
An fsync, using the fast path, can race with a concurrent lockless direct
IO write and end up logging a file extent item that points to an extent
that wasn't written to yet. This is because the fast fsync path collects
ordered extents into a local list and then collects all the new extent
maps to log file extent items based on them, while the direct IO write
path creates the new extent map before it creates the corresponding
ordered extent (and submitting the respective bio(s)).
So fix this by making the direct IO write path create ordered extents
before the extent maps and make the fast fsync path collect any new
ordered extents after it collects the extent maps.
Note that making the fsync handler call inode_dio_wait() (after acquiring
the inode's i_mutex) would not work and lead to a deadlock when doing
AIO, as through AIO we end up in a path where the fsync handler is called
(through dio_aio_complete_work() -> dio_complete() -> vfs_fsync_range())
before the inode's dio counter is decremented (inode_dio_wait() waits
for this counter to have a value of zero).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
inode_foo(inode) being mutex_foo(&inode->i_mutex).
Please, use those for access to ->i_mutex; over the coming cycle
->i_mutex will become rwsem, with ->lookup() done with it held
only shared.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull more btrfs updates from Chris Mason:
"These are mostly fixes that we've been testing, but also we grabbed
and tested a few small cleanups that had been on the list for a while.
Zhao Lei's patchset also fixes some early ENOSPC buglets"
* 'for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (21 commits)
btrfs: raid56: Use raid_write_end_io for scrub
btrfs: Remove unnecessary ClearPageUptodate for raid56
btrfs: use rbio->nr_pages to reduce calculation
btrfs: Use unified stripe_page's index calculation
btrfs: Fix calculation of rbio->dbitmap's size calculation
btrfs: Fix no_space in write and rm loop
btrfs: merge functions for wait snapshot creation
btrfs: delete unused argument in btrfs_copy_from_user
btrfs: Use direct way to determine raid56 write/recover mode
btrfs: Small cleanup for get index_srcdev loop
btrfs: Enhance chunk validation check
btrfs: Enhance super validation check
Btrfs: fix deadlock running delayed iputs at transaction commit time
Btrfs: fix typo in log message when starting a balance
btrfs: remove duplicate const specifier
btrfs: initialize the seq counter in struct btrfs_device
Btrfs: clean up an error code in btrfs_init_space_info()
btrfs: fix iterator with update error in backref.c
Btrfs: fix output of compression message in btrfs_parse_options()
Btrfs: Initialize btrfs_root->highest_objectid when loading tree root and subvolume roots
...
The requested bitmap size varies, observed numbers were < 4K up to 16K.
Using vmalloc unconditionally would be too heavy, we'll try contiguous
allocations first and fall back to vmalloc if there's no contig memory.
Signed-off-by: David Sterba <dsterba@suse.com>
There's no reason to do GFP_NOFS in tests, it's not data-heavy and
memory allocation failures would affect only developers or testers.
Signed-off-by: David Sterba <dsterba@suse.com>
The files under /sys/fs/UUID/features get out of sync with the actual
incompat bits set for the filesystem if they change after mount (eg. the
LZO compression).
Synchronize the feature bits with the sysfs files representing them
right after we set/clear them.
Signed-off-by: David Sterba <dsterba@suse.com>
The files under /sys/fs/UUID/features get out of sync with the actual
incompat bits set for the filesystem if they change after mount. We're
going to sync them and need a helper to do that.
Signed-off-by: David Sterba <dsterba@suse.com>
The incompat bit representing the newly added free space tree feature is
missing. Right now it will be listed only among features supported by
the module, not per-fs.
Signed-off-by: David Sterba <dsterba@suse.com>
No need to create additional end_io function for scrub, it increased
code size and introduced some un-unified lines, as:
raid_write_parity_end_io():
int err = bio->bi_error;
if (bio->bi_error)
raid_write_end_io():
int err = bio->bi_error;
if (err)
This patch combines them.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
PageUptodate flag already initialized to 0 for new page,
no need to set it again.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
We can use rbio->stripe_npages to reduce unnecessary calculation in
many code place.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
We are using different index calculation method for stripe_page in
current code:
1: (rbio->stripe_len / PAGE_CACHE_SIZE) * stripe_index + page_index
2: DIV_ROUND_UP(rbio->stripe_len, PAGE_CACHE_SIZE) * stripe_index + page_index
3: DIV_ROUND_UP(rbio->stripe_len * stripe_index, PAGE_CACHE_SIZE) + page_index
...
They can get same result when stripe_len align to PAGE_CACHE_SIZE,
this is why current code can work, intruduce and use a common function
for calculation is a better choose.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Current code is trying to calculate rbio->dbitmap's size to make it
align to sizeof(long), but implement haven't achived this object,
it is align to sizeof(char) instead.
This patch fixed above calculation, and use sizeof(long) instead of
fixed "8" to increate compatibility.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
I see no_space in v4.4-rc1 again in xfstests generic/102.
It happened randomly in some node only.
(one of 4 phy-node, and a kvm with non-virtio block driver)
By bisect, we can found the first-bad is:
commit bdced438ac ("block: setup bi_phys_segments after splitting")'
But above patch only triggered the bug by making bio operation
faster(or slower).
Main reason is in our space_allocating code, we need to commit
page writeback before wait it complish, this patch fixed above
bug.
BTW, there is another reason for generic/102 fail, caused by
disable default mixed-blockgroup, I'll fix it in xfstests.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
wait_for_snapshot_creation() is in same group with oher two:
btrfs_start_write_no_snapshoting()
btrfs_end_write_no_snapshoting()
Rename wait_for_snapshot_creation() and move it into same place
with other two.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
size_t write_bytes is not necessary for btrfs_copy_from_user(),
delete it.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Old code used bbio->raid_map to determine whether in raid56
write/recover operation, because we didn't't have bbio->map_type.
Now we have direct way for this condition, rid of using
the function-relative data, and make the code more readable.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
1: Adjust condition in loop to make less TAB
2: Move btrfs_put_bbio()'s line for combine, and makes logic clean.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Enhance chunk validation:
1) Num_stripes
We already have such check but it's only in super block sys chunk
array.
Now check all on-disk chunks.
2) Chunk logical
It should be aligned to sector size.
This behavior should be *DOUBLE CHECKED* for 64K sector size like
PPC64 or AArch64.
Maybe we can found some hidden bugs.
3) Chunk length
Same as chunk logical, should be aligned to sector size.
4) Stripe length
It should be power of 2.
5) Chunk type
Any bit out of TYPE_MAS | PROFILE_MASK is invalid.
With all these much restrict rules, several fuzzed image reported in
mail list should no longer cause kernel panic.
Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Enhance btrfs_check_super_valid() function by the following points:
1) Restrict sector/node size check
Not the old max/min valid check, but also check if it's a power of 2.
So some bogus number like 12K node size won't pass now.
2) Super flag check
For now, there is still some inconsistency between kernel and
btrfs-progs super flags.
And considering btrfs-progs may add new flags for super block, this
check will only output warning.
3) Better root alignment check
Now root bytenr is checked against sector size.
4) Move some check into btrfs_check_super_valid().
Like node size vs leaf size check, and PAGESIZE vs sectorsize check.
And magic number check.
Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
The recent change titled "Btrfs: Check metadata redundancy on balance"
(already in linux-next) left a typo in a message for users:
metatdata -> metadata.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Pull btrfs updates from Chris Mason:
"This has our usual assortment of fixes and cleanups, but the biggest
change included is Omar Sandoval's free space tree. It's not the
default yet, mounting -o space_cache=v2 enables it and sets a readonly
compat bit. The tree can actually be deleted and regenerated if there
are any problems, but it has held up really well in testing so far.
For very large filesystems (30T+) our existing free space caching code
can end up taking a huge amount of time during commits. The new tree
based code is faster and less work overall to update as the commit
progresses.
Omar worked on this during the summer and we'll hammer on it in
production here at FB over the next few months"
* 'for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (73 commits)
Btrfs: fix fitrim discarding device area reserved for boot loader's use
Btrfs: Check metadata redundancy on balance
btrfs: statfs: report zero available if metadata are exhausted
btrfs: preallocate path for snapshot creation at ioctl time
btrfs: allocate root item at snapshot ioctl time
btrfs: do an allocation earlier during snapshot creation
btrfs: use smaller type for btrfs_path locks
btrfs: use smaller type for btrfs_path lowest_level
btrfs: use smaller type for btrfs_path reada
btrfs: cleanup, use enum values for btrfs_path reada
btrfs: constify static arrays
btrfs: constify remaining structs with function pointers
btrfs tests: replace whole ops structure for free space tests
btrfs: use list_for_each_entry* in backref.c
btrfs: use list_for_each_entry_safe in free-space-cache.c
btrfs: use list_for_each_entry* in check-integrity.c
Btrfs: use linux/sizes.h to represent constants
btrfs: cleanup, remove stray return statements
btrfs: zero out delayed node upon allocation
btrfs: pass proper enum type to start_transaction()
...
I managed to trigger this:
| INFO: trying to register non-static key.
| the code is fine but needs lockdep annotation.
| turning off the locking correctness validator.
| CPU: 1 PID: 781 Comm: systemd-gpt-aut Not tainted 4.4.0-rt2+ #14
| Hardware name: ARM-Versatile Express
| [<80307cec>] (dump_stack)
| [<80070e98>] (__lock_acquire)
| [<8007184c>] (lock_acquire)
| [<80287800>] (btrfs_ioctl)
| [<8012a8d4>] (do_vfs_ioctl)
| [<8012ac14>] (SyS_ioctl)
so I think that btrfs_device_data_ordered_init() is not invoked behind
a macro somewhere.
Fixes: 7cc8e58d53 ("Btrfs: fix unprotected device's variants on 32bits machine")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we return 1 here, then the caller treats it as an error and returns
-EINVAL. It causes a static checker warning to treat positive returns
as an error.
Fixes: 1aba86d67f ('Btrfs: fix easily get into ENOSPC in mixed case')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Fix the following error:
fs/btrfs/backref.c:565:1-20: iterator with update on line 577
Fixes: a7ca422('btrfs: use list_for_each_entry* in backref.c')
Signed-off-by: Geliang Tang <geliangtang@163.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The compression message might not be correctly output.
Fix it.
[[before fix]]
# mount -o compress /dev/sdb3 /test3
[ 996.874264] BTRFS info (device sdb3): disk space caching is enabled
[ 996.874268] BTRFS: has skinny extents
# mount | grep /test3
/dev/sdb3 on /test3 type btrfs (rw,relatime,compress=zlib,space_cache,subvolid=5,subvol=/)
# mount -o remount,compress-force /dev/sdb3 /test3
[ 1035.075017] BTRFS info (device sdb3): force zlib compression
[ 1035.075021] BTRFS info (device sdb3): disk space caching is enabled
# mount | grep /test3
/dev/sdb3 on /test3 type btrfs (rw,relatime,compress-force=zlib,space_cache,subvolid=5,subvol=/)
# mount -o remount,compress /dev/sdb3 /test3
[ 1053.679092] BTRFS info (device sdb3): disk space caching is enabled
# mount | grep /test3
/dev/sdb3 on /test3 type btrfs (rw,relatime,compress=zlib,space_cache,subvolid=5,subvol=/)
[[after fix]]
# mount -o compress /dev/sdb3 /test3
[ 401.021753] BTRFS info (device sdb3): use zlib compression
[ 401.021758] BTRFS info (device sdb3): disk space caching is enabled
[ 401.021760] BTRFS: has skinny extents
# mount | grep /test3
/dev/sdb3 on /test3 type btrfs (rw,relatime,compress=zlib,space_cache,subvolid=5,subvol=/)
# mount -o remount,compress-force /dev/sdb3 /test3
[ 439.824624] BTRFS info (device sdb3): force zlib compression
[ 439.824629] BTRFS info (device sdb3): disk space caching is enabled
# mount | grep /test3
/dev/sdb3 on /test3 type btrfs (rw,relatime,compress-force=zlib,space_cache,subvolid=5,subvol=/)
# mount -o remount,compress /dev/sdb3 /test3
[ 459.918430] BTRFS info (device sdb3): use zlib compression
[ 459.918434] BTRFS info (device sdb3): disk space caching is enabled
# mount | grep /test3
/dev/sdb3 on /test3 type btrfs (rw,relatime,compress=zlib,space_cache,subvolid=5,subvol=/)
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The following call trace is seen when btrfs/031 test is executed in a loop,
[ 158.661848] ------------[ cut here ]------------
[ 158.662634] WARNING: CPU: 2 PID: 890 at /home/chandan/repos/linux/fs/btrfs/ioctl.c:558 create_subvol+0x3d1/0x6ea()
[ 158.664102] BTRFS: Transaction aborted (error -2)
[ 158.664774] Modules linked in:
[ 158.665266] CPU: 2 PID: 890 Comm: btrfs Not tainted 4.4.0-rc6-g511711a #2
[ 158.666251] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
[ 158.667392] ffffffff81c0a6b0 ffff8806c7c4f8e8 ffffffff81431fc8 ffff8806c7c4f930
[ 158.668515] ffff8806c7c4f920 ffffffff81051aa1 ffff880c85aff000 ffff8800bb44d000
[ 158.669647] ffff8808863b5c98 0000000000000000 00000000fffffffe ffff8806c7c4f980
[ 158.670769] Call Trace:
[ 158.671153] [<ffffffff81431fc8>] dump_stack+0x44/0x5c
[ 158.671884] [<ffffffff81051aa1>] warn_slowpath_common+0x81/0xc0
[ 158.672769] [<ffffffff81051b27>] warn_slowpath_fmt+0x47/0x50
[ 158.673620] [<ffffffff813bc98d>] create_subvol+0x3d1/0x6ea
[ 158.674440] [<ffffffff813777c9>] btrfs_mksubvol.isra.30+0x369/0x520
[ 158.675376] [<ffffffff8108a4aa>] ? percpu_down_read+0x1a/0x50
[ 158.676235] [<ffffffff81377a81>] btrfs_ioctl_snap_create_transid+0x101/0x180
[ 158.677268] [<ffffffff81377b52>] btrfs_ioctl_snap_create+0x52/0x70
[ 158.678183] [<ffffffff8137afb4>] btrfs_ioctl+0x474/0x2f90
[ 158.678975] [<ffffffff81144b8e>] ? vma_merge+0xee/0x300
[ 158.679751] [<ffffffff8115be31>] ? alloc_pages_vma+0x91/0x170
[ 158.680599] [<ffffffff81123f62>] ? lru_cache_add_active_or_unevictable+0x22/0x70
[ 158.681686] [<ffffffff813d99cf>] ? selinux_file_ioctl+0xff/0x1d0
[ 158.682581] [<ffffffff8117b791>] do_vfs_ioctl+0x2c1/0x490
[ 158.683399] [<ffffffff813d3cde>] ? security_file_ioctl+0x3e/0x60
[ 158.684297] [<ffffffff8117b9d4>] SyS_ioctl+0x74/0x80
[ 158.685051] [<ffffffff819b2bd7>] entry_SYSCALL_64_fastpath+0x12/0x6a
[ 158.685958] ---[ end trace 4b63312de5a2cb76 ]---
[ 158.686647] BTRFS: error (device loop0) in create_subvol:558: errno=-2 No such entry
[ 158.709508] BTRFS info (device loop0): forced readonly
[ 158.737113] BTRFS info (device loop0): disk space caching is enabled
[ 158.738096] BTRFS error (device loop0): Remounting read-write after error is not allowed
[ 158.851303] BTRFS error (device loop0): cleaner transaction attach returned -30
This occurs because,
Mount filesystem
Create subvol with ID 257
Unmount filesystem
Mount filesystem
Delete subvol with ID 257
btrfs_drop_snapshot()
Add root corresponding to subvol 257 into
btrfs_transaction->dropped_roots list
Create new subvol (i.e. create_subvol())
257 is returned as the next free objectid
btrfs_read_fs_root_no_name()
Finds the btrfs_root instance corresponding to the old subvol with ID 257
in btrfs_fs_info->fs_roots_radix.
Returns error since btrfs_root_item->refs has the value of 0.
To fix the issue the commit initializes tree root's and subvolume root's
highest_objectid when loading the roots from disk.
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Overloading extent_map->bdev to struct map_lookup * might have started out
as a means to an end, but it's a pattern that's used all over the place
now. Let's get rid of the casting and just add a union instead.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Mark those kmem allocations that are known to be easily triggered from
userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
memcg. For the list, see below:
- threadinfo
- task_struct
- task_delay_info
- pid
- cred
- mm_struct
- vm_area_struct and vm_region (nommu)
- anon_vma and anon_vma_chain
- signal_struct
- sighand_struct
- fs_struct
- files_struct
- fdtable and fdtable->full_fds_bits
- dentry and external_name
- inode for all filesystems. This is the most tedious part, because
most filesystems overwrite the alloc_inode method.
The list is far from complete, so feel free to add more objects.
Nevertheless, it should be close to "account everything" approach and
keep most workloads within bounds. Malevolent users will be able to
breach the limit, but this was possible even with the former "account
everything" approach (simply because it did not account everything in
fact).
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull misc vfs updates from Al Viro:
"All kinds of stuff. That probably should've been 5 or 6 separate
branches, but by the time I'd realized how large and mixed that bag
had become it had been too close to -final to play with rebasing.
Some fs/namei.c cleanups there, memdup_user_nul() introduction and
switching open-coded instances, burying long-dead code, whack-a-mole
of various kinds, several new helpers for ->llseek(), assorted
cleanups and fixes from various people, etc.
One piece probably deserves special mention - Neil's
lookup_one_len_unlocked(). Similar to lookup_one_len(), but gets
called without ->i_mutex and tries to avoid ever taking it. That, of
course, means that it's not useful for any directory modifications,
but things like getting inode attributes in nfds readdirplus are fine
with that. I really should've asked for moratorium on lookup-related
changes this cycle, but since I hadn't done that early enough... I
*am* asking for that for the coming cycle, though - I'm going to try
and get conversion of i_mutex to rwsem with ->lookup() done under lock
taken shared.
There will be a patch closer to the end of the window, along the lines
of the one Linus had posted last May - mechanical conversion of
->i_mutex accesses to inode_lock()/inode_unlock()/inode_trylock()/
inode_is_locked()/inode_lock_nested(). To quote Linus back then:
-----
| This is an automated patch using
|
| sed 's/mutex_lock(&\(.*\)->i_mutex)/inode_lock(\1)/'
| sed 's/mutex_unlock(&\(.*\)->i_mutex)/inode_unlock(\1)/'
| sed 's/mutex_lock_nested(&\(.*\)->i_mutex,[ ]*I_MUTEX_\([A-Z0-9_]*\))/inode_lock_nested(\1, I_MUTEX_\2)/'
| sed 's/mutex_is_locked(&\(.*\)->i_mutex)/inode_is_locked(\1)/'
| sed 's/mutex_trylock(&\(.*\)->i_mutex)/inode_trylock(\1)/'
|
| with a very few manual fixups
-----
I'm going to send that once the ->i_mutex-affecting stuff in -next
gets mostly merged (or when Linus says he's about to stop taking
merges)"
* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
nfsd: don't hold i_mutex over userspace upcalls
fs:affs:Replace time_t with time64_t
fs/9p: use fscache mutex rather than spinlock
proc: add a reschedule point in proc_readfd_common()
logfs: constify logfs_block_ops structures
fcntl: allow to set O_DIRECT flag on pipe
fs: __generic_file_splice_read retry lookup on AOP_TRUNCATED_PAGE
fs: xattr: Use kvfree()
[s390] page_to_phys() always returns a multiple of PAGE_SIZE
nbd: use ->compat_ioctl()
fs: use block_device name vsprintf helper
lib/vsprintf: add %*pg format specifier
fs: use gendisk->disk_name where possible
poll: plug an unused argument to do_poll
amdkfd: don't open-code memdup_user()
cdrom: don't open-code memdup_user()
rsxx: don't open-code memdup_user()
mtip32xx: don't open-code memdup_user()
[um] mconsole: don't open-code memdup_user_nul()
[um] hostaudio: don't open-code memdup_user()
...
Pull vfs copy_file_range updates from Al Viro:
"Several series around copy_file_range/CLONE"
* 'work.copy_file_range' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
btrfs: use new dedupe data function pointer
vfs: hoist the btrfs deduplication ioctl to the vfs
vfs: wire up compat ioctl for CLONE/CLONE_RANGE
cifs: avoid unused variable and label
nfsd: implement the NFSv4.2 CLONE operation
nfsd: Pass filehandle to nfs4_preprocess_stateid_op()
vfs: pull btrfs clone API to vfs layer
locks: new locks_mandatory_area calling convention
vfs: Add vfs_copy_file_range() support for pagecache copies
btrfs: add .copy_file_range file operation
x86: add sys_copy_file_range to syscall tables
vfs: add copy_file_range syscall and vfs helper
Pull vfs xattr updates from Al Viro:
"Andreas' xattr cleanup series.
It's a followup to his xattr work that went in last cycle; -0.5KLoC"
* 'work.xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
xattr handlers: Simplify list operation
ocfs2: Replace list xattr handler operations
nfs: Move call to security_inode_listsecurity into nfs_listxattr
xfs: Change how listxattr generates synthetic attributes
tmpfs: listxattr should include POSIX ACL xattrs
tmpfs: Use xattr handler infrastructure
btrfs: Use xattr handler infrastructure
vfs: Distinguish between full xattr names and proper prefixes
posix acls: Remove duplicate xattr name definitions
gfs2: Remove gfs2_xattr_acl_chmod
vfs: Remove vfs_xattr_cmp
Pull vfs RCU symlink updates from Al Viro:
"Replacement of ->follow_link/->put_link, allowing to stay in RCU mode
even if the symlink is not an embedded one.
No changes since the mailbomb on Jan 1"
* 'work.symlinks' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
switch ->get_link() to delayed_call, kill ->put_link()
kill free_page_put_link()
teach nfs_get_link() to work in RCU mode
teach proc_self_get_link()/proc_thread_self_get_link() to work in RCU mode
teach shmem_get_link() to work in RCU mode
teach page_get_link() to work in RCU mode
replace ->follow_link() with new method that could stay in RCU mode
don't put symlink bodies in pagecache into highmem
namei: page_getlink() and page_follow_link_light() are the same thing
ufs: get rid of ->setattr() for symlinks
udf: don't duplicate page_symlink_inode_operations
logfs: don't duplicate page_symlink_inode_operations
switch befs long symlinks to page_symlink_operations
As of the 4.3 kernel release, the fitrim ioctl can now discard any region
of a disk that is not allocated to any chunk/block group, including the
first megabyte which is used for our primary superblock and by the boot
loader (grub for example).
Fix this by not allowing to trim/discard any region in the device starting
with an offset not greater than min(alloc_start_mount_option, 1Mb), just
as it was not possible before 4.3.
A reproducer test case for xfstests follows.
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
cd /
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_need_to_be_root
_supported_fs btrfs
_supported_os Linux
_require_scratch
rm -f $seqres.full
_scratch_mkfs >>$seqres.full 2>&1
# Write to the [0, 64Kb[ and [68Kb, 1Mb[ ranges of the device. These ranges are
# reserved for a boot loader to use (GRUB for example) and btrfs should never
# use them - neither for allocating metadata/data nor should trim/discard them.
# The range [64Kb, 68Kb[ is used for the primary superblock of the filesystem.
$XFS_IO_PROG -c "pwrite -S 0xfd 0 64K" $SCRATCH_DEV | _filter_xfs_io
$XFS_IO_PROG -c "pwrite -S 0xfd 68K 956K" $SCRATCH_DEV | _filter_xfs_io
# Now mount the filesystem and perform a fitrim against it.
_scratch_mount
_require_batched_discard $SCRATCH_MNT
$FSTRIM_PROG $SCRATCH_MNT
# Now unmount the filesystem and verify the content of the ranges was not
# modified (no trim/discard happened on them).
_scratch_unmount
echo "Content of the ranges [0, 64Kb] and [68Kb, 1Mb[ after fitrim:"
od -t x1 -N $((64 * 1024)) $SCRATCH_DEV
od -t x1 -j $((68 * 1024)) -N $((956 * 1024)) $SCRATCH_DEV
status=0
exit
Reported-by: Vincent Petry <PVince81@yahoo.fr>
Reported-by: Andrei Borzenkov <arvidjaar@gmail.com>
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=109341
Fixes: 499f377f49 (btrfs: iterate over unused chunk space in FITRIM)
Cc: stable@vger.kernel.org # 4.3+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
When converting a filesystem via balance check that metadata mode
is at least as redundant as the data mode. For example give warning
when:
-dconvert=raid1 -mconvert=single
Signed-off-by: Sam Tygier <samtygier@yahoo.co.uk>
[ minor message reformatting ]
Signed-off-by: David Sterba <dsterba@suse.com>
There is one ENOSPC case that's very confusing. There's Available
greater than zero but no file operation succeds (besides removing
files). This happens when the metadata are exhausted and there's no
possibility to allocate another chunk.
In this scenario it's normal that there's still some space in the data
chunk and the calculation in df reflects that in the Avail value.
To at least give some clue about the ENOSPC situation, let statfs report
zero value in Avail, even if there's still data space available.
Current:
/dev/sdb1 4.0G 3.3G 719M 83% /mnt/test
New:
/dev/sdb1 4.0G 3.3G 0 100% /mnt/test
We calculate the remaining metadata space minus global reserve. If this
is (supposedly) smaller than zero, there's no space. But this does not
hold in practice, the exhausted state happens where's still some
positive delta. So we apply some guesswork and compare the delta to a 4M
threshold. (Practically observed delta was 2M.)
We probably cannot calculate the exact threshold value because this
depends on the internal reservations requested by various operations, so
some operations that consume a few metadata will succeed even if the
Avail is zero. But this is better than the other way around.
Signed-off-by: David Sterba <dsterba@suse.com>
We can also preallocate btrfs_path that's used during pending snapshot
creation and avoid another late ENOMEM failure.
Signed-off-by: David Sterba <dsterba@suse.com>
The actual snapshot creation is delayed until transaction commit. If we
cannot get enough memory for the root item there, we have to fail the
whole transaction commit which is bad. So we'll allocate the memory at
the ioctl call and pass it along with the pending_snapshot struct. The
potential ENOMEM will be returned to the caller of snapshot ioctl.
Signed-off-by: David Sterba <dsterba@suse.com>
The values of btrfs_path::locks are 0 to 4, fit into a u8. Let's see:
* overall size of btrfs_path drops down from 136 to 112 (-24 bytes),
* better packing in a slab page +6 objects
* the whole structure now fits to 2 cachelines
* slight decrease in code size:
text data bss dec hex filename
938731 43670 23144 1005545 f57e9 fs/btrfs/btrfs.ko.before
938203 43670 23144 1005017 f55d9 fs/btrfs/btrfs.ko.after
(and the generated assembly does not change much)
The main purpose is to decrease the size of the structure without
affecting performance. The byte access is usually well behaving accross
arches, the locks are not accessed frequently and sometimes just
compared to zero.
Note for further size reduction attempts: the slots could be made u16
but this might generate worse code on some arches (non-byte and non-int
access). Also the range of operations on slots is wider compared to
locks and the potential performance drop should be evaluated first.
Signed-off-by: David Sterba <dsterba@suse.com>
The level is 0..7, we can use smaller type. The size of btrfs_path is now
136 bytes from 144, which is +2 objects that fit into a 4k slab.
Signed-off-by: David Sterba <dsterba@suse.com>
The possible values for reada are all positive and bounded, we can later
save some bytes by storing it in u8.
Signed-off-by: David Sterba <dsterba@suse.com>
Replace the integers by enums for better readability. The value 2 does
not have any meaning since a717531942
"Btrfs: do less aggressive btree readahead" (2009-01-22).
Signed-off-by: David Sterba <dsterba@suse.com>
There are a few statically initialized arrays that can be made const.
The remaining (like file_system_type, sysfs attributes or prop handlers)
do not allow that due to type mismatch when passed to the APIs or
because the structures are modified through other members.
Signed-off-by: David Sterba <dsterba@suse.com>
Preparatory work for making btrfs_free_space_op constant. In
test_steal_space_from_bitmap_to_extent, we substitute use_bitmap with
own version thus preventing constification. We can rework it so we
replace the whole structure with the correct function pointers.
Signed-off-by: David Sterba <dsterba@suse.com>
Use list_for_each_entry*() to simplify the code.
Signed-off-by: Geliang Tang <geliangtang@163.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Use list_for_each_entry_safe() instead of list_for_each_safe() to
simplify the code.
Signed-off-by: Geliang Tang <geliangtang@163.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Use list_for_each_entry*() instead of list_for_each*() to simplify
the code.
Signed-off-by: Geliang Tang <geliangtang@163.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We use many constants to represent size and offset value. And to make
code readable we use '256 * 1024 * 1024' instead of '268435456' to
represent '256MB'. However we can make far more readable with 'SZ_256MB'
which is defined in the 'linux/sizes.h'.
So this patch replaces 'xxx * 1024 * 1024' kind of expression with
single 'SZ_xxxMB' if 'xxx' is a power of 2 then 'xxx * SZ_1M' if 'xxx' is
not a power of 2. And I haven't touched to '4096' & '8192' because it's
more intuitive than 'SZ_4KB' & 'SZ_8KB'.
Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's slightly cleaner to zero-out the delayed node upon allocation
than to do it by hand in btrfs_init_delayed_node() for a few members
Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Conform to __btrfs_fs_incompat() cast-to-bool (!!) by explicitly
returning boolean not int.
Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The inode argument is never used from the beginning, so remove it.
Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Although we prefer to use separate caches for various structs, it seems
better not to do that for struct btrfs_delalloc_work. Objects of this
type are allocated rarely, when transaction commit calls
btrfs_start_delalloc_roots, requesting delayed iputs.
The objects are temporary (with some IO involved) but still allocated
and freed within __start_delalloc_inodes. Memory allocation failure is
handled.
The slab cache is empty most of the time (observed on several systems),
so if we need to allocate a new slab object, the first one has to
allocate a full page. In a potential case of low memory conditions this
might fail with higher probability compared to using the generic slab
caches.
Signed-off-by: David Sterba <dsterba@suse.com>
We can handle the special case of num_stripes == 0 directly inside
btrfs_read_sys_array. The BUG_ON in btrfs_chunk_item_size is there to
catch other unhandled cases where we fail to validate external data.
A crafted or corrupted image crashes at mount time:
BTRFS: device fsid 9006933e-2a9a-44f0-917f-514252aeec2c devid 1 transid 7 /dev/loop0
BTRFS info (device loop0): disk space caching is enabled
BUG: failure at fs/btrfs/ctree.h:337/btrfs_chunk_item_size()!
Kernel panic - not syncing: BUG!
CPU: 0 PID: 313 Comm: mount Not tainted 4.2.5-00657-ge047887-dirty #25
Stack:
637af890 60062489 602aeb2e 604192ba
60387961 00000011 637af8a0 6038a835
637af9c0 6038776b 634ef32b 00000000
Call Trace:
[<6001c86d>] show_stack+0xfe/0x15b
[<6038a835>] dump_stack+0x2a/0x2c
[<6038776b>] panic+0x13e/0x2b3
[<6020f099>] btrfs_read_sys_array+0x25d/0x2ff
[<601cfbbe>] open_ctree+0x192d/0x27af
[<6019c2c1>] btrfs_mount+0x8f5/0xb9a
[<600bc9a7>] mount_fs+0x11/0xf3
[<600d5167>] vfs_kern_mount+0x75/0x11a
[<6019bcb0>] btrfs_mount+0x2e4/0xb9a
[<600bc9a7>] mount_fs+0x11/0xf3
[<600d5167>] vfs_kern_mount+0x75/0x11a
[<600d710b>] do_mount+0xa35/0xbc9
[<600d7557>] SyS_mount+0x95/0xc8
[<6001e884>] handle_syscall+0x6b/0x8e
Reported-by: Jiri Slaby <jslaby@suse.com>
Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
CC: stable@vger.kernel.org # 3.19+
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_delayed_extent_op can be packed in a better way, it's 40 bytes now
and has 8 unused bytes. Reducing the level type to u8 makes it possible
to squeeze it to the padding byte after key. The bitfields were switched
to bool as there's space to store the full byte without increasing the
whole structure, besides that the generated assembly is smaller.
struct btrfs_delayed_extent_op {
struct btrfs_disk_key key; /* 0 17 */
u8 level; /* 17 1 */
bool update_key; /* 18 1 */
bool update_flags; /* 19 1 */
bool is_data; /* 20 1 */
/* XXX 3 bytes hole, try to pack */
u64 flags_to_set; /* 24 8 */
/* size: 32, cachelines: 1, members: 6 */
/* sum members: 29, holes: 1, sum holes: 3 */
/* last cacheline: 32 bytes */
};
The final size is 32 bytes which gives +26 object per slab page.
text data bss dec hex filename
938811 43670 23144 1005625 f5839 fs/btrfs/btrfs.ko.before
938747 43670 23144 1005561 f57f9 fs/btrfs/btrfs.ko.after
Signed-off-by: David Sterba <dsterba@suse.com>
Inodes for delayed iput allocate a trivial helper structure, let's place
the list hook directly into the inode and save a kmalloc (killing a
__GFP_NOFAIL as a bonus) at the cost of increasing size of btrfs_inode.
The inode can be put into the delayed_iputs list more than once and we
have to keep the count. This means we can't use the list_splice to
process a bunch of inodes because we'd lost track of the count if the
inode is put into the delayed iputs again while it's processed.
Signed-off-by: David Sterba <dsterba@suse.com>
Since we will add support for -d dup for non-mixed filesystem,
kernel need to support converting to this raid-type.
This patch remove limitation of above case.
Tested by following script:
(combination of dup conversion with fsck):
export TEST_DEV='/dev/vdc'
export TEST_DIR='/var/ltf/tester/mnt'
do_dup_test()
{
local m_from="$1"
local d_from="$2"
local m_to="$3"
local d_to="$4"
echo "Convert from -m $m_from -d $d_from to -m $m_to -d $d_to"
umount "$TEST_DIR" &>/dev/null
./mkfs.btrfs -f -m "$m_from" -d "$d_from" "$TEST_DEV" >/dev/null || return 1
mount "$TEST_DEV" "$TEST_DIR" || return 1
cp -a /sbin/* "$TEST_DIR"
[[ "$m_from" != "$m_to" ]] && {
./btrfs balance start -f -mconvert="$m_to" "$TEST_DIR" || return 1
}
[[ "$d_from" != "$d_to" ]] && {
local opt=()
[[ "$d_to" == single ]] && opt+=("-f")
./btrfs balance start "${opt[@]}" -dconvert="$d_to" "$TEST_DIR" || return 1
}
umount "$TEST_DIR" || return 1
./btrfsck "$TEST_DEV" || return 1
echo
return 0
}
test_all()
{
for m_from in single dup; do
for d_from in single dup; do
for m_to in single dup; do
for d_to in single dup; do
do_dup_test "$m_from" "$d_from" "$m_to" "$d_to" || return 1
done
done
done
done
}
test_all
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We hit this panic on a few of our boxes this week where we have an
ordered_extent with an NULL inode. We do an igrab() of the inode in writepages,
but weren't doing it in writepage which can be called directly from the VM on
dirty pages. If the inode has been unlinked then we could have I_FREEING set
which means igrab() would return NULL and we get this panic. Fix this by trying
to igrab in btrfs_writepage, and if it returns NULL then just redirty the page
and return AOP_WRITEPAGE_ACTIVATE; so the VM knows it wasn't successful. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Looks like oversight, call brelse() when checksum fails. Further down the
code, in the non error path, we do call brelse() and so we don't see
brelse() in the goto error paths.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we failed to create a hard link we were not always releasing the
the transaction handle we got before, resulting in a memory leak and
preventing any other tasks from being able to commit the current
transaction.
Fix this by always releasing our transaction handle.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Now that the VFS encapsulates the dedupe ioctl, wire up btrfs to it.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
We weren't accounting for the insertion of an inline extent item for the
symlink inode nor that we need to update the parent inode item (through
the call to btrfs_add_nondir()). So fix this by including two more
transaction units.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
When we are creating a symlink we might fail with an error after we
created its inode and added the corresponding directory indexes to its
parent inode. In this case we end up never removing the directory indexes
because the inode eviction handler, called for our symlink inode on the
final iput(), only removes items associated with the symlink inode and
not with the parent inode.
Example:
$ mkfs.btrfs -f /dev/sdi
$ mount /dev/sdi /mnt
$ touch /mnt/foo
$ ln -s /mnt/foo /mnt/bar
ln: failed to create symbolic link ‘bar’: Cannot allocate memory
$ umount /mnt
$ btrfsck /dev/sdi
Checking filesystem on /dev/sdi
UUID: d5acb5ba-31bd-42da-b456-89dca2e716e1
checking extents
checking free space cache
checking fs roots
root 5 inode 258 errors 2001, no inode item, link count wrong
unresolved ref dir 256 index 3 namelen 3 name bar filetype 7 errors 4, no inode ref
found 131073 bytes used err is 1
total csum bytes: 0
total tree bytes: 131072
total fs tree bytes: 32768
total extent tree bytes: 16384
btree space waste bytes: 124305
file data blocks allocated: 262144
referenced 262144
btrfs-progs v4.2.3
So fix this by adding the directory index entries as the very last
step of symlink creation.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
When a symlink is successfully created it always has an inline extent
containing the source path. However if an error happens when creating
the symlink, we can leave in the subvolume's tree a symlink inode without
any such inline extent item - this happens if after btrfs_symlink() calls
btrfs_end_transaction() and before it calls the inode eviction handler
(through the final iput() call), the transaction gets committed and a
crash happens before the eviction handler gets called, or if a snapshot
of the subvolume is made before the eviction handler gets called. Sadly
we can't just avoid this by making btrfs_symlink() call
btrfs_end_transaction() after it calls the eviction handler, because the
later can commit the current transaction before it removes any items from
the subvolume tree (if it encounters ENOSPC errors while reserving space
for removing all the items).
So make send fail more gracefully, with an -EIO error, and print a
message to dmesg/syslog informing that there's an empty symlink inode,
so that the user can delete the empty symlink or do something else
about it.
Reported-by: Stephen R. van den Berg <srb@cuci.nl>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
While running a stress test I ran into the following trace/transaction
abort:
[471626.672243] ------------[ cut here ]------------
[471626.673322] WARNING: CPU: 9 PID: 19107 at fs/btrfs/extent-tree.c:3740 btrfs_write_dirty_block_groups+0x17c/0x214 [btrfs]()
[471626.675492] BTRFS: Transaction aborted (error -2)
[471626.676748] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc i2c_piix
[471626.688802] CPU: 14 PID: 19107 Comm: fsstress Tainted: G W 4.3.0-rc5-btrfs-next-17+ #1
[471626.690148] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[471626.691901] 0000000000000000 ffff880016037cf0 ffffffff812566f4 ffff880016037d38
[471626.695009] ffff880016037d28 ffffffff8104d0a6 ffffffffa040c84e 00000000fffffffe
[471626.697490] ffff88011fe855f8 ffff88000c484cb0 ffff88000d195000 ffff880016037d90
[471626.699201] Call Trace:
[471626.699804] [<ffffffff812566f4>] dump_stack+0x4e/0x79
[471626.701049] [<ffffffff8104d0a6>] warn_slowpath_common+0x9f/0xb8
[471626.702542] [<ffffffffa040c84e>] ? btrfs_write_dirty_block_groups+0x17c/0x214 [btrfs]
[471626.704326] [<ffffffff8104d107>] warn_slowpath_fmt+0x48/0x50
[471626.705636] [<ffffffffa0403717>] ? write_one_cache_group.isra.32+0x77/0x82 [btrfs]
[471626.707048] [<ffffffffa040c84e>] btrfs_write_dirty_block_groups+0x17c/0x214 [btrfs]
[471626.708616] [<ffffffffa048a50a>] commit_cowonly_roots+0x1d7/0x25a [btrfs]
[471626.709950] [<ffffffffa041e34a>] btrfs_commit_transaction+0x4c4/0x991 [btrfs]
[471626.711286] [<ffffffff81081c61>] ? signal_pending_state+0x31/0x31
[471626.712611] [<ffffffffa03f6df4>] btrfs_sync_fs+0x145/0x1ad [btrfs]
[471626.715610] [<ffffffff811962a2>] ? SyS_tee+0x226/0x226
[471626.716718] [<ffffffff811962c2>] sync_fs_one_sb+0x20/0x22
[471626.717672] [<ffffffff8116fc01>] iterate_supers+0x75/0xc2
[471626.718800] [<ffffffff8119669a>] sys_sync+0x52/0x80
[471626.719990] [<ffffffff8147cd97>] entry_SYSCALL_64_fastpath+0x12/0x6f
[471626.721835] ---[ end trace baf57f43d76693f4 ]---
[471626.722954] BTRFS: error (device sdc) in btrfs_write_dirty_block_groups:3740: errno=-2 No such entry
This is a very rare situation and it happened due to a race between a free
space endio worker and writing the space caches for dirty block groups at
a transaction's commit critical section. The steps leading to this are:
1) A task calls btrfs_commit_transaction() and starts the writeout of the
space caches for all currently dirty block groups (i.e. it calls
btrfs_start_dirty_block_groups());
2) The previous step starts writeback for space caches;
3) When the writeback finishes it queues jobs for free space endio work
queue (fs_info->endio_freespace_worker) that execute
btrfs_finish_ordered_io();
4) The task committing the transaction sets the transaction's state
to TRANS_STATE_COMMIT_DOING and shortly after calls
btrfs_write_dirty_block_groups();
5) A free space endio job joins the transaction, through
btrfs_join_transaction_nolock(), and updates a free space inode item
in the root tree through btrfs_update_inode_fallback();
6) Updating the free space inode item resulted in COWing one or more
nodes/leaves of the root tree, and that resulted in creating a new
metadata block group, which gets added to the transaction's list
of dirty block groups (this is a very rare case);
7) The free space endio job has not released yet its transaction handle
at this point, so the new metadata block group was not yet fully
created (didn't go through btrfs_create_pending_block_groups() yet);
8) The transaction commit task sees the new metadata block group in
the transaction's list of dirty block groups and processes it.
When it attempts to update the block group's block group item in
the extent tree, through write_one_cache_group(), it isn't able
to find it and aborts the transaction with error -ENOENT - this
is because the free space endio job hasn't yet released its
transaction handle (which calls btrfs_create_pending_block_groups())
and therefore the block group item was not yet added to the extent
tree.
Fix this waiting for free space endio jobs if we fail to find a block
group item in the extent tree and then retry once updating the block
group item.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
This is a short term solution to make sure btrfs_run_delayed_refs()
doesn't change the extent tree while we are scanning it to create the
free space tree.
Longer term we need to synchronize scanning the block groups one by one,
similar to what happens during a balance.
Signed-off-by: Chris Mason <clm@fb.com>
We call btrfs_write_dirty_block_groups() in the critical section of a
transaction's commit, when no other tasks can join the transaction and
add more block groups to the transaction's list of dirty block groups,
so we not taking the dirty block groups spinlock when checking for the
list's emptyness, grabbing its first element or deleting elements from
it.
However there's a special and rare case where we can have a concurrent
task adding elements to this list. We trigger writeback for space
caches before at btrfs_start_dirty_block_groups() and in past iterations
of the loop at btrfs_write_dirty_block_groups(), this means that when
the writeback finishes (which happens asynchronously) it creates a
task for the endio free space work queue that executes
btrfs_finish_ordered_io() - this function is able to join the transaction,
through btrfs_join_transaction_nolock(), and update the free space cache's
inode item in the root tree, which can result in COWing nodes of this tree
and therefore allocation of a new block group can happen, which gets added
to the transaction's list of dirty block groups while the transaction
commit task is operating on it concurrently.
So fix this by taking the dirty block groups spinlock before doing
operations on the dirty block groups list at
btrfs_write_dirty_block_groups().
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Those are stupid and code should use static_cpu_has_safe() or
boot_cpu_has() instead. Kill the least used and unused ones.
The remaining ones need more careful inspection before a conversion can
happen. On the TODO.
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/1449481182-27541-4-git-send-email-bp@alien8.de
Cc: David Sterba <dsterba@suse.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Pull btrfs fixes from Chris Mason:
"A couple of small fixes"
* 'for-linus-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: check prepare_uptodate_page() error code earlier
Btrfs: check for empty bitmap list in setup_cluster_bitmaps
btrfs: fix misleading warning when space cache failed to load
Btrfs: fix transaction handle leak in balance
Btrfs: fix unprotected list move from unused_bgs to deleted_bgs list
When running fstests btrfs/070, with a higher number of fsstress
operations, I ran frequently into two different locking bugs when
defragging directories.
The first bug produced the following traces:
[133860.229792] ------------[ cut here ]------------
[133860.251062] WARNING: CPU: 2 PID: 26057 at fs/btrfs/locking.c:46 btrfs_set_lock_blocking_rw+0x57/0xbd [btrfs]()
[133860.253576] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc i2c_piix4 psmouse parport
[133860.282566] CPU: 2 PID: 26057 Comm: btrfs Tainted: G W 4.3.0-rc5-btrfs-next-17+ #1
[133860.284393] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[133860.286827] 0000000000000000 ffff880207697b78 ffffffff812566f4 0000000000000000
[133860.288341] ffff880207697bb0 ffffffff8104d0a6 ffffffffa052d4c1 ffff880178f60e00
[133860.294219] ffff880178f60e00 0000000000000000 00000000000000f6 ffff880207697bc0
[133860.295831] Call Trace:
[133860.306518] [<ffffffff812566f4>] dump_stack+0x4e/0x79
[133860.307473] [<ffffffff8104d0a6>] warn_slowpath_common+0x9f/0xb8
[133860.308619] [<ffffffffa052d4c1>] ? btrfs_set_lock_blocking_rw+0x57/0xbd [btrfs]
[133860.310068] [<ffffffff8104d172>] warn_slowpath_null+0x1a/0x1c
[133860.312552] [<ffffffffa052d4c1>] btrfs_set_lock_blocking_rw+0x57/0xbd [btrfs]
[133860.314630] [<ffffffffa04d5787>] btrfs_set_lock_blocking+0xe/0x10 [btrfs]
[133860.323596] [<ffffffffa04d99cb>] btrfs_realloc_node+0xb3/0x341 [btrfs]
[133860.325233] [<ffffffffa050e396>] btrfs_defrag_leaves+0x239/0x2fa [btrfs]
[133860.332427] [<ffffffffa04fc2ce>] btrfs_defrag_root+0x63/0xca [btrfs]
[133860.337259] [<ffffffffa052a34e>] btrfs_ioctl_defrag+0x78/0x14e [btrfs]
[133860.340147] [<ffffffffa052b00b>] btrfs_ioctl+0x746/0x24c6 [btrfs]
[133860.344833] [<ffffffff81087481>] ? arch_local_irq_save+0x9/0xc
[133860.346343] [<ffffffff8113ad61>] ? __might_fault+0x4c/0xa7
[133860.353248] [<ffffffff8113ad61>] ? __might_fault+0x4c/0xa7
[133860.354242] [<ffffffff8113adba>] ? __might_fault+0xa5/0xa7
[133860.355232] [<ffffffff81171139>] ? cp_new_stat+0x15d/0x174
[133860.356237] [<ffffffff8117c610>] do_vfs_ioctl+0x427/0x4e6
[133860.358587] [<ffffffff81171175>] ? SYSC_newfstat+0x25/0x2e
[133860.360195] [<ffffffff8118574d>] ? __fget_light+0x4d/0x71
[133860.361380] [<ffffffff8117c726>] SyS_ioctl+0x57/0x79
[133860.363578] [<ffffffff8147cd97>] entry_SYSCALL_64_fastpath+0x12/0x6f
[133860.366217] ---[ end trace 2cadb2f653437e49 ]---
[133860.367399] ------------[ cut here ]------------
[133860.368162] kernel BUG at fs/btrfs/locking.c:307!
[133860.369430] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[133860.370205] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc i2c_piix4 psmouse parport
[133860.370205] CPU: 2 PID: 26057 Comm: btrfs Tainted: G W 4.3.0-rc5-btrfs-next-17+ #1
[133860.370205] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[133860.370205] task: ffff8800aec6db40 ti: ffff880207694000 task.ti: ffff880207694000
[133860.370205] RIP: 0010:[<ffffffffa052d466>] [<ffffffffa052d466>] btrfs_assert_tree_locked+0x10/0x14 [btrfs]
[133860.370205] RSP: 0018:ffff880207697bc0 EFLAGS: 00010246
[133860.370205] RAX: 0000000000000000 RBX: ffff880178f60e00 RCX: 0000000000000000
[133860.370205] RDX: ffff88023ec4fb50 RSI: 00000000ffffffff RDI: ffff880178f60e00
[133860.370205] RBP: ffff880207697bc0 R08: 0000000000000001 R09: 0000000000000000
[133860.370205] R10: 0000160000000000 R11: ffffffff81651000 R12: ffff880178f60e00
[133860.370205] R13: 0000000000000000 R14: 00000000000000f6 R15: ffff8801ff409000
[133860.370205] FS: 00007f763efd48c0(0000) GS:ffff88023ec40000(0000) knlGS:0000000000000000
[133860.370205] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[133860.370205] CR2: 0000000002158048 CR3: 000000003fd6c000 CR4: 00000000000006e0
[133860.370205] Stack:
[133860.370205] ffff880207697bd8 ffffffffa052d4d0 0000000000000000 ffff880207697be8
[133860.370205] ffffffffa04d5787 ffff880207697c80 ffffffffa04d99cb ffff8801ff409590
[133860.370205] ffff880207697ca8 000000f507697c80 ffff880183c11bb8 0000000000000000
[133860.370205] Call Trace:
[133860.370205] [<ffffffffa052d4d0>] btrfs_set_lock_blocking_rw+0x66/0xbd [btrfs]
[133860.370205] [<ffffffffa04d5787>] btrfs_set_lock_blocking+0xe/0x10 [btrfs]
[133860.370205] [<ffffffffa04d99cb>] btrfs_realloc_node+0xb3/0x341 [btrfs]
[133860.370205] [<ffffffffa050e396>] btrfs_defrag_leaves+0x239/0x2fa [btrfs]
[133860.370205] [<ffffffffa04fc2ce>] btrfs_defrag_root+0x63/0xca [btrfs]
[133860.370205] [<ffffffffa052a34e>] btrfs_ioctl_defrag+0x78/0x14e [btrfs]
[133860.370205] [<ffffffffa052b00b>] btrfs_ioctl+0x746/0x24c6 [btrfs]
[133860.370205] [<ffffffff81087481>] ? arch_local_irq_save+0x9/0xc
[133860.370205] [<ffffffff8113ad61>] ? __might_fault+0x4c/0xa7
[133860.370205] [<ffffffff8113ad61>] ? __might_fault+0x4c/0xa7
[133860.370205] [<ffffffff8113adba>] ? __might_fault+0xa5/0xa7
[133860.370205] [<ffffffff81171139>] ? cp_new_stat+0x15d/0x174
[133860.370205] [<ffffffff8117c610>] do_vfs_ioctl+0x427/0x4e6
[133860.370205] [<ffffffff81171175>] ? SYSC_newfstat+0x25/0x2e
[133860.370205] [<ffffffff8118574d>] ? __fget_light+0x4d/0x71
[133860.370205] [<ffffffff8117c726>] SyS_ioctl+0x57/0x79
[133860.370205] [<ffffffff8147cd97>] entry_SYSCALL_64_fastpath+0x12/0x6f
This bug happened because we assumed that by setting keep_locks to 1 in
our search path, our path after a call to btrfs_search_slot() would have
all nodes locked, which is not always true because unlock_up() (called by
btrfs_search_slot()) will unlock a node in a path if the slot of the node
below it doesn't point to the last item or beyond the last item. For
example, when the tree has a heigth of 2 and path->slots[0] has a value
smaller than btrfs_header_nritems(path->nodes[0]) - 1, the node at level 2
will be unlocked (also because lowest_unlock is set to 1 due to the fact
that the value passed as ins_len to btrfs_search_slot is 0).
This resulted in btrfs_find_next_key(), called before btrfs_realloc_node(),
to release out path and call again btrfs_search_slot(), but this time with
the cow parameter set to 0, meaning the resulting path got only read locks.
Therefore when we called btrfs_realloc_node(), with path->nodes[1] having
a read lock, it resulted in the warning and BUG_ON when calling
btrfs_set_lock_blocking() against the node, as that function expects the
node to have a write lock.
The second bug happened often when the first bug didn't happen, and made
us hang and hitting the following warning at fs/btrfs/locking.c:
251 void btrfs_tree_lock(struct extent_buffer *eb)
252 {
253 WARN_ON(eb->lock_owner == current->pid);
This happened because the tree search we made at btrfs_defrag_leaves()
before calling btrfs_find_next_key() locked a leaf and all the other
nodes in the path, so btrfs_find_next_key() had no need to release the
path and make a new search (with path->lowest_level set to 1). This
made btrfs_realloc_node() attempt to write lock the same leaf again,
resulting in a hang/deadlock.
So fix these issues by calling btrfs_find_next_key() after calling
btrfs_realloc_node() and setting the search path's lowest_level to 1
to avoid the hang/deadlock when attempting to write lock the leaves
at btrfs_realloc_node().
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Now we can finally hook up everything so we can actually use free space
tree. The free space tree is enabled by passing the space_cache=v2 mount
option. On the first mount with the this option set, the free space tree
will be created and the FREE_SPACE_TREE read-only compat bit will be
set. Any time the filesystem is mounted from then on, we must use the
free space tree. The clear_cache option will also clear the free space
tree.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
The free space tree is updated in tandem with the extent tree. There are
only a handful of places where we need to hook in:
1. Block group creation
2. Block group deletion
3. Delayed refs (extent creation and deletion)
4. Block group caching
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
This tests the operations on the free space tree trying to excercise all
of the main cases for both formats. Between this and xfstests, the free
space tree should have pretty good coverage.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
The free space cache has turned out to be a scalability bottleneck on
large, busy filesystems. When the cache for a lot of block groups needs
to be written out, we can get extremely long commit times; if this
happens in the critical section, things are especially bad because we
block new transactions from happening.
The main problem with the free space cache is that it has to be written
out in its entirety and is managed in an ad hoc fashion. Using a B-tree
to store free space fixes this: updates can be done as needed and we get
all of the benefits of using a B-tree: checksumming, RAID handling,
well-understood behavior.
With the free space tree, we get commit times that are about the same as
the no cache case with load times slower than the free space cache case
but still much faster than the no cache case. Free space is represented
with extents until it becomes more space-efficient to use bitmaps,
giving us similar space overhead to the free space cache.
The operations on the free space tree are: adding and removing free
space, handling the creation and deletion of block groups, and loading
the free space for a block group. We can also create the free space tree
by walking the extent tree and clear the free space tree.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
The on-disk format for the free space tree is straightforward. Each
block group is represented in the free space tree by a free space info
item that stores accounting information: whether the free space for this
block group is stored as bitmaps or extents and how many extents of free
space exist for this block group (regardless of which format is being
used in the tree). Extents are (start, FREE_SPACE_EXTENT, length) keys
with no corresponding item, and bitmaps instead have the
FREE_SPACE_BITMAP type and have a bitmap item attached, which is just an
array of bytes.
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
We're also going to load the free space tree from caching_thread(), so
we should refactor some of the common code.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
We're finally going to add one of these for the free space tree, so
let's add the same nice helpers that we have for the incompat bits.
While we're add it, also add helpers to clear the bits.
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Sanity test the extent buffer bitmap operations (test, set, and clear)
against the equivalent standard kernel operations.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
When doing a direct IO write, __blockdev_direct_IO() can call the
btrfs_get_blocks_direct() callback one or more times before it calls the
btrfs_submit_direct() callback. However it can fail after calling the
first callback and before calling the second callback, which is a problem
because the first one creates ordered extents and the second one is the
one that submits bios that cover the ordered extents created by the first
one. That means the ordered extents will never complete nor have any of
the flags BTRFS_ORDERED_IO_DONE / BTRFS_ORDERED_IOERR set, resulting in
subsequent operations (such as other direct IO writes, buffered writes or
hole punching) that lock the same IO range and lookup for ordered extents
in the range to hang forever waiting for those ordered extents because
they can not complete ever, since no bio was submitted.
Fix this by tracking a range of created ordered extents that don't have
yet corresponding bios submitted and completing the ordered extents in
the range if __blockdev_direct_IO() fails with an error.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
If readpages() (triggered by defrag or buffered reads) is called while a
direct IO write is in progress, we have a small time window where we can
deadlock, resulting in traces like the following being generated:
[84723.212993] INFO: task fio:2849 blocked for more than 120 seconds.
[84723.214310] Tainted: G W 4.3.0-rc5-btrfs-next-17+ #1
[84723.215640] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[84723.217313] fio D ffff88023ec75218 0 2849 2835 0x00000000
[84723.218778] ffff880122dfb6e8 0000000000000092 0000000000000000 ffff88023ec75200
[84723.220458] ffff88000e05d2c0 ffff880122dfc000 ffff88023ec75200 7fffffffffffffff
[84723.230597] 0000000000000002 ffffffff8147891a ffff880122dfb700 ffffffff8147856a
[84723.232085] Call Trace:
[84723.232625] [<ffffffff8147891a>] ? bit_wait+0x3c/0x3c
[84723.233529] [<ffffffff8147856a>] schedule+0x7d/0x95
[84723.234398] [<ffffffff8147baa3>] schedule_timeout+0x43/0x10b
[84723.235384] [<ffffffff810f82eb>] ? time_hardirqs_on+0x15/0x28
[84723.236426] [<ffffffff8108a23d>] ? trace_hardirqs_on+0xd/0xf
[84723.237502] [<ffffffff810af8a3>] ? read_seqcount_begin.constprop.20+0x57/0x6d
[84723.238807] [<ffffffff8108a09b>] ? trace_hardirqs_on_caller+0x16/0x1ab
[84723.242012] [<ffffffff8108a23d>] ? trace_hardirqs_on+0xd/0xf
[84723.243064] [<ffffffff810af2ad>] ? timekeeping_get_ns+0xe/0x33
[84723.244116] [<ffffffff810afa2e>] ? ktime_get+0x41/0x52
[84723.245029] [<ffffffff81477cff>] io_schedule_timeout+0xb7/0x12b
[84723.245942] [<ffffffff81477cff>] ? io_schedule_timeout+0xb7/0x12b
[84723.246596] [<ffffffff81478953>] bit_wait_io+0x39/0x45
[84723.247503] [<ffffffff81478b93>] __wait_on_bit_lock+0x49/0x8d
[84723.248540] [<ffffffff8111684f>] __lock_page+0x66/0x68
[84723.249558] [<ffffffff81081c9b>] ? autoremove_wake_function+0x3a/0x3a
[84723.250844] [<ffffffff81124a04>] lock_page+0x2c/0x2f
[84723.251871] [<ffffffff81124afc>] invalidate_inode_pages2_range+0xf5/0x2aa
[84723.253274] [<ffffffff81117c34>] ? filemap_fdatawait_range+0x12d/0x146
[84723.254757] [<ffffffff81118191>] ? filemap_fdatawrite_range+0x13/0x15
[84723.256378] [<ffffffffa05139a2>] btrfs_get_blocks_direct+0x1b0/0x664 [btrfs]
[84723.258556] [<ffffffff8119e3f9>] ? submit_page_section+0x7b/0x111
[84723.260064] [<ffffffff8119eb90>] do_blockdev_direct_IO+0x658/0xbdb
[84723.261479] [<ffffffffa05137f2>] ? btrfs_page_exists_in_range+0x1a9/0x1a9 [btrfs]
[84723.262961] [<ffffffffa050a8a6>] ? btrfs_writepage_start_hook+0xce/0xce [btrfs]
[84723.264449] [<ffffffff8119f144>] __blockdev_direct_IO+0x31/0x33
[84723.265614] [<ffffffff8119f144>] ? __blockdev_direct_IO+0x31/0x33
[84723.266769] [<ffffffffa050a8a6>] ? btrfs_writepage_start_hook+0xce/0xce [btrfs]
[84723.268264] [<ffffffffa050935d>] btrfs_direct_IO+0x1b9/0x259 [btrfs]
[84723.270954] [<ffffffffa050a8a6>] ? btrfs_writepage_start_hook+0xce/0xce [btrfs]
[84723.272465] [<ffffffff8111878c>] generic_file_direct_write+0xb3/0x128
[84723.273734] [<ffffffffa051955c>] btrfs_file_write_iter+0x228/0x404 [btrfs]
[84723.275101] [<ffffffff8116ca6f>] __vfs_write+0x7c/0xa5
[84723.276200] [<ffffffff8116cfab>] vfs_write+0xa0/0xe4
[84723.277298] [<ffffffff8116d79d>] SyS_write+0x50/0x7e
[84723.278327] [<ffffffff8147cd97>] entry_SYSCALL_64_fastpath+0x12/0x6f
[84723.279595] INFO: lockdep is turned off.
[84723.379035] INFO: task btrfs:2923 blocked for more than 120 seconds.
[84723.380323] Tainted: G W 4.3.0-rc5-btrfs-next-17+ #1
[84723.381608] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[84723.383003] btrfs D ffff88023ed75218 0 2923 2859 0x00000000
[84723.384277] ffff88001311f860 0000000000000082 ffff88001311f840 ffff88023ed75200
[84723.385748] ffff88012c6751c0 ffff880013120000 ffff88012042fe68 ffff88012042fe30
[84723.387152] ffff880221571c88 0000000000000001 ffff88001311f878 ffffffff8147856a
[84723.388620] Call Trace:
[84723.389105] [<ffffffff8147856a>] schedule+0x7d/0x95
[84723.391882] [<ffffffffa051da32>] btrfs_start_ordered_extent+0x161/0x1fa [btrfs]
[84723.393718] [<ffffffff81081c61>] ? signal_pending_state+0x31/0x31
[84723.395659] [<ffffffffa0522c5b>] __do_contiguous_readpages.constprop.21+0x81/0xdc [btrfs]
[84723.397383] [<ffffffffa050ac96>] ? btrfs_submit_direct+0x3f0/0x3f0 [btrfs]
[84723.398852] [<ffffffffa0522da3>] __extent_readpages.constprop.20+0xed/0x100 [btrfs]
[84723.400561] [<ffffffff81123f6c>] ? __lru_cache_add+0x5d/0x72
[84723.401787] [<ffffffffa0523896>] extent_readpages+0x111/0x1a7 [btrfs]
[84723.403121] [<ffffffffa050ac96>] ? btrfs_submit_direct+0x3f0/0x3f0 [btrfs]
[84723.404583] [<ffffffffa05088fa>] btrfs_readpages+0x1f/0x21 [btrfs]
[84723.406007] [<ffffffff811226df>] __do_page_cache_readahead+0x168/0x1f4
[84723.407502] [<ffffffff81122988>] ondemand_readahead+0x21d/0x22e
[84723.408937] [<ffffffff81122988>] ? ondemand_readahead+0x21d/0x22e
[84723.410487] [<ffffffff81122af1>] page_cache_sync_readahead+0x3d/0x3f
[84723.411710] [<ffffffffa0535388>] btrfs_defrag_file+0x419/0xaaf [btrfs]
[84723.413007] [<ffffffffa0531db0>] ? kzalloc+0xf/0x11 [btrfs]
[84723.414085] [<ffffffffa0535b43>] btrfs_ioctl_defrag+0x125/0x14e [btrfs]
[84723.415307] [<ffffffffa0536753>] btrfs_ioctl+0x746/0x24c6 [btrfs]
[84723.416532] [<ffffffff81087481>] ? arch_local_irq_save+0x9/0xc
[84723.417731] [<ffffffff8113ad61>] ? __might_fault+0x4c/0xa7
[84723.418699] [<ffffffff8113ad61>] ? __might_fault+0x4c/0xa7
[84723.421532] [<ffffffff8113adba>] ? __might_fault+0xa5/0xa7
[84723.422629] [<ffffffff81171139>] ? cp_new_stat+0x15d/0x174
[84723.423712] [<ffffffff8117c610>] do_vfs_ioctl+0x427/0x4e6
[84723.424801] [<ffffffff81171175>] ? SYSC_newfstat+0x25/0x2e
[84723.425968] [<ffffffff8118574d>] ? __fget_light+0x4d/0x71
[84723.427063] [<ffffffff8117c726>] SyS_ioctl+0x57/0x79
[84723.428138] [<ffffffff8147cd97>] entry_SYSCALL_64_fastpath+0x12/0x6f
Consider the following logical and physical file layout:
logical: ... [ prealloc extent A ] [ prealloc extent B ] [ extent C ] ...
4K 8K 16K
physical: ... 12853248 12857344 1103101952 ...
(= 12853248 + 4K)
Extents A and B are physically adjacent. The following diagram shows a
sequence of events that lead to the deadlock when we attempt to do a
direct IO write against the file range [4K, 16K[ and a defrag is triggered
simultaneously.
CPU 1 CPU 2
btrfs_direct_IO()
btrfs_get_blocks_direct()
creates ordered extent A, covering
the 4k prealloc extent A (range [4K, 8K[)
btrfs_defrag_file()
page_cache_sync_readahead([0K, 1M[)
btrfs_readpages()
extent_readpages()
locks all pages in the file
range [0K, 128K[ through calls
to add_to_page_cache_lru()
__do_contiguous_readpages()
finds ordered extent A
waits for it to complete
btrfs_get_blocks_direct() called again
lock_extent_direct(range [8K, 16K[)
finds a page in range [8K, 16K[ through
btrfs_page_exists_in_range()
invalidate_inode_pages2_range([8K, 16K[)
--> tries to lock pages that are already
locked by the task at CPU 2
--> our task, running __blockdev_direct_IO(),
hangs waiting to lock the pages and the
submit bio callback, btrfs_submit_direct(),
ends up never being called, resulting in the
ordered extent A never completing (because a
corresponding bio is never submitted) and
CPU 2 will wait for it forever while holding
the pages locked
---> deadlock!
Fix this by removing the page invalidation approach when attempting to
lock the range for IO from the callback btrfs_get_blocks_direct() and
falling back buffered IO. This was a rare case anyway and well behaved
applications do not mix concurrent direct IO writes with buffered reads
anyway, being a concurrent defrag the only normal case that could lead
to the deadlock.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Commit 61de718fce ("Btrfs: fix memory corruption on failure to submit
bio for direct IO") fixed problems with the error handling code after we
fail to submit a bio for direct IO. However there were 2 problems that it
did not address when the failure is due to memory allocation failures for
direct IO writes:
1) We considered that there could be only one ordered extent for the whole
IO range, which is not always true, as we can have multiple;
2) It did not set the bit BTRFS_ORDERED_IO_DONE in the ordered extent,
which can make other tasks running btrfs_wait_logged_extents() hang
forever, since they wait for that bit to be set. The general assumption
is that regardless of an error, the BTRFS_ORDERED_IO_DONE is always set
and it precedes setting the bit BTRFS_ORDERED_COMPLETE.
Fix these issues by moving part of the btrfs_endio_direct_write() handler
into a new helper function and having that new helper function called when
we fail to allocate memory to submit the bio (and its private object) for
a direct IO write.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
When a transaction is aborted, or its commit fails before writing the new
superblock and calling btrfs_finish_extent_commit(), we leak reference
counts on the block groups attached to the transaction's delete_bgs list,
because btrfs_finish_extent_commit() is never called for those two cases.
Fix this by dropping their references at btrfs_put_transaction(), which
is called when transactions are aborted (by making the transaction kthread
commit the transaction) or if their commits fail.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
During the final phase of a device replace operation, I ran into a
transaction abort that resulted in the following trace:
[23919.655368] WARNING: CPU: 10 PID: 30175 at fs/btrfs/extent-tree.c:9843 btrfs_create_pending_block_groups+0x15e/0x1ab [btrfs]()
[23919.664742] BTRFS: Transaction aborted (error -2)
[23919.665749] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc i2c_piix4 parport psmouse acpi_cpufreq processor i2c_core evdev microcode pcspkr button serio_raw ext4 crc16 jbd2 mbcache sd_mod sg sr_mod cdrom virtio_scsi ata_generic ata_piix virtio_pci floppy virtio_ring libata e1000 virtio scsi_mod [last unloaded: btrfs]
[23919.679442] CPU: 10 PID: 30175 Comm: fsstress Not tainted 4.3.0-rc5-btrfs-next-17+ #1
[23919.682392] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[23919.689151] 0000000000000000 ffff8804020cbb50 ffffffff812566f4 ffff8804020cbb98
[23919.692604] ffff8804020cbb88 ffffffff8104d0a6 ffffffffa03eea69 ffff88041b678a48
[23919.694230] ffff88042ac38000 ffff88041b678930 00000000fffffffe ffff8804020cbbf0
[23919.696716] Call Trace:
[23919.698669] [<ffffffff812566f4>] dump_stack+0x4e/0x79
[23919.700597] [<ffffffff8104d0a6>] warn_slowpath_common+0x9f/0xb8
[23919.701958] [<ffffffffa03eea69>] ? btrfs_create_pending_block_groups+0x15e/0x1ab [btrfs]
[23919.703612] [<ffffffff8104d107>] warn_slowpath_fmt+0x48/0x50
[23919.705047] [<ffffffffa03eea69>] btrfs_create_pending_block_groups+0x15e/0x1ab [btrfs]
[23919.706967] [<ffffffffa0402097>] __btrfs_end_transaction+0x84/0x2dd [btrfs]
[23919.708611] [<ffffffffa0402300>] btrfs_end_transaction+0x10/0x12 [btrfs]
[23919.710099] [<ffffffffa03ef0b8>] btrfs_alloc_data_chunk_ondemand+0x121/0x28b [btrfs]
[23919.711970] [<ffffffffa0413025>] btrfs_fallocate+0x7d3/0xc6d [btrfs]
[23919.713602] [<ffffffff8108b78f>] ? lock_acquire+0x10d/0x194
[23919.714756] [<ffffffff81086dbc>] ? percpu_down_read+0x51/0x78
[23919.716155] [<ffffffff8116ef1d>] ? __sb_start_write+0x5f/0xb0
[23919.718918] [<ffffffff8116ef1d>] ? __sb_start_write+0x5f/0xb0
[23919.724170] [<ffffffff8116b579>] vfs_fallocate+0x170/0x1ff
[23919.725482] [<ffffffff8117c1d7>] ioctl_preallocate+0x89/0x9b
[23919.726790] [<ffffffff8117c5ef>] do_vfs_ioctl+0x406/0x4e6
[23919.728428] [<ffffffff81171175>] ? SYSC_newfstat+0x25/0x2e
[23919.729642] [<ffffffff8118574d>] ? __fget_light+0x4d/0x71
[23919.730782] [<ffffffff8117c726>] SyS_ioctl+0x57/0x79
[23919.731847] [<ffffffff8147cd97>] entry_SYSCALL_64_fastpath+0x12/0x6f
[23919.733330] ---[ end trace 166ef301a335832a ]---
This is due to a race between device replace and chunk allocation, which
the following diagram illustrates:
CPU 1 CPU 2
btrfs_dev_replace_finishing()
at this point
dev_replace->tgtdev->devid ==
BTRFS_DEV_REPLACE_DEVID (0ULL)
...
btrfs_start_transaction()
btrfs_commit_transaction()
btrfs_fallocate()
btrfs_alloc_data_chunk_ondemand()
btrfs_join_transaction()
--> starts a new transaction
do_chunk_alloc()
lock fs_info->chunk_mutex
btrfs_alloc_chunk()
--> creates extent map for
the new chunk with
em->bdev->map->stripes[i]->dev->devid
== X (X > 0)
--> extent map is added to
fs_info->mapping_tree
--> initial phase of bg A
allocation completes
unlock fs_info->chunk_mutex
lock fs_info->chunk_mutex
btrfs_dev_replace_update_device_in_mapping_tree()
--> iterates fs_info->mapping_tree and
replaces the device in every extent
map's map->stripes[] with
dev_replace->tgtdev, which still has
an id of 0ULL (BTRFS_DEV_REPLACE_DEVID)
btrfs_end_transaction()
btrfs_create_pending_block_groups()
--> starts final phase of
bg A creation (update device,
extent, and chunk trees, etc)
btrfs_finish_chunk_alloc()
btrfs_update_device()
--> attempts to update a device
item with ID == 0ULL
(BTRFS_DEV_REPLACE_DEVID)
which is the current ID of
bg A's
em->bdev->map->stripes[i]->dev->devid
--> doesn't find such item
returns -ENOENT
--> the device id should have been X
and not 0ULL
got -ENOENT from
btrfs_finish_chunk_alloc()
and aborts current transaction
finishes setting up the target device,
namely it sets tgtdev->devid to the value
of srcdev->devid, which is X (and X > 0)
frees the srcdev
unlock fs_info->chunk_mutex
So fix this by taking the device list mutex when processing the chunk's
extent map stripes to update the device items. This avoids getting the
wrong device id and use-after-free problems if the task finishing a
chunk allocation grabs the replaced device, which is freed while the
dev replace task is holding the device list mutex.
This happened while running fstest btrfs/071.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
prepare_pages() may end up calling prepare_uptodate_page() twice if our
write only spans a single page. But if the first call returns an error,
our page will be unlocked and its not safe to call it again.
This bug goes all the way back to 2011, and it's not something commonly
hit.
While we're here, add a more explicit check for the page being truncated
away. The bare lock_page() alone is protected only by good thoughts and
i_mutex, which we're sure to regret eventually.
Reported-by: Dave Jones <dsj@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Dave Jones found a warning from kasan in setup_cluster_bitmaps()
==================================================================
BUG: KASAN: stack-out-of-bounds in setup_cluster_bitmap+0xc4/0x5a0 at
addr ffff88039bef6828
Read of size 8 by task nfsd/1009
page:ffffea000e6fbd80 count:0 mapcount:0 mapping: (null)
index:0x0
flags: 0x8000000000000000()
page dumped because: kasan: bad access detected
CPU: 1 PID: 1009 Comm: nfsd Tainted: G W
4.4.0-rc3-backup-debug+ #1
ffff880065647b50 000000006bb712c2 ffff88039bef6640 ffffffffa680a43e
0000004559c00000 ffff88039bef66c8 ffffffffa62638d1 ffffffffa61121c0
ffff8803a5769de8 0000000000000296 ffff8803a5769df0 0000000000046280
Call Trace:
[<ffffffffa680a43e>] dump_stack+0x4b/0x6d
[<ffffffffa62638d1>] kasan_report_error+0x501/0x520
[<ffffffffa61121c0>] ? debug_show_all_locks+0x1e0/0x1e0
[<ffffffffa6263948>] kasan_report+0x58/0x60
[<ffffffffa6814b00>] ? rb_last+0x10/0x40
[<ffffffffa66f8af4>] ? setup_cluster_bitmap+0xc4/0x5a0
[<ffffffffa6262ead>] __asan_load8+0x5d/0x70
[<ffffffffa66f8af4>] setup_cluster_bitmap+0xc4/0x5a0
[<ffffffffa66f675a>] ? setup_cluster_no_bitmap+0x6a/0x400
[<ffffffffa66fcd16>] btrfs_find_space_cluster+0x4b6/0x640
[<ffffffffa66fc860>] ? btrfs_alloc_from_cluster+0x4e0/0x4e0
[<ffffffffa66fc36e>] ? btrfs_return_cluster_to_free_space+0x9e/0xb0
[<ffffffffa702dc37>] ? _raw_spin_unlock+0x27/0x40
[<ffffffffa666a1a1>] find_free_extent+0xba1/0x1520
Andrey noticed this was because we were doing list_first_entry on a list
that might be empty. Rework the tests a bit so we don't do that.
Signed-off-by: Chris Mason <clm@fb.com>
Reprorted-by: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Reported-by: Dave Jones <dsj@fb.com>
When an inconsistent space cache is detected during loading we log a
warning that users frequently mistake as instruction to invalidate the
cache manually, even though this is not required. Fix the message to
indicate that the cache will be rebuilt automatically.
Signed-off-by: Holger Hoffstätte <holger.hoffstaette@googlemail.com>
Acked-by: Filipe Manana <fdmanana@suse.com>
If we fail to allocate a new data chunk, we were jumping to the error path
without release the transaction handle we got before. Fix this by always
releasing it before doing the jump.
Fixes: 2c9fe83552 ("btrfs: Fix lost-data-profile caused by balance bg")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
As of my previous change titled "Btrfs: fix scrub preventing unused block
groups from being deleted", the following warning at
extent-tree.c:btrfs_delete_unused_bgs() can be hit when we mount the a
filesysten with "-o discard":
10263 void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
10264 {
(...)
10405 if (trimming) {
10406 WARN_ON(!list_empty(&block_group->bg_list));
10407 spin_lock(&trans->transaction->deleted_bgs_lock);
10408 list_move(&block_group->bg_list,
10409 &trans->transaction->deleted_bgs);
10410 spin_unlock(&trans->transaction->deleted_bgs_lock);
10411 btrfs_get_block_group(block_group);
10412 }
(...)
This happens because scrub can now add back the block group to the list of
unused block groups (fs_info->unused_bgs). This is dangerous because we
are moving the block group from the unused block groups list to the list
of deleted block groups without holding the lock that protects the source
list (fs_info->unused_bgs_lock).
The following diagram illustrates how this happens:
CPU 1 CPU 2
cleaner_kthread()
btrfs_delete_unused_bgs()
sees bg X in list
fs_info->unused_bgs
deletes bg X from list
fs_info->unused_bgs
scrub_enumerate_chunks()
searches device tree using
its commit root
finds device extent for
block group X
gets block group X from the tree
fs_info->block_group_cache_tree
(via btrfs_lookup_block_group())
sets bg X to RO (again)
scrub_chunk(bg X)
sets bg X back to RW mode
adds bg X to the list
fs_info->unused_bgs again,
since it's still unused and
currently not in that list
sets bg X to RO mode
btrfs_remove_chunk(bg X)
--> discard is enabled and bg X
is in the fs_info->unused_bgs
list again so the warning is
triggered
--> we move it from that list into
the transaction's delete_bgs
list, but we can have another
task currently manipulating
the first list (fs_info->unused_bgs)
Fix this by using the same lock (fs_info->unused_bgs_lock) to protect both
the list of unused block groups and the list of deleted block groups. This
makes it safe and there's not much worry for more lock contention, as this
lock is seldom used and only the cleaner kthread adds elements to the list
of deleted block groups. The warning goes away too, as this was previously
an impossible case (and would have been better a BUG_ON/ASSERT) but it's
not impossible anymore.
Reproduced with fstest btrfs/073 (using MOUNT_OPTIONS="-o discard").
Signed-off-by: Filipe Manana <fdmanana@suse.com>
new method: ->get_link(); replacement of ->follow_link(). The differences
are:
* inode and dentry are passed separately
* might be called both in RCU and non-RCU mode;
the former is indicated by passing it a NULL dentry.
* when called that way it isn't allowed to block
and should return ERR_PTR(-ECHILD) if it needs to be called
in non-RCU mode.
It's a flagday change - the old method is gone, all in-tree instances
converted. Conversion isn't hard; said that, so far very few instances
do not immediately bail out when called in RCU mode. That'll change
in the next commits.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
kmap() in page_follow_link_light() needed to go - allowing to hold
an arbitrary number of kmaps for long is a great way to deadlocking
the system.
new helper (inode_nohighmem(inode)) needs to be used for pagecache
symlinks inodes; done for all in-tree cases. page_follow_link_light()
instrumented to yell about anything missed.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The btrfs clone ioctls are now adopted by other file systems, with NFS
and CIFS already having support for them, and XFS being under active
development. To avoid growth of various slightly incompatible
implementations, add one to the VFS. Note that clones are different from
file copies in several ways:
- they are atomic vs other writers
- they support whole file clones
- they support 64-bit legth clones
- they do not allow partial success (aka short writes)
- clones are expected to be a fast metadata operation
Because of that it would be rather cumbersome to try to piggyback them on
top of the recent clone_file_range infrastructure. The converse isn't
true and the clone_file_range system call could try clone file range as
a first attempt to copy, something that further patches will enable.
Based on earlier work from Peng Tao.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Does not return any errors, nor anything from the callgraph. There's a
BUG_ON but it's a sanity check and not an error condition we could
recover from.
Signed-off-by: David Sterba <dsterba@suse.com>
Does not return any errors, nor anything from the callgraph. There's a
BUG_ON but it's a sanity check and not an error condition we could
recover from.
Signed-off-by: David Sterba <dsterba@suse.com>
Does not return any errors, nor anything from the callgraph. There's a
BUG_ON but it's a sanity check and not an error condition we could
recover from.
Signed-off-by: David Sterba <dsterba@suse.com>
Does not return any errors, nor anything from the callgraph. The branch
in end_bio_extent_writepage has been skipped since
5fd0204355 ("Btrfs: finish ordered extents in their own thread").
Signed-off-by: David Sterba <dsterba@suse.com>
Use the VFS xattr handler infrastructure and get rid of similar code in
the filesystem.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Remove POSIX_ACL_XATTR_{ACCESS,DEFAULT} and GFS2_POSIX_ACL_{ACCESS,DEFAULT}
and replace them with the definitions in <include/uapi/linux/xattr.h>.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: James Morris <james.l.morris@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
We don't have to use GFP_NOFS in context of ACL or XATTR actions, not
possible to loop through the allocator and it's safe to fail with
ENOMEM.
Signed-off-by: David Sterba <dsterba@suse.com>
We don't have to use GFP_NOFS to allocate workqueue structures, this is
done from mount context or potentially scrub start context, safe to fail
in both cases.
Signed-off-by: David Sterba <dsterba@suse.com>
We don't have to use GFP_NOFS in the ioctl handlers because there's no
risk of looping through the allocators back to the filesystem. This
patch covers only allocations that are directly in the ioctl handlers.
Signed-off-by: David Sterba <dsterba@suse.com>
There's only one caller and single value, we can propagate it down to
the callee and remove the unused parameter.
Signed-off-by: David Sterba <dsterba@suse.com>
The funcions just wrap the clear_extent_bit API and generate function
calls. This increases stack consumption and may negatively affect
performance due to icache misses. We can simply make the helpers static
inline and keep the type checking and API untouched. The code slightly
decreases:
text data bss dec hex filename
938667 43670 23144 1005481 f57a9 fs/btrfs/btrfs.ko.before
939651 43670 23144 1006465 f5b81 fs/btrfs/btrfs.ko.after
Signed-off-by: David Sterba <dsterba@suse.com>
The funcions just wrap the set_extent_bit API and generate function
calls. This increases stack consumption and may negatively affect
performance due to icache misses. We can simply make the helpers static
inline and keep the type checking and API untouched. The code slightly
increases:
text data bss dec hex filename
938427 43670 23144 1005241 f56b9 fs/btrfs/btrfs.ko.before
938667 43670 23144 1005481 f57a9 fs/btrfs/btrfs.ko
Signed-off-by: David Sterba <dsterba@suse.com>
This rearranges the existing COPY_RANGE ioctl implementation so that the
.copy_file_range file operation can call the core loop that copies file
data extent items.
The extent copying loop is lifted up into its own function. It retains
the core btrfs error checks that should be shared.
Signed-off-by: Zach Brown <zab@redhat.com>
[Anna Schumaker: Make flags an unsigned int,
Check for COPY_FR_REFLINK]
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull btrfs fixes from Chris Mason:
"This has Mark Fasheh's patches to fix quota accounting during subvol
deletion, which we've been working on for a while now. The patch is
pretty small but it's a key fix.
Otherwise it's a random assortment"
* 'for-linus-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
btrfs: fix balance range usage filters in 4.4-rc
btrfs: qgroup: account shared subtree during snapshot delete
Btrfs: use btrfs_get_fs_root in resolve_indirect_ref
btrfs: qgroup: fix quota disable during rescan
Btrfs: fix race between cleaner kthread and space cache writeout
Btrfs: fix scrub preventing unused block groups from being deleted
Btrfs: fix race between scrub and block group deletion
btrfs: fix rcu warning during device replace
btrfs: Continue replace when set_block_ro failed
btrfs: fix clashing number of the enhanced balance usage filter
Btrfs: fix the number of transaction units needed to remove a block group
Btrfs: use global reserve when deleting unused block group after ENOSPC
Btrfs: tests: checking for NULL instead of IS_ERR()
btrfs: fix signed overflows in btrfs_sync_file
There's a regression in 4.4-rc since commit bc3094673f
(btrfs: extend balance filter usage to take minimum and maximum) in that
existing (non-ranged) balance with -dusage=x no longer works; all chunks
are skipped.
After staring at the code for a while and wondering why a non-ranged
balance would even need min and max thresholds (..which then were not
set correctly, leading to the bug) I realized that the only problem
was the fact that the filter functions were named wrong, thanks to
patching copypasta. Simply renaming both functions lets the existing
btrfs-progs call balance with -dusage=x and now the non-ranged filter
function is invoked, properly using only a single chunk limit.
Signed-off-by: Holger Hoffstätte <holger.hoffstaette@googlemail.com>
Fixes: bc3094673f ("btrfs: extend balance filter usage to take minimum and maximum")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Commit 0ed4792 ('btrfs: qgroup: Switch to new extent-oriented qgroup
mechanism.') removed our qgroup accounting during
btrfs_drop_snapshot(). Predictably, this results in qgroup numbers
going bad shortly after a snapshot is removed.
Fix this by adding a dirty extent record when we encounter extents during
our shared subtree walk. This effectively restores the functionality we had
with the original shared subtree walking code in 1152651 (btrfs: qgroup:
account shared subtrees during snapshot delete).
The idea with the original patch (and this one) is that shared subtrees can
get skipped during drop_snapshot. The shared subtree walk then allows us a
chance to visit those extents and add them to the qgroup work for later
processing. This ultimately makes the accounting for drop snapshot work.
The new qgroup code nicely handles all the other extents during the tree
walk via the ref dec/inc functions so we don't have to add actions beyond
what we had originally.
Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Chris Mason <clm@fb.com>
The backref code will look up the fs_root we're trying to resolve our indirect
refs for, unfortunately we use btrfs_read_fs_root_no_name, which returns -ENOENT
if the ref is 0. This isn't helpful for the qgroup stuff with snapshot delete
as it won't be able to search down the snapshot we are deleting, which will
cause us to miss roots. So use btrfs_get_fs_root and send false for check_ref
so we can always get the root we're looking for. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Chris Mason <clm@fb.com>
There's a race condition that leads to a NULL pointer dereference if you
disable quotas while a quota rescan is running. To fix this, we just need
to wait for the quota rescan worker to actually exit before tearing down
the quota structures.
Signed-off-by: Justin Maggard <jmaggard@netgear.com>
Signed-off-by: Chris Mason <clm@fb.com>
When a block group becomes unused and the cleaner kthread is currently
running, we can end up getting the current transaction aborted with error
-ENOENT when we try to commit the transaction, leading to the following
trace:
[59779.258768] WARNING: CPU: 3 PID: 5990 at fs/btrfs/extent-tree.c:3740 btrfs_write_dirty_block_groups+0x17c/0x214 [btrfs]()
[59779.272594] BTRFS: Transaction aborted (error -2)
(...)
[59779.291137] Call Trace:
[59779.291621] [<ffffffff812566f4>] dump_stack+0x4e/0x79
[59779.292543] [<ffffffff8104d0a6>] warn_slowpath_common+0x9f/0xb8
[59779.293435] [<ffffffffa04cb81f>] ? btrfs_write_dirty_block_groups+0x17c/0x214 [btrfs]
[59779.295000] [<ffffffff8104d107>] warn_slowpath_fmt+0x48/0x50
[59779.296138] [<ffffffffa04c2721>] ? write_one_cache_group.isra.32+0x77/0x82 [btrfs]
[59779.297663] [<ffffffffa04cb81f>] btrfs_write_dirty_block_groups+0x17c/0x214 [btrfs]
[59779.299141] [<ffffffffa0549b0d>] commit_cowonly_roots+0x1de/0x261 [btrfs]
[59779.300359] [<ffffffffa04dd5b6>] btrfs_commit_transaction+0x4c4/0x99c [btrfs]
[59779.301805] [<ffffffffa04b5df4>] btrfs_sync_fs+0x145/0x1ad [btrfs]
[59779.302893] [<ffffffff81196634>] sync_filesystem+0x7f/0x93
(...)
[59779.318186] ---[ end trace 577e2daff90da33a ]---
The following diagram illustrates a sequence of steps leading to this
problem:
CPU 1 CPU 2
<at transaction N>
adds bg A to list
fs_info->unused_bgs
adds bg B to list
fs_info->unused_bgs
<transaction kthread
commits transaction N
and wakes up the
cleaner kthread>
cleaner kthread
delete_unused_bgs()
sees bg A in list
fs_info->unused_bgs
btrfs_start_transaction()
<transaction N + 1 starts>
deletes bg A
update_block_group(bg C)
--> adds bg C to list
fs_info->unused_bgs
deletes bg B
sees bg C in the list
fs_info->unused_bgs
btrfs_remove_chunk(bg C)
btrfs_remove_block_group(bg C)
--> checks if the block group
is in a dirty list, and
because it isn't now, it
does nothing
--> the block group item
is deleted from the
extent tree
--> adds bg C to list
transaction->dirty_bgs
some task calls
btrfs_commit_transaction(t N + 1)
commit_cowonly_roots()
btrfs_write_dirty_block_groups()
--> sees bg C in cur_trans->dirty_bgs
--> calls write_one_cache_group()
which returns -ENOENT because
it did not find the block group
item in the extent tree
--> transaction aborte with -ENOENT
because write_one_cache_group()
returned that error
So fix this by adding a block group to the list of dirty block groups
before adding it to the list of unused block groups.
This happened on a stress test using fsstress plus concurrent calls to
fallocate 20G and truncate (releasing part of the space allocated with
fallocate).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Currently scrub can race with the cleaner kthread when the later attempts
to delete an unused block group, and the result is preventing the cleaner
kthread from ever deleting later the block group - unless the block group
becomes used and unused again. The following diagram illustrates that
race:
CPU 1 CPU 2
cleaner kthread
btrfs_delete_unused_bgs()
gets block group X from
fs_info->unused_bgs and
removes it from that list
scrub_enumerate_chunks()
searches device tree using
its commit root
finds device extent for
block group X
gets block group X from the tree
fs_info->block_group_cache_tree
(via btrfs_lookup_block_group())
sets bg X to RO
sees the block group is
already RO and therefore
doesn't delete it nor adds
it back to unused list
So fix this by making scrub add the block group again to the list of
unused block groups if the block group is still unused when it finished
scrubbing it and it hasn't been removed already.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Scrub can race with the cleaner kthread deleting block groups that are
unused (and with relocation too) leading to a failure with error -EINVAL
that gets returned to user space.
The following diagram illustrates how it happens:
CPU 1 CPU 2
cleaner kthread
btrfs_delete_unused_bgs()
gets block group X from
fs_info->unused_bgs
sets block group to RO
btrfs_remove_chunk(bg X)
deletes device extents
scrub_enumerate_chunks()
searches device tree using
its commit root
finds device extent for
block group X
gets block group X from the tree
fs_info->block_group_cache_tree
(via btrfs_lookup_block_group())
sets bg X to RO (again)
btrfs_remove_block_group(bg X)
deletes block group from
fs_info->block_group_cache_tree
removes extent map from
fs_info->mapping_tree
scrub_chunk(offset X)
searches fs_info->mapping_tree
for extent map starting at
offset X
--> doesn't find any such
extent map
--> returns -EINVAL and scrub
errors out to userspace
with -EINVAL
Fix this by dealing with an extent map lookup failure as an indicator of
block group deletion.
Issue reproduced with fstest btrfs/071.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
The test btrfs/011 triggers a rcu warning
Reviewed-by: Anand Jain <anand.jain@oracle.com>
===============================
[ INFO: suspicious RCU usage. ]
4.4.0-rc1-default+ #286 Tainted: G W
-------------------------------
fs/btrfs/volumes.c:1977 suspicious rcu_dereference_check() usage!
other info that might help us debug this:
rcu_scheduler_active = 1, debug_locks = 0
4 locks held by btrfs/28786:
0: (&fs_info->dev_replace.lock_finishing_cancel_unmount){+.+...}, at: [<ffffffffa00bc785>] btrfs_dev_replace_finishing+0x45/0xa00 [btrfs]
1: (uuid_mutex){+.+.+.}, at: [<ffffffffa00bc84f>] btrfs_dev_replace_finishing+0x10f/0xa00 [btrfs]
2: (&fs_devs->device_list_mutex){+.+.+.}, at: [<ffffffffa00bc868>] btrfs_dev_replace_finishing+0x128/0xa00 [btrfs]
3: (&fs_info->chunk_mutex){+.+...}, at: [<ffffffffa00bc87d>] btrfs_dev_replace_finishing+0x13d/0xa00 [btrfs]
stack backtrace:
CPU: 0 PID: 28786 Comm: btrfs Tainted: G W 4.4.0-rc1-default+ #286
Hardware name: Intel Corporation SandyBridge Platform/To be filled by O.E.M., BIOS ASNBCPT1.86C.0031.B00.1006301607 06/30/2010
0000000000000001 ffff8800a07dfb48 ffffffff8141d47b 0000000000000001
0000000000000001 0000000000000000 ffff8801464a4f00 ffff8800a07dfb78
ffffffff810cd883 ffff880146eb9400 ffff8800a3698600 ffff8800a33fe220
Call Trace:
[<ffffffff8141d47b>] dump_stack+0x4f/0x74
[<ffffffff810cd883>] lockdep_rcu_suspicious+0x103/0x140
[<ffffffffa0071261>] btrfs_rm_dev_replace_remove_srcdev+0x111/0x130 [btrfs]
[<ffffffff810d354d>] ? trace_hardirqs_on+0xd/0x10
[<ffffffff81449536>] ? __percpu_counter_sum+0x66/0x80
[<ffffffffa00bcc15>] btrfs_dev_replace_finishing+0x4d5/0xa00 [btrfs]
[<ffffffffa00bc96e>] ? btrfs_dev_replace_finishing+0x22e/0xa00 [btrfs]
[<ffffffffa00a8795>] ? btrfs_scrub_dev+0x415/0x6d0 [btrfs]
[<ffffffffa003ea69>] ? btrfs_start_transaction+0x9/0x20 [btrfs]
[<ffffffffa00bda79>] btrfs_dev_replace_start+0x339/0x590 [btrfs]
[<ffffffff81196aa5>] ? __might_fault+0x95/0xa0
[<ffffffffa0078638>] btrfs_ioctl_dev_replace+0x118/0x160 [btrfs]
[<ffffffff811409c6>] ? stack_trace_call+0x46/0x70
[<ffffffffa007c914>] ? btrfs_ioctl+0x24/0x1770 [btrfs]
[<ffffffffa007ce43>] btrfs_ioctl+0x553/0x1770 [btrfs]
[<ffffffff811409c6>] ? stack_trace_call+0x46/0x70
[<ffffffff811d6eb1>] ? do_vfs_ioctl+0x21/0x5a0
[<ffffffff811d6f1c>] do_vfs_ioctl+0x8c/0x5a0
[<ffffffff811e3336>] ? __fget_light+0x86/0xb0
[<ffffffff811e3369>] ? __fdget+0x9/0x20
[<ffffffff811d7451>] ? SyS_ioctl+0x21/0x80
[<ffffffff811d7483>] SyS_ioctl+0x53/0x80
[<ffffffff81b1efd7>] entry_SYSCALL_64_fastpath+0x12/0x6f
This is because of unprotected use of rcu_dereference in
btrfs_scratch_superblocks. We can't add rcu locks around the whole
function because we read the superblock.
The fix will use the rcu string buffer directly without the rcu locking.
Thi is safe as the device will not go away in the meantime. We're
holding the device list mutexes.
Restructuring the code to narrow down the rcu section turned out to be
impossible, we need to call filp_open (through update_dev_time) on the
buffer and this could call kmalloc/__might_sleep. We could call kstrdup
with GFP_ATOMIC but it's not absolutely necessary.
Fixes: 12b1c2637b (Btrfs: enhance btrfs_scratch_superblock to scratch all superblocks)
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
xfstests/011 failed in node with small_size filesystem.
Can be reproduced by following script:
DEV_LIST="/dev/vdd /dev/vde"
DEV_REPLACE="/dev/vdf"
do_test()
{
local mkfs_opt="$1"
local size="$2"
dmesg -c >/dev/null
umount $SCRATCH_MNT &>/dev/null
echo mkfs.btrfs -f $mkfs_opt "${DEV_LIST[*]}"
mkfs.btrfs -f $mkfs_opt "${DEV_LIST[@]}" || return 1
mount "${DEV_LIST[0]}" $SCRATCH_MNT
echo -n "Writing big files"
dd if=/dev/urandom of=$SCRATCH_MNT/t0 bs=1M count=1 >/dev/null 2>&1
for ((i = 1; i <= size; i++)); do
echo -n .
/bin/cp $SCRATCH_MNT/t0 $SCRATCH_MNT/t$i || return 1
done
echo
echo Start replace
btrfs replace start -Bf "${DEV_LIST[0]}" "$DEV_REPLACE" $SCRATCH_MNT || {
dmesg
return 1
}
return 0
}
# Set size to value near fs size
# for example, 1897 can trigger this bug in 2.6G device.
#
./do_test "-d raid1 -m raid1" 1897
System will report replace fail with following warning in dmesg:
[ 134.710853] BTRFS: dev_replace from /dev/vdd (devid 1) to /dev/vdf started
[ 135.542390] BTRFS: btrfs_scrub_dev(/dev/vdd, 1, /dev/vdf) failed -28
[ 135.543505] ------------[ cut here ]------------
[ 135.544127] WARNING: CPU: 0 PID: 4080 at fs/btrfs/dev-replace.c:428 btrfs_dev_replace_start+0x398/0x440()
[ 135.545276] Modules linked in:
[ 135.545681] CPU: 0 PID: 4080 Comm: btrfs Not tainted 4.3.0 #256
[ 135.546439] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014
[ 135.547798] ffffffff81c5bfcf ffff88003cbb3d28 ffffffff817fe7b5 0000000000000000
[ 135.548774] ffff88003cbb3d60 ffffffff810a88f1 ffff88002b030000 00000000ffffffe4
[ 135.549774] ffff88003c080000 ffff88003c082588 ffff88003c28ab60 ffff88003cbb3d70
[ 135.550758] Call Trace:
[ 135.551086] [<ffffffff817fe7b5>] dump_stack+0x44/0x55
[ 135.551737] [<ffffffff810a88f1>] warn_slowpath_common+0x81/0xc0
[ 135.552487] [<ffffffff810a89e5>] warn_slowpath_null+0x15/0x20
[ 135.553211] [<ffffffff81448c88>] btrfs_dev_replace_start+0x398/0x440
[ 135.554051] [<ffffffff81412c3e>] btrfs_ioctl+0x1d2e/0x25c0
[ 135.554722] [<ffffffff8114c7ba>] ? __audit_syscall_entry+0xaa/0xf0
[ 135.555506] [<ffffffff8111ab36>] ? current_kernel_time64+0x56/0xa0
[ 135.556304] [<ffffffff81201e3d>] do_vfs_ioctl+0x30d/0x580
[ 135.557009] [<ffffffff8114c7ba>] ? __audit_syscall_entry+0xaa/0xf0
[ 135.557855] [<ffffffff810011d1>] ? do_audit_syscall_entry+0x61/0x70
[ 135.558669] [<ffffffff8120d1c1>] ? __fget_light+0x61/0x90
[ 135.559374] [<ffffffff81202124>] SyS_ioctl+0x74/0x80
[ 135.559987] [<ffffffff81809857>] entry_SYSCALL_64_fastpath+0x12/0x6f
[ 135.560842] ---[ end trace 2a5c1fc3205abbdd ]---
Reason:
When big data writen to fs, the whole free space will be allocated
for data chunk.
And operation as scrub need to set_block_ro(), and when there is
only one metadata chunk in system(or other metadata chunks
are all full), the function will try to allocate a new chunk,
and failed because no space in device.
Fix:
When set_block_ro failed for metadata chunk, it is not a problem
because scrub_lock paused commit_trancaction in same time, and
metadata are always cowed, so the on-the-fly writepages will not
write data into same place with scrub/replace.
Let replace continue in this case is no problem.
Tested by above script, and xfstests/011, plus 100 times xfstests/070.
Changelog v1->v2:
1: Add detail comments in source and commit-message.
2: Add dmesg detail into commit-message.
3: Limit return value of -ENOSPC to be passed.
All suggested by: Filipe Manana <fdmanana@gmail.com>
Suggested-by: Filipe Manana <fdmanana@gmail.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
I've accidentally picked an already used number for the enhanced usage
filter represented by BTRFS_BALANCE_ARGS_USAGE_RANGE, clashing with
BTRFS_BALANCE_ARGS_CONVERT. Introduced during the development phase,
no backward compatibility issues.
Reported-by: Holger Hoffstätte <holger.hoffstaette@googlemail.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Fixes: bc3094673f ("btrfs: extend balance filter usage to take minimum and maximum")
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
We were using only 1 transaction unit when attempting to delete an unused
block group but in reality we need 3 + N units, where N corresponds to the
number of stripes. We were accounting only for the addition of the orphan
item (for the block group's free space cache inode) but we were not
accounting that we need to delete one block group item from the extent
tree, one free space item from the tree of tree roots and N device extent
items from the device tree.
While one unit is not enough, it worked most of the time because for each
single unit we are too pessimistic and assume an entire tree path, with
the highest possible heigth (8), needs to be COWed with eventual node
splits at every possible level in the tree, so there was usually enough
reserved space for removing all the items and adding the orphan item.
However after adding the orphan item, writepages() can by called by the VM
subsystem against the btree inode when we are under memory pressure, which
causes writeback to start for the nodes we COWed before, this forces the
operation to remove the free space item to COW again some (or all of) the
same nodes (in the tree of tree roots). Even without writepages() being
called, we could fail with ENOSPC because these items are located in
multiple trees and one of them might have a higher heigth and require
node/leaf splits at many levels, exhausting all the reserved space before
removing all the items and adding the orphan.
In the kernel 4.0 release, commit 3d84be7991 ("Btrfs: fix BUG_ON in
btrfs_orphan_add() when delete unused block group"), we attempted to fix
a BUG_ON due to ENOSPC when trying to add the orphan item by making the
cleaner kthread reserve one transaction unit before attempting to remove
the block group, but this was not enough. We had a couple user reports
still hitting the same BUG_ON after 4.0, like Stefan Priebe's report on
a 4.2-rc6 kernel for example:
http://www.spinics.net/lists/linux-btrfs/msg46070.html
So fix this by reserving all the necessary units of metadata.
Reported-by: Stefan Priebe <s.priebe@profihost.ag>
Fixes: 3d84be7991 ("Btrfs: fix BUG_ON in btrfs_orphan_add() when delete unused block group")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
It's possible to reach a state where the cleaner kthread isn't able to
start a transaction to delete an unused block group due to lack of enough
free metadata space and due to lack of unallocated device space to allocate
a new metadata block group as well. If this happens try to use space from
the global block group reserve just like we do for unlink operations, so
that we don't reach a permanent state where starting a transaction for
filesystem operations (file creation, renames, etc) keeps failing with
-ENOSPC. Such an unfortunate state was observed on a machine where over
a dozen unused data block groups existed and the cleaner kthread was
failing to delete them due to ENOSPC error when attempting to start a
transaction, and even running balance with a -dusage=0 filter failed with
ENOSPC as well. Also unmounting and mounting again the filesystem didn't
help. Allowing the cleaner kthread to use the global block reserve to
delete the unused data block groups fixed the problem.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
btrfs_alloc_dummy_root() return an error pointer on failure, it never
returns NULL.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
The calculation of range length in btrfs_sync_file leads to signed
overflow. This was caught by PaX gcc SIZE_OVERFLOW plugin.
https://forums.grsecurity.net/viewtopic.php?f=1&t=4284
The fsync call passes 0 and LLONG_MAX, the range length does not fit to
loff_t and overflows, but the value is converted to u64 so it silently
works as expected.
The minimal fix is a typecast to u64, switching functions to take
(start, end) instead of (start, len) would be more intrusive.
Coccinelle script found that there's one more opencoded calculation of
the length.
<smpl>
@@
loff_t start, end;
@@
* end - start
</smpl>
CC: stable@vger.kernel.org
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Pull btrfs fixes and cleanups from Chris Mason:
"Some of this got cherry-picked from a github repo this week, but I
verified the patches.
We have three small scrub cleanups and a collection of fixes"
* 'for-linus-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
btrfs: Use fs_info directly in btrfs_delete_unused_bgs
btrfs: Fix lost-data-profile caused by balance bg
btrfs: Fix lost-data-profile caused by auto removing bg
btrfs: Remove len argument from scrub_find_csum
btrfs: Reduce unnecessary arguments in scrub_recheck_block
btrfs: Use scrub_checksum_data and scrub_checksum_tree_block for scrub_recheck_block_checksum
btrfs: Reset sblock->xxx_error stats before calling scrub_recheck_block_checksum
btrfs: scrub: setup all fields for sblock_to_check
btrfs: scrub: set error stats when tree block spanning stripes
Btrfs: fix race when listing an inode's xattrs
Btrfs: fix race leading to BUG_ON when running delalloc for nodatacow
Btrfs: fix race leading to incorrect item deletion when dropping extents
Btrfs: fix sleeping inside atomic context in qgroup rescan worker
Btrfs: fix race waiting for qgroup rescan worker
btrfs: qgroup: exit the rescan worker during umount
Btrfs: fix extent accounting for partial direct IO writes
No need to use root->fs_info in btrfs_delete_unused_bgs(),
use fs_info directly instead.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Reproduce:
(In integration-4.3 branch)
TEST_DEV=(/dev/vdg /dev/vdh)
TEST_DIR=/mnt/tmp
umount "$TEST_DEV" >/dev/null
mkfs.btrfs -f -d raid1 "${TEST_DEV[@]}"
mount -o nospace_cache "$TEST_DEV" "$TEST_DIR"
btrfs balance start -dusage=0 $TEST_DIR
btrfs filesystem usage $TEST_DIR
dd if=/dev/zero of="$TEST_DIR"/file count=100
btrfs filesystem usage $TEST_DIR
Result:
We can see "no data chunk" in first "btrfs filesystem usage":
# btrfs filesystem usage $TEST_DIR
Overall:
...
Metadata,single: Size:8.00MiB, Used:0.00B
/dev/vdg 8.00MiB
Metadata,RAID1: Size:122.88MiB, Used:112.00KiB
/dev/vdg 122.88MiB
/dev/vdh 122.88MiB
System,single: Size:4.00MiB, Used:0.00B
/dev/vdg 4.00MiB
System,RAID1: Size:8.00MiB, Used:16.00KiB
/dev/vdg 8.00MiB
/dev/vdh 8.00MiB
Unallocated:
/dev/vdg 1.06GiB
/dev/vdh 1.07GiB
And "data chunks changed from raid1 to single" in second
"btrfs filesystem usage":
# btrfs filesystem usage $TEST_DIR
Overall:
...
Data,single: Size:256.00MiB, Used:0.00B
/dev/vdh 256.00MiB
Metadata,single: Size:8.00MiB, Used:0.00B
/dev/vdg 8.00MiB
Metadata,RAID1: Size:122.88MiB, Used:112.00KiB
/dev/vdg 122.88MiB
/dev/vdh 122.88MiB
System,single: Size:4.00MiB, Used:0.00B
/dev/vdg 4.00MiB
System,RAID1: Size:8.00MiB, Used:16.00KiB
/dev/vdg 8.00MiB
/dev/vdh 8.00MiB
Unallocated:
/dev/vdg 1.06GiB
/dev/vdh 841.92MiB
Reason:
btrfs balance delete last data chunk in case of no data in
the filesystem, then we can see "no data chunk" by "fi usage"
command.
And when we do write operation to fs, the only available data
profile is 0x0, result is all new chunks are allocated single type.
Fix:
Allocate a data chunk explicitly to ensure we don't lose the
raid profile for data.
Test:
Test by above script, and confirmed the logic by debug output.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Reproduce:
(In integration-4.3 branch)
TEST_DEV=(/dev/vdg /dev/vdh)
TEST_DIR=/mnt/tmp
umount "$TEST_DEV" >/dev/null
mkfs.btrfs -f -d raid1 "${TEST_DEV[@]}"
mount -o nospace_cache "$TEST_DEV" "$TEST_DIR"
umount "$TEST_DEV"
mount -o nospace_cache "$TEST_DEV" "$TEST_DIR"
btrfs filesystem usage $TEST_DIR
We can see the data chunk changed from raid1 to single:
# btrfs filesystem usage $TEST_DIR
Data,single: Size:8.00MiB, Used:0.00B
/dev/vdg 8.00MiB
#
Reason:
When a empty filesystem mount with -o nospace_cache, the last
data blockgroup will be auto-removed in umount.
Then if we mount it again, there is no data chunk in the
filesystem, so the only available data profile is 0x0, result
is all new chunks are created as single type.
Fix:
Don't auto-delete last blockgroup for a raid type.
Test:
Test by above script, and confirmed the logic by debug output.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
We don't need pass so many arguments for recheck sblock now,
this patch cleans them.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
We can use existing scrub_checksum_data() and scrub_checksum_tree_block()
for scrub_recheck_block_checksum(), instead of write duplicated code.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
We should reset sblock->xxx_error stats before calling
scrub_recheck_block_checksum().
Current code run correctly because all sblock are allocated by
k[cz]alloc(), and the error stats are not got changed.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
scrub_setup_recheck_block() isn't setup all necessary fields for
sblock_to_check because history reason.
So current code need more arguments in severial functions,
and more local variables, just to passing these lacked values to
necessary place.
This patch setup above fields to sblock_to_check in
scrub_setup_recheck_block(), for:
1: more cleanup for function arg, local variable
2: to make sblock_to_check complete, then we can use sblock_to_check
without concern about some uninitialized member.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
It is better to show error stats to user when we found tree block
spanning stripes.
On a btrfs created by old version of btrfs-convert:
Before patch:
# btrfs scrub start -B /dev/vdh
scrub done for 8b342d35-2904-41ab-b3cb-2f929709cf47
scrub started at Tue Aug 25 21:19:09 2015 and finished after 00:00:00
total bytes scrubbed: 53.54MiB with 0 errors
# dmesg
...
[ 128.711434] BTRFS error (device vdh): scrub: tree block 27054080 spanning stripes, ignored. logical=27000832
[ 128.712744] BTRFS error (device vdh): scrub: tree block 27054080 spanning stripes, ignored. logical=27066368
...
After patch:
# btrfs scrub start -B /dev/vdh
scrub done for ff7f844b-7a4e-4b1a-88a9-8252ab25be1b
scrub started at Tue Aug 25 21:42:29 2015 and finished after 00:00:00
total bytes scrubbed: 53.60MiB with 2 errors
error details:
corrected errors: 0, uncorrectable errors: 2, unverified errors: 0
ERROR: There are uncorrectable errors.
# dmesg
...omit...
#
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
new_valid_dev() always returns 1, so the !new_valid_dev() check is not
needed. Remove it.
Signed-off-by: Yaowei Bai <bywxiaobai@163.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Acked-by: David Sterba <dsterba@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When listing a inode's xattrs we have a time window where we race against
a concurrent operation for adding a new hard link for our inode that makes
us not return any xattr to user space. In order for this to happen, the
first xattr of our inode needs to be at slot 0 of a leaf and the previous
leaf must still have room for an inode ref (or extref) item, and this can
happen because an inode's listxattrs callback does not lock the inode's
i_mutex (nor does the VFS does it for us), but adding a hard link to an
inode makes the VFS lock the inode's i_mutex before calling the inode's
link callback.
If we have the following leafs:
Leaf X (has N items) Leaf Y
[ ... (257 INODE_ITEM 0) (257 INODE_REF 256) ] [ (257 XATTR_ITEM 12345), ... ]
slot N - 2 slot N - 1 slot 0
The race illustrated by the following sequence diagram is possible:
CPU 1 CPU 2
btrfs_listxattr()
searches for key (257 XATTR_ITEM 0)
gets path with path->nodes[0] == leaf X
and path->slots[0] == N
because path->slots[0] is >=
btrfs_header_nritems(leaf X), it calls
btrfs_next_leaf()
btrfs_next_leaf()
releases the path
adds key (257 INODE_REF 666)
to the end of leaf X (slot N),
and leaf X now has N + 1 items
searches for the key (257 INODE_REF 256),
with path->keep_locks == 1, because that
is the last key it saw in leaf X before
releasing the path
ends up at leaf X again and it verifies
that the key (257 INODE_REF 256) is no
longer the last key in leaf X, so it
returns with path->nodes[0] == leaf X
and path->slots[0] == N, pointing to
the new item with key (257 INODE_REF 666)
btrfs_listxattr's loop iteration sees that
the type of the key pointed by the path is
different from the type BTRFS_XATTR_ITEM_KEY
and so it breaks the loop and stops looking
for more xattr items
--> the application doesn't get any xattr
listed for our inode
So fix this by breaking the loop only if the key's type is greater than
BTRFS_XATTR_ITEM_KEY and skip the current key if its type is smaller.
Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
If we are using the NO_HOLES feature, we have a tiny time window when
running delalloc for a nodatacow inode where we can race with a concurrent
link or xattr add operation leading to a BUG_ON.
This happens because at run_delalloc_nocow() we end up casting a leaf item
of type BTRFS_INODE_[REF|EXTREF]_KEY or of type BTRFS_XATTR_ITEM_KEY to a
file extent item (struct btrfs_file_extent_item) and then analyse its
extent type field, which won't match any of the expected extent types
(values BTRFS_FILE_EXTENT_[REG|PREALLOC|INLINE]) and therefore trigger an
explicit BUG_ON(1).
The following sequence diagram shows how the race happens when running a
no-cow dellaloc range [4K, 8K[ for inode 257 and we have the following
neighbour leafs:
Leaf X (has N items) Leaf Y
[ ... (257 INODE_ITEM 0) (257 INODE_REF 256) ] [ (257 EXTENT_DATA 8192), ... ]
slot N - 2 slot N - 1 slot 0
(Note the implicit hole for inode 257 regarding the [0, 8K[ range)
CPU 1 CPU 2
run_dealloc_nocow()
btrfs_lookup_file_extent()
--> searches for a key with value
(257 EXTENT_DATA 4096) in the
fs/subvol tree
--> returns us a path with
path->nodes[0] == leaf X and
path->slots[0] == N
because path->slots[0] is >=
btrfs_header_nritems(leaf X), it
calls btrfs_next_leaf()
btrfs_next_leaf()
--> releases the path
hard link added to our inode,
with key (257 INODE_REF 500)
added to the end of leaf X,
so leaf X now has N + 1 keys
--> searches for the key
(257 INODE_REF 256), because
it was the last key in leaf X
before it released the path,
with path->keep_locks set to 1
--> ends up at leaf X again and
it verifies that the key
(257 INODE_REF 256) is no longer
the last key in the leaf, so it
returns with path->nodes[0] ==
leaf X and path->slots[0] == N,
pointing to the new item with
key (257 INODE_REF 500)
the loop iteration of run_dealloc_nocow()
does not break out the loop and continues
because the key referenced in the path
at path->nodes[0] and path->slots[0] is
for inode 257, its type is < BTRFS_EXTENT_DATA_KEY
and its offset (500) is less then our delalloc
range's end (8192)
the item pointed by the path, an inode reference item,
is (incorrectly) interpreted as a file extent item and
we get an invalid extent type, leading to the BUG_ON(1):
if (extent_type == BTRFS_FILE_EXTENT_REG ||
extent_type == BTRFS_FILE_EXTENT_PREALLOC) {
(...)
} else if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
(...)
} else {
BUG_ON(1)
}
The same can happen if a xattr is added concurrently and ends up having
a key with an offset smaller then the delalloc's range end.
So fix this by skipping keys with a type smaller than
BTRFS_EXTENT_DATA_KEY.
Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
While running a stress test I got the following warning triggered:
[191627.672810] ------------[ cut here ]------------
[191627.673949] WARNING: CPU: 8 PID: 8447 at fs/btrfs/file.c:779 __btrfs_drop_extents+0x391/0xa50 [btrfs]()
(...)
[191627.701485] Call Trace:
[191627.702037] [<ffffffff8145f077>] dump_stack+0x4f/0x7b
[191627.702992] [<ffffffff81095de5>] ? console_unlock+0x356/0x3a2
[191627.704091] [<ffffffff8104b3b0>] warn_slowpath_common+0xa1/0xbb
[191627.705380] [<ffffffffa0664499>] ? __btrfs_drop_extents+0x391/0xa50 [btrfs]
[191627.706637] [<ffffffff8104b46d>] warn_slowpath_null+0x1a/0x1c
[191627.707789] [<ffffffffa0664499>] __btrfs_drop_extents+0x391/0xa50 [btrfs]
[191627.709155] [<ffffffff8115663c>] ? cache_alloc_debugcheck_after.isra.32+0x171/0x1d0
[191627.712444] [<ffffffff81155007>] ? kmemleak_alloc_recursive.constprop.40+0x16/0x18
[191627.714162] [<ffffffffa06570c9>] insert_reserved_file_extent.constprop.40+0x83/0x24e [btrfs]
[191627.715887] [<ffffffffa065422b>] ? start_transaction+0x3bb/0x610 [btrfs]
[191627.717287] [<ffffffffa065b604>] btrfs_finish_ordered_io+0x273/0x4e2 [btrfs]
[191627.728865] [<ffffffffa065b888>] finish_ordered_fn+0x15/0x17 [btrfs]
[191627.730045] [<ffffffffa067d688>] normal_work_helper+0x14c/0x32c [btrfs]
[191627.731256] [<ffffffffa067d96a>] btrfs_endio_write_helper+0x12/0x14 [btrfs]
[191627.732661] [<ffffffff81061119>] process_one_work+0x24c/0x4ae
[191627.733822] [<ffffffff810615b0>] worker_thread+0x206/0x2c2
[191627.734857] [<ffffffff810613aa>] ? process_scheduled_works+0x2f/0x2f
[191627.736052] [<ffffffff810613aa>] ? process_scheduled_works+0x2f/0x2f
[191627.737349] [<ffffffff810669a6>] kthread+0xef/0xf7
[191627.738267] [<ffffffff810f3b3a>] ? time_hardirqs_on+0x15/0x28
[191627.739330] [<ffffffff810668b7>] ? __kthread_parkme+0xad/0xad
[191627.741976] [<ffffffff81465592>] ret_from_fork+0x42/0x70
[191627.743080] [<ffffffff810668b7>] ? __kthread_parkme+0xad/0xad
[191627.744206] ---[ end trace bbfddacb7aaada8d ]---
$ cat -n fs/btrfs/file.c
691 int __btrfs_drop_extents(struct btrfs_trans_handle *trans,
(...)
758 btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
759 if (key.objectid > ino ||
760 key.type > BTRFS_EXTENT_DATA_KEY || key.offset >= end)
761 break;
762
763 fi = btrfs_item_ptr(leaf, path->slots[0],
764 struct btrfs_file_extent_item);
765 extent_type = btrfs_file_extent_type(leaf, fi);
766
767 if (extent_type == BTRFS_FILE_EXTENT_REG ||
768 extent_type == BTRFS_FILE_EXTENT_PREALLOC) {
(...)
774 } else if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
(...)
778 } else {
779 WARN_ON(1);
780 extent_end = search_start;
781 }
(...)
This happened because the item we were processing did not match a file
extent item (its key type != BTRFS_EXTENT_DATA_KEY), and even on this
case we cast the item to a struct btrfs_file_extent_item pointer and
then find a type field value that does not match any of the expected
values (BTRFS_FILE_EXTENT_[REG|PREALLOC|INLINE]). This scenario happens
due to a tiny time window where a race can happen as exemplified below.
For example, consider the following scenario where we're using the
NO_HOLES feature and we have the following two neighbour leafs:
Leaf X (has N items) Leaf Y
[ ... (257 INODE_ITEM 0) (257 INODE_REF 256) ] [ (257 EXTENT_DATA 8192), ... ]
slot N - 2 slot N - 1 slot 0
Our inode 257 has an implicit hole in the range [0, 8K[ (implicit rather
than explicit because NO_HOLES is enabled). Now if our inode has an
ordered extent for the range [4K, 8K[ that is finishing, the following
can happen:
CPU 1 CPU 2
btrfs_finish_ordered_io()
insert_reserved_file_extent()
__btrfs_drop_extents()
Searches for the key
(257 EXTENT_DATA 4096) through
btrfs_lookup_file_extent()
Key not found and we get a path where
path->nodes[0] == leaf X and
path->slots[0] == N
Because path->slots[0] is >=
btrfs_header_nritems(leaf X), we call
btrfs_next_leaf()
btrfs_next_leaf() releases the path
inserts key
(257 INODE_REF 4096)
at the end of leaf X,
leaf X now has N + 1 keys,
and the new key is at
slot N
btrfs_next_leaf() searches for
key (257 INODE_REF 256), with
path->keep_locks set to 1,
because it was the last key it
saw in leaf X
finds it in leaf X again and
notices it's no longer the last
key of the leaf, so it returns 0
with path->nodes[0] == leaf X and
path->slots[0] == N (which is now
< btrfs_header_nritems(leaf X)),
pointing to the new key
(257 INODE_REF 4096)
__btrfs_drop_extents() casts the
item at path->nodes[0], slot
path->slots[0], to a struct
btrfs_file_extent_item - it does
not skip keys for the target
inode with a type less than
BTRFS_EXTENT_DATA_KEY
(BTRFS_INODE_REF_KEY < BTRFS_EXTENT_DATA_KEY)
sees a bogus value for the type
field triggering the WARN_ON in
the trace shown above, and sets
extent_end = search_start (4096)
does the if-then-else logic to
fixup 0 length extent items created
by a past bug from hole punching:
if (extent_end == key.offset &&
extent_end >= search_start)
goto delete_extent_item;
that evaluates to true and it ends
up deleting the key pointed to by
path->slots[0], (257 INODE_REF 4096),
from leaf X
The same could happen for example for a xattr that ends up having a key
with an offset value that matches search_start (very unlikely but not
impossible).
So fix this by ensuring that keys smaller than BTRFS_EXTENT_DATA_KEY are
skipped, never casted to struct btrfs_file_extent_item and never deleted
by accident. Also protect against the unexpected case of getting a key
for a lower inode number by skipping that key and issuing a warning.
Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Merge second patch-bomb from Andrew Morton:
- most of the rest of MM
- procfs
- lib/ updates
- printk updates
- bitops infrastructure tweaks
- checkpatch updates
- nilfs2 update
- signals
- various other misc bits: coredump, seqfile, kexec, pidns, zlib, ipc,
dma-debug, dma-mapping, ...
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (102 commits)
ipc,msg: drop dst nil validation in copy_msg
include/linux/zutil.h: fix usage example of zlib_adler32()
panic: release stale console lock to always get the logbuf printed out
dma-debug: check nents in dma_sync_sg*
dma-mapping: tidy up dma_parms default handling
pidns: fix set/getpriority and ioprio_set/get in PRIO_USER mode
kexec: use file name as the output message prefix
fs, seqfile: always allow oom killer
seq_file: reuse string_escape_str()
fs/seq_file: use seq_* helpers in seq_hex_dump()
coredump: change zap_threads() and zap_process() to use for_each_thread()
coredump: ensure all coredumping tasks have SIGNAL_GROUP_COREDUMP
signal: remove jffs2_garbage_collect_thread()->allow_signal(SIGCONT)
signal: introduce kernel_signal_stop() to fix jffs2_garbage_collect_thread()
signal: turn dequeue_signal_lock() into kernel_dequeue_signal()
signals: kill block_all_signals() and unblock_all_signals()
nilfs2: fix gcc uninitialized-variable warnings in powerpc build
nilfs2: fix gcc unused-but-set-variable warnings
MAINTAINERS: nilfs2: add header file for tracing
nilfs2: add tracepoints for analyzing reading and writing metadata files
...
There are many places which use mapping_gfp_mask to restrict a more
generic gfp mask which would be used for allocations which are not
directly related to the page cache but they are performed in the same
context.
Let's introduce a helper function which makes the restriction explicit and
easier to track. This patch doesn't introduce any functional changes.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Michal Hocko <mhocko@suse.com>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
__GFP_WAIT has been used to identify atomic context in callers that hold
spinlocks or are in interrupts. They are expected to be high priority and
have access one of two watermarks lower than "min" which can be referred
to as the "atomic reserve". __GFP_HIGH users get access to the first
lower watermark and can be called the "high priority reserve".
Over time, callers had a requirement to not block when fallback options
were available. Some have abused __GFP_WAIT leading to a situation where
an optimisitic allocation with a fallback option can access atomic
reserves.
This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
cannot sleep and have no alternative. High priority users continue to use
__GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and
are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify
callers that want to wake kswapd for background reclaim. __GFP_WAIT is
redefined as a caller that is willing to enter direct reclaim and wake
kswapd for background reclaim.
This patch then converts a number of sites
o __GFP_ATOMIC is used by callers that are high priority and have memory
pools for those requests. GFP_ATOMIC uses this flag.
o Callers that have a limited mempool to guarantee forward progress clear
__GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
into this category where kswapd will still be woken but atomic reserves
are not used as there is a one-entry mempool to guarantee progress.
o Callers that are checking if they are non-blocking should use the
helper gfpflags_allow_blocking() where possible. This is because
checking for __GFP_WAIT as was done historically now can trigger false
positives. Some exceptions like dm-crypt.c exist where the code intent
is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
flag manipulations.
o Callers that built their own GFP flags instead of starting with GFP_KERNEL
and friends now also need to specify __GFP_KSWAPD_RECLAIM.
The first key hazard to watch out for is callers that removed __GFP_WAIT
and was depending on access to atomic reserves for inconspicuous reasons.
In some cases it may be appropriate for them to use __GFP_HIGH.
The second key hazard is callers that assembled their own combination of
GFP flags instead of starting with something like GFP_KERNEL. They may
now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless
if it's missed in most cases as other activity will wake kswapd.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull btrfs updates from Chris Mason:
"We have a lot of subvolume quota improvements in here, along with big
piles of cleanups from Dave Sterba and Anand Jain and others.
Josef pitched in a batch of allocator fixes based on production use
here at FB. We found that mount -o ssd_spread greatly improved our
performance on hardware raid5/6, but it exposed some CPU bottlenecks
in the allocator. These patches make a huge difference"
* 'for-linus-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (100 commits)
Btrfs: fix hole punching when using the no-holes feature
Btrfs: find_free_extent: Do not erroneously skip LOOP_CACHING_WAIT state
btrfs: Fix a data space underflow warning
btrfs: qgroup: Fix a rebase bug which will cause qgroup double free
btrfs: qgroup: Fix a race in delayed_ref which leads to abort trans
btrfs: clear PF_NOFREEZE in cleaner_kthread()
btrfs: qgroup: Don't copy extent buffer to do qgroup rescan
btrfs: add balance filters limits, stripes and usage to supported mask
btrfs: extend balance filter usage to take minimum and maximum
btrfs: add balance filter for stripes
btrfs: extend balance filter limit to take minimum and maximum
btrfs: fix use after free iterating extrefs
btrfs: check unsupported filters in balance arguments
Btrfs: fix regression running delayed references when using qgroups
Btrfs: fix regression when running delayed references
Btrfs: don't do extra bitmap search in one bit case
Btrfs: keep track of largest extent in bitmaps
Btrfs: don't keep trying to build clusters if we are fragmented
Btrfs: cut down on loops through the allocator
Btrfs: don't continue setting up space cache when enospc
...
We were initializing the completion (fs_info->qgroup_rescan_completion)
object after releasing the qgroup rescan lock, which gives a small time
window for a rescan waiter to not actually wait for the rescan worker
to finish. Example:
CPU 1 CPU 2
fs_info->qgroup_rescan_completion->done is 0
btrfs_qgroup_rescan_worker()
complete_all(&fs_info->qgroup_rescan_completion)
sets fs_info->qgroup_rescan_completion->done
to UINT_MAX / 2
... do some other stuff ....
qgroup_rescan_init()
mutex_lock(&fs_info->qgroup_rescan_lock)
set flag BTRFS_QGROUP_STATUS_FLAG_RESCAN
in fs_info->qgroup_flags
mutex_unlock(&fs_info->qgroup_rescan_lock)
btrfs_qgroup_wait_for_completion()
mutex_lock(&fs_info->qgroup_rescan_lock)
sees flag BTRFS_QGROUP_STATUS_FLAG_RESCAN
in fs_info->qgroup_flags
mutex_unlock(&fs_info->qgroup_rescan_lock)
wait_for_completion_interruptible(
&fs_info->qgroup_rescan_completion)
fs_info->qgroup_rescan_completion->done
is > 0 so it returns immediately
init_completion(&fs_info->qgroup_rescan_completion)
sets fs_info->qgroup_rescan_completion->done to 0
So fix this by initializing the completion object while holding the mutex
fs_info->qgroup_rescan_lock.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
I was hitting a consistent NULL pointer dereference during shutdown that
showed the trace running through end_workqueue_bio(). I traced it back to
the endio_meta_workers workqueue being poked after it had already been
destroyed.
Eventually I found that the root cause was a qgroup rescan that was still
in progress while we were stopping all the btrfs workers.
Currently we explicitly pause balance and scrub operations in
close_ctree(), but we do nothing to stop the qgroup rescan. We should
probably be doing the same for qgroup rescan, but that's a much larger
change. This small change is good enough to allow me to unmount without
crashing.
Signed-off-by: Justin Maggard <jmaggard@netgear.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
When doing a write using direct IO we can end up not doing the whole write
operation using the direct IO path, in that case we fallback to a buffered
write to do the remaining IO. This happens for example if the range we are
writing to contains a compressed extent.
When we do a partial write and fallback to buffered IO, due to the
existence of a compressed extent for example, we end up not adjusting the
outstanding extents counter of our inode which ends up getting decremented
twice, once by the DIO ordered extent for the partial write and once again
by btrfs_direct_IO(), resulting in an arithmetic underflow at
extent-tree.c:drop_outstanding_extent(). For example if we have:
extents [ prealloc extent ] [ compressed extent ]
offsets A B C D E
and at the moment our inode's outstanding extents counter is 0, if we do a
direct IO write against the range [B, D[ (which has a length smaller than
128Mb), we end up bumping our inode's outstanding extents counter to 1, we
create a DIO ordered extent for the range [B, C[ and then fallback to a
buffered write for the range [C, D[. The direct IO handler
(inode.c:btrfs_direct_IO()) decrements the outstanding extents counter by
1, leaving it with a value of 0, through a call to
btrfs_delalloc_release_space() and then shortly after the DIO ordered
extent finishes and calls btrfs_delalloc_release_metadata() which ends
up to attempt to decrement the inode's outstanding extents counter by 1,
resulting in an assertion failure at drop_outstanding_extent() because
the operation would result in an arithmetic underflow (0 - 1). This
produces the following trace:
[125471.336838] BTRFS: assertion failed: BTRFS_I(inode)->outstanding_extents >= num_extents, file: fs/btrfs/extent-tree.c, line: 5526
[125471.338844] ------------[ cut here ]------------
[125471.340745] kernel BUG at fs/btrfs/ctree.h:4173!
[125471.340745] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[125471.340745] Modules linked in: btrfs f2fs xfs libcrc32c dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc acpi_cpufreq psmouse i2c_piix4 parport pcspkr serio_raw microcode processor evdev i2c_core button ext4 crc16 jbd2 mbcache sd_mod sg sr_mod cdrom ata_generic virtio_scsi ata_piix virtio_pci virtio_ring floppy libata virtio e1000 scsi_mod [last unloaded: btrfs]
[125471.340745] CPU: 10 PID: 23649 Comm: kworker/u32:1 Tainted: G W 4.3.0-rc5-btrfs-next-17+ #1
[125471.340745] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[125471.340745] Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
[125471.340745] task: ffff8804244fcf80 ti: ffff88040a118000 task.ti: ffff88040a118000
[125471.340745] RIP: 0010:[<ffffffffa0550da1>] [<ffffffffa0550da1>] assfail.constprop.46+0x1e/0x20 [btrfs]
[125471.340745] RSP: 0018:ffff88040a11bc78 EFLAGS: 00010296
[125471.340745] RAX: 0000000000000075 RBX: 0000000000005000 RCX: 0000000000000000
[125471.340745] RDX: ffffffff81098f93 RSI: ffffffff8147c619 RDI: 00000000ffffffff
[125471.340745] RBP: ffff88040a11bc78 R08: 0000000000000001 R09: 0000000000000000
[125471.340745] R10: ffff88040a11bc08 R11: ffffffff81651000 R12: ffff8803efb4a000
[125471.340745] R13: ffff8803efb4a000 R14: 0000000000000000 R15: ffff8802f8e33c88
[125471.340745] FS: 0000000000000000(0000) GS:ffff88043dd40000(0000) knlGS:0000000000000000
[125471.340745] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[125471.340745] CR2: 00007fae7ca86095 CR3: 0000000001a0b000 CR4: 00000000000006e0
[125471.340745] Stack:
[125471.340745] ffff88040a11bc88 ffffffffa04ca0cd ffff88040a11bcc8 ffffffffa04ceeb1
[125471.340745] ffff8802f8e33940 ffff8802c93eadb0 ffff8802f8e0bf50 ffff8803efb4a000
[125471.340745] 0000000000000000 ffff8802f8e33c88 ffff88040a11bd38 ffffffffa04eccfa
[125471.340745] Call Trace:
[125471.340745] [<ffffffffa04ca0cd>] drop_outstanding_extent+0x3d/0x6d [btrfs]
[125471.340745] [<ffffffffa04ceeb1>] btrfs_delalloc_release_metadata+0x51/0xdd [btrfs]
[125471.340745] [<ffffffffa04eccfa>] btrfs_finish_ordered_io+0x420/0x4eb [btrfs]
[125471.340745] [<ffffffffa04ecdda>] finish_ordered_fn+0x15/0x17 [btrfs]
[125471.340745] [<ffffffffa050e6e8>] normal_work_helper+0x14c/0x32a [btrfs]
[125471.340745] [<ffffffffa050e9c8>] btrfs_endio_write_helper+0x12/0x14 [btrfs]
[125471.340745] [<ffffffff81063b23>] process_one_work+0x24a/0x4ac
[125471.340745] [<ffffffff81064285>] worker_thread+0x206/0x2c2
[125471.340745] [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb
[125471.340745] [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb
[125471.340745] [<ffffffff8106904d>] kthread+0xef/0xf7
[125471.340745] [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24
[125471.340745] [<ffffffff8147d10f>] ret_from_fork+0x3f/0x70
[125471.340745] [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24
[125471.340745] Code: a5 55 a0 48 89 e5 e8 42 50 bc e0 0f 0b 55 89 f1 48 c7 c2 f0 a8 55 a0 48 89 fe 31 c0 48 c7 c7 14 aa 55 a0 48 89 e5 e8 22 50 bc e0 <0f> 0b 0f 1f 44 00 00 55 31 c9 ba 18 00 00 00 48 89 e5 41 56 41
[125471.340745] RIP [<ffffffffa0550da1>] assfail.constprop.46+0x1e/0x20 [btrfs]
[125471.340745] RSP <ffff88040a11bc78>
[125471.539620] ---[ end trace 144259f7838b4aa4 ]---
So fix this by ensuring we adjust the outstanding extents counter when we
do the fallback just like we do for the case where the whole write can be
done through the direct IO path.
We were also adjusting the outstanding extents counter by a constant value
of 1, which is incorrect because we were ignorning that we account extents
in BTRFS_MAX_EXTENT_SIZE units, o fix that as well.
The following test case for fstests reproduces this issue:
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_need_to_be_root
_supported_fs btrfs
_supported_os Linux
_require_scratch
_require_xfs_io_command "falloc"
rm -f $seqres.full
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount "-o compress"
# Create a compressed extent covering the range [700K, 800K[.
$XFS_IO_PROG -f -s -c "pwrite -S 0xaa -b 100K 700K 100K" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Create prealloc extent covering the range [600K, 700K[.
$XFS_IO_PROG -c "falloc 600K 100K" $SCRATCH_MNT/foo
# Write 80K of data to the range [640K, 720K[ using direct IO. This
# range covers both the prealloc extent and the compressed extent.
# Because there's a compressed extent in the range we are writing to,
# the DIO write code path ends up only writing the first 60k of data,
# which goes to the prealloc extent, and then falls back to buffered IO
# for writing the remaining 20K of data - because that remaining data
# maps to a file range containing a compressed extent.
# When falling back to buffered IO, we used to trigger an assertion when
# releasing reserved space due to bad accounting of the inode's
# outstanding extents counter, which was set to 1 but we ended up
# decrementing it by 1 twice, once through the ordered extent for the
# 60K of data we wrote using direct IO, and once through the main direct
# IO handler (inode.cbtrfs_direct_IO()) because the direct IO write
# wrote less than 80K of data (60K).
$XFS_IO_PROG -d -c "pwrite -S 0xbb -b 80K 640K 80K" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Now similar test as above but for very large write operations. This
# triggers special cases for an inode's outstanding extents accounting,
# as internally btrfs logically splits extents into 128Mb units.
$XFS_IO_PROG -f -s \
-c "pwrite -S 0xaa -b 128M 258M 128M" \
-c "falloc 0 258M" \
$SCRATCH_MNT/bar | _filter_xfs_io
$XFS_IO_PROG -d -c "pwrite -S 0xbb -b 256M 3M 256M" $SCRATCH_MNT/bar \
| _filter_xfs_io
# Now verify the file contents are correct and that they are the same
# even after unmounting and mounting the fs again (or evicting the page
# cache).
#
# For file foo, all bytes in the range [0, 640K[ must have a value of
# 0x00, all bytes in the range [640K, 720K[ must have a value of 0xbb
# and all bytes in the range [720K, 800K[ must have a value of 0xaa.
#
# For file bar, all bytes in the range [0, 3M[ must havea value of 0x00,
# all bytes in the range [3M, 259M[ must have a value of 0xbb and all
# bytes in the range [259M, 386M[ must have a value of 0xaa.
#
echo "File digests before remounting the file system:"
md5sum $SCRATCH_MNT/foo | _filter_scratch
md5sum $SCRATCH_MNT/bar | _filter_scratch
_scratch_remount
echo "File digests after remounting the file system:"
md5sum $SCRATCH_MNT/foo | _filter_scratch
md5sum $SCRATCH_MNT/bar | _filter_scratch
status=0
exit
Fixes: e1cbbfa5f5 ("Btrfs: fix outstanding_extents accounting in DIO")
Fixes: 3e05bde8c3 ("Btrfs: only adjust outstanding_extents when we do a short write")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
When we are using the no-holes feature, if we punch a hole into a file
range that already contains a hole which overlaps the range we are passing
to fallocate(), we end up removing the extent map that represents the
existing hole without adding a new one. This happens because with the
no-holes feature we do not have explicit extent items to represent holes
and therefore the call to __btrfs_drop_extents(), made from
btrfs_punch_hole(), returns an end offset to the variable drop_end that
is smaller than the end of the range passed to fallocate(), while it
drops all existing extent maps in that range.
Normally having a missing extent map is not a problem, for example for
a readpages() operation we just end up building the extent map by
looking at the fs/subvol tree for a matching extent item (or a lack of
one for implicit holes). However for an fsync that uses the fast path,
which needs to look at the list of modified extent maps, this means
the fsync will not record information about the complete hole we had
before the fallocate() call into the log tree, resulting in a file with
content/layout that does not match what we had neither before nor after
the hole punch operation.
The following test case for fstests reproduces the issue. It fails without
this change because we get a file with a different digest after the fsync
log replay and also with a different extent/hole layout.
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
_cleanup_flakey
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
. ./common/punch
. ./common/dmflakey
# real QA test starts here
_need_to_be_root
_supported_fs generic
_supported_os Linux
_require_scratch
_require_xfs_io_command "fpunch"
_require_xfs_io_command "fiemap"
_require_dm_target flakey
_require_metadata_journaling $SCRATCH_DEV
# This test was motivated by an issue found in btrfs when the btrfs
# no-holes feature is enabled (introduced in kernel 3.14). So enable
# the feature if the fs being tested is btrfs.
if [ $FSTYP == "btrfs" ]; then
_require_btrfs_fs_feature "no_holes"
_require_btrfs_mkfs_feature "no-holes"
MKFS_OPTIONS="$MKFS_OPTIONS -O no-holes"
fi
rm -f $seqres.full
_scratch_mkfs >>$seqres.full 2>&1
_init_flakey
_mount_flakey
# Create out test file with some data and then fsync it.
# We do the fsync only to make sure the last fsync we do in this test
# triggers the fast code path of btrfs' fsync implementation, a
# condition necessary to trigger the bug btrfs had.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 128K" \
-c "fsync" \
$SCRATCH_MNT/foobar | _filter_xfs_io
# Now punch a hole against the range [96K, 128K[.
$XFS_IO_PROG -c "fpunch 96K 32K" $SCRATCH_MNT/foobar
# Punch another hole against a range that overlaps the previous range
# and ends beyond eof.
$XFS_IO_PROG -c "fpunch 64K 128K" $SCRATCH_MNT/foobar
# Punch another hole against a range that overlaps the first range
# ([96K, 128K[) and ends at eof.
$XFS_IO_PROG -c "fpunch 32K 96K" $SCRATCH_MNT/foobar
# Fsync our file. We want to verify that, after a power failure and
# mounting the filesystem again, the file content reflects all the hole
# punch operations.
$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foobar
echo "File digest before power failure:"
md5sum $SCRATCH_MNT/foobar | _filter_scratch
echo "Fiemap before power failure:"
$XFS_IO_PROG -c "fiemap -v" $SCRATCH_MNT/foobar | _filter_fiemap
# Silently drop all writes and umount to simulate a crash/power failure.
_load_flakey_table $FLAKEY_DROP_WRITES
_unmount_flakey
# Allow writes again, mount to trigger log replay and validate file
# contents.
_load_flakey_table $FLAKEY_ALLOW_WRITES
_mount_flakey
echo "File digest after log replay:"
# Must match the same digest we got before the power failure.
md5sum $SCRATCH_MNT/foobar | _filter_scratch
echo "Fiemap after log replay:"
# Must match the same extent listing we got before the power failure.
$XFS_IO_PROG -c "fiemap -v" $SCRATCH_MNT/foobar | _filter_fiemap
_unmount_flakey
status=0
exit
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
When executing generic/001 in a loop on a ppc64 machine (with both sectorsize
and nodesize set to 64k), the following call trace is observed,
WARNING: at /root/repos/linux/fs/btrfs/locking.c:253
Modules linked in:
CPU: 2 PID: 8353 Comm: umount Not tainted 4.3.0-rc5-13676-ga5e681d #54
task: c0000000f2b1f560 ti: c0000000f6008000 task.ti: c0000000f6008000
NIP: c000000000520c88 LR: c0000000004a3b34 CTR: 0000000000000000
REGS: c0000000f600a820 TRAP: 0700 Not tainted (4.3.0-rc5-13676-ga5e681d)
MSR: 8000000102029032 <SF,VEC,EE,ME,IR,DR,RI> CR: 24444884 XER: 00000000
CFAR: c0000000004a3b30 SOFTE: 1
GPR00: c0000000004a3b34 c0000000f600aaa0 c00000000108ac00 c0000000f5a808c0
GPR04: 0000000000000000 c0000000f600ae60 0000000000000000 0000000000000005
GPR08: 00000000000020a1 0000000000000001 c0000000f2b1f560 0000000000000030
GPR12: 0000000084842882 c00000000fdc0900 c0000000f600ae60 c0000000f070b800
GPR16: 0000000000000000 c0000000f3c8a000 0000000000000000 0000000000000049
GPR20: 0000000000000001 0000000000000001 c0000000f5aa01f8 0000000000000000
GPR24: 0f83e0f83e0f83e1 c0000000f5a808c0 c0000000f3c8d000 c000000000000000
GPR28: c0000000f600ae74 0000000000000001 c0000000f3c8d000 c0000000f5a808c0
NIP [c000000000520c88] .btrfs_tree_lock+0x48/0x2a0
LR [c0000000004a3b34] .btrfs_lock_root_node+0x44/0x80
Call Trace:
[c0000000f600aaa0] [c0000000f600ab80] 0xc0000000f600ab80 (unreliable)
[c0000000f600ab80] [c0000000004a3b34] .btrfs_lock_root_node+0x44/0x80
[c0000000f600ac00] [c0000000004a99dc] .btrfs_search_slot+0xa8c/0xc00
[c0000000f600ad40] [c0000000004ab878] .btrfs_insert_empty_items+0x98/0x120
[c0000000f600adf0] [c00000000050da44] .btrfs_finish_chunk_alloc+0x1d4/0x620
[c0000000f600af20] [c0000000004be854] .btrfs_create_pending_block_groups+0x1d4/0x2c0
[c0000000f600b020] [c0000000004bf188] .do_chunk_alloc+0x3c8/0x420
[c0000000f600b100] [c0000000004c27cc] .find_free_extent+0xbfc/0x1030
[c0000000f600b260] [c0000000004c2ce8] .btrfs_reserve_extent+0xe8/0x250
[c0000000f600b330] [c0000000004c2f90] .btrfs_alloc_tree_block+0x140/0x590
[c0000000f600b440] [c0000000004a47b4] .__btrfs_cow_block+0x124/0x780
[c0000000f600b530] [c0000000004a4fc0] .btrfs_cow_block+0xf0/0x250
[c0000000f600b5e0] [c0000000004a917c] .btrfs_search_slot+0x22c/0xc00
[c0000000f600b720] [c00000000050aa40] .btrfs_remove_chunk+0x1b0/0x9f0
[c0000000f600b850] [c0000000004c4e04] .btrfs_delete_unused_bgs+0x434/0x570
[c0000000f600b950] [c0000000004d3cb8] .close_ctree+0x2e8/0x3b0
[c0000000f600ba20] [c00000000049d178] .btrfs_put_super+0x18/0x30
[c0000000f600ba90] [c000000000243cd4] .generic_shutdown_super+0xa4/0x1a0
[c0000000f600bb10] [c0000000002441d8] .kill_anon_super+0x18/0x30
[c0000000f600bb90] [c00000000049c898] .btrfs_kill_super+0x18/0xc0
[c0000000f600bc10] [c0000000002444f8] .deactivate_locked_super+0x98/0xe0
[c0000000f600bc90] [c000000000269f94] .cleanup_mnt+0x54/0xa0
[c0000000f600bd10] [c0000000000bd744] .task_work_run+0xc4/0x100
[c0000000f600bdb0] [c000000000016334] .do_notify_resume+0x74/0x80
[c0000000f600be30] [c0000000000098b8] .ret_from_except_lite+0x64/0x68
Instruction dump:
fba1ffe8 fbc1fff0 fbe1fff8 7c791b78 f8010010 f821ff21 e94d0290 81030040
812a04e8 7d094a78 7d290034 5529d97e <0b090000> 3b400000 3be30050 3bc3004c
The above call trace is seen even on x86_64; albeit very rarely and that too
with nodesize set to 64k and with nospace_cache mount option being used.
The reason for the above call trace is,
btrfs_remove_chunk
check_system_chunk
Allocate chunk if required
For each physical stripe on underlying device,
btrfs_free_dev_extent
...
Take lock on Device tree's root node
btrfs_cow_block("dev tree's root node");
btrfs_reserve_extent
find_free_extent
index = BTRFS_RAID_DUP;
have_caching_bg = false;
When in LOOP_CACHING_NOWAIT state, Assume we find a block group
which is being cached; Hence have_caching_bg is set to true
When repeating the search for the next RAID index, we set
have_caching_bg to false.
Hence right after completing the LOOP_CACHING_NOWAIT state, we incorrectly
skip LOOP_CACHING_WAIT state and move to LOOP_ALLOC_CHUNK state where we
allocate a chunk and try to add entries corresponding to the chunk's physical
stripe into the device tree. When doing so the task deadlocks itself waiting
for the blocking lock on the root node of the device tree.
This commit fixes the issue by introducing a new local variable to help
indicate as to whether a block group of any RAID type is being cached.
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Even with quota disabled, generic/127 will trigger a kernel warning by
underflow data space info.
The bug is caused by buffered write, which in case of short copy, the
start parameter for btrfs_delalloc_release_space() is wrong, and
round_up/down() in btrfs_delalloc_release() extents the range to page
aligned, decreasing one more page than expected.
This patch will fix it by passing correct start.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
When rebasing my patchset, I forgot to pick up a cleanup patch to remove
old hotfix in 4.2 release.
Witouth the cleanup, it will screw up new qgroup reserve framework and
always cause minus reserved number.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Between btrfs_allocerved_file_extent() and
btrfs_add_delayed_qgroup_reserve(), there is a window that delayed_refs
are run and delayed ref head maybe freed before
btrfs_add_delayed_qgroup_reserve().
This will cause btrfs_dad_delayed_qgroup_reserve() to return -ENOENT,
and cause transaction to be aborted.
This patch will record qgroup reserve space info into delayed_ref_head
at btrfs_add_delayed_ref(), to eliminate the race window.
Reported-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
cleaner_kthread() kthread calls try_to_freeze() at the beginning of every
cleanup attempt. This operation can't ever succeed though, as the kthread
hasn't marked itself as freezable.
Before (hopefully eventually) kthread freezing gets converted to fileystem
freezing, we'd rather mark cleaner_kthread() freezable (as my
understanding is that it can generate filesystem I/O during suspend).
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
Ancient qgroup code call memcpy() on a extent buffer and use it for leaf
iteration.
As extent buffer contains lock, pointers to pages, it's never sane to do
such copy.
The following bug may be caused by this insane operation:
[92098.841309] general protection fault: 0000 [#1] SMP
[92098.841338] Modules linked in: ...
[92098.841814] CPU: 1 PID: 24655 Comm: kworker/u4:12 Not tainted
4.3.0-rc1 #1
[92098.841868] Workqueue: btrfs-qgroup-rescan btrfs_qgroup_rescan_helper
[btrfs]
[92098.842261] Call Trace:
[92098.842277] [<ffffffffc035a5d8>] ? read_extent_buffer+0xb8/0x110
[btrfs]
[92098.842304] [<ffffffffc0396d00>] ? btrfs_find_all_roots+0x60/0x70
[btrfs]
[92098.842329] [<ffffffffc039af3d>]
btrfs_qgroup_rescan_worker+0x28d/0x5a0 [btrfs]
Where btrfs_qgroup_rescan_worker+0x28d is btrfs_disk_key_to_cpu(),
called in reading key from the copied extent_buffer.
This patch will use btrfs_clone_extent_buffer() to a better copy of
extent buffer to deal such case.
Reported-by: Stephane Lesimple <stephane_btrfs@lesimple.fr>
Suggested-by: Filipe Manana <fdmanana@kernel.org>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Enable the extended 'limit' syntax (a range), the new 'stripes' and
extended 'usage' syntax (a range) filters in the filters mask. The patch
comes separate and not within the series that introduced the new filters
because the patch adding the mask was merged in a late rc. The
integration branch was based on an older rc and could not merge the
patch due to the missing changes.
Prerequisities:
* btrfs: check unsupported filters in balance arguments
* btrfs: extend balance filter limit to take minimum and maximum
* btrfs: add balance filter for stripes
* btrfs: extend balance filter usage to take minimum and maximum
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Similar to the 'limit' filter, we can enhance the 'usage' filter to
accept a range. The change is backward compatible, the range is applied
only in connection with the BTRFS_BALANCE_ARGS_USAGE_RANGE flag.
We don't have a usecase yet, the current syntax has been sufficient. The
enhancement should provide parity with other range-like filters.
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Balance block groups which have the given number of stripes, defined by
a range min..max. This is useful to selectively rebalance only chunks
that do not span enough devices, applies to RAID0/10/5/6.
Signed-off-by: Gabríel Arthúr Pétursson <gabriel@system.is>
[ renamed bargs members, added to the UAPI, wrote the changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
The 'limit' filter is underdesigned, it should have been a range for
[min,max], with some relaxed semantics when one of the bounds is
missing. Besides that, using a full u64 for a single value is a waste of
bytes.
Let's fix both by extending the use of the u64 bytes for the [min,max]
range. This can be done in a backward compatible way, the range will be
interpreted only if the appropriate flag is set
(BTRFS_BALANCE_ARGS_LIMIT_RANGE).
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
The code for btrfs inode-resolve has never worked properly for
files with enough hard links to trigger extrefs. It was trying to
get the leaf out of a path after freeing the path:
btrfs_release_path(path);
leaf = path->nodes[0];
item_size = btrfs_item_size_nr(leaf, slot);
The fix here is to use the extent buffer we cloned just a little higher
up to avoid deadlocks caused by using the leaf in the path.
Signed-off-by: Chris Mason <clm@fb.com>
cc: stable@vger.kernel.org # v3.7+
cc: Mark Fasheh <mfasheh@suse.de>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Chris Mason <clm@fb.com>
We don't verify that all the balance filter arguments supplemented by
the flags are actually known to the kernel. Thus we let it silently pass
and do nothing.
At the moment this means only the 'limit' filter, but we're going to add
a few more soon so it's better to have that fixed. Also in older stable
kernels so that it works with newer userspace tools.
Cc: stable@vger.kernel.org # 3.16+
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
In the kernel 4.2 merge window we had a big changes to the implementation
of delayed references and qgroups which made the no_quota field of delayed
references not used anymore. More specifically the no_quota field is not
used anymore as of:
commit 0ed4792af0 ("btrfs: qgroup: Switch to new extent-oriented qgroup mechanism.")
Leaving the no_quota field actually prevents delayed references from
getting merged, which in turn cause the following BUG_ON(), at
fs/btrfs/extent-tree.c, to be hit when qgroups are enabled:
static int run_delayed_tree_ref(...)
{
(...)
BUG_ON(node->ref_mod != 1);
(...)
}
This happens on a scenario like the following:
1) Ref1 bytenr X, action = BTRFS_ADD_DELAYED_REF, no_quota = 1, added.
2) Ref2 bytenr X, action = BTRFS_DROP_DELAYED_REF, no_quota = 0, added.
It's not merged with Ref1 because Ref1->no_quota != Ref2->no_quota.
3) Ref3 bytenr X, action = BTRFS_ADD_DELAYED_REF, no_quota = 1, added.
It's not merged with the reference at the tail of the list of refs
for bytenr X because the reference at the tail, Ref2 is incompatible
due to Ref2->no_quota != Ref3->no_quota.
4) Ref4 bytenr X, action = BTRFS_DROP_DELAYED_REF, no_quota = 0, added.
It's not merged with the reference at the tail of the list of refs
for bytenr X because the reference at the tail, Ref3 is incompatible
due to Ref3->no_quota != Ref4->no_quota.
5) We run delayed references, trigger merging of delayed references,
through __btrfs_run_delayed_refs() -> btrfs_merge_delayed_refs().
6) Ref1 and Ref3 are merged as Ref1->no_quota = Ref3->no_quota and
all other conditions are satisfied too. So Ref1 gets a ref_mod
value of 2.
7) Ref2 and Ref4 are merged as Ref2->no_quota = Ref4->no_quota and
all other conditions are satisfied too. So Ref2 gets a ref_mod
value of 2.
8) Ref1 and Ref2 aren't merged, because they have different values
for their no_quota field.
9) Delayed reference Ref1 is picked for running (select_delayed_ref()
always prefers references with an action == BTRFS_ADD_DELAYED_REF).
So run_delayed_tree_ref() is called for Ref1 which triggers the
BUG_ON because Ref1->red_mod != 1 (equals 2).
So fix this by removing the no_quota field, as it's not used anymore as
of commit 0ed4792af0 ("btrfs: qgroup: Switch to new extent-oriented
qgroup mechanism.").
The use of no_quota was also buggy in at least two places:
1) At delayed-refs.c:btrfs_add_delayed_tree_ref() - we were setting
no_quota to 0 instead of 1 when the following condition was true:
is_fstree(ref_root) || !fs_info->quota_enabled
2) At extent-tree.c:__btrfs_inc_extent_ref() - we were attempting to
reset a node's no_quota when the condition "!is_fstree(root_objectid)
|| !root->fs_info->quota_enabled" was true but we did it only in
an unused local stack variable, that is, we never reset the no_quota
value in the node itself.
This fixes the remainder of problems several people have been having when
running delayed references, mostly while a balance is running in parallel,
on a 4.2+ kernel.
Very special thanks to Stéphane Lesimple for helping debugging this issue
and testing this fix on his multi terabyte filesystem (which took more
than one day to balance alone, plus fsck, etc).
Also, this fixes deadlock issue when using the clone ioctl with qgroups
enabled, as reported by Elias Probst in the mailing list. The deadlock
happens because after calling btrfs_insert_empty_item we have our path
holding a write lock on a leaf of the fs/subvol tree and then before
releasing the path we called check_ref() which did backref walking, when
qgroups are enabled, and tried to read lock the same leaf. The trace for
this case is the following:
INFO: task systemd-nspawn:6095 blocked for more than 120 seconds.
(...)
Call Trace:
[<ffffffff86999201>] schedule+0x74/0x83
[<ffffffff863ef64c>] btrfs_tree_read_lock+0xc0/0xea
[<ffffffff86137ed7>] ? wait_woken+0x74/0x74
[<ffffffff8639f0a7>] btrfs_search_old_slot+0x51a/0x810
[<ffffffff863a129b>] btrfs_next_old_leaf+0xdf/0x3ce
[<ffffffff86413a00>] ? ulist_add_merge+0x1b/0x127
[<ffffffff86411688>] __resolve_indirect_refs+0x62a/0x667
[<ffffffff863ef546>] ? btrfs_clear_lock_blocking_rw+0x78/0xbe
[<ffffffff864122d3>] find_parent_nodes+0xaf3/0xfc6
[<ffffffff86412838>] __btrfs_find_all_roots+0x92/0xf0
[<ffffffff864128f2>] btrfs_find_all_roots+0x45/0x65
[<ffffffff8639a75b>] ? btrfs_get_tree_mod_seq+0x2b/0x88
[<ffffffff863e852e>] check_ref+0x64/0xc4
[<ffffffff863e9e01>] btrfs_clone+0x66e/0xb5d
[<ffffffff863ea77f>] btrfs_ioctl_clone+0x48f/0x5bb
[<ffffffff86048a68>] ? native_sched_clock+0x28/0x77
[<ffffffff863ed9b0>] btrfs_ioctl+0xabc/0x25cb
(...)
The problem goes away by eleminating check_ref(), which no longer is
needed as its purpose was to get a value for the no_quota field of
a delayed reference (this patch removes the no_quota field as mentioned
earlier).
Reported-by: Stéphane Lesimple <stephane_btrfs@lesimple.fr>
Tested-by: Stéphane Lesimple <stephane_btrfs@lesimple.fr>
Reported-by: Elias Probst <mail@eliasprobst.eu>
Reported-by: Peter Becker <floyd.net@gmail.com>
Reported-by: Malte Schröder <malte@tnxip.de>
Reported-by: Derek Dongray <derek@valedon.co.uk>
Reported-by: Erkki Seppala <flux-btrfs@inside.org>
Cc: stable@vger.kernel.org # 4.2+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
In the kernel 4.2 merge window we had a refactoring/rework of the delayed
references implementation in order to fix certain problems with qgroups.
However that rework introduced one more regression that leads to the
following trace when running delayed references for metadata:
[35908.064664] kernel BUG at fs/btrfs/extent-tree.c:1832!
[35908.065201] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[35908.065201] Modules linked in: dm_flakey dm_mod btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc psmouse i2
[35908.065201] CPU: 14 PID: 15014 Comm: kworker/u32:9 Tainted: G W 4.3.0-rc5-btrfs-next-17+ #1
[35908.065201] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[35908.065201] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs]
[35908.065201] task: ffff880114b7d780 ti: ffff88010c4c8000 task.ti: ffff88010c4c8000
[35908.065201] RIP: 0010:[<ffffffffa04928b5>] [<ffffffffa04928b5>] insert_inline_extent_backref+0x52/0xb1 [btrfs]
[35908.065201] RSP: 0018:ffff88010c4cbb08 EFLAGS: 00010293
[35908.065201] RAX: 0000000000000000 RBX: ffff88008a661000 RCX: 0000000000000000
[35908.065201] RDX: ffffffffa04dd58f RSI: 0000000000000001 RDI: 0000000000000000
[35908.065201] RBP: ffff88010c4cbb40 R08: 0000000000001000 R09: ffff88010c4cb9f8
[35908.065201] R10: 0000000000000000 R11: 000000000000002c R12: 0000000000000000
[35908.065201] R13: ffff88020a74c578 R14: 0000000000000000 R15: 0000000000000000
[35908.065201] FS: 0000000000000000(0000) GS:ffff88023edc0000(0000) knlGS:0000000000000000
[35908.065201] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[35908.065201] CR2: 00000000015e8708 CR3: 0000000102185000 CR4: 00000000000006e0
[35908.065201] Stack:
[35908.065201] ffff88010c4cbb18 0000000000000f37 ffff88020a74c578 ffff88015a408000
[35908.065201] ffff880154a44000 0000000000000000 0000000000000005 ffff88010c4cbbd8
[35908.065201] ffffffffa0492b9a 0000000000000005 0000000000000000 0000000000000000
[35908.065201] Call Trace:
[35908.065201] [<ffffffffa0492b9a>] __btrfs_inc_extent_ref+0x8b/0x208 [btrfs]
[35908.065201] [<ffffffffa0497117>] ? __btrfs_run_delayed_refs+0x4d4/0xd33 [btrfs]
[35908.065201] [<ffffffffa049773d>] __btrfs_run_delayed_refs+0xafa/0xd33 [btrfs]
[35908.065201] [<ffffffffa04a976a>] ? join_transaction.isra.10+0x25/0x41f [btrfs]
[35908.065201] [<ffffffffa04a97ed>] ? join_transaction.isra.10+0xa8/0x41f [btrfs]
[35908.065201] [<ffffffffa049914d>] btrfs_run_delayed_refs+0x75/0x1dd [btrfs]
[35908.065201] [<ffffffffa04992f1>] delayed_ref_async_start+0x3c/0x7b [btrfs]
[35908.065201] [<ffffffffa04d4b4f>] normal_work_helper+0x14c/0x32a [btrfs]
[35908.065201] [<ffffffffa04d4e93>] btrfs_extent_refs_helper+0x12/0x14 [btrfs]
[35908.065201] [<ffffffff81063b23>] process_one_work+0x24a/0x4ac
[35908.065201] [<ffffffff81064285>] worker_thread+0x206/0x2c2
[35908.065201] [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb
[35908.065201] [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb
[35908.065201] [<ffffffff8106904d>] kthread+0xef/0xf7
[35908.065201] [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24
[35908.065201] [<ffffffff8147d10f>] ret_from_fork+0x3f/0x70
[35908.065201] [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24
[35908.065201] Code: 6a 01 41 56 41 54 ff 75 10 41 51 4d 89 c1 49 89 c8 48 8d 4d d0 e8 f6 f1 ff ff 48 83 c4 28 85 c0 75 2c 49 81 fc ff 00 00 00 77 02 <0f> 0b 4c 8b 45 30 8b 4d 28 45 31
[35908.065201] RIP [<ffffffffa04928b5>] insert_inline_extent_backref+0x52/0xb1 [btrfs]
[35908.065201] RSP <ffff88010c4cbb08>
[35908.310885] ---[ end trace fe4299baf0666457 ]---
This happens because the new delayed references code no longer merges
delayed references that have different sequence values. The following
steps are an example sequence leading to this issue:
1) Transaction N starts, fs_info->tree_mod_seq has value 0;
2) Extent buffer (btree node) A is allocated, delayed reference Ref1 for
bytenr A is created, with a value of 1 and a seq value of 0;
3) fs_info->tree_mod_seq is incremented to 1;
4) Extent buffer A is deleted through btrfs_del_items(), which calls
btrfs_del_leaf(), which in turn calls btrfs_free_tree_block(). The
later returns the metadata extent associated to extent buffer A to
the free space cache (the range is not pinned), because the extent
buffer was created in the current transaction (N) and writeback never
happened for the extent buffer (flag BTRFS_HEADER_FLAG_WRITTEN not set
in the extent buffer).
This creates the delayed reference Ref2 for bytenr A, with a value
of -1 and a seq value of 1;
5) Delayed reference Ref2 is not merged with Ref1 when we create it,
because they have different sequence numbers (decided at
add_delayed_ref_tail_merge());
6) fs_info->tree_mod_seq is incremented to 2;
7) Some task attempts to allocate a new extent buffer (done at
extent-tree.c:find_free_extent()), but due to heavy fragmentation
and running low on metadata space the clustered allocation fails
and we fall back to unclustered allocation, which finds the
extent at offset A, so a new extent buffer at offset A is allocated.
This creates delayed reference Ref3 for bytenr A, with a value of 1
and a seq value of 2;
8) Ref3 is not merged neither with Ref2 nor Ref1, again because they
all have different seq values;
9) We start running the delayed references (__btrfs_run_delayed_refs());
10) The delayed Ref1 is the first one being applied, which ends up
creating an inline extent backref in the extent tree;
10) Next the delayed reference Ref3 is selected for execution, and not
Ref2, because select_delayed_ref() always gives a preference for
positive references (that have an action of BTRFS_ADD_DELAYED_REF);
11) When running Ref3 we encounter alreay the inline extent backref
in the extent tree at insert_inline_extent_backref(), which makes
us hit the following BUG_ON:
BUG_ON(owner < BTRFS_FIRST_FREE_OBJECTID);
This is always true because owner corresponds to the level of the
extent buffer/btree node in the btree.
For the scenario described above we hit the BUG_ON because we never merge
references that have different seq values.
We used to do the merging before the 4.2 kernel, more specifically, before
the commmits:
c6fc245499 ("btrfs: delayed-ref: Use list to replace the ref_root in ref_head.")
c43d160fcd ("btrfs: delayed-ref: Cleanup the unneeded functions.")
This issue became more exposed after the following change that was added
to 4.2 as well:
cffc3374e5 ("Btrfs: fix order by which delayed references are run")
Which in turn fixed another regression by the two commits previously
mentioned.
So fix this by bringing back the delayed reference merge code, with the
proper adaptations so that it operates against the new data structure
(linked list vs old red black tree implementation).
This issue was hit running fstest btrfs/063 in a loop. Several people have
reported this issue in the mailing list when running on kernels 4.2+.
Very special thanks to Stéphane Lesimple for helping debugging this issue
and testing this fix on his multi terabyte filesystem (which took more
than one day to balance alone, plus fsck, etc).
Fixes: c6fc245499 ("btrfs: delayed-ref: Use list to replace the ref_root in ref_head.")
Reported-by: Peter Becker <floyd.net@gmail.com>
Reported-by: Stéphane Lesimple <stephane_btrfs@lesimple.fr>
Tested-by: Stéphane Lesimple <stephane_btrfs@lesimple.fr>
Reported-by: Malte Schröder <malte@tnxip.de>
Reported-by: Derek Dongray <derek@valedon.co.uk>
Reported-by: Erkki Seppala <flux-btrfs@inside.org>
Cc: stable@vger.kernel.org # 4.2+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Pull btrfs fixes from Chris Mason:
"I have two more small fixes this week:
Qu's fix avoids unneeded COW during fallocate, and Christian found a
memory leak in the error handling of an earlier fix"
* 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
btrfs: fix possible leak in btrfs_ioctl_balance()
btrfs: Avoid truncate tailing page if fallocate range doesn't exceed inode size
When we make ctl->unit allocations from a bitmap there is no point in searching
for the next 0 in the bitmap. If we've found a bit we're done and can just exit
the loop. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
We can waste a lot of time searching through bitmaps when we are heavily
fragmented trying to find large contiguous areas that don't exist in the bitmap.
So keep track of the max extent size when we do a full search of a bitmap so
that next time around we can just skip the expensive searching if our max size
is less than what we are looking for. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
If we are extremely fragmented then we won't be able to create a free_cluster.
So if this happens set last_ptr->fragmented so that all future allcations will
give up trying to create a cluster. When we unpin extents we will unset
->fragmented if we free up a sufficient amount of space in a block group.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
We try really really hard to make allocations, but sometimes it is just not
going to happen, especially when free space is extremely fragmented. So add a
few short cuts through the looping states. For example if we couldn't allocate
a chunk, just go straight to the NO_EMPTY_SIZE loop. If there are no uncached
block groups and we've done a full search, go straight to the ALLOC_CHUNK stage.
And finally if we already have empty_size and empty_cluster set to 0 go ahead
and return -ENOSPC. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
If we hit ENOSPC when setting up a space cache don't bother setting up any of
the other space cache's in this transaction, it'll just induce unnecessary
latency. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
When we are heavily fragmented we can induce a lot of latency trying to make an
allocation happen that is simply not going to happen. Thankfully we keep track
of our max_extent_size when going through the allocator, so if we get to the
point where we are exiting find_free_extent with ENOSPC then set our
space_info->max_extent_size so we can keep future allocations from having to pay
this cost. We reset the max_extent_size whenever we release pinned bytes back
into this space info so we can redo all the work. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
The space cache needs to have contiguous allocations, and the allocator tries to
make allocations by reducing the amount of bytes requested and re-searching.
But this just makes us waste time when we are very fragmented, so if we can't
find our space just exit, don't bother trying to search again. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
I want to set some per transaction flags, so instead of adding yet another int
lets just convert the current two int indicators to flags and add a flags field
for future use. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
If we are heavily fragmented we will continually try to prealloc the largest
extent size we can every time we call btrfs_reserve_extent. This can be very
expensive when we are heavily fragmented, burning lots of CPU cycles and loops
through the allocator. So instead notice when we get a smaller chunk from the
allocator than what we specified and use this as the new maximum size we try to
allocate. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
In tracking down these weird bitmap problems it was helpful to artificially
create an extremely fragmented file system. These mount options let us either
fragment data or metadata or both. With these options I could reproduce all
sorts of weird latencies and hangs that occur under extreme fragmentation and
get them fixed. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>