WSL2-Linux-Kernel

Граф коммитов

Автор	SHA1	Сообщение	Дата
Johannes Thumshirn	6fbceb9fa4	btrfs: reduce scope of btrfs_scratch_superblocks() btrfs_scratch_superblocks() isn't used anywhere outside volumes.c so remove it from the header file and mark it as static. Also move it above it's callers so we don't need a forward declaration. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:39 +01:00
Johannes Thumshirn	c514c9b10b	btrfs: don't kmap() pages from block devices Block device mappings are never in highmem so kmap() / kunmap() calls for pages from block devices are unneeded. Use page_address() instead of kmap() to get to the virtual addreses. While we're at it, read_cache_page_gfp() doesn't return NULL on error, only an ERR_PTR, so use IS_ERR() to check for errors. Suggested-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:39 +01:00
Nikolay Borisov	f6d9abbc1f	btrfs: Export btrfs_release_disk_super Preparatory patch for removal of buffer_head usage in btrfs. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:38 +01:00
Anand Jain	f3cd2c5811	btrfs: sysfs, rename device_link add/remove functions Since commit `668e48af7a` ("btrfs: sysfs, add devid/dev_state kobject and device attributes"), the functions btrfs_sysfs_add_device_link() and btrfs_sysfs_rm_device_link() do more than just adding and removing the device link as its name indicated. Rename them to be more specific that's about the directory with the attirbutes Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:35 +01:00
Josef Bacik	0024652895	btrfs: rename btrfs_put_fs_root and btrfs_grab_fs_root We are now using these for all roots, rename them to btrfs_put_root() and btrfs_grab_root(); Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:33 +01:00
Josef Bacik	bc44d7c4b2	btrfs: push btrfs_grab_fs_root into btrfs_get_fs_root Now that all callers of btrfs_get_fs_root are subsequently calling btrfs_grab_fs_root and handling dropping the ref when they are done appropriately, go ahead and push btrfs_grab_fs_root up into btrfs_get_fs_root. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:32 +01:00
Josef Bacik	fbb0ce40d6	btrfs: hold a ref on the root in btrfs_check_uuid_tree_entry We lookup the uuid of arbitrary subvolumes, hold a ref on the root while we're doing this. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:31 +01:00
Josef Bacik	3619c94f07	btrfs: open code btrfs_read_fs_root_no_name All this does is call btrfs_get_fs_root() with check_ref == true. Just use btrfs_get_fs_root() so we don't have a bunch of different helpers that do the same thing. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:26 +01:00
Nikolay Borisov	1362089d2a	btrfs: Fix split-brain handling when changing FSID to metadata uuid Current code doesn't correctly handle the situation which arises when a file system that has METADATA_UUID_INCOMPAT flag set and has its FSID changed to the one in metadata uuid. This causes the incompat flag to disappear. In case of a power failure we could end up in a situation where part of the disks in a multi-disk filesystem are correctly reverted to METADATA_UUID_INCOMPAT flag unset state, while others have METADATA_UUID_INCOMPAT set and CHANGING_FSID_V2_IN_PROGRESS. This patch corrects the behavior required to handle the case where a disk of the second type is scanned first, creating the necessary btrfs_fs_devices. Subsequently, when a disk which has already completed the transition is scanned it should overwrite the data in btrfs_fs_devices. Reported-by: Su Yue <Damenly_Su@gmx.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-23 17:24:39 +01:00
Nikolay Borisov	0584071014	btrfs: Handle another split brain scenario with metadata uuid feature There is one more cases which isn't handled by the original metadata uuid work. Namely, when a filesystem has METADATA_UUID incompat bit and the user decides to change the FSID to the original one e.g. have metadata_uuid and fsid match. In case of power failure while this operation is in progress we could end up in a situation where some of the disks have the incompat bit removed and the other half have both METADATA_UUID_INCOMPAT and FSID_CHANGING_IN_PROGRESS flags. This patch handles the case where a disk that has successfully changed its FSID such that it equals METADATA_UUID is scanned first. Subsequently when a disk with both METADATA_UUID_INCOMPAT/FSID_CHANGING_IN_PROGRESS flags is scanned find_fsid_changed won't be able to find an appropriate btrfs_fs_devices. This is done by extending find_fsid_changed to correctly find btrfs_fs_devices whose metadata_uuid/fsid are the same and they match the metadata_uuid of the currently scanned device. Fixes: `cc5de4e702` ("btrfs: Handle final split-brain possibility during fsid change") Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reported-by: Su Yue <Damenly_Su@gmx.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-23 17:24:38 +01:00
Su Yue	c6730a0e57	btrfs: Factor out metadata_uuid code from find_fsid. find_fsid became rather hairy with the introduction of metadata uuid changing feature. Alleviate this by factoring out the metadata uuid specific code in a dedicated function which deals with finding correct fsid for a device with changed uuid. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Su Yue <Damenly_Su@gmx.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-23 17:24:38 +01:00
Su Yue	c0d81c7cb2	btrfs: Call find_fsid from find_fsid_inprogress Since find_fsid_inprogress should also handle the case in which an fs didn't change its FSID make it call find_fsid directly. This makes the code in device_list_add simpler by eliminating a conditional call of find_fsid. No functional changes. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Su Yue <Damenly_Su@gmx.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-23 17:24:37 +01:00
Nikolay Borisov	96a14336bd	btrfs: Move and unexport btrfs_rmap_block It's used only during initial block group reading to map physical address of super block to a list of logical ones. Make it private to block-group.c, add proper kernel doc and ensure it's exported only for tests. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-23 17:24:34 +01:00
Anand Jain	a69976bc69	btrfs: device stats, log when stats are zeroed We had a report indicating that some read errors aren't reported by the device stats in the userland. It is important to have the errors reported in the device stat as user land scripts might depend on it to take the reasonable corrective actions. But to debug these issue we need to be really sure that request to reset the device stat did not come from the userland itself. So log an info message when device error reset happens. For example: BTRFS info (device sdc): device stats zeroed by btrfs(9223) Reported-by: philip@philip-seeger.de Link: https://www.spinics.net/lists/linux-btrfs/msg96528.html Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-20 16:41:02 +01:00
Dennis Zhou	b0643e59cf	btrfs: add the beginning of async discard, discard workqueue When discard is enabled, everytime a pinned extent is released back to the block_group's free space cache, a discard is issued for the extent. This is an overeager approach when it comes to discarding and helping the SSD maintain enough free space to prevent severe garbage collection situations. This adds the beginning of async discard. Instead of issuing a discard prior to returning it to the free space, it is just marked as untrimmed. The block_group is then added to a LRU which then feeds into a workqueue to issue discards at a much slower rate. Full discarding of unused block groups is still done and will be addressed in a future patch of the series. For now, we don't persist the discard state of extents and bitmaps. Therefore, our failure recovery mode will be to consider extents untrimmed. This lets us handle failure and unmounting as one in the same. On a number of Facebook webservers, I collected data every minute accounting the time we spent in btrfs_finish_extent_commit() (col. 1) and in btrfs_commit_transaction() (col. 2). btrfs_finish_extent_commit() is where we discard extents synchronously before returning them to the free space cache. discard=sync: p99 total per minute p99 total per minute Drive \| extent_commit() (ms) \| commit_trans() (ms) --------------------------------------------------------------- Drive A \| 434 \| 1170 Drive B \| 880 \| 2330 Drive C \| 2943 \| 3920 Drive D \| 4763 \| 5701 discard=async: p99 total per minute p99 total per minute Drive \| extent_commit() (ms) \| commit_trans() (ms) -------------------------------------------------------------- Drive A \| 134 \| 956 Drive B \| 64 \| 1972 Drive C \| 59 \| 1032 Drive D \| 62 \| 1200 While it's not great that the stats are cumulative over 1m, all of these servers are running the same workload and and the delta between the two are substantial. We are spending significantly less time in btrfs_finish_extent_commit() which is responsible for discarding. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-20 16:40:57 +01:00
Johannes Thumshirn	321f69f86a	btrfs: reset device back to allocation state when removing When closing a device, btrfs_close_one_device() first allocates a new device, copies the device to close's name, replaces it in the dev_list with the copy and then finally frees it. This involves two memory allocation, which can potentially fail. As this code path is tricky to unwind, the allocation failures where handled by BUG_ON()s. But this copying isn't strictly needed, all that is needed is resetting the device in question to it's state it had after the allocation. Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-20 16:40:53 +01:00
Johannes Thumshirn	3fff3975a7	btrfs: decrement number of open devices after closing the device not before In btrfs_close_one_device we're decrementing the number of open devices before we're calling btrfs_close_bdev(). As there is no intermediate exit between these points in this function it is technically OK to do so, but it makes the code a bit harder to understand. Move both operations closer together and move the decrement step after btrfs_close_bdev(). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-20 16:40:53 +01:00
Johannes Thumshirn	b38f4cbd65	btrfs: remove impossible WARN_ON in btrfs_destroy_dev_replace_tgtdev() We have a user report, that cppcheck is complaining about a possible NULL-pointer dereference in btrfs_destroy_dev_replace_tgtdev(). We're first dereferencing the 'tgtdev' variable and the later check for the validity of the pointer with a WARN_ON(!tgtdev); But all callers of btrfs_destroy_dev_replace_tgtdev() either explicitly check if 'tgtdev' is non-NULL or directly allocate 'tgtdev', so the WARN_ON() is impossible to hit. Just remove it to silence the checker's complains. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=205003 Signed-off-by: Johannes Thumshirn <jth@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-20 16:40:53 +01:00
David Sterba	db26a02449	btrfs: fill ncopies for all raid table entries Make the number of copies explicit even for entries that use the default 0 value for consistency. Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-20 16:40:49 +01:00
David Sterba	e4f6c6be81	btrfs: use raid_attr table in calc_stripe_length for nparity The table is already used for ncopies, replace open coding of stripes with the raid_attr value. Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-20 16:40:49 +01:00
Josef Bacik	b35cf1f0bf	btrfs: check rw_devices, not num_devices for balance The fstest btrfs/154 reports [ 8675.381709] BTRFS: Transaction aborted (error -28) [ 8675.383302] WARNING: CPU: 1 PID: 31900 at fs/btrfs/block-group.c:2038 btrfs_create_pending_block_groups+0x1e0/0x1f0 [btrfs] [ 8675.390925] CPU: 1 PID: 31900 Comm: btrfs Not tainted 5.5.0-rc6-default+ #935 [ 8675.392780] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014 [ 8675.395452] RIP: 0010:btrfs_create_pending_block_groups+0x1e0/0x1f0 [btrfs] [ 8675.402672] RSP: 0018:ffffb2090888fb00 EFLAGS: 00010286 [ 8675.404413] RAX: 0000000000000000 RBX: ffff92026dfa91c8 RCX: 0000000000000001 [ 8675.406609] RDX: 0000000000000000 RSI: ffffffff8e100899 RDI: ffffffff8e100971 [ 8675.408775] RBP: ffff920247c61660 R08: 0000000000000000 R09: 0000000000000000 [ 8675.410978] R10: 0000000000000000 R11: 0000000000000000 R12: 00000000ffffffe4 [ 8675.412647] R13: ffff92026db74000 R14: ffff920247c616b8 R15: ffff92026dfbc000 [ 8675.413994] FS: 00007fd5e57248c0(0000) GS:ffff92027d800000(0000) knlGS:0000000000000000 [ 8675.416146] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 8675.417833] CR2: 0000564aa51682d8 CR3: 000000006dcbc004 CR4: 0000000000160ee0 [ 8675.419801] Call Trace: [ 8675.420742] btrfs_start_dirty_block_groups+0x355/0x480 [btrfs] [ 8675.422600] btrfs_commit_transaction+0xc8/0xaf0 [btrfs] [ 8675.424335] reset_balance_state+0x14a/0x190 [btrfs] [ 8675.425824] btrfs_balance.cold+0xe7/0x154 [btrfs] [ 8675.427313] ? kmem_cache_alloc_trace+0x235/0x2c0 [ 8675.428663] btrfs_ioctl_balance+0x298/0x350 [btrfs] [ 8675.430285] btrfs_ioctl+0x466/0x2550 [btrfs] [ 8675.431788] ? mem_cgroup_charge_statistics+0x51/0xf0 [ 8675.433487] ? mem_cgroup_commit_charge+0x56/0x400 [ 8675.435122] ? do_raw_spin_unlock+0x4b/0xc0 [ 8675.436618] ? _raw_spin_unlock+0x1f/0x30 [ 8675.438093] ? __handle_mm_fault+0x499/0x740 [ 8675.439619] ? do_vfs_ioctl+0x56e/0x770 [ 8675.441034] do_vfs_ioctl+0x56e/0x770 [ 8675.442411] ksys_ioctl+0x3a/0x70 [ 8675.443718] ? trace_hardirqs_off_thunk+0x1a/0x1c [ 8675.445333] __x64_sys_ioctl+0x16/0x20 [ 8675.446705] do_syscall_64+0x50/0x210 [ 8675.448059] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 8675.479187] BTRFS: error (device vdb) in btrfs_create_pending_block_groups:2038: errno=-28 No space left We now use btrfs_can_overcommit() to see if we can flip a block group read only. Before this would fail because we weren't taking into account the usable un-allocated space for allocating chunks. With my patches we were allowed to do the balance, which is technically correct. The test is trying to start balance on degraded mount. So now we're trying to allocate a chunk and cannot because we want to allocate a RAID1 chunk, but there's only 1 device that's available for usage. This results in an ENOSPC. But we shouldn't even be making it this far, we don't have enough devices to restripe. The problem is we're using btrfs_num_devices(), that also includes missing devices. That's not actually what we want, we need to use rw_devices. The chunk_mutex is not needed here, rw_devices changes only in device add, remove or replace, all are excluded by EXCL_OP mechanism. Fixes: `e4d8ec0f65` ("Btrfs: implement online profile changing") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add stacktrace, update changelog, drop chunk_mutex ] Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-17 15:40:54 +01:00
David Sterba	cf93e15eca	btrfs: fix devs_max constraints for raid1c3 and raid1c4 The value 0 for devs_max means to spread the allocated chunks over all available devices, eg. stripe for RAID0 or RAID5. This got mistakenly copied to the RAID1C3/4 profiles. The intention is to have exactly 3 and 4 copies respectively. Fixes: `47e6f7423b` ("btrfs: add support for 3-copy replication (raid1c3)") Fixes: `8d6fac0087` ("btrfs: add support for 4-copy replication (raid1c4)") Signed-off-by: David Sterba <dsterba@suse.com>	2019-12-13 14:09:23 +01:00
Johannes Thumshirn	7f0432d0d8	btrfs: change btrfs_fs_devices::rotating to bool struct btrfs_fs_devices::rotating currently is declared as an integer variable but only used as a boolean. Change the variable definition to bool and update to code touching it to set 'true' and 'false'. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 17:51:51 +01:00
Johannes Thumshirn	0395d84f8e	btrfs: change btrfs_fs_devices::seeding to bool struct btrfs_fs_devices::seeding currently is declared as an integer variable but only used as a boolean. Change the variable definition to bool and update to code touching it to set 'true' and 'false'. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 17:51:51 +01:00
David Sterba	32da5386d9	btrfs: rename btrfs_block_group_cache The type name is misleading, a single entry is named 'cache' while this normally means a collection of objects. Rename that everywhere. Also the identifier was quite long, making function prototypes harder to format. Suggested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 17:51:51 +01:00
David Sterba	cfbb825c76	btrfs: add incompat for raid1 with 3, 4 copies The new raid1c3 and raid1c4 profiles are backward incompatible and the name shall be 'raid1c34', the status can be found in the global supported features in /sys/fs/btrfs/features or in the per-filesystem directory. Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 17:51:49 +01:00
David Sterba	8d6fac0087	btrfs: add support for 4-copy replication (raid1c4) Add new block group profile to store 4 copies in a simliar way that current RAID1 does. The profile attributes and constraints are defined in the raid table and used by the same code that already handles the 2- and 3-copy RAID1. The minimum number of devices is 4, the maximum number of devices/chunks that can be lost/damaged is 3. There is no comparable traditional RAID level, the profile is added for future needs to accompany triple-parity and beyond. Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 17:51:49 +01:00
David Sterba	47e6f7423b	btrfs: add support for 3-copy replication (raid1c3) Add new block group profile to store 3 copies in a simliar way that current RAID1 does. The profile attributes and constraints are defined in the raid table and used by the same code that already handles the 2-copy RAID1. The minimum number of devices is 3, the maximum number of devices/chunks that can be lost/damaged is 2. Like RAID6 but with 33% space utilization. Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 17:51:49 +01:00
Qu Wenruo	6b7faadd98	btrfs: Ensure we trim ranges across block group boundary [BUG] When deleting large files (which cross block group boundary) with discard mount option, we find some btrfs_discard_extent() calls only trimmed part of its space, not the whole range: btrfs_discard_extent: type=0x1 start=19626196992 len=2144530432 trimmed=1073741824 ratio=50% type: bbio->map_type, in above case, it's SINGLE DATA. start: Logical address of this trim len: Logical length of this trim trimmed: Physically trimmed bytes ratio: trimmed / len Thus leaving some unused space not discarded. [CAUSE] When discard mount option is specified, after a transaction is fully committed (super block written to disk), we begin to cleanup pinned extents in the following call chain: btrfs_commit_transaction() \|- btrfs_finish_extent_commit() \|- find_first_extent_bit(unpin, 0, &start, &end, EXTENT_DIRTY); \|- btrfs_discard_extent() However, pinned extents are recorded in an extent_io_tree, which can merge adjacent extent states. When a large file gets deleted and it has adjacent file extents across block group boundary, we will get a large merged range like this: \|<--- BG1 --->\|<--- BG2 --->\| \|//////\|<-- Range to discard --->\|/////\| To discard that range, we have the following calls: btrfs_discard_extent() \|- btrfs_map_block() \| Returned bbio will end at BG1's end. As btrfs_map_block() \| never returns result across block group boundary. \|- btrfs_issuse_discard() Issue discard for each stripe. So we will only discard the range in BG1, not the remaining part in BG2. Furthermore, this bug is not that reliably observed, for above case, if there is no other extent in BG2, BG2 will be empty and btrfs will trim all space of BG2, covering up the bug. [FIX] - Allow __btrfs_map_block_for_discard() to modify @length parameter btrfs_map_block() uses its @length paramter to notify the caller how many bytes are mapped in current call. With __btrfs_map_block_for_discard() also modifing the @length, btrfs_discard_extent() now understands when to do extra trim. - Call btrfs_map_block() in a loop until we hit the range end Since we now know how many bytes are mapped each time, we can iterate through each block group boundary and issue correct trim for each range. Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 17:51:48 +01:00
Qu Wenruo	2d974619a7	btrfs: volumes: Use more straightforward way to calculate map length The old code goes: offset = logical - em->start; length = min_t(u64, em->len - offset, length); Where @length calculation is dependent on offset, it can take reader several more seconds to find it's just the same code as: offset = logical - em->start; length = min_t(u64, em->start + em->len - logical, length); Use above code to make the length calculate independent from other variable, thus slightly increase the readability. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 17:51:48 +01:00
David Sterba	b3470b5dbe	btrfs: add dedicated members for start and length of a block group The on-disk format of block group item makes use of the key that stores the offset and length. This is further used in the code, although this makes thing harder to understand. The key is also packed so the offset/length is not properly aligned as u64. Add start (key.objectid) and length (key.offset) members to block group and remove the embedded key. When the item is searched or written, a local variable for key is used. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 17:51:45 +01:00
David Sterba	bf38be65f3	btrfs: move block_group_item::used to block group For unknown reasons, the member 'used' in the block group struct is stored in the b-tree item and accessed everywhere using the special accessor helper. Let's unify it and make it a regular member and only update the item before writing it to the tree. The item is still being used for flags and chunk_objectid, there's some duplication until the item is removed in following patches. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 17:51:44 +01:00
Johannes Thumshirn	32ab3d1b4d	btrfs: remove pointless indentation in btrfs_read_sys_array() Instead of checking if we've read a BTRFS_CHUNK_ITEM_KEY from disk and then process it we could just bail out early if the read disk key wasn't a BTRFS_CHUNK_ITEM_KEY. This removes a level of indentation and makes the code nicer to read. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:47:01 +01:00
Johannes Thumshirn	5ae2169290	btrfs: reduce indentation in btrfs_may_alloc_data_chunk In btrfs_may_alloc_data_chunk() we're checking if the chunk type is of type BTRFS_BLOCK_GROUP_DATA and if it is we process it. Instead of checking if the chunk type is a BTRFS_BLOCK_GROUP_DATA chunk we can negate the check and bail out early if it isn't. This makes the code a bit more readable. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:47:00 +01:00
Chris Mason	ba8a9d0795	Btrfs: delete the entire async bio submission framework Now that we're not using btrfs_schedule_bio() anymore, delete all the code that supported it. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Chris Mason <clm@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:53 +01:00
Chris Mason	08635bae0b	Btrfs: stop using btrfs_schedule_bio() btrfs_schedule_bio() hands IO off to a helper thread to do the actual submit_bio() call. This has been used to make sure async crc and compression helpers don't get stuck on IO submission. To maintain good performance, over time the IO submission threads duplicated some IO scheduler characteristics such as high and low priority IOs and they also made some ugly assumptions about request allocation batch sizes. All of this cost at least one extra context switch during IO submission, and doesn't fit well with the modern blkmq IO stack. So, this commit stops using btrfs_schedule_bio(). We may need to adjust the number of async helper threads for crcs and compression, but long term it's a better path. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Chris Mason <clm@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:52 +01:00
David Sterba	4143cb8b6f	btrfs: add const function attribute For some reason the attribute is called __attribute_const__ and not __const, marks functions that have no observable effects on program state, IOW not reading pointers, just the arguments and calculating a value. Allows the compiler to do some optimizations, based on -Wsuggest-attribute=const . The effects are rather small, though, about 60 bytes decrese of btrfs.ko. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:52 +01:00
David Sterba	b105e92755	btrfs: add __cold attribute to more functions The attribute can mark functions supposed to be called rarely if at all and the text can be moved to sections far from the other code. The attribute has been added to several functions already, this patch is based on hints given by gcc -Wsuggest-attribute=cold. The net effect of this patch is decrease of btrfs.ko by 1000-1300, depending on the config options. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:52 +01:00
Anand Jain	aa6c0df73e	btrfs: print process name and pid that calls device scanning Its very helpful if we had logged the device scanner process name to debug the race condition between the systemd-udevd scan and the user initiated device forget command. This patch adds process name and pid to the scan message. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add pid to the message ] Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:51 +01:00
David Sterba	c1499166d1	btrfs: use has_single_bit_set for clarity Replace is_power_of_2 with the helper that is self-documenting and remove the open coded call in alloc_profile_is_valid. Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:50 +01:00
Anand Jain	e62869be1e	btrfs: balance: use term redundancy instead of integrity in message When balance reduces the number of copies of metadata, it reduces the redundancy, use the term redundancy instead of integrity. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:50 +01:00
Omar Sandoval	a0cac0ec96	btrfs: get rid of unique workqueue helper functions Commit `9e0af23764` ("Btrfs: fix task hang under heavy compressed write") worked around the issue that a recycled work item could get a false dependency on the original work item due to how the workqueue code guarantees non-reentrancy. It did so by giving different work functions to different types of work. However, the fixes in the previous few patches are more complete, as they prevent a work item from being recycled at all (except for a tiny window that the kernel workqueue code handles for us). This obsoletes the previous fix, so we don't need the unique helpers for correctness. The only other reason to keep them would be so they show up in stack traces, but they always seem to be optimized to a tail call, so they don't show up anyways. So, let's just get rid of the extra indirection. While we're here, rename normal_work_helper() to the more informative btrfs_work_helper(). Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:48 +01:00
Qu Wenruo	c17add7a1c	btrfs: Consider system chunk array size for new SYSTEM chunks For SYSTEM chunks, despite the regular chunk item size limit, there is another limit due to system chunk array size. The extra limit was removed in a refactoring, so add it back. Fixes: `e3ecdb3fde` ("btrfs: factor out devs_max setting in __btrfs_alloc_chunk") CC: stable@vger.kernel.org # 5.3+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-10-25 19:11:34 +02:00
Zygo Blaxell	7a54789074	btrfs: fix balance convert to single on 32-bit host CPUs Currently, the command: btrfs balance start -dconvert=single,soft . on a Raspberry Pi produces the following kernel message: BTRFS error (device mmcblk0p2): balance: invalid convert data profile single This fails because we use is_power_of_2(unsigned long) to validate the new data profile, the constant for 'single' profile uses bit 48, and there are only 32 bits in a long on ARM. Fix by open-coding the check using u64 variables. Tested by completing the original balance command on several Raspberry Pis. Fixes: `818255feec` ("btrfs: use common helper instead of open coding a bit test") CC: stable@vger.kernel.org # 4.20+ Signed-off-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-10-01 19:37:29 +02:00
Qu Wenruo	fab2735955	btrfs: Fix a regression which we can't convert to SINGLE profile [BUG] With v5.3 kernel, we can't convert to SINGLE profile: # btrfs balance start -f -dconvert=single $mnt ERROR: error during balancing '/mnt/btrfs': Invalid argument # dmesg -t \| tail validate_convert_profile: data profile=0x1000000000000 allowed=0x20 is_valid=1 final=0x1000000000000 ret=1 BTRFS error (device dm-3): balance: invalid convert data profile single [CAUSE] With the extra debug output added, it shows that the @allowed bit is lacking the special in-memory only SINGLE profile bit. Thus we fail at that (profile & ~allowed) check. This regression is caused by commit `081db89b13` ("btrfs: use raid_attr to get allowed profiles for balance conversion") and the fact that we don't use any bit to indicate SINGLE profile on-disk, but uses special in-memory only bit to help distinguish different profiles. [FIX] Add that BTRFS_AVAIL_ALLOC_BIT_SINGLE to @allowed, so the code should be the same as it was and fix the regression. Reported-by: Chris Murphy <lists@colorremedies.com> Fixes: `081db89b13` ("btrfs: use raid_attr to get allowed profiles for balance conversion") CC: stable@vger.kernel.org # 5.3+ Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-09-25 16:00:37 +02:00
David Sterba	1dc990dfd3	btrfs: move dev_stats helpers to volumes.c The other dev stats functions are already there and the helpers are not used by anything else. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: David Sterba <dsterba@suse.com>	2019-09-09 14:59:16 +02:00
David Sterba	784352fe0b	btrfs: move math functions to misc.h Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: David Sterba <dsterba@suse.com>	2019-09-09 14:59:15 +02:00
Anand Jain	d2979aa25f	btrfs: use proper error values on allocation failure in clone_fs_devices Fix the fake ENOMEM return error code to the actual error in clone_fs_devices(). Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-09-09 14:59:14 +02:00
Anand Jain	a06dee4d7e	btrfs: proper error handling when invalid device is found in find_next_devid In a corrupted tree, if search for next devid finds the device with devid = -1, then report the error -EUCLEAN back to the parent function to fail gracefully. The tree checker will not catch this in case the devids are created using the following script: umount /btrfs dev1=/dev/sdb dev2=/dev/sdc mkfs.btrfs -fq -dsingle -msingle $dev1 mount $dev1 /btrfs _fail() { echo $1 exit 1 } while true; do btrfs dev add -f $dev2 /btrfs \|\| _fail "add failed" btrfs dev del $dev1 /btrfs \|\| _fail "del failed" dev_tmp=$dev1 dev1=$dev2 dev2=$dev_tmp done With output: BTRFS critical (device sdb): corrupt leaf: root=3 block=313739198464 slot=1 devid=1 invalid devid: has=507 expect=[0, 506] BTRFS error (device sdb): block=313739198464 write time tree block corruption detected BTRFS: error (device sdb) in btrfs_commit_transaction:2268: errno=-5 IO failure (Error while writing out transaction) BTRFS warning (device sdb): Skipping commit of aborted transaction. BTRFS: error (device sdb) in cleanup_transaction:1827: errno=-5 IO failure Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> [ add script and messages ] Signed-off-by: David Sterba <dsterba@suse.com>	2019-09-09 14:59:14 +02:00
Dan Carpenter	f11369897e	btrfs: fix error pointer check in __btrfs_map_block() The btrfs_get_chunk_map() never returns NULL, it returns error pointers. Fixes: `89b798ad1b` ("btrfs: Use btrfs_get_io_geometry appropriately") Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-09-09 14:59:12 +02:00
Anand Jain	3b80a984d2	btrfs: dev stat drop useless goto In the function btrfs_init_dev_stats() goto out is not needed, because the alloc has failed. So just return -ENOMEM. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-09-09 14:59:12 +02:00
Anand Jain	440630ea7c	btrfs: dev stats item key conversion per cpu type is not needed %found_key is not used, drop it since it hasn't been used since the beginning in `733f4fbbc1` ("Btrfs: read device stats on mount, write modified ones during commit"). Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-09-09 14:59:12 +02:00
David Sterba	f93c39970b	btrfs: factor out sysfs code for updating sprout fsid Wrap the fsid renaming code and move it to sysfs.c. Signed-off-by: David Sterba <dsterba@suse.com>	2019-09-09 14:59:07 +02:00
David Sterba	5b28692e0c	btrfs: factor out sysfs code for sending device uevent The device uevent belongs to the sysfs API. Signed-off-by: David Sterba <dsterba@suse.com>	2019-09-09 14:59:07 +02:00
Anand Jain	ae4b9b4c7d	btrfs: opencode reset of all device stats __btrfs_reset_dev_stats() is a small helper function to reset devices stat values, and is used only once, instead just open code it. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-09-09 14:59:06 +02:00
Anand Jain	4e411a7d04	btrfs: reset device stat using btrfs_dev_stat_set btrfs_dev_stat_reset() is an overdo in terms of wrapping. So this patch open codes btrfs_dev_stat_reset(). Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-09-09 14:59:06 +02:00
Josef Bacik	aac0023c21	btrfs: move basic block_group definitions to their own header This is prep work for moving all of the block group cache code into its own file. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ minor comment updates ] Signed-off-by: David Sterba <dsterba@suse.com>	2019-09-09 14:59:03 +02:00
Qu Wenruo	112974d406	btrfs: volumes: Remove ENOSPC-prone btrfs_can_relocate() [BUG] Test case btrfs/156 fails since commit `302167c50b` ("btrfs: don't end the transaction for delayed refs in throttle") with ENOSPC. [CAUSE] The ENOSPC is reported from btrfs_can_relocate(). This function will check: - If this block group is empty, we can relocate - If we can enough free space, we can relocate Above checks are valid but the following check is vague due to its implementation: - If and only if we can allocated a new block group to contain all the used space, we can relocate This design itself is OK, but the way to determine if we can allocate a new block group is problematic. btrfs_can_relocate() uses find_free_dev_extent() to find free space on a device. However find_free_dev_extent() only searches commit root and excludes dev extents allocated in current trans, this makes it unable to use dev extent just freed in current transaction. So for the following example, btrfs_can_relocate() will report ENOSPC: The example block group layout: 1M 129M 257M 385M 513M 550M \|///////\|///////////\|//////////\| \| \| // = Used bg, consider all bg is 100% used for easy calculation. And all block groups are SINGLE, on-disk bytenr is the same as the logical bytenr. 1) Bg in [129M, 257M) get relocated to [385M, 513M), transid=100 1M 129M 257M 385M 513M 550M \|///////\| \|//////////\|/////////\| In transid 100, bg in [129M, 257M) get relocated to [385M, 513M) However transid 100 is not committed yet, so in dev commit tree, we still have the old dev extents layout: 1M 129M 257M 385M 513M 550M \|///////\|///////////\|//////////\| \| \| 2) Try to relocate bg [257M, 385M) We goes into btrfs_can_relocate(), no free space in current bgs, so we check if we can find large enough free dev extents. The first slot is [385M, 513M), but that is already used by new bg at [385M, 513M), so we continue search. The remaining slot is [512M, 550M), smaller than the bg's length 128M. So btrfs_can_relocate report ENOSPC. However this is over killed, in fact if we just skip btrfs_can_relocate() check, and go into regular relocation routine, at extent reservation time, if we can't find free extent, then we fallback to commit transaction, which will free up the dev extents and allow new block group to be created. [FIX] The fix here is to remove btrfs_can_relocate() completely. If we hit the false ENOSPC case just like btrfs/156, extent allocator will push harder by committing transaction and we will have space for new block group, avoiding the false ENOSPC. If we really ran out of space, we will hit ENOSPC at relocate_block_group(), and btrfs will just reports the ENOSPC error as usual. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-09-09 14:59:01 +02:00
Qu Wenruo	135da9766e	btrfs: volumes: Add comment for find_free_dev_extent_start() Since commit `6df9a95e63` ("Btrfs: make the chunk allocator completely tree lockless") we search commit root of device tree to avoid deadlock. This introduced a safety feature, find_free_dev_extent_start() won't use dev extents which just get freed in current transaction. This safety feature makes sure we won't allocate new block group using just freed dev extents to break CoW. However, this feature also makes find_free_dev_extent_start() not reliable reporting free device space. Just add such comment to make later viewer careful about this behavior. This behavior makes one caller, btrfs_can_relocate() unreliable determining the device free space. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-09-09 14:59:00 +02:00
Qu Wenruo	9e3246a5f6	btrfs: volumes: Unexport find_free_dev_extent_start() This function is only used locally in find_free_dev_extent(), no external callers. So unexport it. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-09-09 14:59:00 +02:00
YueHaibing	99fccf33c2	btrfs: remove set but not used variable 'offset' Fixes gcc '-Wunused-but-set-variable' warning: fs/btrfs/volumes.c: In function __btrfs_map_block: fs/btrfs/volumes.c:6023:6: warning: variable offset set but not used [-Wunused-but-set-variable] It is not used any more since commit 343abd1c0ca9 ("btrfs: Use btrfs_get_io_geometry appropriately") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: YueHaibing <yuehaibing@huawei.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-09-09 14:58:58 +02:00
Filipe Manana	d7cd4dd907	Btrfs: fix sysfs warning and missing raid sysfs directories In the 5.3 merge window, commit `7c7e301406` ("btrfs: sysfs: Replace default_attrs in ktypes with groups"), we started using the member "defaults_groups" for the kobject type "btrfs_raid_ktype". That leads to a series of warnings when running some test cases of fstests, such as btrfs/027, btrfs/124 and btrfs/176. The traces produced by those warnings are like the following: [116648.059212] kernfs: can not remove 'total_bytes', no directory [116648.060112] WARNING: CPU: 3 PID: 28500 at fs/kernfs/dir.c:1504 kernfs_remove_by_name_ns+0x75/0x80 (...) [116648.066482] CPU: 3 PID: 28500 Comm: umount Tainted: G W 5.3.0-rc3-btrfs-next-54 #1 (...) [116648.069376] RIP: 0010:kernfs_remove_by_name_ns+0x75/0x80 (...) [116648.072385] RSP: 0018:ffffabfd0090bd08 EFLAGS: 00010282 [116648.073437] RAX: 0000000000000000 RBX: ffffffffc0c11998 RCX: 0000000000000000 [116648.074201] RDX: ffff9fff603a7a00 RSI: ffff9fff603978a8 RDI: ffff9fff603978a8 [116648.074956] RBP: ffffffffc0b9ca2f R08: 0000000000000000 R09: 0000000000000001 [116648.075708] R10: ffff9ffe1f72e1c0 R11: 0000000000000000 R12: ffffffffc0b94120 [116648.076434] R13: ffffffffb3d9b4e0 R14: 0000000000000000 R15: dead000000000100 [116648.077143] FS: 00007f9cdc78a2c0(0000) GS:ffff9fff60380000(0000) knlGS:0000000000000000 [116648.077852] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [116648.078546] CR2: 00007f9fc4747ab4 CR3: 00000005c7832003 CR4: 00000000003606e0 [116648.079235] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [116648.079907] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [116648.080585] Call Trace: [116648.081262] remove_files+0x31/0x70 [116648.081929] sysfs_remove_group+0x38/0x80 [116648.082596] sysfs_remove_groups+0x34/0x70 [116648.083258] kobject_del+0x20/0x60 [116648.083933] btrfs_free_block_groups+0x405/0x430 [btrfs] [116648.084608] close_ctree+0x19a/0x380 [btrfs] [116648.085278] generic_shutdown_super+0x6c/0x110 [116648.085951] kill_anon_super+0xe/0x30 [116648.086621] btrfs_kill_super+0x12/0xa0 [btrfs] [116648.087289] deactivate_locked_super+0x3a/0x70 [116648.087956] cleanup_mnt+0xb4/0x160 [116648.088620] task_work_run+0x7e/0xc0 [116648.089285] exit_to_usermode_loop+0xfa/0x100 [116648.089933] do_syscall_64+0x1cb/0x220 [116648.090567] entry_SYSCALL_64_after_hwframe+0x49/0xbe [116648.091197] RIP: 0033:0x7f9cdc073b37 (...) [116648.100046] ---[ end trace 22e24db328ccadf8 ]--- [116648.100618] ------------[ cut here ]------------ [116648.101175] kernfs: can not remove 'used_bytes', no directory [116648.101731] WARNING: CPU: 3 PID: 28500 at fs/kernfs/dir.c:1504 kernfs_remove_by_name_ns+0x75/0x80 (...) [116648.105649] CPU: 3 PID: 28500 Comm: umount Tainted: G W 5.3.0-rc3-btrfs-next-54 #1 (...) [116648.107461] RIP: 0010:kernfs_remove_by_name_ns+0x75/0x80 (...) [116648.109336] RSP: 0018:ffffabfd0090bd08 EFLAGS: 00010282 [116648.109979] RAX: 0000000000000000 RBX: ffffffffc0c119a0 RCX: 0000000000000000 [116648.110625] RDX: ffff9fff603a7a00 RSI: ffff9fff603978a8 RDI: ffff9fff603978a8 [116648.111283] RBP: ffffffffc0b9ca41 R08: 0000000000000000 R09: 0000000000000001 [116648.111940] R10: ffff9ffe1f72e1c0 R11: 0000000000000000 R12: ffffffffc0b94120 [116648.112603] R13: ffffffffb3d9b4e0 R14: 0000000000000000 R15: dead000000000100 [116648.113268] FS: 00007f9cdc78a2c0(0000) GS:ffff9fff60380000(0000) knlGS:0000000000000000 [116648.113939] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [116648.114607] CR2: 00007f9fc4747ab4 CR3: 00000005c7832003 CR4: 00000000003606e0 [116648.115286] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [116648.115966] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [116648.116649] Call Trace: [116648.117326] remove_files+0x31/0x70 [116648.117997] sysfs_remove_group+0x38/0x80 [116648.118671] sysfs_remove_groups+0x34/0x70 [116648.119342] kobject_del+0x20/0x60 [116648.120022] btrfs_free_block_groups+0x405/0x430 [btrfs] [116648.120707] close_ctree+0x19a/0x380 [btrfs] [116648.121396] generic_shutdown_super+0x6c/0x110 [116648.122057] kill_anon_super+0xe/0x30 [116648.122702] btrfs_kill_super+0x12/0xa0 [btrfs] [116648.123335] deactivate_locked_super+0x3a/0x70 [116648.123961] cleanup_mnt+0xb4/0x160 [116648.124586] task_work_run+0x7e/0xc0 [116648.125210] exit_to_usermode_loop+0xfa/0x100 [116648.125830] do_syscall_64+0x1cb/0x220 [116648.126463] entry_SYSCALL_64_after_hwframe+0x49/0xbe [116648.127080] RIP: 0033:0x7f9cdc073b37 (...) [116648.135923] ---[ end trace 22e24db328ccadf9 ]--- These happen because, during the unmount path, we call kobject_del() for raid kobjects that are not fully initialized, meaning that we set their ktype (as btrfs_raid_ktype) through link_block_group() but we didn't set their parent kobject, which is done through btrfs_add_raid_kobjects(). We have this split raid kobject setup since commit `75cb379d26` ("btrfs: defer adding raid type kobject until after chunk relocation") in order to avoid triggering reclaim during contextes where we can not (either we are holding a transaction handle or some lock required by the transaction commit path), so that we do the calls to kobject_add(), which triggers GFP_KERNEL allocations, through btrfs_add_raid_kobjects() in contextes where it is safe to trigger reclaim. That change expected that a new raid kobject can only be created either when mounting the filesystem or after raid profile conversion through the relocation path. However, we can have new raid kobject created in other two cases at least: 1) During device replace (or scrub) after adding a device a to the filesystem. The replace procedure (and scrub) do calls to btrfs_inc_block_group_ro() which can allocate a new block group with a new raid profile (because we now have more devices). This can be triggered by test cases btrfs/027 and btrfs/176. 2) During a degraded mount trough any write path. This can be triggered by test case btrfs/124. Fixing this by adding extra calls to btrfs_add_raid_kobjects(), not only makes things more complex and fragile, can also introduce deadlocks with reclaim the following way: 1) Calling btrfs_add_raid_kobjects() at btrfs_inc_block_group_ro() or anywhere in the replace/scrub path will cause a deadlock with reclaim because if reclaim happens and a transaction commit is triggered, the transaction commit path will block at btrfs_scrub_pause(). 2) During degraded mounts it is essentially impossible to figure out where to add extra calls to btrfs_add_raid_kobjects(), because allocation of a block group with a new raid profile can happen anywhere, which means we can't safely figure out which contextes are safe for reclaim, as we can either hold a transaction handle or some lock needed by the transaction commit path. So it is too complex and error prone to have this split setup of raid kobjects. So fix the issue by consolidating the setup of the kobjects in a single place, at link_block_group(), and setup a nofs context there in order to prevent reclaim being triggered by the memory allocations done through the call chain of kobject_add(). Besides fixing the sysfs warnings during kobject_del(), this also ensures the sysfs directories for the new raid profiles end up created and visible to users (a bug that existed before the 5.3 commit `7c7e301406` ("btrfs: sysfs: Replace default_attrs in ktypes with groups")). Fixes: `75cb379d26` ("btrfs: defer adding raid type kobject until after chunk relocation") Fixes: `7c7e301406` ("btrfs: sysfs: Replace default_attrs in ktypes with groups") Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-08-07 16:25:44 +02:00
Johannes Thumshirn	373c3b80e4	btrfs: don't leak extent_map in btrfs_get_io_geometry() btrfs_get_io_geometry() calls btrfs_get_chunk_map() to acquire a reference on a extent_map, but on normal operation it does not drop this reference anymore. This leads to excessive kmemleak reports. Always call free_extent_map(), not just in the error case. Fixes: `5f1411265e` ("btrfs: Introduce btrfs_io_geometry infrastructure") Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-07-17 17:03:36 +02:00
Josef Bacik	8719aaae8d	btrfs: move space_info to space-info.h Migrate the struct definition and the one helper that's in ctree.h into space-info.h Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-07-02 12:30:51 +02:00
Nikolay Borisov	89b798ad1b	btrfs: Use btrfs_get_io_geometry appropriately Presently btrfs_map_block is used not only to do everything necessary to map a bio to the underlying allocation profile but it's also used to identify how much data could be written based on btrfs' stripe logic without actually submitting anything. This is achieved by passing NULL for 'bbio_ret' parameter. This patch refactors all callers that require just the mapping length by switching them to using btrfs_io_geometry instead of calling btrfs_map_block with a special NULL value for 'bbio_ret'. No functional change. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-07-02 12:30:50 +02:00
Nikolay Borisov	5f1411265e	btrfs: Introduce btrfs_io_geometry infrastructure Add a structure that holds various parameters for IO calculations and a helper that fills the values. This will help further refactoring and reduction of functions that in some way open-coded the calculations. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-07-02 12:30:49 +02:00
Filipe Manana	9e967495e0	Btrfs: prevent send failures and crashes due to concurrent relocation Send always operates on read-only trees and always expected that while it is in progress, nothing changes in those trees. Due to that expectation and the fact that send is a read-only operation, it operates on commit roots and does not hold transaction handles. However relocation can COW nodes and leafs from read-only trees, which can cause unexpected failures and crashes (hitting BUG_ONs). while send using a node/leaf, it gets COWed, the transaction used to COW it is committed, a new transaction starts, the extent previously used for that node/leaf gets allocated, possibly for another tree, and the respective extent buffer' content changes while send is still using it. When this happens send normally fails with EIO being returned to user space and messages like the following are found in dmesg/syslog: [ 3408.699121] BTRFS error (device sdc): parent transid verify failed on 58703872 wanted 250 found 253 [ 3441.523123] BTRFS error (device sdc): did not find backref in send_root. inode=63211, offset=0, disk_byte=5222825984 found extent=5222825984 Other times, less often, we hit a BUG_ON() because an extent buffer that send is using used to be a node, and while send is still using it, it got COWed and got reused as a leaf while send is still using, producing the following trace: [ 3478.466280] ------------[ cut here ]------------ [ 3478.466282] kernel BUG at fs/btrfs/ctree.c:1806! [ 3478.466965] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI [ 3478.467635] CPU: 0 PID: 2165 Comm: btrfs Not tainted 5.0.0-btrfs-next-46 #1 [ 3478.468311] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014 [ 3478.469681] RIP: 0010:read_node_slot+0x122/0x130 [btrfs] (...) [ 3478.471758] RSP: 0018:ffffa437826bfaa0 EFLAGS: 00010246 [ 3478.472457] RAX: ffff961416ed7000 RBX: 000000000000003d RCX: 0000000000000002 [ 3478.473151] RDX: 000000000000003d RSI: ffff96141e387408 RDI: ffff961599b30000 [ 3478.473837] RBP: ffffa437826bfb8e R08: 0000000000000001 R09: ffffa437826bfb8e [ 3478.474515] R10: ffffa437826bfa70 R11: 0000000000000000 R12: ffff9614385c8708 [ 3478.475186] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 3478.475840] FS: 00007f8e0e9cc8c0(0000) GS:ffff9615b6a00000(0000) knlGS:0000000000000000 [ 3478.476489] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 3478.477127] CR2: 00007f98b67a056e CR3: 0000000005df6005 CR4: 00000000003606f0 [ 3478.477762] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 3478.478385] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 3478.479003] Call Trace: [ 3478.479600] ? do_raw_spin_unlock+0x49/0xc0 [ 3478.480202] tree_advance+0x173/0x1d0 [btrfs] [ 3478.480810] btrfs_compare_trees+0x30c/0x690 [btrfs] [ 3478.481388] ? process_extent+0x1280/0x1280 [btrfs] [ 3478.481954] btrfs_ioctl_send+0x1037/0x1270 [btrfs] [ 3478.482510] _btrfs_ioctl_send+0x80/0x110 [btrfs] [ 3478.483062] btrfs_ioctl+0x13fe/0x3120 [btrfs] [ 3478.483581] ? rq_clock_task+0x2e/0x60 [ 3478.484086] ? wake_up_new_task+0x1f3/0x370 [ 3478.484582] ? do_vfs_ioctl+0xa2/0x6f0 [ 3478.485075] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs] [ 3478.485552] do_vfs_ioctl+0xa2/0x6f0 [ 3478.486016] ? __fget+0x113/0x200 [ 3478.486467] ksys_ioctl+0x70/0x80 [ 3478.486911] __x64_sys_ioctl+0x16/0x20 [ 3478.487337] do_syscall_64+0x60/0x1b0 [ 3478.487751] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 3478.488159] RIP: 0033:0x7f8e0d7d4dd7 (...) [ 3478.489349] RSP: 002b:00007ffcf6fb4908 EFLAGS: 00000202 ORIG_RAX: 0000000000000010 [ 3478.489742] RAX: ffffffffffffffda RBX: 0000000000000105 RCX: 00007f8e0d7d4dd7 [ 3478.490142] RDX: 00007ffcf6fb4990 RSI: 0000000040489426 RDI: 0000000000000005 [ 3478.490548] RBP: 0000000000000005 R08: 00007f8e0d6f3700 R09: 00007f8e0d6f3700 [ 3478.490953] R10: 00007f8e0d6f39d0 R11: 0000000000000202 R12: 0000000000000005 [ 3478.491343] R13: 00005624e0780020 R14: 0000000000000000 R15: 0000000000000001 (...) [ 3478.493352] ---[ end trace d5f537302be4f8c8 ]--- Another possibility, much less likely to happen, is that send will not fail but the contents of the stream it produces may not be correct. To avoid this, do not allow send and relocation (balance) to run in parallel. In the long term the goal is to allow for both to be able to run concurrently without any problems, but that will take a significant effort in development and testing. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-07-02 12:30:49 +02:00
David Sterba	c7369b3fae	btrfs: add mask for all RAID1 types Preparatory patch for additional RAID1 profiles with more copies. The mask will contain 3-copy and 4-copy, most of the checks for plain RAID1 work the same for the other profiles. Signed-off-by: David Sterba <dsterba@suse.com>	2019-07-02 12:30:48 +02:00
Nikolay Borisov	0b6f5d408b	btrfs: Add comments on locking of several device-related fields Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-07-01 13:34:59 +02:00
David Sterba	cff8267228	btrfs: read number of data stripes from map only once There are several places that call nr_data_stripes, but this value does not change. Signed-off-by: David Sterba <dsterba@suse.com>	2019-07-01 13:34:58 +02:00
David Sterba	158da513b1	btrfs: refactor helper for bg flags to name conversion The helper lacks the btrfs_ prefix and the parameter is the raw blockgroup type, so none of the callers has to do the flags -> index conversion. Signed-off-by: David Sterba <dsterba@suse.com>	2019-07-01 13:34:58 +02:00
David Sterba	e3ecdb3fde	btrfs: factor out devs_max setting in __btrfs_alloc_chunk Merge the repeated code before the if-else block. Signed-off-by: David Sterba <dsterba@suse.com>	2019-07-01 13:34:57 +02:00
David Sterba	946c9256c6	btrfs: factor out helper for counting data stripes Factor the sequence of ifs to a helper, the 'data stripes' here means the number of stripes without redundancy and parity. Signed-off-by: David Sterba <dsterba@suse.com>	2019-07-01 13:34:57 +02:00
David Sterba	44b28adafd	btrfs: use raid_attr table for btrfs_bg_type_to_factor The factor is the number of copies. Signed-off-by: David Sterba <dsterba@suse.com>	2019-07-01 13:34:57 +02:00
David Sterba	6079e12cdb	btrfs: use raid_attr table to find profiles for integrity lowering Replace open coded list of the profiles by selecting them from the raid_attr table. The criteria are now more explicit, we need profiles that have more than 1 copy of the data or can reconstruct the data with a missing device. Signed-off-by: David Sterba <dsterba@suse.com>	2019-07-01 13:34:57 +02:00
David Sterba	081db89b13	btrfs: use raid_attr to get allowed profiles for balance conversion Iterate over the table and gather all allowed profiles for a given number of devices, instead of open coding. Signed-off-by: David Sterba <dsterba@suse.com>	2019-07-01 13:34:56 +02:00
David Sterba	fc9a2ac77c	btrfs: use raid_attr in btrfs_chunk_max_errors The number of tolerated failures is stored in the raid_attr table, use it. Signed-off-by: David Sterba <dsterba@suse.com>	2019-07-01 13:34:56 +02:00
David Sterba	c8bf1b6703	btrfs: remove mapping tree structures indirection fs_info::mapping_tree is the physical<->logical mapping tree and uses the same underlying structure as extents, but is embedded to another structure. There are no other members and this indirection is useless. No functional change. Signed-off-by: David Sterba <dsterba@suse.com>	2019-07-01 13:34:56 +02:00
David Sterba	49cc180ca9	btrfs: raid56: allow the exact minimum number of devices for balance convert The minimum number of devices for RAID5 is 2, though this is only a bit expensive RAID1, and for RAID6 it's 3, which is a triple copy that works only 3 devices. mkfs.btrfs allows that and mounting such filesystem also works, so the conversion via balance filters is inconsistent with the others and we should not prevent it. Signed-off-by: David Sterba <dsterba@suse.com>	2019-07-01 13:34:56 +02:00
David Sterba	0ee5f8ae08	btrfs: fix minimum number of chunk errors for DUP The list of profiles in btrfs_chunk_max_errors lists DUP as a profile DUP able to tolerate 1 device missing. Though this profile is special with 2 copies, it still needs the device, unlike the others. Looking at the history of changes, thre's no clear reason why DUP is there, functions were refactored and blocks of code merged to one helper. `d20983b40e` Btrfs: fix writing data into the seed filesystem - factor code to a helper `de11cc12df` Btrfs: don't pre-allocate btrfs bio - unrelated change, DUP still in the list with max errors 1 `a236aed14c` Btrfs: Deal with failed writes in mirrored configurations - introduced the max errors, leaves DUP and RAID1 in the same group Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-07-01 13:34:55 +02:00
David Sterba	65237ee3b6	btrfs: get fs_info from device in btrfs_rm_dev_replace_free_srcdev We can read fs_info from the device and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:48 +02:00
David Sterba	163e97ee0d	btrfs: get fs_info from device in btrfs_scrub_cancel_dev We can read fs_info from the device and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:47 +02:00
David Sterba	f331a9525f	btrfs: get fs_info from device in btrfs_rm_dev_item We can read fs_info from the device and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:47 +02:00
David Sterba	196c9d8de8	btrfs: get fs_info from trans in btrfs_run_dev_stats We can read fs_info from the transaction and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:43 +02:00
David Sterba	5c466629e2	btrfs: get fs_info from trans in btrfs_finish_sprout We can read fs_info from the transaction and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:42 +02:00
David Sterba	6f8e0fc77c	btrfs: get fs_info from trans in init_first_rw_device We can read fs_info from the transaction and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:42 +02:00
David Sterba	9b7a2440ae	btrfs: get fs_info from trans in btrfs_create_tree We can read fs_info from the transaction and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:41 +02:00
David Sterba	178507595c	btrfs: get fs_info from eb in read_one_dev We can read fs_info from extent buffer and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:39 +02:00
David Sterba	9690ac0987	btrfs: get fs_info from eb in read_one_chunk We can read fs_info from extent buffer and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:39 +02:00
David Sterba	ddaf1d5aef	btrfs: get fs_info from eb in btrfs_check_chunk_valid We can read fs_info from extent buffer and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:39 +02:00
David Sterba	6ec0896c4c	btrfs: get fs_info from eb in should_balance_chunk We can read fs_info from extent buffer and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:39 +02:00
Nikolay Borisov	e74e3993bc	btrfs: Factor out in_range macro This is used in more than one places so let's factor it out in ctree.h. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:37 +02:00
Nikolay Borisov	60dfdf25bd	btrfs: Remove 'trans' argument from find_free_dev_extent(_start) Now that these functions no longer require a handle to transaction to inspect pending/pinned chunks the argument can be removed. At the same time also remove any surrounding code which acquired the handle. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:37 +02:00
Jeff Mahoney	1c11b63eff	btrfs: replace pending/pinned chunks lists with io tree The pending chunks list contains chunks that are allocated in the current transaction but haven't been created yet. The pinned chunks list contains chunks that are being released in the current transaction. Both describe chunks that are not reflected on disk as in use but are unavailable just the same. The pending chunks list is anchored by the transaction handle, which means that we need to hold a reference to a transaction when working with the list. The way we use them is by iterating over both lists to perform comparisons on the stripes they describe for each device. This is backwards and requires that we keep a transaction handle open while we're trimming. This patchset adds an extent_io_tree to btrfs_device that maintains the allocation state of the device. Extents are set dirty when chunks are first allocated -- when the extent maps are added to the mapping tree. They're cleared when last removed -- when the extent maps are removed from the mapping tree. This matches the lifespan of the pending and pinned chunks list and allows us to do trims on unallocated space safely without pinning the transaction for what may be a lengthy operation. We can also use this io tree to mark which chunks have already been trimmed so we don't repeat the operation. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:37 +02:00
Nikolay Borisov	8e75fd893b	btrfs: Stop using call_rcu for device freeing btrfs_device structs are freed from RCU context since device iteration is protected by RCU. Currently this is achieved by using call_rcu since no blocking functions are called within btrfs_free_device. Future refactoring of pending/pinned chunks will require calling sleeping functions. This patch is in preparation for these changes by simply switching from RCU callbacks to explicit calls of synchronize_rcu and calling btrfs_free_device directly. This is functionally equivalent, making sure that there are no readers at that time. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:37 +02:00
Nikolay Borisov	39e264a40d	btrfs: Populate ->orig_block_len during read_one_chunk Chunks read from disk currently don't get their ->orig_block_len member set, in contrast when a new chunk is allocated, the respective extent_map's ->orig_block_len is assigned the size of the stripe of this chunk. Let's apply the same strategy for chunks which are read from disk, not only does this codify the invariant that ->orig_block_len always contains the size of the stripe for a chunk (when the em belongs to the mapping tree). But it's also a preparatory patch for further work around tracking chunk allocation in an extent tree rather than pinned/pending lists. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:36 +02:00
Nikolay Borisov	61d0d0d2cb	btrfs: Handle pending/pinned chunks before blockgroup relocation during device shrink During device shrink pinned/pending chunks (i.e. those which have been deleted/created respectively, in the current transaction and haven't touched disk) need to be accounted when doing device shrink. Presently this happens after the main relocation loop in btrfs_shrink_device, which could lead to making another go in the body of the function. Since there is no hard requirement to perform pinned/pending chunks handling after the relocation loop, move the code before it. This leads to simplifying the code flow around - i.e. no need to use 'goto again'. A notable side effect of this change is that modification of the device's size requires a transaction to be started and committed before the relocation loop starts. This is necessary to ensure that relocation process sees the shrunk device size. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:36 +02:00
Nikolay Borisov	bbbf7243d6	btrfs: combine device update operations during transaction commit We currently overload the pending_chunks list to handle updating btrfs_device->commit_bytes used. We don't actually care about the extent mapping or even the device mapping for the chunk - we just need the device, and we can end up processing it multiple times. The fs_devices->resized_list does more or less the same thing, but with the disk size. They are called consecutively during commit and have more or less the same purpose. We can combine the two lists into a single list that attaches to the transaction and contains a list of devices that need updating. Since we always add the device to a list when we change bytes_used or disk_total_size, there's no harm in copying both values at once. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:36 +02:00
Qu Wenruo	ab4ba2e133	btrfs: tree-checker: Verify dev item [BUG] For fuzzed image whose DEV_ITEM has invalid total_bytes as 0, then kernel will just panic: BUG: unable to handle kernel NULL pointer dereference at 0000000000000098 #PF error: [normal kernel read fault] PGD 800000022b2bd067 P4D 800000022b2bd067 PUD 22b2bc067 PMD 0 Oops: 0000 [#1] SMP PTI CPU: 0 PID: 1106 Comm: mount Not tainted 5.0.0-rc8+ #9 RIP: 0010:btrfs_verify_dev_extents+0x2a5/0x5a0 Call Trace: open_ctree+0x160d/0x2149 btrfs_mount_root+0x5b2/0x680 [CAUSE] If device extent verification finds a deivce with 0 total_bytes, then it assumes it's a seed dummy, then search for seed devices. But in this case, there is no seed device at all, causing NULL pointer. [FIX] Since this is caused by fuzzed image, let's go the tree-check way, just add a new verification for device item. Reported-by: Yoon Jungyeon <jungyeon@gatech.edu> Link: https://bugzilla.kernel.org/show_bug.cgi?id=202691 Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:32 +02:00
Qu Wenruo	075cb3c78f	btrfs: tree-checker: Check chunk item at tree block read time Since we have btrfs_check_chunk_valid() in tree-checker, let's do chunk item verification in tree-checker too. Since the tree-checker is run at endio time, if one chunk leaf fails chunk verification, we can still retry the other copy, making btrfs more robust to fuzzed image as we may still get a good chunk item. Also since we have done chunk verification in tree block read time, skip the btrfs_check_chunk_valid() call in read_one_chunk() if we're reading chunk items from leaf. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:32 +02:00
Qu Wenruo	82fc28fbed	btrfs: Move btrfs_check_chunk_valid() to tree-check.[ch] and export it By function, chunk item verification is more suitable to be done inside tree-checker. So move btrfs_check_chunk_valid() to tree-checker.c and export it. And since it's now moved to tree-checker, also add a better comment for what this function is doing. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:31 +02:00
Arnd Bergmann	290342f661	btrfs: use BUG() instead of BUG_ON(1) BUG_ON(1) leads to bogus warnings from clang when CONFIG_PROFILE_ANNOTATED_BRANCHES is set: fs/btrfs/volumes.c:5041:3: error: variable 'max_chunk_size' is used uninitialized whenever 'if' condition is false [-Werror,-Wsometimes-uninitialized] BUG_ON(1); ^~~~~~~~~ include/asm-generic/bug.h:61:36: note: expanded from macro 'BUG_ON' #define BUG_ON(condition) do { if (unlikely(condition)) BUG(); } while (0) ^~~~~~~~~~~~~~~~~~~ include/linux/compiler.h:48:23: note: expanded from macro 'unlikely' # define unlikely(x) (__branch_check__(x, 0, __builtin_constant_p(x))) ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ fs/btrfs/volumes.c:5046:9: note: uninitialized use occurs here max_chunk_size); ^~~~~~~~~~~~~~ include/linux/kernel.h:860:36: note: expanded from macro 'min' #define min(x, y) __careful_cmp(x, y, <) ^ include/linux/kernel.h:853:17: note: expanded from macro '__careful_cmp' __cmp_once(x, y, __UNIQUE_ID(__x), __UNIQUE_ID(__y), op)) ^ include/linux/kernel.h:847:25: note: expanded from macro '__cmp_once' typeof(y) unique_y = (y); \ ^ fs/btrfs/volumes.c:5041:3: note: remove the 'if' if its condition is always true BUG_ON(1); ^ include/asm-generic/bug.h:61:32: note: expanded from macro 'BUG_ON' #define BUG_ON(condition) do { if (unlikely(condition)) BUG(); } while (0) ^ fs/btrfs/volumes.c:4993:20: note: initialize the variable 'max_chunk_size' to silence this warning u64 max_chunk_size; ^ = 0 Change it to BUG() so clang can see that this code path can never continue. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:28 +02:00
David Sterba	0cc068e6ee	btrfs: don't report readahead errors and don't update statistics As readahead is an optimization, all errors are usually filtered out, but still properly handled when the real read call is done. The commit `5e9d398240` ("btrfs: readpages() should submit IO as read-ahead") added REQ_RAHEAD to readpages() because that's only used for readahead (despite what one would expect from the callback name). This causes a flood of messages and inflated read error stats, so skip reporting in case it's readahead. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202403 Reported-by: LimeTech <tomm@lime-technology.com> Fixes: `5e9d398240` ("btrfs: readpages() should submit IO as read-ahead") CC: stable@vger.kernel.org # 4.19+ Signed-off-by: David Sterba <dsterba@suse.com>	2019-03-13 17:14:27 +01:00
Johannes Thumshirn	349ae63f40	btrfs: ensure that a DUP or RAID1 block group has exactly two stripes We recently had a customer issue with a corrupted filesystem. When trying to mount this image btrfs panicked with a division by zero in calc_stripe_length(). The corrupt chunk had a 'num_stripes' value of 1. calc_stripe_length() takes this value and divides it by the number of copies the RAID profile is expected to have to calculate the amount of data stripes. As a DUP profile is expected to have 2 copies this division resulted in 1/2 = 0. Later then the 'data_stripes' variable is used as a divisor in the stripe length calculation which results in a division by 0 and thus a kernel panic. When encountering a filesystem with a DUP block group and a 'num_stripes' value unequal to 2, refuse mounting as the image is corrupted and will lead to unexpected behaviour. Code inspection showed a RAID1 block group has the same issues. Fixes: `e06cd3dd7c` ("Btrfs: add validadtion checks for chunk loading") CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-02-25 14:13:41 +01:00
Anand Jain	7faad6e25c	btrfs: fix comment its device list mutex not volume lock We have killed volume mutex (commit: `dccdb07bc9` btrfs: kill btrfs_fs_info::volume_mutex). This a trival one seems to have escaped. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-02-25 14:13:37 +01:00
Anand Jain	228a73abde	btrfs: introduce new ioctl to unregister a btrfs device Support for a new command that can be used eg. as a command $ btrfs device scan --forget [dev]' (the final name may change though) to undo the effects of 'btrfs device scan [dev]'. For this purpose this patch proposes to use ioctl #5 as it was empty and is next to the SCAN ioctl. The new ioctl BTRFS_IOC_FORGET_DEV works only on the control device (/dev/btrfs-control) to unregister one or all devices, devices that are not mounted. The argument is struct btrfs_ioctl_vol_args, ::name specifies the device path. To unregister all device, the path is an empty string. Again, the devices are removed only if they aren't part of a mounte filesystem. This new ioctl provides: - release of unwanted btrfs_fs_devices and btrfs_devices structures from memory if the device is not going to be mounted - ability to mount filesystem in degraded mode, when one devices is corrupted like in split brain raid1 - running test cases which would require reloading the kernel module but this is not possible eg. due to mounted filesystem or built-in Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>	2019-02-25 14:13:30 +01:00
Anand Jain	09ba3bc9dd	btrfs: merge btrfs_find_device and find_device Both btrfs_find_device() and find_device() does the same thing except that the latter does not take the seed device onto account in the device scanning context. We can merge them. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-02-25 14:13:24 +01:00
Anand Jain	70bc7088aa	btrfs: refactor btrfs_free_stale_devices() to get return value Preparatory patch to add ioctl that allows to forget a device (ie. reverse of scan). Refactors btrfs_free_stale_devices() to obtain return status. As this function can fail if it can't find the given path (returns -ENOENT) or trying to delete a mounted device (returns -EBUSY). Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-02-25 14:13:23 +01:00
Anand Jain	e4319cd9ca	btrfs: refactor btrfs_find_device() take fs_devices as argument btrfs_find_device() accepts fs_info as an argument and retrieves fs_devices from fs_info. Instead use fs_devices, so that this function can be used in non-mount (during device scanning) context as well. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-02-25 14:13:23 +01:00
Anand Jain	6e927cebe2	btrfs: cleanup btrfs_find_device_by_devspec() btrfs_find_device_by_devspec() finds the device by @devid or by @device_path. This patch makes code flow easy to read by open coding the else part and renames devpath to device_path. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-02-25 14:13:23 +01:00
Anand Jain	d95a830c78	btrfs: merge btrfs_find_device_missing_or_by_path() into parent btrfs_find_device_missing_or_by_path() is relatively small function, and its only parent btrfs_find_device_by_devspec() is small as well. Besides there are a number of find_device functions. Merge btrfs_find_device_missing_or_by_path() into its parent. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-02-25 14:13:22 +01:00
Al Viro	92900e5160	btrfs: fix potential oops in device_list_add alloc_fs_devices() can return ERR_PTR(-ENOMEM), so dereferencing its result before the check for IS_ERR() is a bad idea. Fixes: `d1a6300282` ("btrfs: add members to fs_devices to track fsid changes") Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-01-30 18:16:40 +01:00
Qu Wenruo	1b3922a8bc	btrfs: Use real device structure to verify dev extent [BUG] Linux v5.0-rc1 will fail fstests/btrfs/163 with the following kernel message: BTRFS error (device dm-6): dev extent devid 1 physical offset 13631488 len 8388608 is beyond device boundary 0 BTRFS error (device dm-6): failed to verify dev extents against chunks: -117 BTRFS error (device dm-6): open_ctree failed [CAUSE] Commit `cf90d884b3` ("btrfs: Introduce mount time chunk <-> dev extent mapping check") introduced strict check on dev extents. We use btrfs_find_device() with dev uuid and fs uuid set to NULL, and only dependent on @devid to find the real device. For seed devices, we call clone_fs_devices() in open_seed_devices() to allow us search seed devices directly. However clone_fs_devices() just populates devices with devid and dev uuid, without populating other essential members, like disk_total_bytes. This makes any device returned by btrfs_find_device(fs_info, devid, NULL, NULL) is just a dummy, with 0 disk_total_bytes, and any dev extents on the seed device will not pass the device boundary check. [FIX] This patch will try to verify the device returned by btrfs_find_device() and if it's a dummy then re-search in seed devices. Fixes: `cf90d884b3` ("btrfs: Introduce mount time chunk <-> dev extent mapping check") CC: stable@vger.kernel.org # 4.19+ Reported-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-01-10 17:13:00 +01:00
Andrea Gelmini	52042d8e82	btrfs: Fix typos in comments and strings The typos accumulate over time so once in a while time they get fixed in a large patch. Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-12-17 14:51:50 +01:00
Nikolay Borisov	15c8276302	btrfs: Remove 1st shrink/grow phase from balance The first step of the rebalance process ensures there is 1MiB free on each device. This number seems rather small. And in fact when talking to the original authors their opinions were: "man that's a little bonkers" "i don't think we even need that code anymore" "I think it was there to make sure we had room for the blank 1M at the beginning. I bet it goes all the way back to v0" "we just don't need any of that tho, i say we just delete it" Clearly, this piece of code has lost its original intent throughout the years. It doesn't really bring any real practical benefits to the relocation process. Additionally, this patch makes the balance process more lightweight by removing a pair of shrink/grow operations which are rather expensive for heavily populated filesystems. This is mainly due to shrink requiring relocating block groups, involving heavy use of the btree. The intermediate shrink/grow can fail and leave the filesystem in a middle state that would need to be changed back by the user. Suggested-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>	2018-12-17 14:51:49 +01:00
Johannes Thumshirn	7073017aeb	btrfs: use offset_in_page instead of open-coding it Constructs like 'var & (PAGE_SIZE - 1)' or 'var & ~PAGE_MASK' can denote an offset into a page. So replace them by the offset_in_page() macro instead of open-coding it if they're not used as an alignment check. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-12-17 14:51:45 +01:00
David Sterba	cb5583dd52	btrfs: dev-replace: open code trivial locking helpers The dev-replace locking functions are now trivial wrappers around rw semaphore that can be used directly everywhere. No functional change. Signed-off-by: David Sterba <dsterba@suse.com>	2018-12-17 14:51:45 +01:00
David Sterba	53176dde0a	btrfs: dev-replace: remove custom read/write blocking scheme After the rw semaphore has been added, the custom blocking using ::blocking_readers and ::read_lock_wq is redundant. The blocking logic in __btrfs_map_block is replaced by extending the time the semaphore is held, that has the same blocking effect on writes as the previous custom scheme that waited until ::blocking_readers was zero. Signed-off-by: David Sterba <dsterba@suse.com>	2018-12-17 14:51:45 +01:00
Nikolay Borisov	da12fe5414	btrfs: Refactor btrfs_merge_bio_hook This function really checks whether adding more data to the bio will straddle a stripe/chunk. So first let's give it a more appropraite name - btrfs_bio_fits_in_stripe. Secondly, the offset parameter was never used to just remove it. Thirdly, pages are submitted to either btree or data inodes so it's guaranteed that tree->ops is set so replace the check with an ASSERT. Finally, document the parameters of the function. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-12-17 14:51:42 +01:00
Anand Jain	7333bd02dc	btrfs: balance: print to system log when balance ends or is paused Print a kernel log message when the balance ends, either for cancel or completed or if it is paused. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-12-17 14:51:39 +01:00
Anand Jain	56fc37d936	btrfs: balance: print args during start and resume The information about balance arguments is important for system audit, this patch prints the textual representation when balance starts or is resumed. Example command: $ btrfs balance start -f -mprofiles=raid1,convert=single,soft -dlimit=10..20,usage=50 /btrfs Example kernel log output: BTRFS info (device sdb): balance: start -f -dusage=50,limit=10..20 -mconvert=single,soft,profiles=raid1 -sconvert=single,soft,profiles=raid1 Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog, simplify code ] Signed-off-by: David Sterba <dsterba@suse.com>	2018-12-17 14:51:39 +01:00
Anand Jain	f89e09cf45	btrfs: add helper to describe block group flags Factor out helper that describes block group flags from describe_relocation. The result will not be longer than the given size. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add comments ] Signed-off-by: David Sterba <dsterba@suse.com>	2018-12-17 14:51:39 +01:00
Filipe Manana	5a8067c0d1	Btrfs: fix access to available allocation bits when starting balance The available allocation bits members from struct btrfs_fs_info are protected by a sequence lock, and when starting balance we access them incorrectly in two different ways: 1) In the read sequence lock loop at btrfs_balance() we use the values we read from fs_info->avail_*_alloc_bits and we can immediately do actions that have side effects and can not be undone (printing a message and jumping to a label). This is wrong because a retry might be needed, so our actions must not have side effects and must be repeatable as long as read_seqretry() returns a non-zero value. In other words, we were essentially ignoring the sequence lock; 2) Right below the read sequence lock loop, we were reading the values from avail_metadata_alloc_bits and avail_data_alloc_bits without any protection from concurrent writers, that is, reading them outside of the read sequence lock critical section. So fix this by making sure we only read the available allocation bits while in a read sequence lock critical section and that what we do in the critical section is repeatable (has nothing that can not be undone) so that any eventual retry that is needed is handled properly. Fixes: `de98ced9e7` ("Btrfs: use seqlock to protect fs_info->avail_{data, metadata, system}_alloc_bits") Fixes: `1450612797` ("btrfs: fix a bogus warning when converting only data or metadata") Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-12-17 14:51:39 +01:00
Nikolay Borisov	cc5de4e702	btrfs: Handle final split-brain possibility during fsid change This patch lands the last case which needs to be handled by the fsid change code. Namely, this is the case where a multidisk filesystem has already undergone at least one successful fsid change i.e all disks have the METADATA_UUID incompat bit and power failure occurs as another fsid change is in progress. When such an event occurs, disks could be split in 2 groups. One of the groups will have both METADATA_UUID and CHANGING_FSID_V2 flags set coupled with old fsid/metadata_uuid pairs. The other group of disks will have only METADATA_UUID bit set and their fsid will be different than the one in disks in the first group. Here we look at the following cases: a) A disk from the first group is scanned first, so fs_devices is created with stale fsid/metdata_uuid. Then when a disk from the second group is scanned it needs to first check whether there exists such an fs_devices that has fsid_change set to true (because it was created with a disk having the CHANGING_FSID_V2 flag), the metadata_uuid and fsid of the fs_devices will be different (since it was created by a disk which already has had at least 1 successful fsid change) and finally the metadata_uuid of the fs_devices will equal that of the currently scanned disk (because metadata_uuid never really changes). When the correct fs_devices is found the information from the scanned disk will replace the current one in fs_devices since the scanned disk will have higher generation number. b) A disk from the second group is scanned so fs_devices is created as usual with differing fsid/metdata_uid. Then when a disk from the first group is scanned the code detects that it has both CHANGING_FSID_V2 and METADATA_UUID flags set and will search for fs_devices that has differing metadata_uuid/fsid and whose metadata_uuid is the same as that of the scanned device. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-12-17 14:51:38 +01:00
Nikolay Borisov	7a62d0f073	btrfs: Handle one more split-brain scenario during fsid change This commit continues hardening the scanning code to handle cases where power loss could have caused disks in a multi-disk filesystem to be in inconsistent state. Namely handle the situation that can occur when some of the disks in multi-disk fs have completed their fsid change i.e they have METADATA_UUID incompat flag set, have cleared the CHANGING_FSID_V2 flag and their fsid/metadata_uuid are different. At the same time the other half of the disks will have their fsid/metadata_uuid unchanged and will only have CHANGING_FSID_V2 flag. This is handled by introducing code in the scan path which: a) Handles the case when a device with CHANGING_FSID_V2 flag is scanned and as a result btrfs_fs_devices is created with matching fsid/metdata_uuid. Subsequently, when a device with completed fsid change is scanned it will detect this via the new code in find_fsid i.e that such an fs_devices exist that fsid_change flag is set to true, it's metadata_uuid/fsid match and the metadata_uuid of the scanned device matches that of the fs_devices. In this case, it's important to note that the devices which has its fsid change completed will have a higher generation number than the device with FSID_CHANGING_V2 flag set, so its superblock block will be used during mount. To prevent an assertion triggering because the sb used for mounting will have differing fsid/metadata_uuid than the ones in the fs_devices struct also add code in device_list_add which overwrites the values in fs_devices. b) Alternatively we can end up with a device that completed its fsid change be scanned first which will create the respective btrfs_fs_devices struct with differing fsid/metadata_uuid. In this case when a device with FSID_CHANGING_V2 flag set is scanned it will call the newly added find_fsid_inprogress function which will return the correct fs_devices. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-12-17 14:51:38 +01:00
Nikolay Borisov	d1a6300282	btrfs: add members to fs_devices to track fsid changes In order to gracefully handle split-brain scenario during fsid change (which are very unlikely, yet possible), two more pieces of information will be necessary: 1. The highest generation number among all devices registered to a particular btrfs_fs_devices 2. A boolean flag whether a given btrfs_fs_devices was created by a device which had the FSID_CHANGING_V2 flag set. This is a preparatory patch and just introduces the variables as well as code which sets them, their actual use is going to happen in a later patch. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-12-17 14:51:38 +01:00
Nikolay Borisov	de37aa5131	btrfs: Remove fsid/metadata_fsid fields from btrfs_info Currently btrfs_fs_info structure contains a copy of the fsid/metadata_uuid fields. Same values are also contained in the btrfs_fs_devices structure which fs_info has a reference to. Let's reduce duplication by removing the fields from fs_info and always refer to the ones in fs_devices. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-12-17 14:51:37 +01:00
Nikolay Borisov	7239ff4b2b	btrfs: Introduce support for FSID change without metadata rewrite This field is going to be used when the user wants to change the UUID of the filesystem without having to rewrite all metadata blocks. This field adds another level of indirection such that when the FSID is changed what really happens is the current UUID (the one with which the fs was created) is copied to the 'metadata_uuid' field in the superblock as well as a new incompat flag is set METADATA_UUID. When the kernel detects this flag is set it knows that the superblock in fact has 2 UUIDs: 1. Is the UUID which is user-visible, currently known as FSID. 2. Metadata UUID - this is the UUID which is stamped into all on-disk datastructures belonging to this file system. When the new incompat flag is present device scanning checks whether both fsid/metadata_uuid of the scanned device match any of the registered filesystems. When the flag is not set then both UUIDs are equal and only the FSID is retained on disk, metadata_uuid is set only in-memory during mount. Additionally a new metadata_uuid field is also added to the fs_info struct. It's initialised either with the FSID in case METADATA_UUID incompat flag is not set or with the metdata_uuid of the superblock otherwise. This commit introduces the new fields as well as the new incompat flag and switches all users of the fsid to the new logic. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ minor updates in comments ] Signed-off-by: David Sterba <dsterba@suse.com>	2018-12-17 14:51:37 +01:00
Nikolay Borisov	64bc6c2a34	btrfs: Remove superfluous check form btrfs_remove_chunk It's unnecessary to check map->stripes[i].dev for NULL given its value is already set and dereferenced above the the check. No functional changes. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-12-17 14:51:36 +01:00
Anand Jain	a9261d4125	btrfs: harden agaist duplicate fsid on scanned devices It's not that impossible to imagine that a device OR a btrfs image is copied just by using the dd or the cp command. Which in case both the copies of the btrfs will have the same fsid. If on the system with automount enabled, the copied FS gets scanned. We have a known bug in btrfs, that we let the device path be changed after the device has been mounted. So using this loop hole the new copied device would appears as if its mounted immediately after it's been copied. For example: Initially.. /dev/mmcblk0p4 is mounted as / $ lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT mmcblk0 179:0 0 29.2G 0 disk \|-mmcblk0p4 179:4 0 4G 0 part / \|-mmcblk0p2 179:2 0 500M 0 part /boot \|-mmcblk0p3 179:3 0 256M 0 part [SWAP] `-mmcblk0p1 179:1 0 256M 0 part /boot/efi $ btrfs fi show Label: none uuid: 07892354-ddaa-4443-90ea-f76a06accaba Total devices 1 FS bytes used 1.40GiB devid 1 size 4.00GiB used 3.00GiB path /dev/mmcblk0p4 Copy mmcblk0 to sda $ dd if=/dev/mmcblk0 of=/dev/sda And immediately after the copy completes the change in the device superblock is notified which the automount scans using btrfs device scan and the new device sda becomes the mounted root device. $ lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 1 14.9G 0 disk \|-sda4 8:4 1 4G 0 part / \|-sda2 8:2 1 500M 0 part \|-sda3 8:3 1 256M 0 part `-sda1 8:1 1 256M 0 part mmcblk0 179:0 0 29.2G 0 disk \|-mmcblk0p4 179:4 0 4G 0 part \|-mmcblk0p2 179:2 0 500M 0 part /boot \|-mmcblk0p3 179:3 0 256M 0 part [SWAP] `-mmcblk0p1 179:1 0 256M 0 part /boot/efi $ btrfs fi show / Label: none uuid: 07892354-ddaa-4443-90ea-f76a06accaba Total devices 1 FS bytes used 1.40GiB devid 1 size 4.00GiB used 3.00GiB path /dev/sda4 The bug is quite nasty that you can't either unmount /dev/sda4 or /dev/mmcblk0p4. And the problem does not get solved until you take sda out of the system on to another system to change its fsid using the 'btrfstune -u' command. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-12-17 14:51:34 +01:00
Hans van Kranenburg	b50836edf9	btrfs: introduce nparity raid_attr Instead of hardcoding exceptions for RAID5 and RAID6 in the code, use an nparity field in raid_attr. Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-12-17 14:51:34 +01:00
Hans van Kranenburg	da612e31ae	btrfs: fix ncopies raid_attr for RAID56 RAID5 and RAID6 profile store one copy of the data, not 2 or 3. These values are not yet used anywhere so there's no change. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-12-17 14:51:33 +01:00
Hans van Kranenburg	baf92114c7	btrfs: alloc_chunk: fix more DUP stripe size handling Commit `92e222df7b` "btrfs: alloc_chunk: fix DUP stripe size handling" fixed calculating the stripe_size for a new DUP chunk. However, the same calculation reappears a bit later, and that one was not changed yet. The resulting bug that is exposed is that the newly allocated device extents ('stripes') can have a few MiB overlap with the next thing stored after them, which is another device extent or the end of the disk. The scenario in which this can happen is: * The block device for the filesystem is less than 10GiB in size. * The amount of contiguous free unallocated disk space chosen to use for chunk allocation is 20% of the total device size, or a few MiB more or less. An example: - The filesystem device is 7880MiB (max_chunk_size gets set to 788MiB) - There's 1578MiB unallocated raw disk space left in one contiguous piece. In this case stripe_size is first calculated as 789MiB, (half of 1578MiB). Since 789MiB (stripe_size * data_stripes) > 788MiB (max_chunk_size), we enter the if block. Now stripe_size value is immediately overwritten while calculating an adjusted value based on max_chunk_size, which ends up as 788MiB. Next, the value is rounded up to a 16MiB boundary, 800MiB, which is actually more than the value we had before. However, the last comparison fails to detect this, because it's comparing the value with the total amount of free space, which is about twice the size of stripe_size. In the example above, this means that the resulting raw disk space being allocated is 1600MiB, while only a gap of 1578MiB has been found. The second device extent object for this DUP chunk will overlap for 22MiB with whatever comes next. The underlying problem here is that the stripe_size is reused all the time for different things. So, when entering the code in the if block, stripe_size is immediately overwritten with something else. If later we decide we want to have the previous value back, then the logic to compute it was copy pasted in again. With this change, the value in stripe_size is not unnecessarily destroyed, so the duplicated calculation is not needed any more. Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-12-17 14:51:33 +01:00
Hans van Kranenburg	23f0ff1ec4	btrfs: alloc_chunk: improve chunk size variable name The variable num_bytes is really a way too generic name for a variable in this function. There are a dozen other variables that hold a number of bytes as value. Give it a name that actually describes what it does, which is holding the size of the chunk that we're allocating. Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-12-17 14:51:33 +01:00
Hans van Kranenburg	2f29df4fc2	btrfs: alloc_chunk: do not refurbish num_bytes The variable num_bytes is used to store the chunk length of the chunk that we're allocating. Do not reuse it for something really different in the same function. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-12-17 14:51:33 +01:00
Nikolay Borisov	fc8a168aa9	btrfs: Check for missing device before bio submission in btrfs_map_bio Before btrfs_map_bio submits all stripe bios it does a number of checks to ensure the device for every stripe is present. However, it doesn't do a DEV_STATE_MISSING check, instead this is relegated to the lower level btrfs_schedule_bio (in the async submission case, sync submission doesn't check DEV_STATE_MISSING at all). Additionally btrfs_schedule_bios does the duplicate device->bdev check which has already been performed in btrfs_map_bio. This patch moves the DEV_STATE_MISSING check in btrfs_map_bio and removes the duplicate device->bdev check. Doing so ensures that no bio cloning/submission happens for both async/sync requests in the face of missing device. This makes the async io submission path slightly shorter in terms of instruction count. No functional changes. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-12-17 14:51:32 +01:00
Omar Sandoval	60ca842e34	Btrfs: rename and export get_chunk_map The Btrfs swap code is going to need it, so give it a btrfs_ prefix and make it non-static. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-12-17 14:51:29 +01:00
Omar Sandoval	eede2bf34f	Btrfs: prevent ioctls from interfering with a swap file A later patch will implement swap file support for Btrfs, but before we do that, we need to make sure that the various Btrfs ioctls cannot change a swap file. When a swap file is active, we must make sure that the extents of the file are not moved and that they don't become shared. That means that the following are not safe: - chattr +c (enable compression) - reflink - dedupe - snapshot - defrag Don't allow those to happen on an active swap file. Additionally, balance, resize, device remove, and device replace are also unsafe if they affect an active swapfile. Add a red-black tree of block groups and devices which contain an active swapfile. Relocation checks each block group against this tree and skips it or errors out for balance or resize, respectively. Device remove and device replace check the tree for the device they will operate on. Note that we don't have to worry about chattr -C (disable nocow), which we ignore for non-empty files, because an active swapfile must be non-empty and can't be truncated. We also don't have to worry about autodefrag because it's only done on COW files. Truncate and fallocate are already taken care of by the generic code. Device add doesn't do relocation so it's not an issue, either. Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-12-17 14:51:29 +01:00
Qu Wenruo	05a37c4860	btrfs: volumes: Make sure no dev extent is beyond device boundary Add extra dev extent end check against device boundary. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-12-17 14:51:27 +01:00
Qu Wenruo	5eb193812a	btrfs: volumes: Make sure there is no overlap of dev extents at mount time Enhance btrfs_verify_dev_extents() to remember previous checked dev extents, so it can verify no dev extents can overlap. Analysis from Hans: "Imagine allocating a DATA\|DUP chunk. In the chunk allocator, we first set... max_stripe_size = SZ_1G; max_chunk_size = BTRFS_MAX_DATA_CHUNK_SIZE ... which is 10GiB. Then... /* we don't want a chunk larger than 10% of writeable space / max_chunk_size = min(div_factor(fs_devices->total_rw_bytes, 1), max_chunk_size); Imagine we only have one 7880MiB block device in this filesystem. Now max_chunk_size is down to 788MiB. The next step in the code is to search for max_stripe_size dev_stripes amount of free space on the device, which is in our example 1GiB * 2 = 2GiB. Imagine the device has exactly 1578MiB free in one contiguous piece. This amount of bytes will be put in devices_info[ndevs - 1].max_avail Next we recalculate the stripe_size (which is actually the device extent length), based on the actual maximum amount of available raw disk space: stripe_size = div_u64(devices_info[ndevs - 1].max_avail, dev_stripes); stripe_size is now 789MiB Next we do... data_stripes = num_stripes / ncopies ...where data_stripes ends up as 1, because num_stripes is 2 (the amount of device extents we're going to have), and DUP has ncopies 2. Next there's a check... if (stripe_size * data_stripes > max_chunk_size) ...which matches because 789MiB * 1 > 788MiB. We go into the if code, and next is... stripe_size = div_u64(max_chunk_size, data_stripes); ...which resets stripe_size to max_chunk_size: 788MiB Next is a fun one... /* bump the answer up to a 16MB boundary / stripe_size = round_up(stripe_size, SZ_16M); ...which changes stripe_size from 788MiB to 800MiB. We're not done changing stripe_size yet... / But don't go higher than the limits we found while searching * for free extents */ stripe_size = min(devices_info[ndevs - 1].max_avail, stripe_size); This is bad. max_avail is twice the stripe_size (we need to fit 2 device extents on the same device for DUP). The result here is that 800MiB < 1578MiB, so it's unchanged. However, the resulting DUP chunk will need 1600MiB disk space, which isn't there, and the second dev_extent might extend into the next thing (next dev_extent? end of device?) for 22MiB. The last shown line of code relies on a situation where there's twice the value of stripe_size present as value for the variable stripe_size when it's DUP. This was actually the case before commit `92e222df7b` "btrfs: alloc_chunk: fix DUP stripe size handling", from which I quote: "[...] in the meantime there's a check to see if the stripe_size does not exceed max_chunk_size. Since during this check stripe_size is twice the amount as intended, the check will reduce the stripe_size to max_chunk_size if the actual correct to be used stripe_size is more than half the amount of max_chunk_size." In the previous version of the code, the 16MiB alignment (why is this done, by the way?) would result in a 50% chance that it would actually do an 8MiB alignment for the individual dev_extents, since it was operating on double the size. Does this matter? Does it matter that stripe_size can be set to anything which is not 16MiB aligned because of the amount of remaining available disk space which is just taken? What is the main purpose of this round_up? The most straightforward thing to do seems something like... stripe_size = min( div_u64(devices_info[ndevs - 1].max_avail, dev_stripes), stripe_size ) ..just putting half of the max_avail into stripe_size." Link: https://lore.kernel.org/linux-btrfs/b3461a38-e5f8-f41d-c67c-2efac8129054@mendix.com/ Reported-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com> Signed-off-by: Qu Wenruo <wqu@suse.com> [ add analysis from report ] Signed-off-by: David Sterba <dsterba@suse.com>	2018-12-17 14:51:27 +01:00
David Sterba	7fb2eced10	btrfs: open code btrfs_dev_replace_clear_lock_blocking There's a single caller and the function name does not say it's actually taking the lock, so open coding makes it more explicit. For now, btrfs_dev_replace_read_lock is used instead of read_lock so it's paired with the unlocking wrapper in the same block. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-10-15 17:23:37 +02:00
David Sterba	818255feec	btrfs: use common helper instead of open coding a bit test The helper does the same math and we take care about the special case when flags is 0 too. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-10-15 17:23:35 +02:00
Liu Bo	07e1ce096d	Btrfs: extent_map: use rb_first_cached rb_first_cached() trades an extra pointer "leftmost" for doing the same job as rb_first() but in O(1). As evict_inode_truncate_pages() removes all extent mapping by always looking for the first rb entry, it's helpful to use rb_first_cached instead. For more details about the optimization see patch "Btrfs: delayed-refs: use rb_first_cached for href_root". Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-10-15 17:23:33 +02:00
Nikolay Borisov	a27a94c2b0	btrfs: Make btrfs_find_device_by_devspec return btrfs_device directly Instead of returning an error value and using one of the parameters for returning the actual object we are interested in just refactor the function to directly return btrfs_device *. Also bubble up the error handling for the special BTRFS_ERROR_DEV_MISSING_NOT_FOUND value into btrfs_rm_device. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-10-15 17:23:30 +02:00
Nikolay Borisov	6c05040702	btrfs: Make btrfs_find_device_missing_or_by_path return directly a device This function returns a numeric error value and additionally the device found in one of its input parameters. Simplify this by making the function directly return a pointer to btrfs_device. Additionally adjust the caller to handle the case when we want to remove the 'missing' device and ENOENT is returned to return the expected positive error value, parsed by progs. Finally, unexport the function since it's not called outside of volume.c. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-10-15 17:23:29 +02:00
Nikolay Borisov	b444ad46b2	btrfs: Make btrfs_find_device_by_path return struct btrfs_device Currently this function returns an error code as well as uses one of its arguments as a return value for struct btrfs_device. Change the function so that it returns btrfs_device directly and use the usual "encode error in pointer" mechanics if something goes wrong. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-10-15 17:23:29 +02:00
Anand Jain	1da739678e	btrfs: add helper to obtain number of devices with ongoing dev-replace When the replace is running the fs_devices::num_devices also includes the replaced device, however in some operations like device delete and balance it needs the actual num_devices without the repalced devices. The function btrfs_num_devices() just provides that. And here is a scenario how balance and repalce items could co-exist: Consider balance is started and paused, now start the replace followed by a unmount or system power-cycle. During following mount, the open_ctree() first restarts the balance so it must check for the device replace otherwise our num_devices calculation will be wrong. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>	2018-10-15 17:23:26 +02:00
Anand Jain	16220c467a	btrfs: add assertions where number of devices could go below 0 In preparation to add helper function to deduce the num_devices with replace running, use assert instead of BUG_ON or WARN_ON. The number of devices would not normally drop to 0 due to other checks so the assert is sufficient. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog, adjust the assert condition ] Signed-off-by: David Sterba <dsterba@suse.com>	2018-10-15 17:23:26 +02:00
Anand Jain	801660b040	btrfs: btrfs_shrink_device should call commit transaction at the end Test case btrfs/164 reports use-after-free: [ 6712.084324] general protection fault: 0000 [#1] PREEMPT SMP .. [ 6712.195423] btrfs_update_commit_device_size+0x75/0xf0 [btrfs] [ 6712.201424] btrfs_commit_transaction+0x57d/0xa90 [btrfs] [ 6712.206999] btrfs_rm_device+0x627/0x850 [btrfs] [ 6712.211800] btrfs_ioctl+0x2b03/0x3120 [btrfs] Reason for this is that btrfs_shrink_device adds the resized device to the fs_devices::resized_devices after it has called the last commit transaction. So the list fs_devices::resized_devices is not empty when btrfs_shrink_device returns. Now the parent function btrfs_rm_device calls: btrfs_close_bdev(device); call_rcu(&device->rcu, free_device_rcu); and then does the transactio ncommit. It goes through the fs_devices::resized_devices in btrfs_update_commit_device_size and leads to use-after-free. Fix this by making sure btrfs_shrink_device calls the last needed btrfs_commit_transaction before the return. This is consistent with what the grow counterpart does and this makes sure the on-disk state is persistent when the function returns. Reported-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Tested-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>	2018-08-23 17:37:27 +02:00
Naohiro Aota	39379faaad	btrfs: revert fs_devices state on error of btrfs_init_new_device When btrfs hits error after modifying fs_devices in btrfs_init_new_device() (such as btrfs_add_dev_item() returns error), it leaves everything as is, but frees allocated btrfs_device. As a result, fs_devices->devices and fs_devices->alloc_list contain already freed btrfs_device, leading to later use-after-free bug. Error path also messes the things like ->num_devices. While they go back to the original value by unscanning btrfs devices, it is safe to revert them here. Fixes: `79787eaab4` ("btrfs: replace many BUG_ONs with proper error handling") Signed-off-by: Naohiro Aota <naota@elisp.net> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-08-06 13:13:04 +02:00
Qu Wenruo	64f64f43c8	btrfs: Exit gracefully when chunk map cannot be inserted to the tree It's entirely possible that a crafted btrfs image contains overlapping chunks. Although we can't detect such problem by tree-checker, it's not a catastrophic problem, current extent map can already detect such problem and return -EEXIST. We just only need to exit gracefully and fail the mount. Reported-by: Xu Wen <wen.xu@gatech.edu> Link: https://bugzilla.kernel.org/show_bug.cgi?id=200409 Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-08-06 13:13:03 +02:00
Qu Wenruo	cf90d884b3	btrfs: Introduce mount time chunk <-> dev extent mapping check This patch will introduce chunk <-> dev extent mapping check, to protect us against invalid dev extents or chunks. Since chunk mapping is the fundamental infrastructure of btrfs, extra check at mount time could prevent a lot of unexpected behavior (BUG_ON). Reported-by: Xu Wen <wen.xu@gatech.edu> Link: https://bugzilla.kernel.org/show_bug.cgi?id=200403 Link: https://bugzilla.kernel.org/show_bug.cgi?id=200407 Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Su Yue <suy.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-08-06 13:13:03 +02:00
Misono Tomohiro	672d599041	btrfs: Use wrapper macro for rcu string to remove duplicate code Cleanup patch and no functional changes. Signed-off-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-08-06 13:13:02 +02:00
Nikolay Borisov	97aff912a2	btrfs: Remove fs_info from btrfs_finish_chunk_alloc It can be referenced from the passed transaction handle. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-08-06 13:12:58 +02:00
Nikolay Borisov	f4208794d0	btrfs: Remove fs_info form btrfs_free_chunk It can be referenced from the passed transaction handle. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-08-06 13:12:57 +02:00
Nikolay Borisov	4f5ad7bd63	btrfs: Remove fs_info from btrfs_destroy_dev_replace_tgtdev This function is always passed a well-formed tgtdevice so the fs_info can be referenced from there. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-08-06 13:12:57 +02:00
Nikolay Borisov	d6507cf1e2	btrfs: Remove fs_info from btrfs_assign_next_active_device It can be referenced from the passed 'device' argument which is always a well-formed device. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-08-06 13:12:57 +02:00
Nikolay Borisov	5495f195fc	btrfs: remove fs_info argument from update_dev_stat_item It can be referenced from the passed transaction handle. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-08-06 13:12:57 +02:00
Nikolay Borisov	68a9db5f23	btrfs: Remove fs_info from btrfs_rm_dev_replace_remove_srcdev It can be referenced from the passed srcdev argument. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-08-06 13:12:57 +02:00
Nikolay Borisov	8e87e85627	btrfs: Remove fs_info argument from btrfs_add_dev_item It can be referenced form the passed transaction handle. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-08-06 13:12:56 +02:00
Gu Jinxiang	315409b009	btrfs: validate type when reading a chunk Reported in https://bugzilla.kernel.org/show_bug.cgi?id=199839, with an image that has an invalid chunk type but does not return an error. Add chunk type check in btrfs_check_chunk_valid, to detect the wrong type combinations. Link: https://bugzilla.kernel.org/show_bug.cgi?id=199839 Reported-by: Xu Wen <wen.xu@gatech.edu> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Gu Jinxiang <gujx@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-08-06 13:12:55 +02:00
David Sterba	46df06b85e	btrfs: refactor block group replication factor calculation to a helper There are many places that open code the duplicity factor of the block group profiles, create a common helper. This can be easily extended for more copies. Signed-off-by: David Sterba <dsterba@suse.com>	2018-08-06 13:12:53 +02:00
Anand Jain	321a4bf72b	btrfs: use the assigned fs_devices instead of the dereference We have assigned the %fs_info->fs_devices in %fs_devices as its not modified just use it for the mutex_lock(). Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-08-06 13:12:53 +02:00
Gu Jinxiang	36350e95a2	btrfs: return device pointer from btrfs_scan_one_device Return device pointer (with the IS_ERR semantics) from btrfs_scan_one_device so we don't have to return in through pointer. And since btrfs_fs_devices can be obtained from btrfs_device, return that. Signed-off-by: Gu Jinxiang <gujx@cn.fujitsu.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ fixed conflics after recent changes to btrfs_scan_one_device ] Signed-off-by: David Sterba <dsterba@suse.com>	2018-08-06 13:12:48 +02:00
David Sterba	f5194e34ca	btrfs: lift uuid_mutex to callers of btrfs_open_devices Prepartory work to fix race between mount and device scan. The callers will have to manage the critical section, eg. mount wants to scan and then call btrfs_open_devices without the ioctl scan walking in and modifying the fs devices in the meantime. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-08-06 13:12:47 +02:00
David Sterba	899f9307c3	btrfs: lift uuid_mutex to callers of btrfs_scan_one_device Prepartory work to fix race between mount and device scan. The callers will have to manage the critical section, eg. mount wants to scan and then call btrfs_open_devices without the ioctl scan walking in and modifying the fs devices in the meantime. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-08-06 13:12:47 +02:00
Anand Jain	7bcb8164ad	btrfs: use device_list_mutex when removing stale devices btrfs_free_stale_devices() finds a stale (not opened) device matching path in the fs_uuid list. We are already under uuid_mutex so when we check for each fs_devices, hold the device_list_mutex too. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-08-06 13:12:47 +02:00
Anand Jain	fa6d2ae540	btrfs: rename local devices for fs_devices in btrfs_free_stale_devices( Over the years we named %fs_devices and %devices to represent the struct btrfs_fs_devices and the struct btrfs_device. So follow the same scheme here too. No functional changes. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-08-06 13:12:47 +02:00
Anand Jain	9c6d173ea6	btrfs: extend locked section when adding a new device in device_list_add Make sure the device_list_lock is held the whole time: * when the device is being looked up * new device is initialized and put to the list * the list counters are updated (fs_devices::opened, fs_devices::total_devices) Signed-off-by: Anand Jain <anand.jain@oracle.com> [ update changelog ] Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-08-06 13:12:46 +02:00
Anand Jain	4306a97449	btrfs: do btrfs_free_stale_devices outside of device_list_add btrfs_free_stale_devices() looks for device path reused for another filesystem, and deletes the older fs_devices::device entry. In preparation to handle locking in device_list_add, move btrfs_free_stale_devices outside as these two functions serve a different purpose. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-08-06 13:12:46 +02:00
Nikolay Borisov	959b1c0467	btrfs: close devices without offloading to a temporary list Since commit `88c14590cd` ("btrfs: use RCU in btrfs_show_devname for device list traversal") btrfs_show_devname no longer takes device_list_mutex. As such the deadlock that `0ccd05285e` ("btrfs: fix a possible umount deadlock") aimed to fix no longer exists, we can free the devices immediatelly and remove the code that does the pending work. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> [ update changelog ] Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-08-06 13:12:46 +02:00
Qu Wenruo	621567a28c	btrfs: Remove unused function btrfs_account_dev_extents_size This function is not used since the alloc_start parameter has been obsoleted in commit `0d0c71b317` ("btrfs: obsolete and remove mount option alloc_start"). Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-08-06 13:12:46 +02:00
Anand Jain	b4993e64f7	btrfs: fix in-memory value of total_devices after seed device deletion In case of deleting the seed device the %cur_devices (seed) and the %fs_devices (parent) are different. Now, as the parent fs_devices::total_devices also maintains the total number of devices including the seed device, so decrement its in-memory value for the successful seed delete. We are already updating its corresponding on-disk btrfs_super_block::number_devices value. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-08-06 13:12:45 +02:00
David Sterba	d7f663fa3f	btrfs: prune unused includes Remove includes if none of the interfaces and exports is used in the given source file. Signed-off-by: David Sterba <dsterba@suse.com>	2018-08-06 13:12:43 +02:00
Anand Jain	694c51fb2e	btrfs: drop unnecessary variable in btrfs_init_new_device There is only usage of the declared devices variable, instead use its value directly. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-08-06 13:12:42 +02:00
Anand Jain	5da54bc138	btrfs: use a temporary variable for fs_devices in btrfs_init_new_device There are many instances of the %fs_info->fs_devices pointer dereferences, use a temporary variable instead. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-08-06 13:12:42 +02:00
Qu Wenruo	fce466eab7	btrfs: tree-checker: Verify block_group_item A crafted image with invalid block group items could make free space cache code to cause panic. We could detect such invalid block group item by checking: 1) Item size Known fixed value. 2) Block group size (key.offset) We have an upper limit on block group item (10G) 3) Chunk objectid Known fixed value. 4) Type Only 4 valid type values, DATA, METADATA, SYSTEM and DATA\|METADATA. No more than 1 bit set for profile type. 5) Used space No more than the block group size. This should allow btrfs to detect and refuse to mount the crafted image. Link: https://bugzilla.kernel.org/show_bug.cgi?id=199849 Reported-by: Xu Wen <wen.xu@gatech.edu> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Gu Jinxiang <gujx@cn.fujitsu.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Gu Jinxiang <gujx@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-08-06 13:12:41 +02:00
Nikolay Borisov	43a7e99db6	btrfs: Remove fs_info from btrfs_force_chunk_alloc It can be referenced from the passed transaction handle. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-08-06 13:12:38 +02:00
Nikolay Borisov	451a2c1303	btrfs: Remove fs_info from check_system_chunk It can be referenced from trans since the function is always called within a transaction. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-08-06 13:12:36 +02:00
Nikolay Borisov	c216b2039a	btrfs: Remove fs_info from btrfs_alloc_chunk It can be referenced from trans since the function is always called within a transaction. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-08-06 13:12:36 +02:00
Nikolay Borisov	5a98ec0141	btrfs: Remove fs_info from btrfs_remove_block_group This function is always called with a valid transaction handle from where we can reference fs_info. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-08-06 13:12:34 +02:00
Nikolay Borisov	e7e02096d9	btrfs: Remove fs_info from btrfs_make_block_group This function is always called with a valid transaction handle from where we can reference the fs_info. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-08-06 13:12:34 +02:00
David Sterba	20c5bbc640	btrfs: restore uuid_mutex in btrfs_open_devices Commit `542c5908ab` ("btrfs: replace uuid_mutex by device_list_mutex in btrfs_open_devices") switched to device_list_mutex as we need that for the device list traversal, but we also need uuid_mutex to protect access to fs_devices::opened to be consistent with other users of that. Fixes: `542c5908ab` ("btrfs: replace uuid_mutex by device_list_mutex in btrfs_open_devices") Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-07-13 14:55:46 +02:00
Lu Fengqi	cdb345a877	btrfs: Remove fs_info argument from btrfs_uuid_tree_add This function always takes a transaction handle which contains a reference to the fs_info. Use that and remove the extra argument. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-30 16:46:52 +02:00
Anand Jain	6dac13f8e2	btrfs: add prefix "balance:" for log messages Kernel logs are very important for the forensic investigations of the issues in general make it easy to use it. This patch adds 'balance:' prefix so that it can be easily searched. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:23:38 +02:00
Anand Jain	d9a071f008	btrfs: use common variable for fs_devices in btrfs_destroy_dev_replace_tgtdev Use a local btrfs_fs_devices variable to access the structure. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:23:18 +02:00
Anand Jain	ab5c2f65de	btrfs: drop uuid_mutex in btrfs_destroy_dev_replace_tgtdev Delete the uuid_mutex lock here as this thread accesses the btrfs_fs_devices::devices only (counters or called functions do a list traversal). And the device_list_mutex lock is already taken. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:23:17 +02:00
Anand Jain	542c5908ab	btrfs: replace uuid_mutex by device_list_mutex in btrfs_open_devices btrfs_open_devices() is using the uuid_mutex, but as btrfs_open_devices is just limited to openning all the devices under for given fsid, so we don't need uuid_mutex. Instead it should hold the device_list_mutex as it updates the members of the btrfs_fs_devices and btrfs_device and not the whole fs_devs list. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:23:15 +02:00
Anand Jain	3dd0f7a364	btrfs: document uuid_mutex uasge in read_chunk_tree read_chunk_tree() calls read_one_dev(), but for seed device we have to search the fs_uuids list, so we need the uuid_mutex. Add a comment comment, so that we can improve this part. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:23:14 +02:00
Anand Jain	41a52a0f1b	btrfs: use existing cur_devices, cleanup btrfs_rm_device Instead of de-referencing the device->fs_devices use cur_devices which points to the same fs_devices and does not change. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:23:13 +02:00
Anand Jain	b6ed73bcb1	btrfs: reduce uuid_mutex critical section while scanning devices The generic block device lookup or cleanup does not need the uuid mutex, that's only for the device_list_add. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:23:12 +02:00
David Sterba	6fcf6e2bff	btrfs: remove redundant btrfs_balance_control::fs_info The fs_info is always available from the context so we don't need to store it in the structure. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:30 +02:00
Nikolay Borisov	63a9c7b9ce	btrfs: Remove devid parameter from btrfs_rmap_block This function is used in only one place and devid argument is always passed 0. So just remove it, similarly to how it was removed in the userspace code. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:29 +02:00
Gu Jinxiang	0338dff6e0	btrfs: do reverse path readahead in btrfs_shrink_device In btrfs_shrink_device, before btrfs_search_slot, path->reada is set to READA_FORWARD. But I think READA_BACK is correct. Since: 1. key.offset is set to (u64)-1 2. after btrfs_search_slot, btrfs_previous_item is called So, for readahead previous items, READA_BACK is the correct one. Signed-off-by: Gu Jinxiang <gujx@cn.fujitsu.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:28 +02:00
Anand Jain	f9fbcaa2a3	btrfs: move btrfs_raid_mindev_errorvalues to btrfs_raid_attr table Add a new member struct btrfs_raid_attr::mindev_error so that btrfs_raid_array can maintain the error code to return if the minimum number of devices condition is not met while trying to delete a device in the given raid. And so we can drop btrfs_raid_mindev_error. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:27 +02:00
Anand Jain	41a6e8913c	btrfs: move btrfs_raid_group values to btrfs_raid_attr table Add a new member struct btrfs_raid_attr::bg_flag so that btrfs_raid_array can maintain the bit map flag of the raid type, and so we can drop btrfs_raid_group. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:27 +02:00
Anand Jain	ed23467b18	btrfs: move btrfs_raid_type_names values to btrfs_raid_attr table Add a new member struct btrfs_raid_attr::raid_name so that btrfs_raid_array can maintain the name of the raid type, and so we can drop btrfs_raid_type_names. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:27 +02:00
David Sterba	833aae18fc	btrfs: open code set_balance_control The helper is quite simple and I'd like to see the locking in the caller. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:26 +02:00
David Sterba	1354e1a13e	btrfs: use mutex in btrfs_resume_balance_async While the spinlock does not cause problems, using the mutex is more correct and consistent with others. The global status of balance is eg. checked from btrfs_pause_balance or btrfs_cancel_balance with mutex. Resuming balance happens during mount or ro->rw remount. In the former case, no other user of the balance_ctl exists, in the latter, balance cannot run until the ro/rw transition is finished. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:26 +02:00
David Sterba	008ef0969d	btrfs: drop lock parameter from update_ioctl_balance_args and rename The parameter controls locking of the stats part but we can lock it unconditionally, as this only happens once when balance starts. This is not performance critical. Add the prefix for an exported function. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:26 +02:00
David Sterba	cf7d20f447	btrfs: move and comment read-only check in btrfs_cancel_balance Balance cannot be started on a read-only filesystem and will have to finish/exit before eg. going to read-only via remount. In case the filesystem is forcibly set to read-only after an error, balance will finish anyway and if the cancel call is too fast it will just wait for that to happen. The last case is when the balance is paused after mount but it's read-only and cancelling would want to delete the item. The test is moved after the check if balance is running at all, as it looks more logical to report "no balance running" instead of "read-only filesystem". Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:26 +02:00
David Sterba	3009a62f3b	btrfs: track running balance in a simpler way Currently fs_info::balance_running is 0 or 1 and does not use the semantics of atomics. The pause and cancel check for 0, that can happen only after __btrfs_balance exits for whatever reason. Parallel calls to balance ioctl may enter btrfs_ioctl_balance multiple times but will block on the balance_mutex that protects the fs_info::flags bit. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:25 +02:00
David Sterba	dccdb07bc9	btrfs: kill btrfs_fs_info::volume_mutex Mutual exclusion of device add/rm and balance was done by the volume mutex up to version 3.7. The commit `5ac00addc7` ("Btrfs: disallow mutually exclusive admin operations from user mode") added a bit that essentially tracked the same information. The status bit has an advantage over a mutex that it can be set without restrictions of function context, so it started to be used in the mount-time resuming of balance or device replace. But we don't really need to track the same information in two ways. 1) After the previous cleanups, the main ioctl handlers for add/del/resize copy the EXCL_OP bit next to the volume mutex, here it's clearly safe. 2) Resuming balance during mount or after rw remount will set only the EXCL_OP bit and the volume_mutex is held in the kernel thread that calls btrfs_balance. 3) Resuming device replace during mount or after rw remount is done after balance and is excluded by the EXCL_OP bit. It does not take the volume_mutex at all and completely relies on the EXCL_OP bit. 4) The resuming of balance and dev-replace cannot hapen at the same time as the ioctls cannot be started in parallel. Nevertheless, a crafted image could trigger that and a warning is printed. 5) Balance is normally excluded by EXCL_OP and also uses own mutex to protect against concurrent access to its status data. There's some trickery to maintain the right lock nesting in case we need to reexamine the status in btrfs_ioctl_balance. The volume_mutex is removed and the unlock/lock sequence is left in place as we might expect other waiters to proceed. 6) Similar to 5, the unlock/lock sequence is kept in btrfs_cancel_balance to allow waiters to continue. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:25 +02:00
David Sterba	a0fecc2371	btrfs: remove wrong use of volume_mutex from btrfs_dev_replace_start The volume mutex does not protect against anything in this case, the comment about scrub is right but not related to locking and looks confusing. The comment in btrfs_find_device_missing_or_by_path is wrong and confusing too. The device_list_mutex is not held here to protect device lookup, but in this case device replace cannot run in parallel with device removal (due to exclusive op protection), so we don't need further locking here. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:25 +02:00
David Sterba	149196a2ae	btrfs: cleanup helpers that reset balance state The function __cancel_balance name is confusing with the cancel operation of balance and it really resets the state of balance back to zero. The unset_balance_control helper is called only from one place and simple enough to be inlined. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:25 +02:00
David Sterba	eee95e3fb0	btrfs: add sanity check when resuming balance after mount Replace a WARN_ON with a proper check and message in case something goes really wrong and resumed balance cannot set up its exclusive status. The check is a user friendly assertion, I don't expect to ever happen under normal circumstances. Also document that the paused balance starts here and owns the exclusive op status. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:25 +02:00
David Sterba	a17c95df4c	btrfs: move clearing of EXCL_OP out of __cancel_balance Make the clearning visible in the callers so we can pair it with the test_and_set part. Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:24 +02:00
David Sterba	72b81abf95	btrfs: move volume_mutex to callers of btrfs_rm_device Move locking and unlocking next to the BTRFS_FS_EXCL_OP bit manipulation so it's obvious that the two happen at the same time. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:24 +02:00
David Sterba	d48f39d5a5	btrfs: move btrfs_init_dev_replace_tgtdev to dev-replace.c and make static The function logically belongs there and there's only a single caller, no need to export it. No code changes. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:24 +02:00
David Sterba	a425f9d475	btrfs: export and rename free_device The function will be used outside of volumes.c, the allocation btrfs_alloc_device is also exported. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:23 +02:00
David Sterba	6fc4749d25	btrfs: make success path out of btrfs_init_dev_replace_tgtdev more clear This is a preparatory cleanup that will make clear that the only successful way out of btrfs_init_dev_replace_tgtdev will also set the device_out to a valid pointer. With this guarantee, the callers can be simplified. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:23 +02:00
Anand Jain	b518519713	btrfs: cleanup btrfs_rm_device() promote fs_devices pointer This function uses fs_info::fs_devices number of time, however we declare and use it only at the end, instead do it in the beginning of the function and use it. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:22 +02:00
Anand Jain	636d2c9d63	btrfs: cleanup find_device() drop list_head pointer find_device() declares struct list_head *head pointer and used only once, instead just use it directly. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:22 +02:00
Anand Jain	897fb5734a	btrfs: rename __btrfs_open_devices to open_fs_devices __btrfs_open_devices() is un-exported drop __ prefix and rename it to open_fs_devices(). Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:21 +02:00
Anand Jain	0226e0eb65	btrfs: rename __btrfs_close_devices to close_fs_devices __btrfs_close_devices() is un-exported, drop the __ prefix and rename it to close_fs_devices(). Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:21 +02:00
Anand Jain	f117e290e8	btrfs: cleanup __btrfs_open_devices() drop head pointer __btrfs_open_devices() declares struct list_head *head, however head is used only once, instead use btrfs_fs_devices::devices directly. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:21 +02:00
Anand Jain	c4babc5e38	btrfs: rename struct btrfs_fs_devices::list btrfs_fs_devices::list is the list of BTRFS fsid in the kernel, a generic name 'list' makes it's search very difficult, rename it to fs_list. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:21 +02:00
Anand Jain	89595e80de	btrfs: add comment about BTRFS_FS_EXCL_OP Adds comments about BTRFS_FS_EXCL_OP to existing comments about the device locks. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ minor updates ] Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:15 +02:00
Anand Jain	02ee654d3a	btrfs: fix crash when trying to resume balance without the resume flag We set the BTRFS_BALANCE_RESUME flag in the btrfs_recover_balance() only, which isn't called during the remount. So when resuming from the paused balance we hit the bug: kernel: kernel BUG at fs/btrfs/volumes.c:3890! :: kernel: balance_kthread+0x51/0x60 [btrfs] kernel: kthread+0x111/0x130 :: kernel: RIP: btrfs_balance+0x12e1/0x1570 [btrfs] RSP: ffffba7d0090bde8 Reproducer: On a mounted filesystem: btrfs balance start --full-balance /btrfs btrfs balance pause /btrfs mount -o remount,ro /dev/sdb /btrfs mount -o remount,rw /dev/sdb /btrfs To fix this set the BTRFS_BALANCE_RESUME flag in btrfs_resume_balance_async(). CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-17 14:38:24 +02:00
David Sterba	c1d7c514f7	btrfs: replace GPL boilerplate by SPDX -- sources Remove GPL boilerplate text (long, short, one-line) and keep the rest, ie. personal, company or original source copyright statements. Add the SPDX header. Signed-off-by: David Sterba <dsterba@suse.com>	2018-04-12 16:29:51 +02:00
David Sterba	7e79cb86be	btrfs: split dev-replace locking helpers for read and write The current calls are unclear in what way btrfs_dev_replace_lock takes the locks, so drop the argument, split the helpers and use similar naming as for read and write locks. Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 02:01:07 +02:00
David Sterba	a32bf9a302	btrfs: use lockdep_assert_held for mutexes Using lockdep_assert_held is preferred, replace mutex_is_locked. Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 02:01:06 +02:00
Jeff Mahoney	75cb379d26	btrfs: defer adding raid type kobject until after chunk relocation Any time the first block group of a new type is created, we add a new kobject to sysfs to hold the attributes for that type. Kobject-internal allocations always use GFP_KERNEL, making them prone to fs-reclaim races. While it appears as if this can occur any time a block group is created, the only times the first block group of a new type can be created in memory is at mount and when we create the first new block group during raid conversion. This patch adds a new list to track pending kobject additions and then handles them after we do chunk relocation. Between relocating the target chunk (or forcing allocation of a new chunk in the case of data) and removing the old chunk, we're in a safe place for fs-reclaim to occur. We're holding the volume mutex, which is already held across page faults, and the delete_unused_bgs_mutex, which will only stall the cleaner thread. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 01:41:12 +02:00
Anand Jain	8ba0ae7821	btrfs: drop optimal argument from find_live_mirror() Drop optimal argument from the function find_live_mirror() as we can deduce it in the function itself. Also rename optimal to preferred_mirror. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 01:41:11 +02:00
Anand Jain	99f92a7c1e	btrfs: drop num argument from find_live_mirror() Obtain the stripes info from the map directly and so no need to pass it as an argument. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 01:41:11 +02:00
Qu Wenruo	ba89b80268	btrfs: volumes: Remove the meaningless condition of minimal nr_devs when allocating a chunk When checking the minimal nr_devs, there is one dead and meaningless condition: if (ndevs < devs_increment * sub_stripes \|\| ndevs < devs_min) { ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ This condition is meaningless, @devs_increment has nothing to do with @sub_stripes. In fact, in btrfs_raid_array[], profile with sub_stripes larger than 1 (RAID10) already has the @devs_increment set to 2. So no need to multiple it by @sub_stripes. And above condition is also dead. For RAID10, @devs_increment * @sub_stripes equals 4, which is also the @devs_min of RAID10. For other profiles, @sub_stripes is always 1, and since @ndevs is rounded down to @devs_increment, the condition will always be true. Remove the meaningless condition to make later reader wander less. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 01:26:56 +02:00
Anand Jain	7c829b722d	btrfs: add define for oldest generation Some functions can filter metadata by the generation. Add a define that will annotate such arguments. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 01:26:50 +02:00
Anand Jain	9b99b11564	btrfs: rename btrfs_close_extra_device to btrfs_free_extra_devids This function btrfs_close_extra_devices() is about freeing extra devids which once it may have belonged to this filesystem. So rename it and add the comment. The _devid suffix is appropriate as this function won't handle devices which are outside of the filesytem being mounted. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:42 +02:00
Anand Jain	16db5758fe	btrfs: remove assert in btrfs_init_dev_replace_tgtdev() In the same function we just ran btrfs_alloc_device() which means the btrfs_device::resized_list is sure to be empty and we are protected with the btrfs_fs_info::volume_mutex. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:40 +02:00
David Sterba	ffc5a3794f	btrfs: add (the only possible) __exit annotation Recently, the __init annotations have been added. There's unfortunatelly only one case where we can add __exit, because most of the cleanup helpers are also called from the __init phase. As the __exit annotated functions get discarded completely for a built-in code, we'd miss them from the init phase. Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:39 +02:00
Qu Wenruo	3e72ee8874	btrfs: Refactor __get_raid_index() to btrfs_bg_flags_to_raid_index() Function __get_raid_index() is used to convert block group flags into raid index, which can be used to get various info directly from btrfs_raid_array[]. Refactor this function a little: 1) Rename to btrfs_bg_flags_to_raid_index() Double underline prefix is normally for internal functions, while the function is used by both extent-tree and volumes. Although the name is a little longer, but it should explain its usage quite well. 2) Move it to volumes.h and make it static inline Just several if-else branches, really no need to define it as a normal function. This also makes later code re-use between kernel and btrfs-progs easier. 3) Remove function get_block_group_index() Really no need to do such a simple thing as an exported function. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:38 +02:00
Qu Wenruo	793ff2c88c	btrfs: volumes: Cleanup stripe size calculation Cleanup the following things: 1) open coded SZ_16M round up 2) use min() to replace open-coded size comparison 3) code style Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Gu Jinxiang <gujx@cn.fujitsu.com> [ reformat comment ] Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:37 +02:00
Anand Jain	b1b8e38622	btrfs: insert newly opened device to the end of the list Add opened device to the tail of dev_alloc_list instead of head, so that it maintains the same order as dev_list. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:37 +02:00
Anand Jain	f8e10cd3f8	btrfs: keep device list sorted By maintaining the device list sorted lets us reproduce the problems related to missing chunk in the degraded mode much more consistent. So fix this by sorting the devices by devid within the kernel. So that we know which device is assigned to the struct fs_info::latest_bdev when all the devices are having and same SB generation. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:36 +02:00
Qu Wenruo	4117f207d4	btrfs: Add chunk allocation ENOSPC debug message for enospc_debug mount option Enospc_debug makes extent allocator print more debug messages, however for chunk allocation, there is no debug message for enospc_debug at all. This patch will add message for the following parts of chunk allocator: 1) No rw device at all Quite rare, but at least output one message for this case. 2) Not enough space for some device This debug message is quite handy for unbalanced disks with stripe based profiles (RAID0/10/5/6). 3) Not enough free devices This debug message should tell us if current chunk allocator is working correctly under minimal device requirements. Although in most cases, we will hit other ENOSPC before we even hit a chunk allocator ENOSPC, but such debug info won't help. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:35 +02:00
Nikolay Borisov	e9b919b1f7	btrfs: Remove fs_info argument from btrfs_update_commit_device_bytes_used We already pass the btrfs_transaction which references fs_info so no need to pass the later as an argument. Also use the opportunity to shorten transaction->trans. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:32 +02:00
Anand Jain	15fc1283f6	btrfs: open code btrfs_init_dev_replace_tgtdev_for_resume() btrfs_init_dev_replace_tgtdev_for_resume() initializes replace target device in a few simple steps, so do it at the parent function. Moreover, there isn't any other caller so just open code it. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:30 +02:00
Qu Wenruo	062d4d1f40	btrfs: Refactor parameter of BTRFS_MAX_DEVS() from root to fs_info Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:28 +02:00
Hans van Kranenburg	92e222df7b	btrfs: alloc_chunk: fix DUP stripe size handling In case of using DUP, we search for enough unallocated disk space on a device to hold two stripes. The devices_info[ndevs-1].max_avail that holds the amount of unallocated space found is directly assigned to stripe_size, while it's actually twice the stripe size. Later on in the code, an unconditional division of stripe_size by dev_stripes corrects the value, but in the meantime there's a check to see if the stripe_size does not exceed max_chunk_size. Since during this check stripe_size is twice the amount as intended, the check will reduce the stripe_size to max_chunk_size if the actual correct to be used stripe_size is more than half the amount of max_chunk_size. The unconditional division later tries to correct stripe_size, but will actually make sure we can't allocate more than half the max_chunk_size. Fix this by moving the division by dev_stripes before the max chunk size check, so it always contains the right value, instead of putting a duct tape division in further on to get it fixed again. Since in all other cases than DUP, dev_stripes is 1, this change only affects DUP. Other attempts in the past were made to fix this: * `37db63a400` "Btrfs: fix max chunk size check in chunk allocator" tried to fix the same problem, but still resulted in part of the code acting on a wrongly doubled stripe_size value. * `86db25785a` "Btrfs: fix max chunk size on raid5/6" unintentionally broke this fix again. The real problem was already introduced with the rest of the code in `73c5de0051`. The user visible result however will be that the max chunk size for DUP will suddenly double, while it's actually acting according to the limits in the code again like it was 5 years ago. Reported-by: Naohiro Aota <naohiro.aota@wdc.com> Link: https://www.spinics.net/lists/linux-btrfs/msg69752.html Fixes: `73c5de0051` ("btrfs: quasi-round-robin for chunk allocation") Fixes: `86db25785a` ("Btrfs: fix max chunk size on raid5/6") Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update comment ] Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-01 16:16:47 +01:00
Nikolay Borisov	fd649f10c3	btrfs: Fix use-after-free when cleaning up fs_devs with a single stale device Commit `4fde46f0cc` ("Btrfs: free the stale device") introduced btrfs_free_stale_device which iterates the device lists for all registered btrfs filesystems and deletes those devices which aren't mounted. In a btrfs_devices structure has only 1 device attached to it and it is unused then btrfs_free_stale_devices will proceed to also free the btrfs_fs_devices struct itself. Currently this leads to a use after free since list_for_each_entry will try to perform a check on the already freed memory to see if it has to terminate the loop. The fix is to use 'break' when we know we are freeing the current fs_devs. Fixes: `4fde46f0cc` ("Btrfs: free the stale device") Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-02-05 17:15:14 +01:00
Anand Jain	3acbcbfc8f	btrfs: drop devid as device_list_add() arg As struct btrfs_disk_super is being passed, so it can get devid the same way its parent does. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-01-29 19:31:16 +01:00
Anand Jain	e124ece53e	btrfs: get device pointer from device_list_add() Instead of pointer to btrfs_fs_devices as an arg in device_list_add() better to get pointer to btrfs_device as return value, then we have both, pointer to btrfs_device and btrfs_fs_devices. btrfs_device is needed to handle reappearing missing device. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-01-29 19:31:15 +01:00
Anand Jain	f2788d2f76	btrfs: set the total_devices in device_list_add() There is no other parent for device_list_add() except for btrfs_scan_one_device(), which would set btrfs_fs_devices::total_devices if device_list_add is successful and this can be done with in device_list_add() itself. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-01-22 20:25:56 +01:00
Anand Jain	327f18cc7f	btrfs: move pr_info into device_list_add Commit `60999ca4b4` ("btrfs: make device scan less noisy") adds return value 1 to device_list_add(), so that parent function can call pr_info only when new device is added. Move the pr_info() part into device_list_add() so that this function can be kept simple. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-01-22 20:25:54 +01:00
Anand Jain	d8367db30a	btrfs: make btrfs_free_stale_devices() to match the path The btrfs_free_stale_devices() is updated to match for the given device path and delete it. (It searches for only unmounted list of devices.) Also drop the comment about different path being used for the same device, since now we will have cli to clean any device that's not a concern any more. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-01-22 20:25:52 +01:00
Anand Jain	0d34097f66	btrfs: rename btrfs_free_stale_devices() arg to skip_dev No functional changes. Rename btrfs_free_stale_devices() arg to skip_dev, so that it reflects what that arg for. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-01-22 20:25:50 +01:00
Anand Jain	522f1b45e4	btrfs: make btrfs_free_stale_devices() argument optional This updates btrfs_free_stale_devices() helper function to delete all unmouted devices, when arg is NULL. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-01-22 20:25:48 +01:00
Anand Jain	38cf665d33	btrfs: make btrfs_free_stale_device() to iterate all stales Let the list iterator iterate further and find other stale devices and delete it. This is in preparation to add support for user land request-able stale devices cleanup. Also rename btrfs_free_stale_device() to btrfs_free_stale_devices(). Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-01-22 20:25:47 +01:00
Anand Jain	a848b3e547	btrfs: no need to check for btrfs_fs_devices::seeding There is no need to check for btrfs_fs_devices::seeding when we have checked for btrfs_fs_devices::opened, because we can't sprout without its seed FS being opened. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-01-22 20:25:44 +01:00
Matthew Wilcox	3cbf26da5e	btrfs: Remove unused readahead spinlock The reada_lock in struct btrfs_device was only initialised, and not actually used. That's good because there's another lock also called reada_lock in the btrfs_fs_info that was quite heavily used. Remove this one. Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-01-22 16:08:21 +01:00

... 3 4 5 6 7 ...

1239 Коммитов